Req Faster Scraping
#1
I am using a Fritzbox! 7390 with attached USB storage as a poor man's NAS to serve video content to my LAN. The NAS throughput is around 3,5 Mbit/s, so it's pretty slow but sufficient to stream most everything. However, scraping of new content like weekly updated TV shows is very slow, especially over Wifi. That's why I wanted to investigate about the current method of scraping. I'm not a programmer and I have no idea how scraping is currently implemented but it seems to me there's room for improvement.

So, my idea was if it could be accelerated with the use of checksums. My folder tree structure is pretty big, so XBMC has to check a lot of directories for changes. What if for any given directory, a checksum would be stored which is calculated from its contents and the scraper then only checks the contents of a folder itself if the checksum has changed. Obviously, larger trees benefit more from this than smaller trees.

Example:
I have to main folders, MOVIES and TV. The movie folder is not updated that often, so if its checksum remains the same, the scraper would ignore it. Under MOVIES, I have folders according to genre, like SF, HORROR, COMEDY, etc. Same principle would apply if only the contents of one folder change. Alternatively one could make folders such as A-E, F-H, etc. to benefit from the idea.
In my TV folder, there are like 30 TV shows. Again, reading only the checksums should be lightning fast.

The checksum values could be stored as a file in each folder for portability or in the XBMC roaming folder - this could be a preference setting.
Reply
#2
XBMC already does this, I think. When you look at the debug log (wiki) it should even mention when it skips a directory due to no changes. It should say something about "fast hash". It can still take some time to check a lot of sub folders, so there could be other issues at play here.
Reply
#3
Calculating the checksum from a huge directory structure of course also needs time - don't forget that (there is IO involved ... and IO is always a bottleneck).
AppleTV4/iPhone/iPod/iPad: HowTo find debug logs and everything else which the devs like so much: click here
HowTo setup NFS for Kodi: NFS (wiki)
HowTo configure avahi (zeroconf): Avahi_Zeroconf (wiki)
READ THE IOS FAQ!: iOS FAQ (wiki)
Reply
#4
(2014-05-21, 14:38)Memphiz Wrote: Calculating the checksum from a huge directory structure of course also needs time - don't forget that (there is IO involved ... and IO is always a bottleneck).

True, but it should be faster the second time around. Maybe the algorithm could be optimized for a structure of many branches... I'm guessing for a flat structure hashes won't help much.

(2014-05-21, 12:05)Ned Scott Wrote: XBMC already does this, I think. When you look at the debug log (wiki) it should even mention when it skips a directory due to no changes. It should say something about "fast hash". It can still take some time to check a lot of sub folders, so there could be other issues at play here.

I've checked the debug log and it confirms my suspicion that fast hash is used for each and every folder. That means that XBMC is still crawling the entire tree structure but not the contents of unchanged folders. My suggestion was to take this a step further and store hashes for parent folders, which would save lots of time by checking only these meta hashes for those who have at least two parent folders like TV and MOVIES, for example.

It takes around 10 mins to scan my entire library this way via LAN. The time could probably be cut in half if extrafanart/ and extrathumbs/ folders could be excluded from scanning.
Reply
#5
I've always considered that the time taken with scans is primarily down to scraping TVDB or IMDB or whatever, rather than scooting through content that's already in the library.

I'm eager to get a fibre-op connection as my 3.5Mb/s connection can grind is is barely adequate for OTA HD streaming. I'd imagine this would also speed up the scan speed as it'll scrape new content quicker.

Correct me if I'm wrong!
Reply
#6
(2014-05-21, 16:54)HeresJohnny Wrote: It takes around 10 mins to scan my entire library this way via LAN. The time could probably be cut in half if extrafanart/ and extrathumbs/ folders could be excluded from scanning.

You can do that, see here advancedsettings - excludefromscan
Reply
#7
(2014-05-21, 17:35)Aenima99x Wrote:
(2014-05-21, 16:54)HeresJohnny Wrote: It takes around 10 mins to scan my entire library this way via LAN. The time could probably be cut in half if extrafanart/ and extrathumbs/ folders could be excluded from scanning.

You can do that, see here advancedsettings - excludefromscan

Thanks, I had tried that just before you wrote it and the time saved is negligible after all. It seems that fast hash is optimized well in this regard.
Reply
#8
(2014-05-21, 17:33)JesusOnEez Wrote: I've always considered that the time taken with scans is primarily down to scraping TVDB or IMDB or whatever, rather than scooting through content that's already in the library.

I'm eager to get a fibre-op connection as my 3.5Mb/s connection can grind is is barely adequate for OTA HD streaming. I'd imagine this would also speed up the scan speed as it'll scrape new content quicker.

Correct me if I'm wrong!

The metadata sites are actually incredibly quick and efficient these days.
Reply
#9
There are also addons (watchdog) and external programs (sickbeard) that will send an update request to XBMC to only scan a single folder when a new file is added to them.
Reply
#10
(2014-05-21, 16:54)HeresJohnny Wrote: My suggestion was to take this a step further and store hashes for parent folders, which would save lots of time by checking only these meta hashes for those who have at least two parent folders like TV and MOVIES, for example.
Yeah if only it was that easy. It's not how it works. You have to look at the content of the folder to compute the hash. Xbmc already use a trick where it looks at the last modify time of a folder. But this doesn't tell you if there's any changes in subfolders, so you have to look at them too.

It's my understanding as well that just checking the tree for new files isn't the really the bottleneck here. Even at tens of thousands of files it shouldn't take more than seconds.
Reply

Logout Mark Read Team Forum Stats Members Help
Faster Scraping0