Kodi Community Forum

Full Version: Scraper performance in Kodi v20
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I just finished testing scrapers for Kodi v20, make of the results what you will...

TVShows

php:
TVShow scraping speed test (20 tvshows, 1,273 episodes)
=======================================================

The Movie DataBase (XML) - 02mins 33secs (153 secs total)
XEM (XML)                - 04mins 32secs (272 secs total)
TMDB TVShows (Python)    - 08mins 25secs (505 secs total)
TVMaze (Python)          - 10mins 39secs (639 secs total)
The TVDB v4 (Python)     - 19mins 14secs (1,154 secs total)
The TVDB New (Python)    - 29mins 24secs (1,764 secs total)

* Tested using latest Kodi v20 nightly, default scraper options.

Movies

php:
Test setup - Latest Nightly. 997 movies. Local artwork pre downloaded, default settings
===============================================================
TMDB Movie (Python)  - 12min 01sec
TMDB Movie (XML)     - 12min 43sec
Universal Movie Scraper - 1hr 21min 44sec

Both using sample library here on the wiki.
I ran the test also, using the same test files. A few differences...

php:
TV Show Scraping
=====================
TheMovieDB XML - 1273 episodes; 02:42
TMDB TV Shows - 1273 episodes; 15:38
TVDB v4 - 1273 episodes; 19:40
TV Maze - 1269 episodes; 04:55

php:
Movie Scraping
=====================
Universal Movie Scraper - 0 movies; 92:41
TVDB v4 (Python) - 949 movies; 32:37
TheMovieDB XML - 997 movies; 36:14
TheMovieDatabase Python - 997 movies; 47:51
* UMS is broken
** TVDB seems to be the least accurate as it missed 50 movies
Are python scrapes slower due to the startup time of the python interpreter? Is there a way Kodi could pass a bulk lot of file paths to the scraper and it processes all and returns a list of results? I suspect that would require big changes to the Kodi system. But imagine if some apis could search for 10 movies at once etc.
(2022-10-02, 07:24)matthuisman Wrote: [ -> ]Are python scrapes slower due to the startup time of the python interpreter? Is there a way Kodi could pass a bulk lot of file paths to the scraper and it processes all and returns a list of results? I suspect that would require big changes to the Kodi system. But imagine if some apis could search for 10 movies at once etc.

I believe its a combination of Python being slower (or doing more things) and the actual API's and how many times they need to be contacted.

@pkscout knows more about the inner workings I think.

I'd be interested in breaking it down somehow, to see where the extra processing time comes from.

Clearly taking 4 x as long to scrape TVshows in v20 is a bit of an issue.
I don't actually know as much about the actual work done in Core, but my impression is that, yes, Python introduces additional processing time.  There is also an issue of the ways in which Python is currently able to interact with Core, and some of that can't be fixed until we actually fully remove support for the XML scrapers (specifically I believe some work around batch saving - right now we have to save every episode one at a time).  That's obviously a touch decision, as the XML scrapers sort of work for most people and are really fast.  I'm personally hoping we can remove the for v21 and work during the next development cycle to squeeze everything we can from the Python scrapers.

Comparing apples to apples, the TV show XML scraper for The Movie Database is clearly faster than the Python version.  Some of that is likely additional API calls to support additional information in the Python scraper, but the rest is the difference between Python and XML.  The differences between the various Python scrapers probably has more to do with the backend APIs than anything.  I used the TVMaze scraper code as a starting point for the TMDb TV Shows scraper, so when you see time differences between those, it's almost all API related.
I just tried TheMovieDB scraper on my TVshows on the local database, and things were scraped quite a bit faster indeed.
However, at least some TV Shows were not fully scraped although all relevant info is present on the TheMovieDB website.

Example: "She-Hulk Attorney At Law" has 9 episodes on the website, 8 episode files are present/have been aired, and only the first 5 were scraped probably because there where Kodi exported nfo files available from TheTVDB.

This log part was shown after scraping the 5th episode:
Code:
2022-10-10 11:13:47.050 T:127611 ERROR <general>: CCurlFile::FillBuffer - Failed: Server returned nothing (no headers, no data)(52)
2022-10-10 11:13:47.050 T:127611 ERROR <general>: CCurlFile::Open failed with code 0 for 92783:

2022-10-10 11:13:47.050 T:127611 ERROR <general>: Run: Unable to parse web site
2022-10-10 11:13:47.054 T:127611 WARNING <general>: No information found for item '/mnt/clrn/wd40/TVSHOWS/She-Hulk Attorney at Law/', it won't be added to the library.
(2022-10-02, 07:24)matthuisman Wrote: [ -> ]Are python scrapes slower due to the startup time of the python interpreter? Is there a way Kodi could pass a bulk lot of file paths to the scraper and it processes all and returns a list of results? I suspect that would require big changes to the Kodi system. But imagine if some apis could search for 10 movies at once etc.

It's not because of python. It's because of how kodi calls the scraper. Kodi calls the scraper like a script, which means it gets a new python environment on every call. This means no reuse of HTTP connections, no memoization, etc.

For an example, here is the per call time of a python method that searches for a tv show on jellyfin. The jellyfin api is accessed over HTTPS.

Code:
per call: 0.2676467499999995

And here is the same call called 100 times, but using a connection pool and memoization:

Code:
per call: 0.01529335788999998
Could someone who tested TVmaze scraper open scraper's addon.xml file with a text editor, find the string <reuselanguageinvoker>false</reuselanguageinvoker>, change it to true, then re-test scraping to see if there is any performance improvement?
Ok, did the tests myself with my own TV shows library using my TVmaze scraper. With reuselanguageinvoker enabled scraping is done more than x2 faster. So reuselanguageinvoker is strongly recommended for scrapers. I'll enable it in the next TVmaze release.

I remember that there was a memory leak bug with reuselanguageinvoker enabled but IIRC it was fixed so using reuselanguageinvoker should be safe now. But a scraper (or other types of plugins for that matter) should be careful with using/modifying global state variables or import time side effects.

Another general recommendation is to use caching as much as possible. But it may not be possible with a specific data provider. For example, TVmaze allows to prefetch all episode info at once and the scraper uses in-memory cache while populating episode details thus completely avoiding extra API calls or disk reads. On the other hand, TheTVDB API does not allow prefetching and a scraper needs to do an API call per episode that slows down scraping.
I just did the same test with TMDb TV Show scraper.  On my Vero 4K+ (which is produced by the OSMC folks and is akin to a Raspberry Pi for general compute power), there was a nearly 4x improvement with reuselanguageinvoker option enabled.  I'm also going to enable it and push out new versions for Matrix and Nexus.
Now that's what I call progress Wink

Thanks all.