Kodi Community Forum

Full Version: Utilize IMDb datasets to enhance tmm's IMDb scraper
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Do you know IMDb offers dataset tsv files that get updated daily? 
Those contain the full list of primary/original/localized titles (both for movies and TV shows) with corresponding ID along with a number of metadata.
Take a look at what data is available (https://www.imdb.com/interfaces/) and it would be great if you can incorporate these into tmm's search & scrape functionality with IMDb scraper.

Since it gives you the full list of available titles along with ID, you no longer need to resort to the site's limited search functionality nor try to guess probable URLs.
Currently adult film is not available from the site's search results, and the site's distinction between adult and non-adult films seems quite arbitrary and unpredictable. Using these datasets would remove such restrictions. It would also allow the user to search with any kind (primary/original/localized) of title with more accuracy. For TV shows, any episode ID could be easily obtained from season/ep number.

Another area it could enhance greatly is updating existing metadata.
Imagine a user with 10k movies trying to update IMDb rating. Using the dataset would almost instantly update the user's database and NFOs rather than hitting the IMDb server 10k times and parsing the fetched result one by one. 
Though not all metadata is included in the datasets and fetching would be still necessary in many cases, this will still help reduce time and resource use for a lot of tasks.
This was already in our mind a few years back.
Unfortunately, there are a few drawbacks.

- a complete set is aboug 1 gig of compressed download; even updating once a week, this is too big
- a local search in this 800mb title CSV takes about 7 secs on an SSD - online will be still faster.
- this sets contain only basic metadata, eg "plot" or other needed things are not in there

So we discarded the idea, and use current approach for IMDB.

BUT:
For v4, we already implemented the "ratings only" IMDB download, which is updated once in a while, and should speedup especially the ratings scraping (from IMDB)

hth