2017-11-21, 16:32
(2017-11-19, 00:16)nabsltd Wrote: Is there a way to improve the accuracy of IMDB studio name scraping?Your second approach will surely not work as "detecting" and handling the country of origin from IMDb is a nightmare.
For example, Iron Man 2 shows:
Quote:Production CompaniesDistributors
- Paramount Pictures (presents) (A Jon Favreau Film)
- Marvel Entertainment (presents)
- Marvel Studios
- Fairview Entertainment (in association with)
- Bontonfilm (2010) (Czech Republic) (theatrical)
- Concorde Filmverleih (2010) (Germany) (theatrical)
- Constantin-Film (2010) (Austria) (theatrical)
- Finnkino (2010) (Finland) (theatrical)
- Forum Cinemas (2010) (Estonia) (theatrical)
- Forum Cinemas (2010) (Lithuania) (theatrical)
- Forum Cinemas (2010) (Latvia) (theatrical)
- Paramount Pictures (2010) (Australia) (theatrical)
- Paramount Pictures (2010) (Canada) (theatrical)
- Paramount Pictures (2010) (Spain) (theatrical)
- Paramount Pictures (2010) (France) (theatrical)
- Paramount Pictures (2010) (UK) (theatrical)
- Paramount Pictures (2010) (Japan) (theatrical)
- Paramount Pictures (2010) (USA) (theatrical)
The scraper always seems to grab the first entry and toss anything surrounded by parenthesis. After looking at about a dozen movies on IMDB and comparing to The Movie Database, it looks like a better thing to do is to use the last entry that doesn't have any parenthesis on the line. This seems to be much more likely to give the "real" studio (like Warner Bros., Disney, Paramount) that most people will associate with the movie. There are some cases where it doesn't work, like all the Indiana Jones movies would be "Lucasfilm" instead of "Paramount Pictures".
On the other hand, taking the "(country of origin) (theatrical)" entry from "Distributors" seems to be even more accurate. This still doesn't work every time (The Avengers should be "Marvel Studios" based on the logo at the beginning of the movie, but no rule set that catches it works for most other movies), but, again, it's much better than now for most movies.
A current solution is to configure the scraper to use The Movie Database for studios, but since it defaults to IMDB, a lot of people might not change it. Perhaps that's one place the default should be changed?
The first approach could be done, but would take significant efforts and I am not sure it would worth the hassle. Actually what the scraper does currently is to parse the main page (e.g. http://akas.imdb.com/title/tt1228705/) and take the first one from the companies -> no need to toss the parenthesis as it is non-existing there.
Changing the default to TMDb is a good idea and I will do it right now! Thanks for the catch!