Is there a way to improve the accuracy of IMDB studio name scraping?
For example,
Iron Man 2 shows:
Quote:Production Companies- Paramount Pictures (presents) (A Jon Favreau Film)
- Marvel Entertainment (presents)
- Marvel Studios
- Fairview Entertainment (in association with)
Distributors- Bontonfilm (2010) (Czech Republic) (theatrical)
- Concorde Filmverleih (2010) (Germany) (theatrical)
- Constantin-Film (2010) (Austria) (theatrical)
- Finnkino (2010) (Finland) (theatrical)
- Forum Cinemas (2010) (Estonia) (theatrical)
- Forum Cinemas (2010) (Lithuania) (theatrical)
- Forum Cinemas (2010) (Latvia) (theatrical)
- Paramount Pictures (2010) (Australia) (theatrical)
- Paramount Pictures (2010) (Canada) (theatrical)
- Paramount Pictures (2010) (Spain) (theatrical)
- Paramount Pictures (2010) (France) (theatrical)
- Paramount Pictures (2010) (UK) (theatrical)
- Paramount Pictures (2010) (Japan) (theatrical)
- Paramount Pictures (2010) (USA) (theatrical)
The scraper always seems to grab the first entry and toss anything surrounded by parenthesis. After looking at about a dozen movies on IMDB and comparing to The Movie Database, it looks like a better thing to do is to use the last entry that doesn't have any parenthesis on the line. This seems to be much more likely to give the "real" studio (like Warner Bros., Disney, Paramount) that most people will associate with the movie. There are some cases where it doesn't work, like all the Indiana Jones movies would be "Lucasfilm" instead of "Paramount Pictures".
On the other hand, taking the "(
country of origin) (theatrical)" entry from "Distributors" seems to be even more accurate. This still doesn't work every time (
The Avengers should be "Marvel Studios" based on the logo at the beginning of the movie, but no rule set that catches it works for most other movies), but, again, it's much better than now for most movies.
A current solution is to configure the scraper to use The Movie Database for studios, but since it defaults to IMDB, a lot of people might not change it. Perhaps that's one place the default should be changed?