Using Bing for better scraper searching
#1
Lightbulb 
So I got tired of the existing scrapers returning incorrect results for about 1/3rd of my movies. It turns out that while the sites we scrape have lots of great data, their search engines range from inaccurate (imdb's) to completely broken (tmdb's). They both choke when your naming doesn't exactly match the official name of the movie and really don't like foreign movies.

Any real search engine can handle these cases just fine. Looking around I found that Bing has a very nice, easy to use developer API for accessing their search results. Google and Yahoo both also have APIs, but they are only for use as part of an AJAX website (Google's FAQ says they'll block you if you scrape their results). The Bing ToU allows "end-user-facing website or application".

Anyway, I edited the existing IMDB scraper to do a Bing search of "site:imdb.com movie (year)" and parse the returned XML. The actual data is still scraped from IMDB, I just changed the search part. For my collection of ~200 movies (hollywood, anime, foreign, etc) Bing got the correct imdb link 100% of the time.

This method could be used for any scraper by replacing "site:imdb.com" with "site:themoviedb.org" or something else.

Question for an XBMC admin: Bing's API requires a AppID, just like TMDB's. I signed up for one personally but would rather not release the scraper using my AppID. Would it be possible for someone @xbmc.org to sign up for an official AppID that can be used? It's an online process that takes about 5 minutes.

Once the AppID is squared away I'll release my bing_imdb.xml file, and post a little guide for how to add bing to other scrapers.
Reply
#2
I made this appid: 16E50AB9947899C41433EB944C60174737855036
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#3
you dont need api to scrape you know. if im not mistaken the imdb scraper did use google before. i've tested google and its not perfect. there are pages missing and it can have problems with aka titles. whether google is better than imdb all depends on which movies you have i guess. maybe you should try this before completely rule out imdb.

and btw google will only block you for small amount of time if you do like 500 searches in <1min.
Reply
#4
Thanks for the AppID Pike. Here are the 2 files for my modification of the imdb scraper. Drop them into the system/scrapers/video directory of your install.

http://www.hackish.org/~rufus/bing_imdb.xml
http://www.hackish.org/~rufus/bing_imdb.png

Can people try it out and report how it works? It works perfectly for me.

To use this for other scrapers simply replace the CreateSearchUrl and GetSearchResults parts with the one from my file. In the CreateSearchUrl replace site:imdb.com with siteConfusedomethingelse.com. In the GetSearchResults replace "http://www.imdb.com/title/(tt[0-9]+)/?" with the full url you're expecting.
Reply
#5
Also I agree that turning on sorted="yes" would vastly improve the default IMDB scraper's results. IMDB's search is doing far more intelligent sorting than XBMC's basic string comparison.
Reply
#6
I've said this about 100 times.

Change the imdb scraper to return ALL titles that the search brings, rather than just the links. The problem is that it's not returning the AKA title names that the search page gives.

eg Infernal Affairs

Should include both "The Departed" and "Infernal Affairs" results, both linking to the same movie.

That way the fuzzy matching in XBMC will work perfectly, _enhancing_ the IMDb results for those cases where it doesn't work well.

Can you find any other cases where the IMDb search would then require sorted="yes" ?

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#7
Hi, What changes to I need to make to get IMDB to return all titles?

I am having this issue as its not returning the AKA titles.

Cheers
Si
Reply
#8
You need to modify the scraper XML file. This involves parsing the results IMDb gives using regular expressions to generate a set of XML results that XBMC then uses.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#9
So I did some quick tests using my Anime collection. Here are the 13 movies I used for testing:
Code:
Akira-1988
Cowboy.Bebop-The.Movie-2001
Fullmetal.Alchemist_the.Conqueror.of.Shamballa-2005
Ghost.In.The.Shell-1995
Ghost.in.the.Shell_Innocence-2004
Grave.of.the.Fireflies-1988
Metoroporisu-2001
Mononoke.Hime-1997
Nausicaa-1984
Ninja.Scroll-1993
Ponyo.On.The.Cliff-2008
Spirited.Away-2001
Wonderful.Days-2003

My Bing+IMDB scraper gets all 13/13 correct.

The default IMDB scraper (unsorted) gets only 7/13. The ones it gets wrong:
Code:
Fullmetal.Alchemist_the.Conqueror.of.Shamballa-2005
Ghost.In.The.Shell-1995
Ghost.in.the.Shell_Innocence-2004
Nausicaa-1984
Ninja.Scroll-1993
Spirited.Away-2001

Using the default IMDB scrapper and adding the sorted="yes" tag gets 12/13. It can't find this one at all (doesn't even get a wrong hit):
Code:
Ghost.in.the.Shell_Innocence-2004

jmarshall: why bother adding a bunch more code to scrape aka names from IMDB results. Is there any case where XBMC's re-order of names provides better results than IMDB's sort order?
Reply
#10
Because IMDb may hit the "popular" result first, even though the actual result may be further down?
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#11
That's always possible, but I've yet to find a case where XBMC's resorting is (or with aka names could be) better than using IMDB's sorting order.

In the case of Infernal Affairs (which was mentioned earlier) both The Departed and Mou gaan dou list it as an AKA name, so re-sorting with AKA names could give you either one of the two. Going with the default IMDB sort gives you The Departed, which is still wrong. My scraper using Bing gives the correct result (Mou gaan dou).

Another issue I just discovered: IMDB's search ignores years. Searching for Oceans Eleven (1960) and Oceans Eleven (2001) give identical results. My Bing-based search gives you the correct one.

The summary from all of my testing so far:
Existing IMDB scraper: lots of issues, never gets the best results
IMDB scraper with sorted="yes": fixes most of the issues, but not all of them
Bing+IMDB scraper: I have yet to find a case where it was not correct
Reply
#12
Nope, the correct result for this movie was infact The Departed :p

XBMC's sorting checks the year in addition to the title, so +1 from there as well.

I doubt that improving the IMDb scraper via my suggested route would work better than <arbitrary search engine>, though I'm just guessing there. The main advantage is not relying on said search engine.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#13
rufus210,

Cheers mate - the scraper you have provided has vastly improved my matches!

100% thus far. Good work.
Reply
#14
i dont know how you think this is supposed to be done. only way i can see is to list movies for each aka title. and honestly, i don't see how you can justify that.

btw; first one i tried, "Dream 2008", failed. unsorted imdb succeeds. you need to test more than 17 movies rufus. (will test the rest of mine later)
Reply
#15
you search on imdb.

it tosses up a list of movies, including aka titles. we grab all the titles (something we don't do now). the sorting does its magic. voilà.
Reply

Logout Mark Read Team Forum Stats Members Help
Using Bing for better scraper searching1