Kodi Community Forum
Amazon US Scraper not working - Printable Version

+- Kodi Community Forum (http://forum.kodi.tv)
+-- Forum: Development (/forumdisplay.php?fid=32)
+--- Forum: Scraper Development (/forumdisplay.php?fid=60)
+--- Thread: Amazon US Scraper not working (/showthread.php?tid=66628)

Pages: 1 2


Amazon US Scraper not working - GregK - 2010-01-09 08:48

Hello,
I've tried for a few hours here to get various forms of the US amazon scraper working. I've tried the release version. I've tried downloading from the SVN. I've tried countless searches on google, and xbmc forums and had no luck. I turned on debugging and got these results when I attempted to scrape a movie:
Code:
01:46:25 T:2899311472 M:524914688   DEBUG: InternalFindMovie: Searching for 'bear in the big blue house' using Amazon US scraper (file: 'amazonus.xml', content: 'movies', language: 'en', date: '2009-05-22', framework: '1.0')
01:46:25 T:2899311472 M:524914688   DEBUG: FileCurl::Open(0xbfa9ef6c) http://www.amazon.com/s/ref=nb_ss_d_h_?url=search-alias%3Ddvd&field-keywords=bear%20in%20the%20big%20blue%20house
01:46:26 T:2899311472 M:524918784   DEBUG: FileCurl::Close(0xbfa9ef6c) http://www.amazon.com/s/ref=nb_ss_d_h_?url=search-alias%3Ddvd&field-keywords=bear%20in%20the%20big%20blue%20house
01:46:26 T:2899311472 M:524918784   DEBUG: scraper: GetSearchResults returned <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results></results>
01:46:26 T:2899311472 M:524918784   ERROR: Process: Error looking up movie Bear in the big blue house
01:46:26 T:2899311472 M:524918784   DEBUG: Thread 2899311472 terminating
01:46:26 T:3078846352 M:524918784    INFO: Loading skin file: DialogKeyboard.xml
01:46:26 T:3078846352 M:524918784   DEBUG: Load DialogKeyboard.xml: 15.29ms
01:46:26 T:3078846352 M:524918784   DEBUG: ------ Window Init (DialogKeyboard.xml) ------
01:46:26 T:3078846352 M:524918784   DEBUG: Alloc resources: 2.77ms (0.00 ms skin load)
01:46:26 T:3078846352 M:524419072   DEBUG: ------ Window Deinit (DialogProgress.xml) ------

My movie collection has several kids movie which IMDB does not handle well. I would love to be able to scrape from amazon. Can anyone help me?

Thanks in advance.

P.S.
Here is the amazonus.xml file.

Code:
<?xml version="1.0" encoding="UTF-8"?>
<!-- Initial basic version doing Studio and Thumb believed to have been written by C-Quel -->
<!-- Then updated by John Lockwood to scrape Title, Year, MPAA, Runtime, Rating, Votes, Plot, Actors, Directors -->
<!-- This version 1.1 dated 12/01/09 includes fix by C-Quel for processing results from Amazon to match recent change -->
<!-- Version 1.1 also now supports the Writers field -->
<scraper framework="1.0" date="2009-05-22" content="movies" name="Amazon US" thumb="amazonus.png" language="en">
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url&gt;http://www.amazon.com/s/ref=nb_ss_d_h_?url=search-alias%3Ddvd&amp;amp;field-keywords=\1&lt;/url&gt;" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;url&gt;\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" clear="yes" noclean="1">productTitle&quot;&gt;&lt;a href=&quot;([^&quot;]*)&quot;&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <expression clear="yes" noclean="1"/>
        </RegExp>
    </GetSearchResults>
    <GetDetails clearbuffers="no" dest="3">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="5">
                <expression noclean="1">&lt;title&gt;[Amazon.com: ]*([^:]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="5+">
                <expression trim="1">[ \[\(]([0-9]{4})[ \]\)][^&lt;]*&lt;/span&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;top250&gt;\1&lt;/top250&gt;" dest="5+">
                <expression>Top 250: #([0-9]*)&lt;/a&gt;</expression>
            </RegExp>
                        <RegExp input="$$9" output="&lt;mpaa&gt;G&lt;/mpaa&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression>&lt;b&gt;Rating: &lt;/b&gt;[^_]*/(g)._</expression>
                                </RegExp>
                                <expression>(g)</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;mpaa&gt;PG&lt;/mpaa&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression>&lt;b&gt;Rating: &lt;/b&gt;[^_]*/(pg)._</expression>
                                </RegExp>
                                <expression>(pg)</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;mpaa&gt;PG-13&lt;/mpaa&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression>&lt;b&gt;Rating: &lt;/b&gt;[^_]*/(pg-13)._</expression>
                                </RegExp>
                                <expression>(pg-13)</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;mpaa&gt;R&lt;/mpaa&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression>&lt;b&gt;Rating: &lt;/b&gt;[^_]*/(r)._</expression>
                                </RegExp>
                                <expression>(r)</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;mpaa&gt;NC-17&lt;/mpaa&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression>&lt;b&gt;Rating: &lt;/b&gt;[^_]*/(nc-17)._</expression>
                                </RegExp>
                                <expression>(nc-17)</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;mpaa&gt;UNRATED&lt;/mpaa&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression>&lt;b&gt;Rating: &lt;/b&gt;[^_]*/(unrated)._</expression>
                                </RegExp>
                                <expression>(unrated)</expression>
                        </RegExp>
            <RegExp input="$$1" output="&lt;certification&gt;\1&lt;/certification&gt;" dest="5+">
                <expression repeat="yes">Classification:&lt;/b&gt;[^&gt;]*alt=&quot;([0-9]*)&quot;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;tagline&gt;\1&lt;/tagline&gt;" dest="5+">
                <expression>&lt;h5&gt;Tagline:&lt;/h5&gt;([^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="5+">
                <expression trim="1">Run Time:&lt;/b&gt;[^0-9]*([^&lt;]*)&lt;/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1.\2&lt;/rating&gt;&lt;votes&gt;\3&lt;/votes&gt;" dest="5+">
                <expression noclean="1">Average Customer Review&lt;/b&gt;[^_]*stars-([0-9])-([0-9])[^)]*&gt;([0-9]*) customer reviews&lt;/a&gt;\)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="5+">
                <expression repeat="yes">&quot;/Sections/Genres/[^/]*/&quot;&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;studio&gt;\1&lt;/studio&gt;" dest="5+">
                <expression>Studio:&lt;/b&gt;  ([^&lt;]*)&lt;/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;outline&gt;\2&lt;/outline&gt;&lt;plot&gt;\2&lt;/plot&gt;" dest="5+">
                <expression trim="1">Plot (Outline|Summary):&lt;/h5&gt;([^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="5+">
                <expression trim="1">&lt;b&gt;Product Description&lt;/b&gt;&lt;br /[^&gt;]*&gt;([^&lt;]+)</expression>
            </RegExp>
                        <RegExp input="$$1" output="&lt;thumb&gt;\101.L.jpg&lt;/thumb&gt;" dest="5+">
                <expression noclean="1">&quot;original_image&quot;, &quot;([^&quot;]*)AA2[0-9]0_\.jpg&quot;</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;credits&gt;\1&lt;/credits&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression noclean="1">&lt;b&gt;Writers:&lt;/b&gt; ([^\n]*&lt;/a&gt;)</expression>
                                </RegExp>
                                <expression noclean="1" repeat="yes">[^&gt;]*&gt;([^&lt;]+)&lt;/a&gt;</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression noclean="1">&lt;b&gt;Directors:&lt;/b&gt; ([^\n]*&lt;/a&gt;)</expression>
                                </RegExp>
                                <expression noclean="1" repeat="yes">[^&gt;]*&gt;([^&lt;]+)&lt;/a&gt;</expression>
                        </RegExp>
                        <RegExp input="$$9" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;/actor&gt;" dest="5+">
                                <RegExp input="$$1" output="\1" dest="9">
                                        <expression noclean="1">&lt;b&gt;Actors:&lt;/b&gt; ([^\n]*&lt;/a&gt;)</expression>
                                </RegExp>
                                <expression noclean="1" repeat="yes">[^&gt;]*&gt;([^&lt;]+)&lt;/a&gt;</expression>
                        </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetDetails>
</scraper>



- GregK - 2010-01-11 08:15

Update:

Very very strange.
I downloaded plex on my mac, and used it to do a scape on amazon us, and it worked. So I opened the package contents, copied the amazonus.xml file from my mac to my xbmc installation on my ubuntu machine (karmic koala 9.10), and it fails.

I am really confused now:confused2:.

Any advice?

Thanks in advance.....


- mkortstiege - 2010-01-11 09:22

Please file a new bug report (xbmc.org/trac).


- GregK - 2010-01-11 20:42

Bug posted:

http://trac.xbmc.org/ticket/8466


- jelockwood - 2010-01-24 20:35

GregK Wrote:Update:

Very very strange.
I downloaded plex on my mac, and used it to do a scape on amazon us, and it worked. So I opened the package contents, copied the amazonus.xml file from my mac to my xbmc installation on my ubuntu machine (karmic koala 9.10), and it fails.

I am really confused now:confused2:.

Any advice?

Thanks in advance.....

Damn. I am the person who got or had got the current version working. By the way, XBMC does still include a copy of the Amazon scrapers as standard. It is/was identical to the Plex version since I sent copies to both teams.

I have just been working on it with Plex to fix some lesser problems which I had just finished, it now seems to work reasonably well with Plex. However I just tried it in XBMC on a Mac and see the same problem as you - it does not find any movies at all, not even if you edit the movie title to make it more likely to do so.

Unfortunately the part of a scraper to do with finding and processing a list of movies is the bit I find hardest to understand - most of my fixes involve getting information for an individual movie assuming the prior bit to find the movie has already worked.

By the way, the same problem seems to apply to The Amazon.co.uk scraper.

I wonder if it is something as basic as the UserAgent that XBMC is presenting to Amazon? And that Amazon is now rejecting connections from some types? It could be Plex sends a different UserAgent.

If anyone else is willing and able to look in to fixing the part to get a working list of movies, I can then provide my changes for fixing User Votes, Movie Ratings, Plot, etc.

C-Quel was very helpful last time a similar problem happened.


- jelockwood - 2010-01-27 01:10

I have done some testing and it appears that despite XBMC sending the same searchURL as before and as Plex (for Mac) does with the same scraper, it seems Amazon is returning totally a totally different format results page, so different that my scrapers fail to understand it.

I could in theory write a totally different version for XBMC but it is hard enough work supporting one set let alone two. Clearly XBMC in-conjunction with Amazon is producing different results, probably as I suggested due to the different User-Agent being used. My tests at this with a web-browser and changing the User-Agent seems to confirm this.

I tried changing the URL in the scraper to 'spoof' the User-Agent but I must be getting something wrong as it still did not work. Can a more knowledgeable person have a look and suggest the correct format?


- mkortstiege - 2010-01-27 09:49

Mind adding your latest findings to the ticket (http://trac.xbmc.org/ticket/8466)?


- jelockwood - 2010-01-27 12:58

vdrfan Wrote:Mind adding your latest findings to the ticket (http://trac.xbmc.org/ticket/8466)?

Updated ticket as requested. I included my (failed) attempt to add a User-Agent spoof.

Note: it is not the User-Agent I tried that was wrong, it is that XBMC and Amazon are not treating it as a User-Agent. This is almost certainly due to me getting the syntax wrong. Hence my plea for help.

--------
Good news!
I have got the User-Agent over-ride working. I am now testing corrections to scrape some fields that have been broken by Amazon.


Amazon scrapers now fixed - jelockwood - 2010-02-01 00:59

Recently I came across some reports of various issues with the Amazon.com and Amazon.co.uk scrapers I wrote.

These included -
  • No user rating and votes count
  • No MPAA (or BBFC) rating/certification
  • No plot
  • Some movie titles missing initial "A" or "An" from film title, i.e. "A Bridge too Far" became "Bridge too far"
I myself found -
  • That a film title containing a : (colon) was shortened, and
  • That it was potentially possible for a movie with no actors listed by Amazon to list the director incorrectly as an actor
I have addressed all these issues, but as no-one bothered responding to my requests for a list of problematic films, it is possible specific cases may or may-not be fixed.

The new fixed versions can be download via http://homepage.mac.com/jelockwood/.Public/amazon-xbmc.zip

Note: These are no longer identical to the Plex version. For the XBMC version I also had to include a 'spoof' to get round Amazon returning totally different formatted pages only for XBMC.


- Trenton_net - 2010-10-18 06:39

Hi Everyone,

Was trying to use the above fix that was posted for XBMC for Ubuntu (The version released for Lucid) and it seems I still can't scrape anything from Amazon US. Does anyone know how to fix this, or what the problem might be? Did they change the formatting again?

UPDATE:

If I use the scraper files from this link it seems to work slightly better:

http://homepage.mac.com/jelockwood/scrapers.html

When I use those scraper files, I get results back but only very little and with simple keywords. For example, if I want to find the movie "Anna in Kung Fu Land", Amazon.com shows it just fine. In XMBC, nothing shows up. If I type in just "Kung Fu" I get a few hits like "Kung Fu: The Legend Continues", etc, but not "Anna in Kung Fu Land".

Same thing happens with movies like "Butterfly and Sword". in Amazon.com the movie shows up fine. In XBMC, nothing shows up, and if I use the simpler terms like "Butterfly" or "sword" I get a few related hits like "The Butterfly Effect" and such, but no exact hit for "Butterfly and Sword"

Perhaps if someone who has the Amazon.com scraper working and experience with scraper development could try those search terms and see what's wrong with them? Or perhaps I'm not using the latest version of the Amazon.com scraper?


- jelockwood - 2010-10-21 17:48

Trenton_net Wrote:Hi Everyone,

Was trying to use the above fix that was posted for XBMC for Ubuntu (The version released for Lucid) and it seems I still can't scrape anything from Amazon US. Does anyone know how to fix this, or what the problem might be? Did they change the formatting again?

UPDATE:

If I use the scraper files from this link it seems to work slightly better:

http://homepage.mac.com/jelockwood/scrapers.html

When I use those scraper files, I get results back but only very little and with simple keywords. For example, if I want to find the movie "Anna in Kung Fu Land", Amazon.com shows it just fine. In XBMC, nothing shows up. If I type in just "Kung Fu" I get a few hits like "Kung Fu: The Legend Continues", etc, but not "Anna in Kung Fu Land".

Same thing happens with movies like "Butterfly and Sword". in Amazon.com the movie shows up fine. In XBMC, nothing shows up, and if I use the simpler terms like "Butterfly" or "sword" I get a few related hits like "The Butterfly Effect" and such, but no exact hit for "Butterfly and Sword"

Perhaps if someone who has the Amazon.com scraper working and experience with scraper development could try those search terms and see what's wrong with them? Or perhaps I'm not using the latest version of the Amazon.com scraper?

The version included with XBMC 9.11 is way out of date. Amazon also keep changing their html and unfortunately breaking these scrapers. I am currently quite busy myself but now I know there are issues I will try and get round to having a look and updating it (again). Either keep watching this thread, or PM me to see how things are going in a few days.


- olympia - 2010-10-21 17:59

Could you please help me to understand what amazon.com gives over imdb.com?

I've found all these movie in IMDb. Is there anything special which only available on amazon.com and not IMDb?

I just try to understand to usecase.


- Trenton_net - 2010-10-24 11:10

olympia Wrote:Could you please help me to understand what amazon.com gives over imdb.com?

I've found all these movie in IMDb. Is there anything special which only available on amazon.com and not IMDb?

I just try to understand to usecase.

Ah, well Amazon.com tends to have more complete information since it's a commercial website. That is, for international titles which are more obscure, they tend to document the item faster and more accurately with their own product scans and information.

For Chinese/HK movies (which is my usecase), you'll usually find cover art and information as the title brakes for international release almost immediately. Other scrapers like IMDb will usually have information as well, but that's (I think) all user submitted and there is no financial motivation for them to get cover art, movie information, etc, unless someone submits them.


- jelockwood - 2010-10-25 18:36

olympia Wrote:Could you please help me to understand what amazon.com gives over imdb.com?

I've found all these movie in IMDb. Is there anything special which only available on amazon.com and not IMDb?

I just try to understand to usecase.

For movies etc. IMDB is indeed normally the best choice, however Amazon is a better choice for non-movie titles, e.g. DVDs of live performance Comedy shows, music concerts, documentaries, some Children's stuff, and some other esoteric DVDs. A lot of these would not count as TV shows so TheTVDB.com does not help either.

I did not get a chance to work on it this weekend, but I will try to do so soon.


- olympia - 2010-10-27 15:56

I fixed the scraper for amazon.com, but the quality of images are crap and the amount of available information is extremely low.

Still not sure why anyone would like to use this.

Could you please test and confirm if this is what you wanted? Smile