Rotten Tomatoes Scraper
#3
Okay, I've got a pretty good skeleton of a scraper going here, no fanart or thumbs yet. I'm slightly at a loss as to how I'm going to tackle that as other scrapers look like they use the IMDB id # to fetch artwork. I can do the reverse (i.e, look up a RT movie with the IMDB id with http://www.rottentomatoes.com/alias?type=imdbid&s=[imdb id #]) but RT movie pages do not contain the IMDB id or a way to look it up from within that site. So should I:

1) rewrite the scraper to lookup the IMDB id with CreateSearchUrl and call all the RT stuff with functions?

or

2) lookup the IMDB title with a call to Google or the like. I think PTGate looks up the IMDB id like this if the movie's page doesn't include it.

Both methods seem less than ideal, but they're the only options I can think of right now.

Here's what I've done so far anyway:

Image

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper name="Rotten Tomatoes 0.5" date="2009-08-05" content="movies" framework="1.0" thumb="rottentomatoes.png" language="">
  <GetSettings dest="3">
    <RegExp input="$$5" output="&lt;settings&gt;\1&lt;/settings&gt;" dest="3">
      <RegExp input="$$1" output="&lt;setting label=&quot;Location&quot; type=&quot;labelenum&quot; values=&quot;us|au|uk&quot; id=&quot;locality&quot; default=&quot;au&quot;&gt;&lt;/setting&gt;" dest="5">
        <expression></expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;setting label=&quot;Retrieve Classification Reason&quot; type=&quot;bool&quot; id=&quot;classreason&quot; default=&quot;false&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression></expression>
            </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetSettings>
  <NfoUrl dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1">(http://$INFO[locality]\.rottentomatoes\.com/m/[A-Za-z0-9_]*)</expression>
    </RegExp>
  </NfoUrl>
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://$INFO[locality].rottentomatoes.com/search/full_search.php?search=\1" dest="3">
      <expression noclean="1"></expression>
    </RegExp>
  </CreateSearchUrl>
<GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 (\3)&lt;/title&gt;&lt;url&gt;http://$INFO[locality].rottentomatoes.com/m/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;a href=&quot;/m/([^&quot;]*)&quot;&gt;([^&lt;]*).*?([0-9]{4})</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetSearchResults>
  <GetDetails dest="3">  
    <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
      <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;originaltitle&gt;\1&lt;/originaltitle&gt;&lt;year&gt;\2&lt;/year&gt;" dest="8">
        <expression noclean="1" trim="1">&lt;h1.class=&quot;movie_title clearfix&quot;&gt;([\S\s]*)\(([0-9]{4})\)&lt;/h1&gt;[\S\s]*dialog_content clearfix</expression>
      </RegExp>
      
      <RegExp input="$$7" output="&lt;director&gt;\1&lt;/director&gt;" dest="8+">
        <RegExp input="$$1" output="\1" dest="7">
          <expression noclean="1">&lt;p class=&quot;movie_crew_shortened[\S\s]*Director:([\S\s]*)movie_crew_all</expression>
        </RegExp>
        <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
      </RegExp>
      <RegExp conditional="!classreason" input="$$1" output="&lt;mpaa&gt;\1&lt;/mpaa&gt;" dest="8+">
                <expression>&lt;div id=&quot;movie_stats&quot;&gt;[\S\s]*&lt;span class=&quot;content&quot;&gt;([^&lt;]*)[\S\s]*\[See.Full.Rating\]</expression>
            </RegExp>
            <RegExp conditional="classreason" input="$$1" output="&lt;mpaa&gt;\1 \2&lt;/mpaa&gt;" dest="8+">
                <expression>&lt;div id=&quot;movie_stats&quot;&gt;[\S\s]*&lt;span class=&quot;content&quot;&gt;([^&lt;]*)[\S\s]*\[See.Full.Rating\][\S\s]*movie_rating_reason&quot;.style=&quot;display:.none&quot;&gt;([^&lt;]*)</expression>
            </RegExp>
      <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="8+">
          <expression>Runtime:[^0-9]*([^&lt;]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;thumb&gt;&lt;url spoof=&quot;http://www.culturalianet.com&quot;&gt;http://www.culturalianet.com/imatges/articulos/\1-1.jpg&lt;/url&gt;&lt;/thumb&gt;" dest="8+">
        <expression>imatges/articulos/([0-9]*)-</expression>
      </RegExp>
      <RegExp input="$$7" output="&lt;credits&gt;\1&lt;/credits&gt;" dest="8+">
        <RegExp input="$$1" output="\1" dest="7">
          <expression noclean="1">class=&quot;label&quot;&gt;Screenwriter:&lt;/span&gt;([\S\s]*)Story:&lt;/span&gt;</expression>
        </RegExp>
        <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
        <expression>&lt;li class=&quot;ui-tabs-selected&quot;&gt;&lt;a title="([0-9]{2,3})</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;votes&gt;\1&lt;/votes&gt;" dest="8+">
        <expression>&lt;p&gt;Reviews Counted: ([0-9]*)</expression>
      </RegExp>
<RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
    <expression noclean="1">&lt;span.class=&quot;label&quot;&gt;Genre:&lt;/span&gt;.&lt;span class=&quot;content&quot;&gt;&lt;a.href=&quot;/movie/browser.php\?genre=[0-9]*&quot;&gt;([^&lt;]*)</expression>
    </RegExp>
      <RegExp input="$$7" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;&lt;/role&gt;&lt;/actor&gt;" dest="8+">
        <RegExp input="$$1" output="\1" dest="7">
          <expression noclean="1">&lt;span class=&quot;label&quot;&gt;Starring:([\S\s]*)&lt;p class=&quot;movie_cast_all&quot;</expression>
        </RegExp>
        <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
        <expression>&lt;span id=&quot;movie_synopsis_all&quot; style=&quot;display: none;&quot;&gt;([\S\s]*)&lt;a href=&quot;#&quot; id=&quot;movie_synopsis_link</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetDetails>
</scraper>
Reply


Messages In This Thread
Rotten Tomatoes Scraper - by seedzero - 2009-08-05, 13:11
[No subject] - by seedzero - 2009-08-06, 04:24
[No subject] - by seedzero - 2009-08-06, 12:58
[No subject] - by seedzero - 2009-08-12, 15:54
[No subject] - by spiff - 2009-08-12, 16:07
[No subject] - by seedzero - 2009-08-13, 00:19
[No subject] - by spiff - 2009-08-13, 10:59
[No subject] - by seedzero - 2009-08-13, 15:44
[No subject] - by seedzero - 2009-08-15, 14:44
[No subject] - by blacklist - 2009-08-15, 16:46
[No subject] - by seedzero - 2009-08-16, 01:47
[No subject] - by rausch101 - 2009-10-21, 05:52
[No subject] - by seedzero - 2009-10-21, 06:00
[No subject] - by rausch101 - 2009-10-21, 06:05
[No subject] - by phonics - 2009-10-21, 07:01
[No subject] - by muzzakus - 2009-10-28, 04:58
[No subject] - by seedzero - 2009-10-29, 01:19
[No subject] - by shadylog - 2009-12-17, 17:23
[No subject] - by mkortstiege - 2009-12-17, 18:51
[No subject] - by seedzero - 2009-12-17, 22:48
[No subject] - by shadylog - 2010-01-09, 19:14
[No subject] - by mkortstiege - 2010-01-09, 19:34
[No subject] - by teddy6565 - 2010-04-17, 12:04
[No subject] - by teddy6565 - 2010-04-18, 02:44
[No subject] - by seedzero - 2010-10-19, 12:06
[No subject] - by seedzero - 2010-10-25, 01:33
[No subject] - by rausch101 - 2010-10-25, 04:28
[No subject] - by kneufeld - 2010-10-25, 04:47
[No subject] - by gabbott - 2010-10-25, 04:59
[No subject] - by olympia - 2010-10-25, 06:47
[No subject] - by seedzero - 2010-10-25, 08:38
[No subject] - by olympia - 2010-10-25, 10:50
[No subject] - by booker88 - 2010-11-26, 04:31
[No subject] - by seedzero - 2010-11-26, 08:13
[No subject] - by booker88 - 2010-11-27, 09:41
[No subject] - by mortstar - 2011-03-31, 18:26
[No subject] - by Flicker - 2011-04-15, 10:59
[No subject] - by NotShorty - 2011-04-19, 01:41
[No subject] - by seedzero - 2011-04-19, 04:07
[No subject] - by sourbob - 2011-06-20, 22:07
Logout Mark Read Team Forum Stats Members Help
Rotten Tomatoes Scraper0