Kodi Community Forum

Full Version: MovieMeter.nl (Dutch Movies) Scraper development...
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
good point
Did this scraper got any progress Smile I'd still love my movie information in dutch.
Sadly i can't report much progress. As i said in my latest post in this thread, i think i have the basic functionality done but need to test it in xbmc. I just haven't found the time this last weeks to do this. I'm still determined to finish the scaper, and add support for impawards, movieposterdb, fanart and additional imdb info. But sadly i can't give you a timeframe for this all. I'm working on it in the (very) little spare time that i have.
Trazer Wrote:Sadly i can't report much progress. As i said in my latest post in this thread, i think i have the basic functionality done but need to test it in xbmc. I just haven't found the time this last weeks to do this. I'm still determined to finish the scaper, and add support for impawards, movieposterdb, fanart and additional imdb info. But sadly i can't give you a timeframe for this all. I'm working on it in the (very) little spare time that i have.


Sounds very ambitious Smile I have plenty of time, just nice to hear your planning on finishing it in the future, hopefully someone pops up to offer you help to speed everything up a bit, there are plenty of dutch/flemish xbmc users out there.
Missaar Wrote:Sounds very ambitious Smile I have plenty of time, just nice to hear your planning on finishing it in the future, hopefully someone pops up to offer you help to speed everything up a bit, there are plenty of dutch/flemish xbmc users out there.

True!

Maybe you can ask for help in the Dutch community topic. Smile
Finally had some time to fiddle with the scraper. Seems moviemeter.nl changed something on the main page so i had to change the regexp for retrieving the hashcode. At least that part is working again.
I have a question about chaining for the GetSearchResults function. Does scrap.exe support this or can i only test this using xbmc? My guess it's the latter. Am i guessing right?
yeah, scrap is totally deprecated as we lost the source code :/
Wow, that was a fast reply. Thanks much appreciated.

Time for me to setup xbmc for windows to correctly test the scraper and expand it's functionalities.
Whe can i find the moviemeter scraper?
It's still under construction. Wink
Hi, i made a php script that hopefully someone can translate into a working scraper file...



If you need more info please reply

PHP Code:
<?php
/*
if ($('quicksearch')) {
               new Searcher.Ajax.Json('quicksearch', 'http://www.moviemeter.nl/calls/search.php?hash=29918b11647fdd3755d59e6ac45d4977&qs=1', {
                       'postVar': 'search',
                       'quicksearch': true,
                       'maxChoices': 12,
                       'overflow':true,
                       'basic':true
                   });
}
*/

$term 'jurassic park';
$url 'http://www.moviemeter.nl/calls/search.php?hash=29918b11647fdd3755d59e6ac45d4977&qs=1&search='.$term;

$str file_get_contents($url);

//json response example for search "jurassic park"
//$str = '["header_films_0_3",{"i":"365","ty":"f","t":"Jurassic Park","a":"","y":"1993","img":"%3Cimg src%3D%22http%3A%2F%2Fwww.moviemeter.nl%2Fimages%2Fcovers%2Fthumbs%2F0%2F365.jpg%22 class%3D%22thumbnail%22 alt%3D%22Jurassic Park %281993%29%22 %2F%3E","px":75,"h":"%3Cp class%3D%22subtext%22%3EAvontuur %2F Science-Fiction%2C 127 minuten%3Cbr %2F%3Egeregisseerd door Steven Spielberg%3Cbr %2F%3Emet Sam Neill%2C Jeff Goldblum en Laura Dern%3Cbr %2F%3E%3C%2Fp%3E"},{"i":"341","ty":"f","t":"Jurassic Park III","a":"Jurassic Park 3","y":"2001","img":"%3Cimg src%3D%22http%3A%2F%2Fwww.moviemeter.nl%2Fimages%2Fcovers%2Fthumbs%2F0%2F341.jpg%22 class%3D%22thumbnail%22 alt%3D%22Jurassic Park III %282001%29%22 %2F%3E","px":75,"h":"%3Cp class%3D%22subtext%22%3EScience-Fiction %2F Actie%2C 92 minuten%3Cbr %2F%3Egeregisseerd door Joe Johnston%3Cbr %2F%3Emet Sam Neill%2C William H. Macy en T%E9a Leoni%3Cbr %2F%3E%3C%2Fp%3E"},{"i":"364","ty":"f","t":"Lost World%3A Jurassic Park%2C The","a":"Jurassic Park 2","y":"1997","img":"%3Cimg src%3D%22http%3A%2F%2Fwww.moviemeter.nl%2Fimages%2Fcovers%2Fthumbs%2F0%2F364.jpg%22 class%3D%22thumbnail%22 alt%3D%22Lost World%3A Jurassic Park%2C The %281997%29%22 %2F%3E","px":75,"h":"%3Cp class%3D%22subtext%22%3EScience-Fiction %2F Avontuur%2C 129 minuten%3Cbr %2F%3Egeregisseerd door Steven Spielberg%3Cbr %2F%3Emet Jeff Goldblum%2C Julianne Moore en Vince Vaughn%3Cbr %2F%3E%3C%2Fp%3E"},"header_directors_0_0","header_topics_0_2",{"i":"1424","t":"Jurassic Park 4 %28Film %3E Nieuws%29","ty":"t"},{"i":"5679","t":"Favoriete dino uit de Jurassic Park reeks %28Film %3E Toplijsten en favorieten%29","ty":"t"},"header_users_0_2",{"i":"37185","t":"JurassicPark","ty":"u","img":"%3Cimg src%3D%22http%3A%2F%2Fwww.moviemeter.nl%2Fimages%2Fuser_unknown.jpg%22 class%3D%22avatar%22 %2F%3E","px":54,"h":"%3Cp class%3D%22subtext%22%3Eingeschreven sinds 15 augustus 2006%3Cbr %2F%3E632 stemmen%2C 509 berichten%3C%2Fp%3E"},{"i":"1720","t":"Jurassic Smurf","ty":"u"}]';

//echo $str;
//echo '<hr />';
//i = id, t = movie title, y = movie year
preg_match_all('|"i":"(.*)".*"t":"(.*)".*"y":"(.*)".*|iUm'$str$ids);

$detail_urls = array();
if (!empty(
$ids[1])) {
    foreach(
$ids[1] as $id) {
        
array_push($detail_urls'http://www.moviemeter.nl/film/'.$id);
    }
}

//echo 'matches<pre>';
//print_r($detail_urls);
//echo '</pre>';

//parse a detail url
$contents file_get_contents('http://www.moviemeter.nl/film/364');
$contents str_replace("\r\n"''$contents);
$contents str_replace("\r"''$contents);
$contents str_replace("\n"''$contents);

preg_match_all('|.*<div id="film_info">(.*)<br />(.*)<br />(.*)<br />(.*)<br />(.*)<br />(.*)<br />(.*)<br />(.*)<br />.*</div>.*|iUm'$contents$movie_info);

echo 
'country:'.$movie_info[1][0];
echo 
'<br />';
echo 
'genre(s):'.$movie_info[2][0];
echo 
'<br />';
echo 
'movie length:'.$movie_info[3][0];
echo 
'<br />';
echo 
'director:'.$movie_info[5][0];
echo 
'<br />';
echo 
'actors:'.$movie_info[6][0];
echo 
'<br />';
echo 
'movie info:'.$movie_info[8][0];
echo 
'<br />';
?>
Currently i have something working to get the movie description from moviemeter, but i am having a problem with the following expression:


<expression>&lt;div id=&quot;film_info&quot;&gt;(.*)[^&lt;div]</expression>

source string = 'fdqskfdq<div id="film_info">MOVIE CONTENT THAT I NEED<div>fdqsfqds</div></div>fsdjlk';

i am trying to get all the data betwee div id="film_info">xxx</div>, but i get alot more (also the comments), see for example http://www.moviemeter.nl/film/365

does anyone know how to set this to the right regex?
i have some working code for moviemeter:

Code:
<scraper name="Moviemeter" content="movies" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<!-- By fjskmdl 2 jan 2009 -->
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="http://www.moviemeter.nl/calls/search.php?hash=3d669ba0d93914426945f6985e135be6&amp;qs=1&amp;search=\1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url&gt;http://www.moviemeter.nl/film/\2&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">({&quot;i&quot;:&quot;([0-9]+)&quot;,&quot;ty&quot;:&quot;[a-z]*&quot;,&quot;t&quot;:&quot;(.[^&quot;]*).[^}])</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>
  <GetDetails dest="3">
    <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
        <!-- title,year -->
        <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;year&gt;\2&lt;/year&gt;" dest="8">
            <expression trim="1" noclean="1">&lt;h1&gt;([^\(]*)\(([^\(]*)</expression>
        </RegExp>
        <!--Director-->
        <RegExp input="$$1" output="&lt;director&gt;\2&lt;/director&gt;" dest="8+">
            <expression repeat="yes">geregisseerd door ([^&gt;]*)&gt;([^&lt;]*)</expression>
        </RegExp>
        <!--Actors -->
        <RegExp input="$$1" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;&lt;/role&gt;&lt;/actor&gt;" dest="8+">
            <expression>met ([^&lt;]*)</expression>
        </RegExp>

        <!-- Runtime !-->
        <RegExp input="$$1" output="&lt;runtime&gt;\1 minuten&lt;/runtime&gt;" dest="8+">
            <expression repeat="yes">([0-9]+) minuten</expression>
        </RegExp>
        <!-- Thumbnail !-->
        <RegExp input="$$1" output="&lt;thumb&gt;&lt;url spoof=&quot;http://www.moviemeter.nl&quot;&gt;http://www.moviemeter.nl/images/covers/\1/\2.jpg&lt;/url&gt;&lt;/thumb&gt;" dest="8+">
            <expression>http://www.moviemeter.nl/images/covers/([0-9]+)/([0-9]+)\.jpg</expression>
        </RegExp>

        <!--rating -->
        <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
            <expression>gemiddelde &lt;b&gt;([0-9,]+)([^&lt;]*)&lt;/b&gt;</expression>
        </RegExp>

        <!-- nr votes -->
        <RegExp input="$$1" output="&lt;votes&gt;\1&lt;/votes&gt;" dest="8+">
            <expression>&lt;b&gt;([0-9]+)&lt;/b&gt; stemmen</expression>
        </RegExp>

        <!-- genre -->
        <RegExp input="$$1" output="&lt;genre&gt;\2&lt;/genre&gt;" dest="8+">
            <expression>film_info&quot;&gt;([^&lt;]*)&lt;br /&gt;([^&lt;]*)</expression>
        </RegExp>
        <!-- Plot -->
        <RegExp input="$$1" output="&lt;plot&gt;\7&lt;/plot&gt;" dest="8+">
            <expression repeat="yes">&lt;div id=&quot;film_info&quot;&gt;([^&lt;]*)&lt;br /&gt;([^&lt;]*)&lt;br /&gt;([^&lt;]*)&lt;br /&gt;&lt;br /&gt;geregisseerd door &lt;a href=&quot;http://www\.moviemeter\.nl/director/([0-9]+)&quot;([^&lt;]*)&lt;/a&gt;&lt;br /&gt;([^&lt;]*)&lt;br /&gt;&lt;br /&gt;([^&lt;]*)</expression>
        </RegExp>
        <expression noclean="1"/>
        </RegExp>
    </GetDetails>
</scraper>

bugs: when rating is 2,98 it shows 2.00
cast --> all persons are shown on 1 line
yeyh! i see you figured out the scraper syntax Smile

the problem with the comma separated number is that they are simply not valid floating point numbers (parsed as %f in a sscanf like function if that tells you anything). you need to translate them to use a dot.

cast is just the expression, there is no repeat on it (but i assume you knew that)
Hi,

This scraper is going to stop working soon because of some changes in the HTML of the site I'm going to make. However, I'm creating an XML-RPC API (web service) for accessing the MovieMeter.nl film information. Would it be possible for you to change your scripts so this API is used instead of scraping the HTML? If someone wants to test using this API, please contact me at [email protected]
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30