moviemaze.de scraper development - help needed
#1
Question 
Hi there,

im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:

Director:

Moviemaze.de HTML
[HTML] <td valign=top class="standard">
<span class="fett">Regie:</span>
</td>

<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]

my regex:
Code:
<RegExp input="$$6" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="6">
    <expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
    </RegExp>
    <expression>([A-Za-z0-9 ,.]+)</expression>
</RegExp>

scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?


Actors:

Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>

</tr>[/HTML]

My regex:
Code:
<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="2">
        <expression repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>
</RegExp>

It works, but I miss the actors with a href link and I did not managed to find a solution.


German Umlaut [äöüß]:

I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9&auml;&Auml;&ouml;&Ouml;&uuml;&Uuml; ,.]+) without success.


Can somebody please help me?

regards,
w00dst0ck
Reply
#2
if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

actors; try adding the href stuff with a ?, i.e. make it optional

umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)
Reply
#3
thanx for reply!

spiff Wrote:if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.


spiff Wrote:umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

How do i set the encoding of the scraper?
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
Reply
#4
you set the encoding of the scraper xml file using exactly that kinda header as you just pasted
Reply
#5
w00dst0ck Wrote:Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.

Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source :

Heerf & # 2 5 2 ; hrer
Ma & # 2 2 3 ; arbeit

Heerführer
Maßarbeit

I had to space out the code to illustrate, of course they aren't needed.

XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
Reply
#6
Gaarv Wrote:Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

Thanx, that solves the problem.

The problem with the umlaut is solved by using (.*) Nod

Next I will try to solve my href problem...
Reply
#7
I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea

Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))

In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned

XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
Reply
#8
Hi there and thanx for your help.
I've only one problem left with the actors part.
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td>
</tr>[/HTML]


Code:
<!--Actors-->    
    <RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\5&lt;/role&gt;&lt;/actor&gt;" dest="5+">
        <RegExp input="$$1" output="\2" dest="2">
            <expression trim="2" repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
        </RegExp>
        <expression repeat="yes">(&lt;a href\="[^&gt;]*&gt;)?(.*?)(&lt;/a&gt;)?( \((.*?)\))?, </expression>
    </RegExp>

The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}?
but i think the XBMC scraper engine don't understand \t.
Any idea how I can get solve this?
Reply
#9
\\t
Reply
#10
Found another solution.

Will submit the working scraper at http://trac.xbmc.org/ticket/4563
Reply
#11
Need some help again!

Don't know what's wrong. The RegEx works in a regex tester. So it must be the code. Sad

Code:
<!--URL to Trailer-->
    <RegExp input="$$1" output="&lt;url function=&quot;GetTrailerLink&quot;&gt;http://www.moviemaze.de/media/trailer/\1.html&lt;/url&gt;" dest="5+">
        <expression>href=&quot;/media/trailer/(.*?).html&quot; ti</expression>
    </RegExp>

    <expression noclean="1"></expression>
    </RegExp>
</GetDetails>

<!--Trailer-->
    <GetTrailerLink dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;&gt;&lt;details&gt;\1&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;trailer urlencoded=&quot;yes&quot;&gt;http://www.moviemaze.de/media/trailer/delivery/\1.mov&lt;/trailer&gt;" dest="2">
                <expression>delivery/(.*?).mov&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetTrailerLink>

Is there a way to implement more than one trailer?
The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.
Reply
#12
w00dst0ck Wrote:Is there a way to implement more than one trailer? The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.
I am guessing that will count as a feature request, please submit a new ticket on trac http://trac.xbmc.org

Wink
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#13
Sometimes the xbmc.log helps to solve a problem.
Submitted the working version with trailer support to trac.
Also submitted a feature request.
Reply
#14
I have the bulk part of fanart done.

To get the missing imdb number I've created a google wrapper.
Code:
<!--URL to Google and Fanart-->
<RegExp conditional="fanart" input="$$1" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=site:imdb.com+moviemaze+\2+\1&lt;/url&gt;" dest="5+">
<expression>&lt;h2&gt;\(([^,]*), ([0-9]{4})</expression>
</RegExp>

The generated URL is:
Code:
http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight

And it results in:
Code:
INFO: Get URL: http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight
ERROR: Server returned: 400 Bad Request

I've discover that the spaces in "The Dark Knight" have to be replaced with "+".
But I don't know how to replace that char with an regex. Any ideas about that?
Reply
#15
something along

<RegExp input=$$1 output="\1+\2" dest="4">
<expression repeat="yes" noclean="1,2">(.*?) (.*)</expression>
</RegExp>
Reply
 
Thread Rating:
  • 1 Vote(s) - 5 Average



Logout Mark Read Team Forum Stats Members Help
moviemaze.de scraper development - help needed51