Quick Scraper Question (Hope so:))
#1
Hi everyone,

i try to make a scraper but can't get ahead with one step.

I use scrap.exe to test my scraper:

CreateSearchUrl returned is okay!

GetSearchResults returned is okay !

Details URL is okay !

but then the GetDetails returned: is nothing with the Error: Unable to parse details.xml

Here's my code:

PHP Code:
<scraper name="TEST" content="movies" thumb="cinefacts.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" language="de">
    <
CreateSearchUrl dest="3">
        <
RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
            <
expression noclean="1"/>
        </
RegExp>
    </
CreateSearchUrl>
    <
GetSearchResults dest="8">
        <
RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <
RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3 \4&lt;/title&gt;&lt;url&gt;http://www.cinefacts.de/kino/\1/\2/filmdetails.html&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <
expression repeat="yes">&gt;&lt;a href=&quot;/kino/([0-9]*)/(.[^\/]*)/filmdetails.html&quot;&gt;[^&gt;]*(.[^&lt;]*)&lt;/b&gt;&lt;/a&gt;&lt;br&gt;[^&gt;]*[^\t]+\t+[^&nbsp;]+[^0-9]+([^&lt;]+)</expression>
            </
RegExp>
            <
expression noclean="1"/>
        </
RegExp>
    </
GetSearchResults>
    <
GetDetails dest="3">
        <
RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <!--
Title -->
            <
RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="5+">
                <
expression trim="1" noclean="1">&lt;h1&gt;([^&lt;]*)</expression>
            </
RegExp>
                </
RegExp>
        </
GetDetails>
</
scraper

Maybe someone could have a quick look at this and tell me the direction to get it right.

Thanks so much in advance

Schenk
Reply
#2
unfortunately scrap.exe is outdated and we lost the source.

and the reason it does not work is that you are missing the expression for the outermost RegExp in GetDetails, i.e.
Code:
....
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
Reply
#3
Hi Spiff,

thanks for your answer, that solved the problem with scrap.exe Smile

But now i tried it in XBMC and it doesn't work. i know that scrap.exe is outdated but is there any chance to see at which point XBMC stuck with my scrapper or better why it not works. any scrapper logsHuh At this point i have absolutely no clue where to start and find the error because with scrap.exe it's just fine. Thanks again for any hints or infos.

Greetz

Schenk
Reply
#4
my answer depends on two things;
1) you speak c++ and can compile
or
2) you can compile
or
3) neither
Reply
#5
Big Grin

maybe 2) better 3)

Could you explain why?

Thanks

Schenk
Reply
#6
if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly - here it is; http://dureks.dyndns.org:8080/scraperlog.diff
3 means i don't have to do anything

Smile
Reply
#7
spiff Wrote:if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly
3 means i don't have to do anything

Smile


2 sound like i could try
3 makes me crying because i want that Cinefacts Scraper working Smile
Reply
#8
little side note:

i made a cinefacts.de scraper for MediaPortal but now switched to XBMC and would like to use it here. It was even hard for me to do this in MP, in XBMC i'm getting depressed because it's totally different Smile
Reply
#9
heh, different does not mean bad. don't give up, you'll get the hang of it =P
Reply
#10
Spiff, i know i'm kind of lazy yet but is there a compiled version with your patch to download or do i really have to compile by my own, what makes me really afraid Shocked
Reply
#11
that was the prerequisite for 2)
Reply
#12
Hey Spiff,

don't wanna waste your time but i got a question left. I'm getting on with my scraper and the first things are very good. But now i parse the genres and that work but the output is like Action, Thriller, Horror. How to get rid of the , Huh

Thanks in advance

Schenk
Reply
#13
Code:
<RegExp input="$$2" output="\1\2" dest="3">
    <expression noclean="1,2" repeat="yes">(.*?),(.*)</expression>
</RegExp>

also you should use multiple <genre> tags so maybe something like this?
Code:
<RegExp input="$$2" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="3">
    <expression noclean="1" repeat="yes">(.*?),</expression>
</RegExp>
Reply
#14
And again a question, sorry for that in advance:

if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:

Here's my code:

PHP Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<scraper name="Cinefacts.de" content="movies" thumb="cinefacts.jpg" language="de">
    
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3 (\4)&lt;/title&gt;&lt;url&gt;http://www.cinefacts.de/kino/\1/\2/filmdetails.html&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&gt;&lt;a href=&quot;/kino/([0-9]*)/(.[^\/]*)/filmdetails.html&quot;&gt;[^&lt;]*&lt;b title=&quot;([^&quot;]*)&quot; class=&quot;headline&quot;&gt;[^&lt;]+&lt;/b&gt;&lt;/a&gt;&lt;br&gt;[^&lt;]+&lt;br&gt;+[^0-9]+([^&lt;]*)&lt;/td&gt;</expression>
        </RegExp>
                        <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>

</scraper> 


thanks again for any hints!!!

Schenk
Reply
#15
Schenk2302 Wrote:And again a question, sorry for that in advance:

if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:

Here's my code:



thanks again for any hints!!!

Schenk


try escaping the charachter code with '\xE4'
not sure if that's included in the regular expression engine though
Reply
 
Thread Rating:
  • 0 Vote(s) - 0 Average



Logout Mark Read Team Forum Stats Members Help
Quick Scraper Question (Hope so:))00