Login at Kodi Home

Schenk2302 · 2009-04-22, 22:01

Hi everyone,

i try to make a scraper but can't get ahead with one step.

I use scrap.exe to test my scraper:

CreateSearchUrl returned is okay!

GetSearchResults returned is okay !

Details URL is okay !

but then the GetDetails returned: is nothing with the Error: Unable to parse details.xml

Here's my code:

PHP Code:
<scraper name="TEST" content="movies" thumb="cinefacts.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" language="de">
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3 \4&lt;/title&gt;&lt;url&gt;http://www.cinefacts.de/kino/\1/\2/filmdetails.html&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&gt;&lt;a href=&quot;/kino/([0-9]*)/(.[^\/]*)/filmdetails.html&quot;&gt;[^&gt;]*(.[^&lt;]*)&lt;/b&gt;&lt;/a&gt;&lt;br&gt;[^&gt;]*[^\t]+\t+[^&nbsp;]+[^0-9]+([^&lt;]+)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <!--Title -->
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="5+">
                <expression trim="1" noclean="1">&lt;h1&gt;([^&lt;]*)</expression>
            </RegExp>
                </RegExp>
        </GetDetails>
</scraper> 

Maybe someone could have a quick look at this and tell me the direction to get it right.

Thanks so much in advance

Schenk

**spiff** · (This post was last modified: 2009-04-22, 22:19 by spiff.)

unfortunately scrap.exe is outdated and we lost the source.

and the reason it does not work is that you are missing the expression for the outermost RegExp in GetDetails, i.e.

Code:
....

</RegExp>

<expression noclean="1"/>

</RegExp>

</GetDetails>

Schenk2302 · 2009-04-22, 22:30

Hi Spiff,

thanks for your answer, that solved the problem with scrap.exe Smile

But now i tried it in XBMC and it doesn't work. i know that scrap.exe is outdated but is there any chance to see at which point XBMC stuck with my scrapper or better why it not works. any scrapper logs Huh

At this point i have absolutely no clue where to start and find the error because with scrap.exe it's just fine. Thanks again for any hints or infos.

Greetz

Schenk

**spiff** · 2009-04-22, 22:32

my answer depends on two things;
1) you speak c++ and can compile
or
2) you can compile
or
3) neither

Schenk2302 · 2009-04-22, 22:36

maybe 2) better 3)

Could you explain why?

Thanks

Schenk

**spiff** · (This post was last modified: 2009-04-22, 22:41 by spiff.)

if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly - here it is; http://dureks.dyndns.org:8080/scraperlog.diff
3 means i don't have to do anything

Smile

Schenk2302 · 2009-04-22, 22:40

spiff Wrote:if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly
3 means i don't have to do anything

2 sound like i could try
3 makes me crying because i want that Cinefacts Scraper working Smile

Schenk2302 · 2009-04-22, 22:43

little side note:

i made a cinefacts.de scraper for MediaPortal but now switched to XBMC and would like to use it here. It was even hard for me to do this in MP, in XBMC i'm getting depressed because it's totally different Smile

**spiff** · 2009-04-22, 22:45

heh, different does not mean bad. don't give up, you'll get the hang of it =P

Schenk2302 · 2009-04-22, 22:53

Spiff, i know i'm kind of lazy yet but is there a compiled version with your patch to download or do i really have to compile by my own, what makes me really afraid Shocked

**spiff** · 2009-04-22, 22:55

that was the prerequisite for 2)

Schenk2302 · 2009-04-23, 22:16

Hey Spiff,

don't wanna waste your time but i got a question left. I'm getting on with my scraper and the first things are very good. But now i parse the genres and that work but the output is like Action, Thriller, Horror. How to get rid of the , Huh

Thanks in advance

Schenk

**spiff** · 2009-04-23, 22:41

Code:
<RegExp input="$$2" output="\1\2" dest="3">

    <expression noclean="1,2" repeat="yes">(.*?),(.*)</expression>

</RegExp>

also you should use multiple <genre> tags so maybe something like this?

Code:
<RegExp input="$$2" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="3">

    <expression noclean="1" repeat="yes">(.*?),</expression>

</RegExp>

Schenk2302 · 2009-04-26, 01:37

And again a question, sorry for that in advance:

if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:

Here's my code:

PHP Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<scraper name="Cinefacts.de" content="movies" thumb="cinefacts.jpg" language="de">
    
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3 (\4)&lt;/title&gt;&lt;url&gt;http://www.cinefacts.de/kino/\1/\2/filmdetails.html&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&gt;&lt;a href=&quot;/kino/([0-9]*)/(.[^\/]*)/filmdetails.html&quot;&gt;[^&lt;]*&lt;b title=&quot;([^&quot;]*)&quot; class=&quot;headline&quot;&gt;[^&lt;]+&lt;/b&gt;&lt;/a&gt;&lt;br&gt;[^&lt;]+&lt;br&gt;+[^0-9]+([^&lt;]*)&lt;/td&gt;</expression>
        </RegExp>
                        <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>

</scraper> 

thanks again for any hints!!!

Schenk

Nicezia · 2009-04-26, 10:47

Schenk2302 Wrote:And again a question, sorry for that in advance:

if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:

Here's my code:

thanks again for any hints!!!

Schenk

try escaping the charachter code with '\xE4'
not sure if that's included in the regular expression engine though