ScraperEdit for XBMC (Java)
#46
Addressing the issue of the imports, I decided to write an AddonEdit program, that looks similar to the ScraperEdit program. It should make possible to edit addon.xml files, and launch ScraperEdit on the referenced scraper.
Currently in very early alpha state...
Reply
#47
I may have run into a bug...

When I run a certain regexp in CreateSearchUrl, "Find Matches" returns correct \1 and \2, but when run in the debugger, no matches are found. This depends on the presence of a certain search group, i.e, " \(HQ\)". The details:

Input / Title to scrape:
Code:
The Palm Beach Story (HQ) arte 2015-02-16 20h15.mov

Regexp with marker:
Code:
(.+)( \(HQ\).*\.[a-zA-Z0-9]{3})

Returns of "Find Matches" for this regexp:
Code:
\1 = "The Palm Beach Story"
\2 = " (HQ) arte 2015-02-16 20h15.mov"

Regexp without marker:
Code:
(.+)(\.[a-zA-Z0-9]{3})
Returns of "Find Matches" for this regexp:
Code:
\1 = "The Palm Beach Story (HQ) arte 2015-02-16 20h15"
\2 = ".mov"


Debugger Log, running both regexps:
Code:
7:41:44.085 [INFO] Entering Function: CreateSearchUrl
17:41:44.491 [FINE] RUNNING
17:41:44.492 [INFO] Executing RegExp: (.+)(\.[a-zA-Z0-9]{3})
17:41:44.900 [FINE] Loading variable = 1
17:41:44.901 [FINER] executing expression
17:41:44.902 [FINER] Match found = 0 - 55
17:41:44.904 [FINER] Match = \1
17:41:44.906 [FINE] RUNNING
17:41:44.907 [INFO] Executing RegExp: (.+)( \(HQ\).*\.[a-zA-Z0-9]{3})
17:41:45.317 [FINE] Loading variable = 1
17:41:45.319 [FINER] executing expression
17:41:45.323 [FINE] RUNNING
17:41:45.324 [INFO] Leaving Function: CreateSearchUrl

The debugger does not return a \1 on the regexp with the marker. Testing with \2: the same.

For \1 in the regexp without the marker, even with "URL encode" unchecked, the "()" have been replaced with "%28" and "%29", and the spaces with "+" (can be seen in the return variable, $$x). The total input is 51 characters long, the regexp without marker returns characters 1-55, so maybe the replacement happened before the regexp match, thus spoiling it?

Did I overlook something crucial? Please help.
y

P.S.:
ScraperEdit v0.1.2-66
Java 8-Update 31
Mac OS 10.10.2
Reply
#48
(2015-02-20, 19:02)yaxlir Wrote: I may have run into a bug...
Could you provide me with the scraper you use?
I tried to reproduce the "bug", but did not succeeded...
Reply
#49
(2015-03-22, 16:44)UsagiYojimbo Wrote:
(2015-02-20, 19:02)yaxlir Wrote: I may have run into a bug...
Could you provide me with the scraper you use?
I tried to reproduce the "bug", but did not succeeded...


This is the scraper:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scraper thumb="icon.png" date="2015-02-12" content="movies" framework="1.1" name="TV Spielfilm Filmarchiv">
    <NfoUrl dest="3">
        <RegExp dest="3" output="\1" input="$$1">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp dest="3" output="&lt;url spoof=&quot;http://www.google.de&quot;&gt;http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=\1&lt;/url&gt;" input="$$1">
            <expression trim="1" encode="1" clear="yes">(.+)</expression>
        </RegExp>
        <RegExp dest="3" output="&lt;url spoof=&quot;http://www.google.de&quot;&gt;http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=\1&lt;/url&gt;" input="$$1">
            <expression trim="1" encode="1">(.+)(\((HQ|HD|SD)\).*)</expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp dest="8" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results sorted=&quot;yes&quot;&gt;\1&lt;/results&gt;" input="$$5">
            <RegExp dest="5" output="&lt;entity&gt;&lt;title&gt;\3. \2. \4&lt;/title&gt;&lt;url&gt;\1&lt;/url&gt;&lt;/entity&gt;" input="$$1">
                <expression clear="yes" repeat="yes" noclean="1">&lt;article class="post"&gt;[\n \t]*&lt;a href="(http://www\.tvspielfilm\.de/kino/filmarchiv/film/[^\n]+,[0-9]+,ApplicationMovie\.html)[^&lt;]*&lt;img src[^&lt;]*&lt;div [^&lt;]*&lt;span class="sub-title"&gt;([^&lt;]*)&lt;/span&gt;[^&lt;]*&lt;h3&gt;([^&lt;]*)&lt;/h3&gt;[^&lt;]*&lt;p&gt;([^&lt;]*)&lt;/p&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp dest="3" output="&lt;details&gt;\1&lt;/details&gt;" input="$$8">
            <RegExp dest="8" output="&lt;title&gt;\1&lt;/title&gt;" input="$$1">
                <expression noclean="1" trim="1">&lt;h1 class="film-title"&gt;([^&lt;]*)&lt;/h1&gt;</expression>
            </RegExp>
            <RegExp dest="8+" output="&lt;votes&gt;\1&lt;/votes&gt;" input="$$6">
                <RegExp dest="6" output="\1 \2/3   " input="$$1">
                    <expression clear="yes" repeat="yes" noclean="1" trim="1">&lt;li[^&lt;]*&lt;span&gt;([^&lt;]*)&lt;/span&gt;[^&lt;]*&lt;ul class="red-br-rating"&gt;[^&lt;]*&lt;li class="active([0-3])"&gt;&lt;/li&gt;</expression>
                </RegExp>
                <RegExp dest="6+" output="TVS: \1   " input="$$1">
                    <expression noclean="1" trim="1">&lt;div class="editorial-rating big([1-3])"&gt;&lt;/div&gt;</expression>
                </RegExp>
                <RegExp dest="6+" output="Community: \2/5   " input="$$1">
                    <expression noclean="2" trim="2">&lt;li[^&lt;]*&lt;span&gt;([^&lt;]*)&lt;/span&gt;[^&lt;]*&lt;ul class="community-rating"&gt;[^&lt;]*&lt;li class="active([0-5])"&gt;&lt;/li&gt;</expression>
                </RegExp>
                <expression noclean="1"></expression>
            </RegExp>
            <RegExp dest="8+" output="&lt;tagline&gt;\1&lt;/tagline&gt;&lt;plot&gt;\2&lt;/plot&gt;" input="$$1">
                <expression noclean="1" trim="1">&lt;div class="description-text"&gt;[^&lt;]*&lt;h3&gt;([^&lt;]*)&lt;/h3[^&lt;]*&lt;p&gt;([^&lt;]*)&lt;/p&gt;</expression>
            </RegExp>
            <RegExp dest="8+" output="\1" input="$$7">
                <RegExp dest="7" output="&lt;genre&gt;\2&lt;/genre&gt;" input="$$1">
                    <expression clear="yes" noclean="2" trim="2">&lt;dt&gt;Genre:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;originaltitle&gt;\2&lt;/originaltitle&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Originaltitel:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;country&gt;\2&lt;/country&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Land:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;year&gt;\2&lt;/year&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Jahr:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;runtime&gt;\2&lt;/runtime&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Länge:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;rating&gt;\2&lt;/rating&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Altersfreigabe:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;director&gt;\2&lt;/director&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Regie:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <expression repeat="yes" noclean="1" trim="1"></expression>
            </RegExp>
            <RegExp dest="8+" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\4&lt;/role&gt;&lt;/actor&gt;" input="$$1">
                <expression repeat="yes" noclean="2" trim="2">&lt;span class="name"&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?[^&lt;]*&lt;/span&gt;[^&lt;]*&lt;span class="role"&gt;([^&lt;]*)&lt;/span&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
</scraper>


Scraping for "Ben Hur" logs
Code:
17:16:42.127 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur
17:16:42.130 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur

And for "Ben Hur (HQ)" logs
Code:
17:18:04.625 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur%2B%2528HQ%2529
17:18:04.629 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur%2B%2528HQ%2529

The difference is some encoding of my regexp "(.+)(\((HQ|HD|SD)\).*)" in the second case, i guess. Similar problems in later versions of scraper with info in "[ ]", like "[Musical]".

Thanks for your work!
Reply
#50
Update: unchecking URL encode in both regexps logs

Code:
20:45:50.957 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur
20:45:50.960 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur
and
Code:
20:46:29.080 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur+%28HQ%29
20:46:29.081 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur+%28HQ%29

The outpout is set to
Code:
<url spoof="http://www.google.de">http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=\1</url>
and only the part before the " (HQ)" should be in the URL.
Reply
#51
(2015-02-20, 19:02)yaxlir Wrote: For \1 in the regexp without the marker, even with "URL encode" unchecked, the "()" have been replaced with "%28" and "%29", and the spaces with "+" (can be seen in the return variable, $$x). The total input is 51 characters long, the regexp without marker returns characters 1-55, so maybe the replacement happened before the regexp match, thus spoiling it?

...

P.S.:
ScraperEdit v0.1.2-66
I have to look it up in the sources, and in SVN history, to find out that somewhere between versions 0.1.2-60 and 0.1.2-65, an "URL encode" function call was added to the very beginning of the debugging process. That means that the Scraper got an already "URL encoded" title in $$1.
I cannot remember, though, why was that added, and also no mention of it in the CHANGELOG file, either...
Maybe i should bind it to a configuration setting...
Reply
#52
Great you could find the bug!
Please let us know when a fixed version is available.

And a big thanks for you work!
Reply
#53
(2015-03-23, 23:23)yaxlir Wrote: Please let us know when a fixed version is available.
Version 0.1.2-72 is available on SF.net.

Will change XBMC logo to Kodi logo in near future...
Reply
#54
Thumbs Up 
(2015-03-25, 01:31)UsagiYojimbo Wrote: Version 0.1.2-72 is available on SF.net.

Works great with the filenames now.

Another thing: now the log includes the complete page loaded by GetSearchResults

Code:
11:30:19.563 [FINE] RUNNING
11:30:19.565 [INFO] Executing RegExp: <article class="post">[\n \t]*<a href="(http://www\.tvspielfilm\.de/kino/filmarchiv/film/[^\n]+,[0-9]+,ApplicationMovie\.html)[^<]*<img src[^<]*<div [^<]*<span class="sub-title">([^<]*)</span>[^<]*<h3>([^<]*)</h3>[^<]*<p>([^<]*)</p>
11:30:20.129 [FINE] clear dest
11:30:20.130 [FINE] Loading variable = 1
11:30:20.133 [FINER] executing expression
11:30:20.133 [FINE] input = <!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="de-DE">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=1146" />
<title>Ben Hur (1959) - Film-Suche - TV SPIELFILM</title>
    <meta name="date" content="2015-03-25T12:00:00+01:0

[ … snipped … ]

</body>
</html>
11:30:37.557 [FINE] output = <actor><name>\2</name><role>\4</role></actor>
11:30:37.573 [FINE] expression = <span class="name">([^<]*<a [^>]*>)?([^<]*)(</a>)?[^<]*</span>[^<]*<span class="role">([^<]*)</span>
11:30:37.575 [FINER] Match found = 114.901 - 115.142
11:30:37.616 [FINER] Match = \2
11:30:37.644 [FINER] Match = \4

Intended?
Reply
#55
(2015-03-25, 12:42)yaxlir Wrote: Another thing: now the log includes the complete page loaded by GetSearchResults

Intended?
I do not remember of the cause, but i think it is.
I turned the log file's level up, however. It is now set to FINEST, so log files became more detailed.
(The old setting was FINER, so FINE messages were always in the log file.)
Reply
#56
Thumbs Up 
(2015-03-27, 09:28)UsagiYojimbo Wrote: I turned the log file's level up, however. It is now set to FINEST, so log files became more detailed.
(The old setting was FINER, so FINE messages were always in the log file.)

I see. You could make it into a settings value for the user to choose if you are *really* bored :-) but ScraperEdit does what it was made to do, and it does it well already.

Thank you for your work. I would have given up on my scraper without it.
Reply
#57
(2015-03-27, 11:09)yaxlir Wrote: You could make it into a settings value for the user to choose if you are *really* bored :-) ScraperEdit does what it was made to do, and it does it well already.

Thank you for your work. I would have given up on my scraper without it.
I am glad that it helps you!

BTW, inside the JAR file, (which is a ZIP archive,) there is a logging.properties file, that controls the logging...
Reply
#58
(2015-03-27, 21:28)UsagiYojimbo Wrote: BTW, inside the JAR file, (which is a ZIP archive,) there is a logging.properties file, that controls the logging...

Cool!
Reply
#59
is it possible to use this editor to adapt the default tvdb scraper? i'd like that scraper to use the imdb rating per episode. any help would really be appreciated.
Reply
#60
(2015-03-29, 15:11)bbbutch Wrote: is it possible to use this editor to adapt the default tvdb scraper? i'd like that scraper to use the imdb rating per episode. any help would really be appreciated.

It might be easier to use the Universal scraper and set the data sources inside it.
Reply

Logout Mark Read Team Forum Stats Members Help
ScraperEdit for XBMC (Java)1