ScraperEdit for XBMC (Java)

  Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #46
Addressing the issue of the imports, I decided to write an AddonEdit program, that looks similar to the ScraperEdit program. It should make possible to edit addon.xml files, and launch ScraperEdit on the referenced scraper.
Currently in very early alpha state...
(This post was last modified: 2014-06-25 20:59 by UsagiYojimbo.)
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Post: #47
I may have run into a bug...

When I run a certain regexp in CreateSearchUrl, "Find Matches" returns correct \1 and \2, but when run in the debugger, no matches are found. This depends on the presence of a certain search group, i.e, " \(HQ\)". The details:

Input / Title to scrape:
Code:
The Palm Beach Story (HQ) arte 2015-02-16 20h15.mov

Regexp with marker:
Code:
(.+)( \(HQ\).*\.[a-zA-Z0-9]{3})

Returns of "Find Matches" for this regexp:
Code:
\1 = "The Palm Beach Story"
\2 = " (HQ) arte 2015-02-16 20h15.mov"

Regexp without marker:
Code:
(.+)(\.[a-zA-Z0-9]{3})
Returns of "Find Matches" for this regexp:
Code:
\1 = "The Palm Beach Story (HQ) arte 2015-02-16 20h15"
\2 = ".mov"


Debugger Log, running both regexps:
Code:
7:41:44.085 [INFO] Entering Function: CreateSearchUrl
17:41:44.491 [FINE] RUNNING
17:41:44.492 [INFO] Executing RegExp: (.+)(\.[a-zA-Z0-9]{3})
17:41:44.900 [FINE] Loading variable = 1
17:41:44.901 [FINER] executing expression
17:41:44.902 [FINER] Match found = 0 - 55
17:41:44.904 [FINER] Match = \1
17:41:44.906 [FINE] RUNNING
17:41:44.907 [INFO] Executing RegExp: (.+)( \(HQ\).*\.[a-zA-Z0-9]{3})
17:41:45.317 [FINE] Loading variable = 1
17:41:45.319 [FINER] executing expression
17:41:45.323 [FINE] RUNNING
17:41:45.324 [INFO] Leaving Function: CreateSearchUrl

The debugger does not return a \1 on the regexp with the marker. Testing with \2: the same.

For \1 in the regexp without the marker, even with "URL encode" unchecked, the "()" have been replaced with "%28" and "%29", and the spaces with "+" (can be seen in the return variable, $$x). The total input is 51 characters long, the regexp without marker returns characters 1-55, so maybe the replacement happened before the regexp match, thus spoiling it?

Did I overlook something crucial? Please help.
y

P.S.:
ScraperEdit v0.1.2-66
Java 8-Update 31
Mac OS 10.10.2
(This post was last modified: 2015-02-24 15:09 by yaxlir.)
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #48
(2015-02-20 19:02)yaxlir Wrote:  I may have run into a bug...
Could you provide me with the scraper you use?
I tried to reproduce the "bug", but did not succeeded...
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Post: #49
(2015-03-22 16:44)UsagiYojimbo Wrote:  
(2015-02-20 19:02)yaxlir Wrote:  I may have run into a bug...
Could you provide me with the scraper you use?
I tried to reproduce the "bug", but did not succeeded...


This is the scraper:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scraper thumb="icon.png" date="2015-02-12" content="movies" framework="1.1" name="TV Spielfilm Filmarchiv">
    <NfoUrl dest="3">
        <RegExp dest="3" output="\1" input="$$1">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp dest="3" output="&lt;url spoof=&quot;http://www.google.de&quot;&gt;http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=\1&lt;/url&gt;" input="$$1">
            <expression trim="1" encode="1" clear="yes">(.+)</expression>
        </RegExp>
        <RegExp dest="3" output="&lt;url spoof=&quot;http://www.google.de&quot;&gt;http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=\1&lt;/url&gt;" input="$$1">
            <expression trim="1" encode="1">(.+)(\((HQ|HD|SD)\).*)</expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp dest="8" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results sorted=&quot;yes&quot;&gt;\1&lt;/results&gt;" input="$$5">
            <RegExp dest="5" output="&lt;entity&gt;&lt;title&gt;\3. \2. \4&lt;/title&gt;&lt;url&gt;\1&lt;/url&gt;&lt;/entity&gt;" input="$$1">
                <expression clear="yes" repeat="yes" noclean="1">&lt;article class="post"&gt;[\n \t]*&lt;a href="(http://www\.tvspielfilm\.de/kino/filmarchiv/film/[^\n]+,[0-9]+,ApplicationMovie\.html)[^&lt;]*&lt;img src[^&lt;]*&lt;div [^&lt;]*&lt;span class="sub-title"&gt;([^&lt;]*)&lt;/span&gt;[^&lt;]*&lt;h3&gt;([^&lt;]*)&lt;/h3&gt;[^&lt;]*&lt;p&gt;([^&lt;]*)&lt;/p&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp dest="3" output="&lt;details&gt;\1&lt;/details&gt;" input="$$8">
            <RegExp dest="8" output="&lt;title&gt;\1&lt;/title&gt;" input="$$1">
                <expression noclean="1" trim="1">&lt;h1 class="film-title"&gt;([^&lt;]*)&lt;/h1&gt;</expression>
            </RegExp>
            <RegExp dest="8+" output="&lt;votes&gt;\1&lt;/votes&gt;" input="$$6">
                <RegExp dest="6" output="\1 \2/3   " input="$$1">
                    <expression clear="yes" repeat="yes" noclean="1" trim="1">&lt;li[^&lt;]*&lt;span&gt;([^&lt;]*)&lt;/span&gt;[^&lt;]*&lt;ul class="red-br-rating"&gt;[^&lt;]*&lt;li class="active([0-3])"&gt;&lt;/li&gt;</expression>
                </RegExp>
                <RegExp dest="6+" output="TVS: \1   " input="$$1">
                    <expression noclean="1" trim="1">&lt;div class="editorial-rating big([1-3])"&gt;&lt;/div&gt;</expression>
                </RegExp>
                <RegExp dest="6+" output="Community: \2/5   " input="$$1">
                    <expression noclean="2" trim="2">&lt;li[^&lt;]*&lt;span&gt;([^&lt;]*)&lt;/span&gt;[^&lt;]*&lt;ul class="community-rating"&gt;[^&lt;]*&lt;li class="active([0-5])"&gt;&lt;/li&gt;</expression>
                </RegExp>
                <expression noclean="1"></expression>
            </RegExp>
            <RegExp dest="8+" output="&lt;tagline&gt;\1&lt;/tagline&gt;&lt;plot&gt;\2&lt;/plot&gt;" input="$$1">
                <expression noclean="1" trim="1">&lt;div class="description-text"&gt;[^&lt;]*&lt;h3&gt;([^&lt;]*)&lt;/h3[^&lt;]*&lt;p&gt;([^&lt;]*)&lt;/p&gt;</expression>
            </RegExp>
            <RegExp dest="8+" output="\1" input="$$7">
                <RegExp dest="7" output="&lt;genre&gt;\2&lt;/genre&gt;" input="$$1">
                    <expression clear="yes" noclean="2" trim="2">&lt;dt&gt;Genre:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;originaltitle&gt;\2&lt;/originaltitle&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Originaltitel:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;country&gt;\2&lt;/country&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Land:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;year&gt;\2&lt;/year&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Jahr:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;runtime&gt;\2&lt;/runtime&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Länge:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;rating&gt;\2&lt;/rating&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Altersfreigabe:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <RegExp dest="7+" output="&lt;director&gt;\2&lt;/director&gt;" input="$$1">
                    <expression noclean="2" trim="2">&lt;dt&gt;Regie:?&lt;/dt&gt;[^&lt;]*&lt;dd&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?&lt;(dt|[^&gt;]*)&gt;</expression>
                </RegExp>
                <expression repeat="yes" noclean="1" trim="1"></expression>
            </RegExp>
            <RegExp dest="8+" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\4&lt;/role&gt;&lt;/actor&gt;" input="$$1">
                <expression repeat="yes" noclean="2" trim="2">&lt;span class="name"&gt;([^&lt;]*&lt;a [^&gt;]*&gt;)?([^&lt;]*)(&lt;/a&gt;)?[^&lt;]*&lt;/span&gt;[^&lt;]*&lt;span class="role"&gt;([^&lt;]*)&lt;/span&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
</scraper>


Scraping for "Ben Hur" logs
Code:
17:16:42.127 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur
17:16:42.130 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur

And for "Ben Hur (HQ)" logs
Code:
17:18:04.625 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur%2B%2528HQ%2529
17:18:04.629 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben%2BHur%2B%2528HQ%2529

The difference is some encoding of my regexp "(.+)(\((HQ|HD|SD)\).*)" in the second case, i guess. Similar problems in later versions of scraper with info in "[ ]", like "[Musical]".

Thanks for your work!
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Post: #50
Update: unchecking URL encode in both regexps logs

Code:
20:45:50.957 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur
20:45:50.960 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur
and
Code:
20:46:29.080 [FINEST] URL = http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur+%28HQ%29
20:46:29.081 [FINE] Downloading URL: http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=Ben+Hur+%28HQ%29

The outpout is set to
Code:
<url spoof="http://www.google.de">http://www.tvspielfilm.de/kino/filmarchiv/suche/?q=\1</url>
and only the part before the " (HQ)" should be in the URL.
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #51
(2015-02-20 19:02)yaxlir Wrote:  For \1 in the regexp without the marker, even with "URL encode" unchecked, the "()" have been replaced with "%28" and "%29", and the spaces with "+" (can be seen in the return variable, $$x). The total input is 51 characters long, the regexp without marker returns characters 1-55, so maybe the replacement happened before the regexp match, thus spoiling it?

...

P.S.:
ScraperEdit v0.1.2-66
I have to look it up in the sources, and in SVN history, to find out that somewhere between versions 0.1.2-60 and 0.1.2-65, an "URL encode" function call was added to the very beginning of the debugging process. That means that the Scraper got an already "URL encoded" title in $$1.
I cannot remember, though, why was that added, and also no mention of it in the CHANGELOG file, either...
Maybe i should bind it to a configuration setting...
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Post: #52
Great you could find the bug!
Please let us know when a fixed version is available.

And a big thanks for you work!
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #53
(2015-03-23 23:23)yaxlir Wrote:  Please let us know when a fixed version is available.
Version 0.1.2-72 is available on SF.net.

Will change XBMC logo to Kodi logo in near future...
(This post was last modified: 2015-03-25 01:32 by UsagiYojimbo.)
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Thumbs Up  RE: ScraperEdit for XBMC (Java)
Post: #54
(2015-03-25 01:31)UsagiYojimbo Wrote:  Version 0.1.2-72 is available on SF.net.

Works great with the filenames now.

Another thing: now the log includes the complete page loaded by GetSearchResults

Code:
11:30:19.563 [FINE] RUNNING
11:30:19.565 [INFO] Executing RegExp: <article class="post">[\n \t]*<a href="(http://www\.tvspielfilm\.de/kino/filmarchiv/film/[^\n]+,[0-9]+,ApplicationMovie\.html)[^<]*<img src[^<]*<div [^<]*<span class="sub-title">([^<]*)</span>[^<]*<h3>([^<]*)</h3>[^<]*<p>([^<]*)</p>
11:30:20.129 [FINE] clear dest
11:30:20.130 [FINE] Loading variable = 1
11:30:20.133 [FINER] executing expression
11:30:20.133 [FINE] input = <!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="de-DE">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=1146" />
<title>Ben Hur (1959) - Film-Suche - TV SPIELFILM</title>
    <meta name="date" content="2015-03-25T12:00:00+01:0

[ … snipped … ]

</body>
</html>
11:30:37.557 [FINE] output = <actor><name>\2</name><role>\4</role></actor>
11:30:37.573 [FINE] expression = <span class="name">([^<]*<a [^>]*>)?([^<]*)(</a>)?[^<]*</span>[^<]*<span class="role">([^<]*)</span>
11:30:37.575 [FINER] Match found = 114.901 - 115.142
11:30:37.616 [FINER] Match = \2
11:30:37.644 [FINER] Match = \4

Intended?
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #55
(2015-03-25 12:42)yaxlir Wrote:  Another thing: now the log includes the complete page loaded by GetSearchResults

Intended?
I do not remember of the cause, but i think it is.
I turned the log file's level up, however. It is now set to FINEST, so log files became more detailed.
(The old setting was FINER, so FINE messages were always in the log file.)
(This post was last modified: 2015-03-27 09:29 by UsagiYojimbo.)
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Thumbs Up  RE: ScraperEdit for XBMC (Java)
Post: #56
(2015-03-27 09:28)UsagiYojimbo Wrote:  I turned the log file's level up, however. It is now set to FINEST, so log files became more detailed.
(The old setting was FINER, so FINE messages were always in the log file.)

I see. You could make it into a settings value for the user to choose if you are *really* bored :-) but ScraperEdit does what it was made to do, and it does it well already.

Thank you for your work. I would have given up on my scraper without it.
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #57
(2015-03-27 11:09)yaxlir Wrote:  You could make it into a settings value for the user to choose if you are *really* bored :-) ScraperEdit does what it was made to do, and it does it well already.

Thank you for your work. I would have given up on my scraper without it.
I am glad that it helps you!

BTW, inside the JAR file, (which is a ZIP archive,) there is a logging.properties file, that controls the logging...
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Post: #58
(2015-03-27 21:28)UsagiYojimbo Wrote:  BTW, inside the JAR file, (which is a ZIP archive,) there is a logging.properties file, that controls the logging...

Cool!
find quote
bbbutch Offline
Junior Member
Posts: 1
Joined: Mar 2015
Reputation: 0
Post: #59
is it possible to use this editor to adapt the default tvdb scraper? i'd like that scraper to use the imdb rating per episode. any help would really be appreciated.
find quote
yaxlir Offline
Junior Member
Posts: 17
Joined: Feb 2015
Reputation: 0
Post: #60
(2015-03-29 15:11)bbbutch Wrote:  is it possible to use this editor to adapt the default tvdb scraper? i'd like that scraper to use the imdb rating per episode. any help would really be appreciated.

It might be easier to use the Universal scraper and set the data sources inside it.
find quote