Yet another request for help with "CreateSearchURL" function
#1
Hello,

I'm in the process of learning and making my own scraper. I've been struggling with this for a while now and reading through other threads didn't yield any solution for me Sad.
Might I request a little help with this? I must be missing something trivial or obvious...

1) Maybe there's a problem with the "name" attribute? The value is not the same as in the corresponding "addon.xml".
2) As I take it, if the "CreateSearchURL" function is correct, it will output the constructed URL into the debug log?

The problem is the following:
Code:
19:41:30 T:139925521422080   DEBUG: VideoInfoScanner: No NFO file found. Using title search for '/share/HDD1_Filmy/CSFD_scraper_test/atv_cerna_labut_1024x576.m4v'
19:41:30 T:139925521422080   DEBUG: FindMovie: Searching for 'atv cerna labut 1024x576' using CSFD movies SYN1 scraper (path: '/root/.xbmc/addons/metadata.csfd.cz', content: 'movies', version: '1.0.0')
19:41:30 T:139925521422080   ERROR: Run: Unable to parse web site

And this is the scraper I got so far:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scraper language="cs" thumb="icon.png" date="2013-10-26" content="movies" framework="1.1" name="SYN1_CSFD_scraper">

    <CreateSearchUrl dest="2">
        <RegExp dest="2" output="&lt;url&gt;http://www.csfd.cz/hledat/?q=\1&lt;/url&gt;" input="$$1">
            <expression clear="yes">atv (.*) [0-9]+[xX][0-9]+</expression>
        </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="3">
        <RegExp dest="3" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" input="$$2">
            <RegExp dest="2" output="&lt;entity&gt;&lt;title&gt;\2 (\3)&lt;/title&gt;&lt;url&gt;www.csfd.cz/film/\1/&lt;/url&gt;&lt;/entity&gt;" input="$$1">
                <expression repeat="yes">&lt;a href=&quot;/film/([^/]*)/&quot;[^&gt;]*&gt;([^&lt;]+)&lt;/a&gt;(?:[\s]*&lt;span[^&gt;]*&gt;[^&lt;]*&lt;/span&gt;[\s]*)?&lt;/h3&gt;[\s]*&lt;p&gt;[^0-9]*([0-9]{4,4})&lt;/p&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSearchResults>
    
   <GetDetails dest="2">
        <RegExp dest="2" output="&lt;details&gt;\1&lt;/details&gt;" input="$$3">
            <RegExp dest="3+" output="&lt;title&gt;\1&lt;/title&gt;&lt;sorttitle&gt;\1&lt;/sorttitle&gt;" input="$$1">
                <expression>&lt;div class=&quot;info&quot;&gt;(?:[\s]*&lt;[^&gt;]+&gt;[\s]*)+?&lt;h1&gt;[\s]*([^&lt;]+?)[\s]*&lt;/h1&gt;</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" input="$$1">
                <expression>&lt;ul class=&quot;names&quot;&gt;[\s]*&lt;li&gt;[\s]*&lt;[^&gt;]+&gt;[\s]*&lt;h3&gt;([^&lt;]+)&lt;/h3&gt;</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;year&gt;\1&lt;/year&gt;" input="$$1">
                <expression>&lt;p class=&quot;origin&quot;&gt;[^&lt;]*?, ([0-9]{4,4})</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;director&gt;\1&lt;/director&gt;" input="$$1">
                <expression>&lt;h4&gt;Režie:(?:[\s]*&lt;[^&gt;]+&gt;[\s]*)+?&lt;a href=&quot;[^&quot;]+&quot;&gt;([^&lt;]+)&lt;/a&gt;</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;top250&gt;\1&lt;/top250&gt;" input="$$1">
                <expression>&lt;p class=&quot;charts&quot;&gt;[\s]+&lt;a href=&quot;[^&quot;]+&quot;&gt;([0-9]+)\. nejlepší film&lt;/a&gt;</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;rating&gt;\1&lt;/rating&gt;" input="$$1">
                <expression>&lt;div id=&quot;rating&quot;&gt;[\s]*&lt;h2 class=&quot;average&quot;&gt;([0-9]{2,2}%)&lt;/h2&gt;</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;plot&gt;\1&lt;/plot&gt;" input="$$1">
                <expression noclean="1">&lt;div id=&quot;plots&quot;[^&gt;]*&gt;(?:[\s]*&lt;[^&gt;]+&gt;[\s]*)+?&lt;h3&gt;[\s]*Obsah[\s]*&lt;/h3&gt;(?:[\s]*&lt;[^&gt;]+&gt;[\s]*)+?&lt;img[^&gt;]*&gt;[\s]*(.*?)&lt;span</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;runtime&gt;\1&lt;/runtime&gt;" input="$$1">
                <expression>&lt;p class=&quot;origin&quot;&gt;[^&lt;]*?, ([0-9]{2,3}) min</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;thumb&gt;&lt;url spoof=&quot;http://www.csfd.cz&quot;&gt;img.csfd.cz/files/images/film/posters/\1/\2/\3.\4&lt;/url&gt;&lt;/thumb&gt;" input="$$1">
                <expression repeat="yes">&lt;div class=&quot;image&quot; style=&quot;background-image: url\('\\/\\/img\\.csfd\\.cz\\/files\\/images\\/film\\/posters\\/([^\\]+)\\/([^\\]+)\\/([^\\]+)\\\.([a-zA-Z]+)</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;id&gt;\1&lt;/id&gt;" input="$$1">
                <expression>&lt;link rel=&quot;canonical&quot; href=&quot;http://www.csfd.cz/film/([^\-]+)-[^&quot;]+&quot;</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;genre&gt;\1&lt;/genre&gt;" input="$$1">
                <expression>&lt;p class=&quot;genre&quot;&gt;([^/\s]+)</expression>
            </RegExp>
            <RegExp dest="3+" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;/actor&gt;" input="$$1">
                <expression>&lt;h4&gt;Hrají:&lt;/h4&gt;[\s]*&lt;span[^&gt;]*&gt;[\s]*&lt;a href=&quot;[^&quot;]+&quot;&gt;([^&lt;]+)&lt;/a&gt;</expression>
            </RegExp>
            <expression></expression>
        </RegExp>
    </GetDetails>

</scraper>
Reply
#2
(2013-10-26, 20:06)DarkAngeL Wrote: 2) As I take it, if the "CreateSearchURL" function is correct, it will output the constructed URL into the debug log?
...
Code:
<CreateSearchUrl dest="2">
        <RegExp dest="2" output="&lt;url&gt;http://www.csfd.cz/hledat/?q=\1&lt;/url&gt;" input="$$1">
            <expression clear="yes">atv (.*) [0-9]+[xX][0-9]+</expression>
        </RegExp>
    </CreateSearchUrl>
The problem is likely due to the fact that the file name that the regex is being run on will have already been URL-encoded by XBMC, i.e. it actually looks like 'atv%20cerna%20labut%201024x576'.
This doesn't match the expression you're using due to the lack of spaces.
Reply
#3
Awesome, thanks a lot! Wouldn't have thought of that Smile. At least not until 10 or 20 hours spent on the matter.

One last question, if I may:

Let's assume the following HTML input:
Code:
<h4>Hrají:</h4><span data-truncate="340">
<a href="/tvurce/158-natalie-portman/">Natalie Portman</a>, <a href="/tvurce/18093-mila-kunis/">Mila Kunis</a>, <a href="/tvurce/2024-vincent-cassel/">Vincent Cassel</a>

How should I match the actors' names one after another so that an <actor> element is generated for each of them in the "GetDetails" function? Generally speaking, the following regex is not enough:
Code:
<a href="/tvurce/[^"]+/">([^<]+)</a>

It is not enough, because there are many more such elements, completely unrelated to the movie's list of actors. The regex has to relate to the context mentioned in the input above. As such, I can't seem to figure out a "clean" way to capture them all, without an upper limit. Is that even possible?
Reply
#4
The usual trick is to do it in two steps.

First capture everything after Hrají: and before the next thing, and store it in a spare buffer, and then run your regexp on that buffer, rather than $$1. That way it only has the actor links to match against.

So something like:
Code:
<h4>Hrají:</h4><span[^>]*>[^<]*((?:<a[^>]*>[^<]*</a>[^<]*)*)
should match the example you've given, although like I say, likely simpler to do something like <h4>Hrají:</h4>(.*?)<h4>Blah blah</h4> - may depend on the consistency of the layout.

Then you'd just go:
Code:
<RegExp input="$$1" output="\1" dest="10">
    <expression noclean="1" clear="yes">&lt;h4&gt;Hrají:&lt;/h4&gt;&lt;span[^&gt;]*&gt;[^&lt;]*((?:&lt;a[^&gt;]*&gt;[^&lt;]*&lt;/a&gt;[^&lt;]*)*)
</expression>
</RegExp>
<RegExp input="$$10" output="etc., etc." dest="3+">
    <expression repeat="yes">&lt;a href=&quot;/tvurce/[^&quot;]+/&quot;&gt;([^&lt;]+)&lt;/a&gt;</expression>
</RegExp>
Reply
#5
As usual, I tend to think in a much too complicated manner, overlooking the easy and obvious solutions. Thank you very much for help Smile.
Reply

Logout Mark Read Team Forum Stats Members Help
Yet another request for help with "CreateSearchURL" function0