How to code "&" in the URL function in a scraper
#1
Question 
Hello,

could you give me please an hint about the corret coding of the "&" sign in the url function? I want to use the url function in a scraper, no problem I thougt at all... Everything works fine, but if a "&" is in the URL which I want to fetch/scrape, the "&" is somehow escaped, resulting in an error.
System I use for testing is a Win7 x64 in VMWare with XBMC Camelot 9.11 installed.

Following four variants do not work:
Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetInfos&quot;&gt;http://www.xxxx.de/search.php?Sea=mytitle&Kat=dvd&page=results&lt;/url&gt;" dest="5">
Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetInfos&quot;&gt;http://www.xxxx.de/search.php?Sea=mytitle&amp;Kat=dvd&amp;page=results&lt;/url&gt;" dest="5">
Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetInfos&quot;&gt;http://www.xxxx.de/search.php?Sea=mytitle&&amp;Kat=dvd&&amp;page=results&lt;/url&gt;" dest="5">
Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetInfos&quot;&gt;http://www.xxxx.de/search.php?Sea=mytitle%26Kat=dvd%26page=results&lt;/url&gt;" dest="5">

In the log file, the error is always:
Code:
CIMDB::InternalGetDetails: Unable to parse web site [http://www.xxxx.de/search.php?Sea=mytitle]
=> XBMC (or better I ?) has a problem with the handling of "&" in the URL function. How to do it right?

Regards,

Eisbahn
Reply
#2
%26 should do it in theory.

However the entire url after the parameter(I guess it is the title in this case) should be url encoded, so try to url encode the = as well.

BTW real url, full scraper code, full debug log would help pretty much to be in position to give better support for you.
Reply
#3
Hello olympia,

shortend it for better reading and thought all infos are included. Sadly it does not work at all, it gets even worse: XBMC hangs and calls always the URL function because $$2 is not filled witj informations and left out completely (hint: if $$2 is replaced by tt0195234 = the IMDB ID for "Grasgeflüster" it works fine). Sadly forum allows only 10k chars, so only first part of scrape code here
Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="11" date="2010-07-10" name="DE_IMDb_v1.0.0" content="movies" thumb="imdb_de.png" language="de">
    <GetSettings dest="3">
        <RegExp input="$$5" output="&lt;settings&gt;\1&lt;/settings&gt;" dest="3">
            <RegExp input="$$1" output="&lt;setting label=&quot;mehr als die ersten beiden Genre/Kategorisierungen importieren&quot; type=&quot;bool&quot; id=&quot;getallgenre&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5">
                <expression />
            </RegExp>
            <RegExp input="$$1" output="&lt;setting label=&quot;Infos zu allen am Film beteiligten anfügen&quot; type=&quot;bool&quot; id=&quot;fullcredits&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression />
            </RegExp>
            <RegExp input="$$1" output="&lt;setting label=&quot;Fanart von TMDB aktivieren&quot; type=&quot;bool&quot; id=&quot;fanart&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression />
            </RegExp>
            <RegExp input="$$1" output="&lt;setting label=&quot;Cover/Poster von TMDB sammeln&quot; type=&quot;bool&quot; id=&quot;tmdbthumbs&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression />
            </RegExp>
            <RegExp input="$$1" output="&lt;setting label=&quot;Cover/Poster von MoviePosterDB einfügen&quot; type=&quot;bool&quot; id=&quot;movieposterdb&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression />
            </RegExp>
            <RegExp input="$$1" output="&lt;setting label=&quot;IMDb Cover/Poster- und DarstellerInnen Thumbnailgröße [Pixel]&quot; type=&quot;labelenum&quot; values=&quot;192|256|384|512|1024&quot; id=&quot;imdbscale&quot; default=&quot;512&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression />
            </RegExp>
            <RegExp input="$$1" output="&lt;setting label=&quot;Handlung von OFDB abfragen wenn IMDB keine enthält&quot; type=&quot;bool&quot; id=&quot;getofdbplot&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression />
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </GetSettings>
    <NfoUrl dest="2">
        <RegExp input="$$1" output="&lt;url&gt;http://www.imdb.de/title/tt\1/&lt;/url&gt;&lt;id&gt;tt\1&lt;/id&gt;" dest="2">
            <expression clear="yes" noclean="1">imdb.de/title\?([0-9]*)</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;url&gt;http://www.imdb.de/title/tt\1/&lt;/url&gt;&lt;id&gt;tt\1&lt;/id&gt;" dest="2+">
            <expression clear="yes" noclean="1">imdb.de/title/tt([0-9]*)</expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl SearchStringEncoding="utf-8" dest="4">
        <RegExp input="$$1" output="&lt;url&gt;http://www.imdb.de/find?s=tt;q=\1$$3&lt;/url&gt;" dest="4">
            <RegExp input="$$2" output="%20(\1)" dest="3">
                <expression clear="yes">(.+)</expression>
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="5">
        <RegExp input="$$3" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="5">
            <RegExp input="$$1" output="\1" dest="2">
                <expression clear="yes">&lt;link rel="canonical" href="http://www.imdb.de/title/([t0-9]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;year&gt;\2&lt;/year&gt;&lt;url&gt;http://www.imdb.de/title/$$2/&lt;/url&gt;&lt;id&gt;$$2&lt;/id&gt;&lt;/entity&gt;" dest="3">
                <expression clear="yes" noclean="1">&lt;meta name="title" content="([^"]*) \(([0-9]*)\)</expression>
            </RegExp>
            <RegExp input="$$1" output="\1" dest="4">
                <expression noclean="1">(&gt;&lt;a href="/title.*)</expression>
            </RegExp>
            <RegExp input="$$4" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;year&gt;\3&lt;/year&gt;&lt;url&gt;http://www.imdb.de/title/\1/&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;&lt;/entity&gt;" dest="3+">
                <expression repeat="yes" noclean="1,2">&gt;&lt;a href="/title/([t0-9]*)/[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt; *\(([0-9]*)</expression>
            </RegExp>
            <expression clear="yes" noclean="1" />
        </RegExp>
    </GetSearchResults>
    <GetDetails clearbuffers="no" dest="4">
        <RegExp input="$$4" output="&lt;details&gt;\1&lt;/details&gt;" dest="4">
            <RegExp input="" output="&lt;id&gt;$$2&lt;/id&gt;" dest="4">
                <expression />
            </RegExp>
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="4+">
                <expression trim="1" noclean="1">&lt;h1&gt;([^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$5" output="&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" dest="4+">
                <RegExp input="$$1" output="\1" dest="5">
                    <expression trim="1" noclean="1">&lt;h5&gt;Auch bekannt als:&lt;/h5&gt;&lt;div class="info-content"&gt;([^\n]*)</expression>
                </RegExp>
                <expression trim="1">([^&gt;\n]*)(?:&lt;em&gt;\(Originaltitel\)&lt;/em&gt;)</expression>
            </RegExp>
            <RegExp input="$$5" output="&lt;sorttitle&gt;\1 \2&lt;/sorttitle&gt;" dest="4+">
                <expression repeat="yes" trim="1,2">[&lt;br&gt;]?(.*?)(?:&lt;br&gt;)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="4+">
                <expression trim="1,2">&lt;b&gt;([0-9,]+)/10&lt;/b&gt;[^&lt;]*&lt;a href="ratings" class="tn15more"&gt;([.0-9]+)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="4+">
                <expression>&lt;span&gt;\(([0-9][0-9][0-9][0-9])</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;top250&gt;\1&lt;/top250&gt;" dest="4+">
                <expression>Top 250: #([0-9]*)&lt;/a&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;premiered&gt;\1&lt;/premiered&gt;" dest="4+">
                <expression>&lt;h5&gt;Premierendatum:&lt;/h5&gt;\n&lt;div class="info-content"&gt;\n(.*?)[^\n]&lt;a</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;certification&gt;\1 \3&lt;/certification&gt;" dest="4+">
                <expression repeat="yes" trim="1,3">([^/&lt;&gt;|"(\n]+:[^&lt;"\( #\n|:=.]+)[ \n]+(&lt;i&gt;([^&lt;]*)&lt;/i&gt;)?[ \n]</expression>
            </RegExp>
            <RegExp input="$$5" output="&lt;mpaa&gt;\1 \3&lt;/mpaa&gt;" dest="4+">
                <RegExp input="$$1" output="\1" dest="5">
                    <expression trim="1">Altersfreigabe:&lt;/h5&gt;&lt;div class="info-content"&gt;(.*?)[^&lt;]&lt;/div&gt;</expression>
                </RegExp>
                <expression repeat="yes" trim="1,3">Deutschland:([^&lt;|]*)(&lt;i&gt;([^&lt;]*)&lt;/i&gt;?)?</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="4+">
                <expression trim="1">&lt;h5&gt;L&amp;#xE4;nge:&lt;/h5&gt;&lt;div class="info-content"&gt;([0-9]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;studio&gt;\1&lt;/studio&gt;" dest="4+">
                <expression repeat="yes">"/company/[^/]*/"&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;outline&gt;\1&lt;/outline&gt;&lt;tagline&gt;\1&lt;/tagline&gt;&lt;plot&gt;\1&lt;/plot&gt;" dest="4+">
                <expression>Handlung:&lt;/h5&gt;\n&lt;div class="info-content"&gt;\n([^&lt;]+)</expression>
            </RegExp>
            <RegExp input="" output="&lt;url function=&quot;GetIMDBPlot&quot;&gt;$$3plotsummary&lt;/url&gt;" dest="4+">
                <expression />
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </GetDetails>
    <GetIMDBPlot clearbuffers="no" dest="6">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
            <RegExp input="$$1" output="\1" dest="5">
                <expression clear="yes">&lt;div id="swiki.2.1"&gt;\n\n([^\n]+)</expression>
            </RegExp>
            <RegExp input="$$5" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="5">
                <expression>(.+)</expression>
            </RegExp>
            <RegExp conditional="getofdbplot" input="$$5" output="&lt;url function=&quot;GetOFDBLink&quot;&gt;http://www.ofdb.de/view.php?SText=$$2&amp;Kat=IMDb&amp;page=suchergebnis&lt;/url&gt;" dest="5">
                <expression>^$</expression>
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </GetIMDBPlot>
    <GetOFDBLink clearbuffers="no" dest="6">
        <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetOFDBOutTagline&quot;&gt;http://www.ofdb.de/\1&lt;/url&gt;&lt;/details&gt;" dest="6">
            <expression>&lt;br&gt;1. &lt;a href=".*?([^"]+)</expression>
        </RegExp>
    </GetOFDBLink>
    <GetOFDBOutTagline clearbuffers="no" dest="6">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
            <RegExp input="$$1" output="&lt;outline&gt;\1&lt;/outline&gt;&lt;tagline&gt;\1&lt;/tagline&gt;&lt;plot&gt;\1&lt;/plot&gt;" dest="5">
                <expression>&lt;b&gt;Inhalt:&lt;/b&gt;([^&lt;]+)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;url function=&quot;GetOFDBPlot&quot;&gt;http://www.ofdb.de/plot/\1&lt;/url&gt;" dest="5+">
                <expression>&lt;a href="plot/([^"]+)</expression>
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </GetOFDBOutTagline>
    <GetOFDBPlot clearbuffers="no" dest="6">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
            <RegExp input="$$1" output="\1" dest="5">
                <expression noclean="1">Eine Inhaltsangabe von(.*)Zur &amp;Uuml;bersichtsseite des Films</expression>
            </RegExp>
            <!--<br>([^<]+)(?:</font>)-->
            <RegExp input="$$5" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="5">
                <expression noclean="1">&lt;br&gt;([^&lt;]+)(?:&lt;/font&gt;)</expression>
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </GetOFDBPlot>
</scraper>

and the complete error log on pastebin: <http://pastebin.com/jy8SVBDC>

Regards,

Eisbahn
Reply
#4
ID is only available in $$2 in GetDetails function, so you have to rescrape it.

The following should do what you want:

Code:
    <GetIMDBPlot clearbuffers="no" dest="6">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
            <RegExp input="$$1" output="\1" dest="2">
                <expression clear="yes">&lt;a href="/title/(tt[0-9]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="\1" dest="5">
                <expression clear="yes">&lt;div id="swiki.2.1"&gt;\n\n([^\n]+)</expression>
            </RegExp>
            <RegExp input="$$5" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="5">
                <expression>(.+)</expression>
            </RegExp>
            <RegExp conditional="getofdbplot" input="$$5" output="&lt;url function=&quot;GetOFDBLink&quot;&gt;http://www.ofdb.de/view.php?SText=$$2&amp;Kat=IMDb&amp;page=suchergebnis&lt;/url&gt;" dest="5">
                <expression>^$</expression>
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </GetIMDBPlot>
Reply
#5
Hi Olympia,

thanks four your reply. For a better understanding: why are the buffers (e.g. $$2) cleared?
My GetDetails has clearbuffers="no" (=> meaning for me: buffers shall left untouched after end of function) and my "Sub-Functions" do not clear the buffers as well.
Because I need the ID for many things, I do not want to recrape it in every function. So I decided to spend a clearbuffers="no" and have some usefull infos present right at the start of my URL function (which fetch in $$1 always new HTML and overwrite the old HTML).
- At which step are URL functions called in the GetDetails (or other) function? (my assumption: after the lastRegEx of the function is finished)
- In which order are URL functions called (I assume: first come first serve = first in detail tags is worked first)
- When are the URL functions left? (I think after the whole calling function is ready)
- At which step are buffers cleared?

Would be nice to get some more infos for better coding of my scraper.

Thanks in advance,

Eisbahn

P.S: do you offer a GIT or SVN access for the next XBMC version (because I do not have this but could offer my scraper for all users)? Otherwise I can offer a zip, just getting some webspace from my real name/domain.
Reply
#6
If you stuff something explicitely in a buffer and you're using the clearbuffers="no" tag, then it won't be cleared between function calls.

ID from $$2 and web URL from $$3 get cleared because they were just offered as default buffer values in GetDetails function. That means if you stuff ID explicitely to a buffer in GetDetails, you can use it in all the functions you call until you clean or overwrite it.

The answer is basically yes to all your other questions.

Please note, they you have to convert the scraper to addon to make it compatible with next stable version.
Reply
#7
olympia Wrote:Please note, they you have to convert the scraper to addon to make it compatible with next stable version.

For XBMC v9 the actual version is here:
<http://forum.xbmc.org/showpost.php?p=566589&postcount=25>

Any docs for v10 ready? On <http://wiki.xbmc.org/index.php?title=Addons_for_XBMC#Types_of_Extension> there is sadly nothing...
Same with the following problems in v9:
<premiered>Premierendatum</premiered> not im-/exported to XBMC
<aired>???</aired> only for TV-Shows/series?
<set>???</set> don't know what this is
<artist>???</artist> difference to actor?
<status>???</status> don't know what this is
<certification>Altersfreigabe für alle Staaten außer D</certification> not im-/exported to XBMC
<sorttitle>alternative Filmtitel</sorttitle>only first titel is im-/exported to XBMC
<code>???</code> don't know what this is, I think it's the codec => no sense to import anything in this field

Any help?

Regards,

Eisbahn
Reply
#8
Eisbahn Wrote:Same with the following problems in v9:
<premiered>Premierendatum</premiered> not im-/exported to XBMC
<aired>???</aired> only for TV-Shows/series?
<set>???</set> don't know what this is
<artist>???</artist> difference to actor?
<status>???</status> don't know what this is
<certification>Altersfreigabe für alle Staaten außer D</certification> not im-/exported to XBMC
<sorttitle>alternative Filmtitel</sorttitle>only first titel is im-/exported to XBMC
<code>???</code> don't know what this is, I think it's the codec => no sense to import anything in this field
Some info:
<premiered> = first time premiere's date, for movies, or series
<aired> = first time airdate, for episodes only
<set> = like boxset, or movie series (EG Start Trek)
<artist> = only for album (music)
<sorttitle> = alternative title, for sorting only, EG with "the" removed from the beginning
Reply

Logout Mark Read Team Forum Stats Members Help
How to code "&" in the URL function in a scraper0