Scraper functions question

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
ababak Offline
Junior Member
Posts: 8
Joined: Apr 2009
Reputation: 0
Location: Kiev, Ukraine
Sad  Scraper functions question
Post: #1
Hello!

I am trying to merge locally acquired movie thumb with function-returned data. It looks like I can't send just a plain text trough $$ buffers? Sending <thumbs><thumb>url</thumb></thumbs> works only when enclosed in <details></details> tags. Is there any way I can merge this returned data with local <thumb>url</thumb> tags? I am also wondering how is it merged with the documented <GetDetails> data already available in the <details></details> format when I am collecting it in some buffer using "4+"?

Thank you!
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
have a look at how the allmusic scraper passes the thumb tags (i chose this one for clarity). the trick is the usage of the clearbuffers parameter on the functions. (if i understand your question correctly).

after every function we call the load() on the returned xml. this is an additive procedure, i.e. we keep what has been added before. but that does not mean that we can load several <thumbs> tags, hence the trick with clearbuffers
(This post was last modified: 2009-04-09 22:19 by spiff.)
find quote
ababak Offline
Junior Member
Posts: 8
Joined: Apr 2009
Reputation: 0
Location: Kiev, Ukraine
Post: #3
Hello spiff! Thank you for your reply.

Are you talking about bundled /Applications/Plex.app/Contents/Resources/Plex/system/scrapers/music/allmusic.xml ? I can't see any clearbuffers there...

Could you correct me whether arbitrary text can't be passed as a returned value from function?

How is <details></details> format retuned by custom function being merged with GetDetails <details></details>?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #4
plex? this is xbmc.. they apparently haven't updated the scrapers then.

http://trac.xbmc.org/browser/branches/li...lmusic.xml

Code:
<RegExp input="$$1" output="&lt;thumb&gt;http://image.allmusic.com/00/amg/pic200/dr\1\200/\1\2\3\4/\1\2\3\4\5.jpg&lt;/thumb&gt;" dest="7+">
                     <expression noclean="1" repeat="yes">&quot;([A-Z^])([0-9^])([0-9^])([0-9^])([^&quot;]*)&quot;</expression>
</RegExp>
this builds a list of the allmusic available artist thumbs in buffer 7. we want to add htbackdrop thumbs, and that requires another scrape, the GetThumbs function. now, we flag
GetArtistDetails with clearbuffers="no". this means that when that function is finished we do NOT clear the contents of the buffers. we then enter GetThumbs
Code:
<RegExp input="$$13" output="&lt;details&gt;&lt;thumbs&gt;\1$$7&lt;/thumbs&gt;&lt;/details&gt;"
note the usage of $$7 here - this is the list that was built prior in GetDetails. since we do not clear the buffers after the GetDetails function has run, this is still available.

hope that explains it.

no you cannot pass arbitrary data as the result of a function. once you see the allmusic code you'll get the point. as i already explained; after every function we call load() on the returned string (xml). this is an additive procedure, i.e. any new tags present will get loaded. if you return a tag that has been returned earlier, it's overridden. hence the need to do the clearbuffers trick
(This post was last modified: 2009-04-09 23:06 by spiff.)
find quote
ababak Offline
Junior Member
Posts: 8
Joined: Apr 2009
Reputation: 0
Location: Kiev, Ukraine
Post: #5
Ah, ok, sorry, I didn't realize the scrappers code-base isn't common between these projects. I'll have a look. Thank you very much!
find quote
ababak Offline
Junior Member
Posts: 8
Joined: Apr 2009
Reputation: 0
Location: Kiev, Ukraine
Post: #6
Thanks, spiff! That explained everything I asked!
find quote
ababak Offline
Junior Member
Posts: 8
Joined: Apr 2009
Reputation: 0
Location: Kiev, Ukraine
Post: #7
Two more questions.

1. What is an optimal way of replacing strings in the buffer? For example, I need to replace &nbsp; with spaces.
2. How to use "cache" parameter in the url tag (I've seen it in several scrapers but can't understand what it does and how it works)?
(This post was last modified: 2009-04-10 08:49 by ababak.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #8
1) optimality can be discussed but this works
Code:
<RegExp input="$$2" output="\1&amp;amp;\2" dest="3">
  expression noclean="1,2" repeat="yes">(.*?)&amp;(.+)</expression>
</RegExp>
2) cache is just a local file name. that way you can run several function on the same page without redownloading it.
find quote
pancheto Offline
Junior Member
Posts: 36
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Question  RE: Scraper functions question
Post: #9
please forgive me for bumping into this discussion, but I've been struggling for quite a while with the clearbuffers="no" option, and here is the first time I see almost a worked example, which would also almost explain my doubts, although unfortunately I don't seem to get it right.

I'm currently helping to develop the FilmAffinity scraper for XBMC, and the fact is that I have 3 buffers filled inside the GetDetails function ($$11, $$12 and $$13) which I would love to see is their content inside a custom function I call to get the IMDB id. the reason is that if I'm able to send those variables I'd be able to parse the function results more accurately. this would be the (summarized version of the) code:

Code:
<GetDetails dest="3" clearbuffers="no">
    <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
        <RegExp input="$$1" output="\1" dest="11">
            <expression trim="1" noclean="1">movie.gif&quot; border=&quot;0&quot;&gt; (.*?)(\(AKA|&lt;)</expression>
        </RegExp>
        <RegExp input="$$1" output="\1" dest="12">
            <expression trim="1">T&amp;Iacute\;TULO ORIGINAL&lt;/th&gt;\s*&lt;td&gt;&lt;strong&gt;(.*?)(&lt;|\(AKA)</expression>
        </RegExp>
        <RegExp input="$$1" output="\1" dest="13">
            <expression>A&Ntilde;O&lt;/th&gt;\s*&lt;td&gt;.*?\s*([0-9]{4})\s*&lt;/td&gt;</expression>
        </RegExp>
        <RegExp conditional="!EnableFastSearch" input="$$9" output="&lt;url function=&quot;GetIMDBid&quot;&gt;\1&lt;/url&gt;" dest="5+">
            <RegExp conditional="!GoogleAdvSearch" input="" output="http://www.imdb.com/xml/find?xml=1&nr=1&tt=on&q=" dest="9">
                <expression />
            </RegExp>
            <RegExp conditional="GoogleAdvSearch" input="" output="http://www.google.com/search?q=site:imdb.com" dest="9">
                <expression />
            </RegExp>
            <RegExp input="$$12" output="+\1" dest="9+">
                <!-- unimos con '+' cada palabra -->
                <expression repeat="yes">(\w+)</expression>
            </RegExp>
            <RegExp input="$$13" output="+(\1)" dest="9+">
                <expression />
            </RegExp>
            <expression />
        </RegExp>

even though I'm setting the clearbuffers="no" option in the GetDetails function definition, I'm not seeing these $$12 and $$13 variables' content inside the GetIMDBid function. what should I modify to get it right?

EDIT: I've opened a new thread in the "scraper development"'s root with this same post content, as I'm positive other people may find any answer very interesting, due to the lack of documentation on this issue.

EDIT2: this issue has now been solved here
(This post was last modified: 2012-12-09 20:10 by pancheto.)
find quote