Updating an Existing Scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Question  Updating an Existing Scraper
Post: #1
Hi

I was trying to use existing TMDB scraper for my movie collection. Unfortunately, it seems that it sends the whole file name to the search api, but I have all file names like "<Director Name> - <Title> [part1-2] (<Year>)". So I need to change regexps a little to match my file names. I found a guide here:
http://wiki.xbmc.org/index.php?title=HOW...s_guide%29

Unfortunately it's very difficult to understand from that guide, how to work with existing scrapers.
Here is an excerpt from TMDB scraper:
Code:
<CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url&gt;http://api.themoviedb.org/2.1/Movie.search/$INFO[language]/xml/57983e31fb435df4df77afb854740ea9/\1&lt;/url&gt;" dest="3">
            <RegExp input="$$2" output="+\1" dest="4">
                <expression clear="yes">(.+)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>
The regexp I need for my files is simple and obvious:
Code:
.+\s-\s(.+)\s(part[1-9]\s)?\(.+\)

I just don't get how the inner regexp from scraper is connected to the outer regexp - the inner one works with buffers 2 and 4 and the outer one works with 1 and 3. So for me they shouldn't correlate at all...

Furthermore, the inner regexp is (.+), which should wipe out everything and the output is +\1 - does it mean it just adds "+" in front of the string?

Please explain me how can I modify the scraper above to add my regexp?
find quote
bambi73 Offline
Senior Member
Posts: 217
Joined: Jan 2010
Reputation: 0
Location: Czech Republic
Post: #2
nucleo Wrote:I just don't get how the inner regexp from scraper is connected to the outer regexp - the inner one works with buffers 2 and 4 and the outer one works with 1 and 3. So for me they shouldn't correlate at all...

Furthermore, the inner regexp is (.+), which should wipe out everything and the output is +\1 - does it mean it just adds "+" in front of the string?
Inner one is processed first, then outer one.
By Spiff from another thread: expressions are evaluated in an lifo/depth-search fashion, i.e. dig into the deepest one and evaluate that first.

And you are right inner regexp make no sense, result is not used and i have no idea what is inside $$2 at CreateSearchUrl, IMHO nothing.

nucleo Wrote:Please explain me how can I modify the scraper above to add my regexp?

Something like:
Code:
<CreateSearchUrl dest="3">
  <RegExp input="$$1" output="&lt;url&gt;http://api.themoviedb.org/2.1/Movie.search/$INFO[language]/xml/57983e31fb435df4df77afb854740ea9/\1&lt;/url&gt;" dest="3">
    <expression noclean="1">.+?\s-\s(.+)\s(?:part[1-9]\s)?\(.+\)</expression>
  </RegExp>
</CreateSearchUrl>

I added ? to first .+ to make it lazy, otherwise \s-\s will catch last occurence which can be in movie name.
BTW I didn't tested it so it's without any warranty Wink
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #3
Thanks for your reply!

Actually you also added "?:" before part[1-9]. I'm not sure what for, anyway it doesn't work with it or without. This is the main problem. I've already tried before to use it in the way you suggested and XBMC always says "Unable to connect to remote server". However, the default scraper works (though it gets wrong info, because it uses "Paul Haggis - Crash (2004)" instead of just "Crash").

Unfortunately I cannot test regexps with Scrape XML, because I'm on Ubuntu. So I just rely on XBMC itself... And the message is not very promising.

BTW, this additional inner regexp is always in place in any scraper, for example, for IMDB.But there we have output %20\1 instead of +\1.
%20 and + remind me about spaces in HTTP URL, but how does the expression work and why it places its output into $$4, which is never used? Is it some hidden not documented functionality like in Win API? Smile

It is said somewhere in manuals that XBMC will offer you a list of variants for each file, but it offers nothing to me, can I enable it somehow?
find quote
olympia Offline
Team-Kodi Member
Posts: 2,503
Joined: May 2008
Reputation: 32
Post: #4
@bambi73
$$2 is the year from the filename and it is used in case of imdb scraper.

Not sure why it is there in tmdb scraper. There it is indeed not used, so probably a leftover from the past.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #5
olympia, could you please give an example, where I can place my regexp in tmdb or imdb scraper?
find quote
bambi73 Offline
Senior Member
Posts: 217
Joined: Jan 2010
Reputation: 0
Location: Czech Republic
Post: #6
nucleo Wrote:Thanks for your reply!

Actually you also added "?:" before part[1-9]. I'm not sure what for, anyway it doesn't work with it or without. This is the main problem. I've already tried before to use it in the way you suggested and XBMC always says "Unable to connect to remote server". However, the default scraper works (though it gets wrong info, because it uses "Paul Haggis - Crash (2004)" instead of just "Crash").
After Olympia response i got bit suspicious and yes XBMC already removes year from file name and passes it in $$2. So you need only

Code:
.+?\s-\s(.+?)(?:\spart[1-9])?

(?: ) means that this group doesn't produce replace string in \2
EDIT: added ? to movie name group too, this should be lazy too because otherwise "part" will become part of movie name. If movie group is greedy it will always force part group to be {0}.
Again, not tested, only guess from the table Smile

nucleo Wrote:Unfortunately I cannot test regexps with Scrape XML, because I'm on Ubuntu. So I just rely on XBMC itself... And the message is not very promising.

BTW, this additional inner regexp is always in place in any scraper, for example, for IMDB.But there we have output %20\1 instead of +\1.
%20 and + remind me about spaces in HTTP URL, but how does the expression work and why it places its output into $$4, which is never used? Is it some hidden not documented functionality like in Win API? Smile

It is said somewhere in manuals that XBMC will offer you a list of variants for each file, but it offers nothing to me, can I enable it somehow?
Turn on debuging in XBMC setting and you will see return string from parser functions in log. You can exploit this return string to see values of any buffer in function, simply do something like:

Code:
<RegExp input="$$1" output="&lt;url&gt;http://akas.imdb.com/find?s=tt;q=\1$$4&lt;/url&gt;  ##1=$$1  ##2=$$2" dest="3">
because it's after ending element it doesn't hurt XML parser (at least visibly Wink) and you see it in log. Quite simple but useful.

olympia Wrote:@bambi73
$$2 is the year from the filename and it is used in case of imdb scraper.
Good to know, never worked on movie scraper so this is news info for me Smile
(This post was last modified: 2011-04-09 20:35 by bambi73.)
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #7
Thanks a lot, I followed your advice, enabled logging. Seems that my regexp produces nothing, because from logs the output URL contains only static part without generated group \1. So I'll try some simpler regexps, may be there is some mistake.
find quote
mortstar Offline
Senior Member
Posts: 282
Joined: Aug 2010
Reputation: 3
Post: #8
nucleo Wrote:Thanks a lot, I followed your advice, enabled logging. Seems that my regexp produces nothing, because from logs the output URL contains only static part without generated group \1. So I'll try some simpler regexps, may be there is some mistake.

Try using ScraperXML to see how your scraper flows.

You can use the test engine to see what is held in the buffer at each stage. You can also test your regex.

[Image: all-banner.jpg]
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #9
As I already mentioned in one of my posts, I'm on Ubuntu and I'm not aware how to run Scraper XML there. I tried to run it under Wine, but without success.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #10
bambi73 Wrote:EDIT: added ? to movie name group too, this should be lazy too because otherwise "part" will become part of movie name. If movie group is greedy it will always force part group to be {0}.
Again, not tested, only guess from the table Smile

Just noticed your edit. Yes, you are right. I'm not that strong in regexps, never used lazy groups.
find quote
olympia Offline
Team-Kodi Member
Posts: 2,503
Joined: May 2008
Reputation: 32
Post: #11
Be aware that the title in $$1 in CreateSearchURL is URL encoded, so you have to create the regexp for paul%20haggis%20%2d%20crash in this example.

So something like:
Code:
<expression noclean="1">.+?%20%2d%20(.+?)(?:%20part[1-9]%20)?$</expression>
or thereabout.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #12
olympia Wrote:Be aware that the title in $$1 in CreateSearchURL is URL encoded, so you have to create the regexp for paul%20haggis%20%2d%20crash in this example.

So something like:
Code:
<expression noclean="1">.+?%20%2d%20(.+?)(?:%20part[1-9]%20)?$</expression>
or thereabout.

Wow, that's really helpful! Thanks for the tip, I'll try.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Thumbs Up   
Post: #13
Finally I got everything working the way I want. Great thanks to olympia and bambi73. Without you I would never sort this out.

I post here my results for the case if somebody finds it useful to organize their own movie collection.

File names should contain movie title and year. I tried 2 formats, both work perfectly:
1. Sidney Lumet - Dog Day Afternoon (1975).part1.avi
2. Sidney Lumet - Dog Day Afternoon part1 (1975).avi
And the third is obvious when you don't have movie broken into parts:
3. Sidney Lumet - Dog Day Afternoon (1975).avi

There are 3 very important points I've learned about XBMC scraping:

1. It cuts automatically the year and the file extension from the file name before the scraper even starts working, so
Sidney Lumet - Dog Day Afternoon (1975).avi
becomes
Sidney Lumet - Dog Day Afternoon - this what comes to the scraper in buffer $$1 (well, not exactly this, see item 3)
The buffer $$2 in this case will contain "1975" - the year stripped from braces, even before the scraper starts.

2. It automatically recognizes words like "part[1-9]", "cd[1-9]" cuts them off and displays several parts as one item in the movie library. No further action is required from the scraper. Thus
Sidney Lumet - Dog Day Afternoon (1975).part1.avi
and
Sidney Lumet - Dog Day Afternoon (1975).part2.avi
are scraped as one item, which is
Sidney Lumet - Dog Day Afternoon (well, not exactly this, see item 3)
at buffer $$1, before applying regular expressions by scraper.

3. Items in $$1 come to scraper URL-encoded and lower-cased. Thus in our example $$1 will actually contain
sidney%20lumet%20%2d%20dog%20day%20afternoon
All spaces are replaced with %20 and dash is replaced with %2d

I had to modify a little default scrapers, so that they can work with my file naming. Here is what I have for now:

1. TMDB scraper (on Ubuntu: ~/.xbmc/addons/metadata.themoviedb.org/tmdb.xml)
Code:
<CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url&gt;http://api.themoviedb.org/2.1/Movie.search/$INFO[language]/xml/57983e31fb435df4df77afb854740ea9/\1+$$2&lt;/url&gt;" dest="3">
            <expression>.+%20%2d%20(.+)</expression>
        </RegExp>
    </CreateSearchUrl>
There was an inner regexp and I removed it, because it does absolutely nothing. I added "+$$2" to the url so that it also searches by year - it is supported functionality of TMDB public API, so I don't know, why it was not used in the default scraper. Also I used my own regexp to parse file names. I tried to use ".+?" instead of ".+" like suggested by bambi73, but it appears too "lazy", according to my tests it may take just "D" from "Dog Day Afternoon". I'm sorry if I'm wrong here, because I'm not so strong in regexps, as bambi73.
Unfortunately, TMDB appears to not contain information about some of my movies (Woody Allen - Manhatten - what the heck, is it so rare?). That's why I used also another scraper - IMDB
EDIT: It's my mistake in typing. Manhatten should be ManhattAn. And of course TMDB could find a misspelled word too, but anyway I'm happy it was found at all Smile


2. IMDB scraper (on Ubuntu: ~/.xbmc/addons/metadata.imdb.org/imdb.xml)
Code:
<CreateSearchUrl dest="3" SearchStringEncoding="iso-8859-1">
        <RegExp input="$$1" output="&lt;url&gt;http://akas.imdb.com/find?s=tt;q=\1$$4&lt;/url&gt;" dest="3">
            <RegExp input="$$2" output="%20(\1)" dest="4">
                <expression clear="yes">(.+)</expression>
            </RegExp>
            <expression noclean="1">.+%20%2d%20(.+)</expression>
        </RegExp>
    </CreateSearchUrl>

Again, the same regexp, but now there is also inner one for year. It was there and I didn't touch it, though I find it strange to use inner regexp, which just adds %20 before the year in the buffer $$4, while you can just add "%20($$2)" to the url directly. Anyway the scraper works 95% of time for me, so please consider doing this yourself, if you need to.

If this information is anyhow useful and somebody can point me to the corresponding Wiki page, I can add it there. Or somebody can do it for me, if I cannot access that wiki.
(This post was last modified: 2011-04-10 22:58 by nucleo.)
find quote
olympia Offline
Team-Kodi Member
Posts: 2,503
Joined: May 2008
Reputation: 32
Post: #14
Actually both (tmdb year addition and imdb inner regex) are good catch. Thank you for sharing. I will tune the official scrapers according to this.

Not sure why the year was not added to tmdb search URL before. One possibility is that at the time it has been written the API did not supported this yet.
find quote