Help needed with development of new scraper for filmdelta.se (Swedish Movie Scraper)?
#1
Question 
Hi all.
I'm a complete scraper-making newbie, so I hope my questions below aren't really too simple. Would make me feel stupid Rolleyes I tried asking on the swedish xbmc forum at xbmc.nu, but it seems I need international help :-)

I'm trying to do a scraper for the site filmdelta.se, which is the biggest movie database with swedish movie information. These are my concrete problems right now:

First problem: I can't seem to get thumbs working. My scraper returns a valid url to a jpeg image (tried downloading it using both Firefox, wget and curl), but xbmc somehow doesn't seem to accept it. From my xbmc.log:

Code:
20:29:42 T:2185611600 M:3250929664    INFO: Creating thumb from: http://www.filmdelta.se/functions/download.php?id=41432 as: special://masterprofile/Thumbnails/Video/b/bd0e3e2f.tbn
20:29:42 T:2185611600 M:3250929664   DEBUG: FileCurl::Open(0x7f9b82458c90) http://www.filmdelta.se/functions/download.php?id=41432
20:29:42 T:2185611600 M:3250929664    INFO: easy_aquire - Created session to http://www.filmdelta.se
20:29:43 T:2185611600 M:3250475008   DEBUG: FileCurl::Close(0x7f9b82458c90) http://www.filmdelta.se/functions/download.php?id=41432
20:29:43 T:2185611600 M:3250475008    INFO: Creating album thumb from memory: special://masterprofile/Thumbnails/Video/b/bd0e3e2f.tbn
20:29:43 T:2185611600 M:3250475008    INFO:   msg: PICTURE::CreateThumbnailFromMemory: Unable to determine image type.
20:29:43 T:2185611600 M:3250475008   ERROR: PICTURE::CreateAlbumThumbnailFromMemory: exception: memfile FileType: .php

The file tbn-file seems to be deleted imediately, giving me no really chance of checking what file xbmc really got...

Second problem: On movies for which I don't get exakt hits on filmdelta.se (ie movies that doesn't return one movie but an entire list) xbmc claims that my GetSearchResults returns nothing, though when I try the function myself (in nicezias scraperxml editor) it works. Below is a snippet from xbmc.log when xbmc is trying to pinpoint "Sagan om ringen" (which is swedish for Lord of the Rings).

Code:
07:29:54 T:995793232 M:2989572096   DEBUG: DoScan Scanning dir '/home/daniel/fakemovies/Sagan om ringen/' as not in the database
07:29:54 T:995793232 M:2989572096   DEBUG: Hash[movies,/home/daniel/fakemovies/Sagan om ringen/]:DB=[],Computed=[0FD468F7F7EF9C804D7EE4DDECC0F23B]
07:29:54 T:995793232 M:2989572096   DEBUG: No NFO file found. Using title search for '/home/daniel/fakemovies/Sagan om ringen/sagan.avi'
07:29:54 T:995793232 M:2989572096   DEBUG: InternalFindMovie: Searching for 'sagan om ringen' using filmdelta scraper (file: 'filmdelta.xml', content: 'movies', language: 'sv', date: '2009-07-23', framework: '1.0')
07:29:54 T:995793232 M:2989572096   DEBUG: FileCurl::Open(0x365fe60) http://www.filmdelta.se/search.php?string=sagan%20om%20ringen&type=movie
07:29:59 T:995793232 M:2988548096   DEBUG: FileCurl::Close(0x365fe60) http://www.filmdelta.se/search.php?string=sagan%20om%20ringen&type=movie
07:29:59 T:995793232 M:2988548096   DEBUG: scraper: GetSearchResults returned <results></results>
07:29:59 T:995793232 M:2988548096   DEBUG: Not adding item to library as no info was found :(
07:29:59 T:995793232 M:2988548096   DEBUG: DoScan - Finished dir: /home/daniel/fakemovies/Sagan om ringen/

I'll probably have more questions later. And my xml is probably full of errors. But this is a starting point...

I don't see a function for uploading files here. I guess the easiest way to get hold of my xml is to look in this thread at xbmc.nu:

http://www.xbmc.nu/index.php?option=com_...=3&id=2155

(my latest version is on page 2, post 2167)

/Daniel
Reply
#2
For your first issue, it doesn't like that it is a .php extension (even though the image downloads successfully).

You should probably file a trac issue.

As for not uploading here, just use pastebin and provide the link.
42.7% of all statistics are made up on the spot

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#3
Okay, I may have fixed the image loading such that if it doesn't recognize the extension, it should try to probe the buffer for the type of file.

Not sure how I can test, but if you can update to revision 21864 and test, that would be great.
42.7% of all statistics are made up on the spot

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#4
tslayer Wrote:Okay, I may have fixed the image loading such that if it doesn't recognize the extension, it should try to probe the buffer for the type of file.

Not sure how I can test, but if you can update to revision 21864 and test, that would be great.

Yes, that did the trick. Thank you! Now I realize though that the images at filmdelta don't keep a resolution that's good enough for using with xbmc...

My second problem remains though. Anybody?

/Daniel
Reply
#5
Sorry, can't help with scrapers.

Have you tried to write your scraper using the new 3rd party tool that was created to help with the development of scrapers?
42.7% of all statistics are made up on the spot

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#6
i won't dig to help Wink make the file easily available and i'll see if i can help resolve your issues
Reply
#7
tslayer Wrote:Have you tried to write your scraper using the new 3rd party tool that was created to help with the development of scrapers?

That's the tool I'm using. And that's the tool whose test function claims that my function works...

spiff Wrote:i won't dig to help Wink make the file easily available and i'll see if i can help resolve your issues

Ok, ok... http://pastebin.com/mb357c73

Guess you'll have to do some digging anyway though to find out what's wrong Wink

/Daniel
Reply
#8
Added a slightly modified version (credits, fanart and indentation) to this ticket http://trac.xbmc.org/ticket/6958. Poster download works fine using a SVN revision > 21864 (thanks to tslayer).
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not PM or e-mail Team-Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#9
Daniel Malmgren Wrote:That's the tool I'm using. And that's the tool whose test function claims that my function works...



Ok, ok... http://pastebin.com/mb357c73

Guess you'll have to do some digging anyway though to find out what's wrong Wink

/Daniel

I would personally suggest using (.*) sparingly as it can pick up extra stuff, if you must use .* try (.*?) which gives everything else back as soon as it finds the first occurance of anything after it, or even (.+) or (.+?) which means that there has to be at least one character and give back as needed

however i'm not so sure why my program would return something while XBMC returns nothing as i'm using the same options that XBMC uses for Regular expressions.

EDIT: I tested on my end (with the pastebin link you had up) and ScraperXML retrurns no results as well (though i'm working wit a more recent version than is up for download, which i'll upload today.)
however it doesn't seem to be the (.*) as it doesn't seem to match anything at all on the downloaded page (http://www.filmdelta.se/search.php?strin...type=movie)
ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

Image
Reply
#10
Nicezia Wrote:I would personally suggest using (.*) sparingly as it can pick up extra stuff, if you must use .* try (.*?) which gives everything else back as soon as it finds the first occurance of anything after it, or even (.+) or (.+?) which means that there has to be at least one character and give back as needed

however i'm not so sure why my program would return something while XBMC returns nothing as i'm using the same options that XBMC uses for Regular expressions.

EDIT: I tested on my end (with the pastebin link you had up) and ScraperXML retrurns no results as well (though i'm working wit a more recent version than is up for download, which i'll upload today.)
however it doesn't seem to be the (.*) as it doesn't seem to match anything at all on the downloaded page (http://www.filmdelta.se/search.php?strin...type=movie)

Ok, I'll try tiding up the usage of ".*"

I'm quite sure that isn't the problem though. The expression that should return something from that url is
Code:
<li><a href="http://www.filmdelta.se/filmer/([^"]*)">([^<]*)</a>
, which doesn't contain any ".*" uglyness.

I'll try again with your new editor version when you upload it and see if I find the error...

/Daniel

edit: Ok, I found the error. They obviously has redone the search result on filmdelta.se since I did my first expression, and I've stupidly enough only tested with a page I saved down for testing. That's why my tests worked and not xbmc :-)
Reply
#11
Uhm, whats the current problem with the scraper? The modified version i've attached to the ticket (see above) is working fine for me.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not PM or e-mail Team-Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#12
Daniel Malmgren Wrote:edit: Ok, I found the error. They obviously has redone the search result on filmdelta.se since I did my first expression, and I've stupidly enough only tested with a page I saved down for testing. That's why my tests worked and not xbmc :-)

The nature of web-scraping Tongue
the page is always bound to change Wink
ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

Image
Reply
#13
Ok. Quite small effort involved in fixing that one. Now "Sagan om ringen" actually returns quite a list of hits. My xbmc chose the wrong one though. From the wiki (at http://wiki.xbmc.org/?title=Scrapers I understand that if GetSearchResults returns more than one result the user gets to choose which one is correct. When does this choosing happen? I've never actually seen it Huh

New xml at http://pastebin.com/m648a3981

/Daniel
Reply
#14
let me be the first to say the obligatory;

doh!
Reply
#15
it happens if you do a manual lookup using 'movie information' from context
Reply

Logout Mark Read Team Forum Stats Members Help
Help needed with development of new scraper for filmdelta.se (Swedish Movie Scraper)?0