OFDB scraper
#1
Hi,

This is my first "release" of the ofdb (germen version of imdb) scraper.

The mainfeatures work, but there are some problems:

- Umlauts are not readable (??siteencoding??)

- i parse the genres into individual tags (not only one genre tag; mabe one could change this in the scraper parser code (In databases you should not store lists in attributes Smile )) If this will not be changed, then i will change the parser...

- atm no original title since there is no tag for it
- mpaa is only fetched if there is the movie was in the cinemas... (addition here: maybe some possibility to check if a regex failed. example:

PHP Code:
<regex name"theregextocheck" .../>
<
Regex name="aName" condition="theregextocheck"> <!--this one will only be called if the condition does match--> 
cheers morte
Reply
#2
i have no idea how ofdb expects its urls to be encoded. it certainly does not use normal url encoding (umlauts SHOULD be encoded, it does not accept that).

several <genre> tags, sure. i dont see why this is easier to do than the current / separated list though. as for storing them like that in the db, its to speed up the queries, constructing the / separated string each time takes time....

just add the original title as a tag, we will get it in there eventually.

as for those conditions, they are easily simulated using two regexps + clear="#ofbuffer". let one regexp grab the conditional block, clearing the buffer no matter what. then the next expression will either have nothing to look through (and hence fail), or you did grab something during the first expression so they will succeed....
Reply
#3
encoding the urls is the one thing... but the other thing is, that you cant read the results. in the list, where you can choose the the movie the umlauts work like a charm, but e.g. the plot is not readable if there are umlauts in it.

the thing with the genre is given through a db design pattern... it is normal not to store lists in an attribute. it also should make the genre-queries easier, since you do not have to split the result of the genre but only to iterate through all genre tags to find the movies with the needed genre...

the original title will be added, and the condition thing will be testet Smile

thanks for the response

morte
Reply
#4
spiff: Any chance of integrating this ofdb scraper to SVN when its stable and finished? Or should the users download this scraper by themselves?

Don't know how it is planned for all future scrapers?

sCAPe
My XBOX built into a Sony Hifi CD-Player Case
XBOX Hifi Media Center Picture Gallery

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#5
yes ofc we will stick it in svn.

everything deemed to be of high enough quality will be stuck in svn. the more the merrier
Reply
#6
So here is the version with the original title (<originaltitle>)!

It also reorders the title: Matrix, the => The Matrix
If someone does not want this behaviour, then tell me and i write another version..

CYA Morte
Reply
#7
Quote:<RegExp input="$$1" output="&lt;horst&gt;\1&lt;/horst&gt;" dest="8">
.. horst Big Grin

Thanks
asciii
Reply
#8
well, thats some debugging tag... forgot to delete it.. but as long as it works... Smile

RolleyesAND NO: MY NAME IS NOT HORST! Big Grin

cheers morte
Reply
#9
THX!
I'll try it later this evening and report back!

BTW: Horst is a nice name for a debugging tag Wink Are there Karl and Hans in the script, too *LOL* jk
Wetek Play: LE 9.0.x as TVH-server --- RPi3: latest Milhouse LE 9.x (Matrix) with AeonNoxSilvo --- Orbsmart 500 with Kodi18 as Online-radio/TV in the kitchen
Reply
#10
Quote:encoding the urls is the one thing... but the other thing is, that you cant read the results. in the list, where you can choose the the movie the umlauts work like a charm, but e.g. the plot is not readable if there are umlauts in it.
Thats the only "bug" i noticed so far. Nice scraper!
Wetek Play: LE 9.0.x as TVH-server --- RPi3: latest Milhouse LE 9.x (Matrix) with AeonNoxSilvo --- Orbsmart 500 with Kodi18 as Online-radio/TV in the kitchen
Reply
#11
Revision: 7816
http://svn.sourceforge.net/xbmc/?rev=7816&view=rev
Author: spiff_
Date: 2007-02-12 10:06:31 -0800 (Mon, 12 Feb 2007)

Log Message:
-----------
fixed: various encoding related stuff in scraper code. should fix ofdb.

Revision: 7817
http://svn.sourceforge.net/xbmc/?rev=7817&view=rev
Author: spiff_
Date: 2007-02-12 10:22:03 -0800 (Mon, 12 Feb 2007)

Log Message:
-----------
added: original title column to video database (still not accessible from ui).

Revision: 7819
http://svn.sourceforge.net/xbmc/?rev=7819&view=rev
Author: spiff_
Date: 2007-02-12 12:26:44 -0800 (Mon, 12 Feb 2007)

Log Message:
-----------
changed: allow multiple <genre>, <director> and <credits>< tags in scraper xml output.

now, what did you have in mind. just an info label or something more substantial?

also; please confirm my fixes, then feel free to hand me a dehorstified version and i'll add it to svn.

suggested change:

<!--Genre-->
<RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="5+">
<expression repeat="yes">view.php\?page=genre&amp;Genre=[^&quot;]+&quot;&gt;([^&lt;]*)&lt;</expression>
</RegExp>
Reply
#12
Hi,

an infolabel should be adequat for the original title.

THANKS for all the fixes!!
(i'll test it as soon asap)

Quote:dehorstified version
LaughLaughLaugh

the version without horst is attached Smile

cheers morte
Reply
#13
Now the scraper seems to work!
Reply
#14
Hi,

@morte0815

Great Work!!!
A few hours ago I noticed that spiff made some changes to get ofdb work and i thougt to make a second try to get my scraper working, but a second SVN Checkout shows me that you've already done it!

And it works fine... with some problems for me... Huh

The plot (Parsing on Inhaltsangabe.htm) works not for all movies

it works only for 1 out of 9 movies of 2007
and 6 out of 16 of 2006

Movies who don't work for example (2006):
"Catch a Fire", "Haus am See", "Depaerted", "Das Parfum", "Dead or Alive",
"Spiel der Macht"
The plot html page for these movies seems to be a bit different, but it's to late for me now to have a deep look...

@morte 0815 & spiff

There are sometimes problems with the Minus Character '-'.
It is only possible to fiind some movies by remove the '-' in the search string, but in the title on ofdb it is in the name then....
Exampe "Departed - Unter Feinden" will only be found if I remove the '-'. Then "Departed - Unter Feinden (2006)..." is found. In the Browser Search it works with the '-'. Could this be a problem with different but same looking characters?
Perhaps it is possible to remove these characters from the search string, this should fix the problem...

There are problems with "Der" too.
Example: "Das Haus am See" will only be found, if I manual remove the "Das". Then the movie is found: "Haus am See, Das / Lake House, The (2006)". This doesn't work in Browser too, so I think that's a problem with ofdb...

BTW:
Will the moviename.nfo File in the movie folder be observed for the url-search?

Disclaimer:
This is no criticism. I only try to help to make this great scraper to a fantastic scraper... Laugh

Thanks,
StompSC
Reply
#15
Big Grin1. thanks for your help. Big Grin

2. the problems:
I simply didnt test movies where a linebreak occured in the plot. So here is an updated version.

The thing with "der, die, das, the": i know of the problem, but i do not have any idea how to fix this... i tried simply cutting off those words if they occure on the beginning, but then it didnt find "die hard". So if you have an idea please let me know.

the "-"-problem: i will try to fix this.

so long

morte!
Reply
 
Thread Rating:
  • 0 Vote(s) - 0 Average



Logout Mark Read Team Forum Stats Members Help
OFDB scraper00