Culturalia.net scraper
#1
Lightbulb 
Hi there,

After many, many years enjoying XBMC I think I can contribute with something that some Spanish talking guys will like.

I have created (and still working on it) a scraper for the Spanish cinema page http://www.culturalia.net

Here you can get Spanish thumbs, plots and all other fields included in XBMC database.

I have contacted Culturalia.net's administrator asking for the permission to make this script public.

Until that happens (hopefully soon enough) who ever wants to try the scraper must contact me and ask to be a part of the beta testing group.

Before going into details on the features of the scraper let me state something here.
This scraper is a one full day work starting with reading about scrapers, xml, regex (haven't used them in my life) so don't spec this to be perfect. Don't even spec it to work at all... In this way, anything you get will be very welcome.

Of course the coding will need much improving as this is the first attempt on doing something useful with xml in my life.
But the good news is that I learn fast and I have a lot of spear time. So spec improvements. I have taken a lot from XBMC. Now it is my time to give back...

That being said... The actual version of the scraper (I call it internally beta1) can get data from culturalia.net for the following fields:

<Title> - Because I like it this way I have entered here "Title (OriginalTitle)" because I cannot see the original title in any other way

<Originaltitle> - I didn't find this in the documentation (http://www.xboxmediacenter.com/wiki/inde...craper.xml) but in one post over here so I thought it might be interesting for the future

<year>
<director>
<Credits> - This is the writer information - just the first one. I think XBMC only can handle one

<plot>

<mpaa> - I'm entering here the information about who is allowed to see this film. Because I don't see this information anywhere I have also entered this info in the <outline> field (as culturalia has no info for that and I didn't want to repeat the long plot field. This probably will be taken out after the beta state is gone.

<outline> - see <mpaa>

<runtime> - in minutes

<rating> and <votes> - This info is very similar in culturalia to the one implemented in IMDB so it fits quite fine. During beta testing, due to some problems I'm having (see "know bugs") these fields are replicated in <tagline>. In the final version this will go out.

<tagline> - there is no field in culturalia.net for this field (it is not very common in Spanish advertising) see <rating> and <votes> for current use during beta testing.

<genre> - finally Spanish genres (I've been waiting for this a loooong time)

<thumb> - Spanish poster (first time available on XBMC). Thanks Culturalia for this...

<actor> - list of actors. There is no role field in culturalia, so no role info is filled in.

Nothing else for the moment. But I think is enough for starters as it takes all visible fields with current XBMC versions.


KNOWN BUGS

Of course when you start a new program you also start al list of bugs. And this won't be the exception. Some of the bugs listed here might be caused by the build of XBMC that I'm using (rev7767) so I'm trying to update as I write (and you read). As this moment this are the know bugs for culturalia_beta1.xml:

1. <Votes> field is showing wrong numbers even though I have checked I'm parsing correctly the field in culturalia's page (you can see it in the <tagline> field). Might be a problem with my XBMC build as this happens also with the IMDB scraper.
2. Problem with Spanish special characters (like in French we use more that the ascii coded characters) I have not much experience with page codes in XML. But I have tried to set encoding to UTF-8 in the xml header. I don't know if this is supposed to work. But according to other post for the filmweb scraper that had the same problem (and according to spiff) this might be solved in rev7829 and above... so again will see when I can update my XBMC.
Anyway it is weird that Spanish characters show perfectly fine in Genre field... Anyway I haven't got much into it... We'll see after the update.

MISING/INCOMPLETE

1. I have done nothing with the nfourl as I'm not using nfo files for this. But I guess I will put it in when someone asks.
2. Any other fields not listed here (documented or not) are, of course, also missing. Again no problem to add anything when requested and if the data is available in culturalia

What I'm not going to do (for the moment) is take petitions to convert this to other sites... at least for the moment.

And that is all for the moment. Now I have to get as much people as possible to test it and fix what they will surely find.

Hope I get a positive response from Culturalia ASAP to make the beta available for everybody.

Thanks for taking the time to read this and don't hesitate to contact me if you want to test this.

Finally I want to apologize for my pour English. But that is why I needed this scraper, isn't it?

Best regards,

Jurrabi.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#2
use <title> for the spanish and <originaltitle> for the english - that's how it is supposed to be.

<director>, <credits>, <genre> and <actor> can have multiple tags.

and cheers - keep pushin Smile
Reply
#3
spiff Wrote:use <title> for the spanish and <originaltitle> for the english - that's how it is supposed to be.

Yes. For the moment I'm adding the ogTitle to Title just tu see it in the main view. But if the scraper gets published I'll do it the right way

spiff Wrote:<director>, <credits>, <genre> and <actor> can have multiple tags.
I didn't knew it for credits. I'll change it. The rest are already ok

spiff Wrote:and cheers - keep pushin Smile

Always my friend. and thx for your comments.

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#4
last minute update.

Still no response from Culturalia.net people... keep waiting my friends.

I just updated to XBMC rev 7841 (the MCE1.0 build) and now I'm in the middle of extensive testing.

The Spanish characters problem is solved (thx to spiff) but the issue about the votes still is there. Any ideas? I'll keep looking.

Last night I also detected issues with some films that didn't get plot, actors or thumb. Today I'll recheck against the info in culturalia.net. It might be that those films didn’t have that info in their database. I'll keep you posted. But you’ll like to know that 95% of movies just worked fine.

Maybe it’s worth mentioning that I’m using a search for the Spanish title. This, sometimes, keeps XBMC from finding the movie at the first attempt. Mainly when the file name is after the original name (not surprising). In these cases you have to use the manual search and enter the Spanish title. I can live with that but this behaviour will improve if it was possible to enter a second search url (for the original title). Is that possible? As I say, I can live with this. Can you?

Thanks for your time,

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#5
24/02/2007 11:00 GTM
--------------------
Changes in beta2:

BUG FIXES
1. <Actor> is not filled correctly for actors that don't have personal profile. (FIXED)
2. <Genre> was not being filled correctly when the genre had more than one word (science fiction) (FIXED)
3. <Director> field only got the first one. (FIXED)
4. <Director> Field. Fixed problem if Director does not have profile (not tested as I don't know any real example) (FIXED)
5. <Credits> field only got the first one. (FIXED)
6. <Credits> Field. Fixed problem if writer does not have profile (not tested as I don't know any real example) (FIXED)
7. <Plot> didn't filled correctly when text formating tags where used in the plot(<i></i>). Because I'm not a better regexp programer I have to make a compromise here. Now it will work if it finds formating but won't work if it finds an equal sigh (=). If someone finds a film with this problem let me know and I'll try to find a better solution. (Partially FIXED)


KNOWN BUGS

2. <Votes> field doesn't show right in XBMC rev7841. I'm pretty sure this is not a problem of the scraper but I'll keep looking

3. Get Thumb (the menu option in the film info page) when the film does not have one in culturalia.net proposes the thumb of the previous searched film as the IMDB Thumb (time to change texts, don't you think?).I have to figure out if this is a problem of XBMC or it has something to do with the buffer clearing within the scrapper.

TO DO LIST.

Appart from solving the bugs I want to take a look on the following (Priority listed between brackets)

1. Take a look on the NFOUrl tags. As I'm not using NFO files to help the search I'm not very interested in this... are you (LOW)

Best regards,

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#6
jurrabi: The NFOUrl is very easy it will be a one liner. Just have a look at this:

PHP Code:
<NfoUrl dest="3">
    <
RegExp input="$$1" output="http://akas.imdb.com/title/tt\1/"  dest="3">
        <
expression noclean="1">imdb.com/title/tt([0-9]*)</expression>
    </
RegExp>
</
NfoUrl

Keep up the good work!
Reply
#7
Thanks DonJ.

Right now I'm writing a small documentation for beta testers and will make Beta 2 available during the day.
But that will go on Beta 3 for sure.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#8
So, here it is. The Beta 2 of Scraper Culturalia goes public. Download here.

I sill have no response from culturalia but I want to get testers so...If they deny permision I'll take it back.

In the attached file you can find.

1. Culturalia_beta2.xml - The scraper itself
2. Culturalia_beta2.gif - Logo for selection screen
3. Lee ESTO primero.txt - Instructions (in spanish, sorry)
4. changelog.txt - as it says...


Please feedback in this thread.

Thanks in advance,

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#9
Ok. Now that Beta2 is on the road is time to get moving towards beta3...

Maybe DonJ you can give me a hand with the NFOUrl stuff.

I have entered the apropiate code in the xml and created some sample nfo's but I cannot get it to work.

It seems that the scraper is not seing the NFO file at all. Acording to this Doc Nfo files take precedence even from scraper settings, but I failled to make it work.
I have also failled in my tests with other scrapers with NFO functionality implemented.

Is there maybe a minimun XBMC revision where the NFO stuff starts to work?

Thanks in advance.

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#10
You probably have to update xbmc, I only added the feature 3 days ago or so. Wait for the next t3ch release or build one yourself..
Reply
#11
Thanks DonJ,

It's been a while since I last built an XBMC revision. And now with the new installation of Vista... I'll better wait. Wink

Thanks for your response.

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#12
I just tested the nfo functionality on rev7937 and working perfect.

Thanks DonJ.

P.S. Beta3 is due tomorrow and it is right on time.
When it's releases I will post it here so everybody has a chance to test it.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#13
I have a small bug related to the special spanish characters.

I'm getting everythig working fine in the movie card, but in the selection screen (when the search finds more than one result) the localized characters appear as scares...

Any chance this is a problem with XBMC? 'cause I don't see a way to fix it in the scraper.

Thanks in advance,

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#14
DonJ,

I find also that the NFOUrl funcionality doesn't work when the movie is a Directory (not a File).
It doesn't work putting the NFO File in the directory matching the avi name nor outside the directory with the same name of the directory.

Maybe with the sample data will be easier to understand (sorry about my english)

Movie file in Directory C:\Movies\Saw\saw.avi

Choice 1: NFO file in C:\Movies\Saw.nfo ->It doesn't get it.
Choice 2: NFO file in C:\Movies\Saw\saw.nfo ->it doesn't get it either.

Any ideas?

Thanks in advance,

Jur.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#15
oh darn missed one utf8 or not spot. will fix
Reply

Logout Mark Read Team Forum Stats Members Help
Culturalia.net scraper0