Kodi Community Forum

Full Version: Clean scraping API
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
So for those who knows, and those who don't, I will be working throughout the summer creating a new generalized scraper API for xbmc. I will create this outside xbmc and in python, which will hopefully make it useable in xbmc quickly.

As a first step I've created a script which will gather scraped data, this script will be released into the official repository soon I hope. The data this script will generate will be used to a) give us an idea of how well the old engine works and b) data to train a new engine upon. All data will be anonymous and I will create a blog post about it when it hits official repository.

So to highlight what I want to achieve with a new engine, what I think is of importance:
  • It must be generalized, adding a new media type should be trivial, no part of the core should be bound by media types.
  • What fields is of interest is not tied to the engine, any scraper may add metadata as it sees fit. The user of the engine (xbmc and skinners) may choose what data it understands but scrapers can emit all type of data it wants.
  • Parallelism! As much as possible needs to be parallelism friendly, ideally not only between files but all parts of scraping of a file too.
  • Everything is linked, a movie can have a soundtrack and game associated with it. The director of a movie can be the singer in a band and share photographs on a site.

I will go over my plan here in more details later but first I'd love to know from all current scraper creators what features they would like to see, and perhaps even more important what features they like with the old system.

I hope its going to be a great summer!
All I want is stay 100% compatible with the old one Smile
Having the ability to add custom metadata at the top of this would be just wonderful.
topfs2, sounds very exciting, will definitely hand over my data.

I realize this is probably outside of the scope of your project, but one thing which would be quite useful imo and might link into your work, is to make the "Scan for new contents" function (which triggers scraping) more customizable i.e. make the scanner a type of service addon!

Scanning for new media on start-up usually slows down my atv2 significantly and takes ages as my NAS is sizeable. Hence, being able to customize the scanning process would be extremely helpful. The scanner could for instance only monitor certain folders, make use of file system specific features in order to pick up new files more efficiently etc. etc.
maybe also take parental control into account, like having scrapers for just movie ratings. But as I understood the new engine would easily allow this.
First, thanks for the suggestions and the start of the discussion! I appreciate it very much!

(2012-06-14, 22:38)olympia Wrote: [ -> ]All I want is stay 100% compatible with the old one Smile
Having the ability to add custom metadata at the top of this would be just wonderful.

Well I don't think I'll make it backwards compatible at the core, as this would be much to limiting in the design. However, making a wrapper from the old to the new I'm sure we will make and I have even suggested it in my gsoc proposal. I understand that loosing all the old ones or forcing developers to convert is a bit to big of a burden Smile However I hope that all will want to move to a new API Tongue

Anyways, this why I need to know what the old had which was good, as it might be an all or nothing choice for the new scrapers, either they use the old (we wrap) or they use the new.

(2012-06-15, 01:36)DonJ Wrote: [ -> ]topfs2, sounds very exciting, will definitely hand over my data.

I realize this is probably outside of the scope of your project, but one thing which would be quite useful imo and might link into your work, is to make the "Scan for new contents" function (which triggers scraping) more customizable i.e. make the scanner a type of service addon!

Scanning for new media on start-up usually slows down my atv2 significantly and takes ages as my NAS is sizeable. Hence, being able to customize the scanning process would be extremely helpful. The scanner could for instance only monitor certain folders, make use of file system specific features in order to pick up new files more efficiently etc. etc.

I'm not 100% I follow, I'd love some more examples. What I'm not sure is if you want the cnfiguration as an xbmc user or as xbmc (code using this engine) or as a scraper developer? I haven't 100% decided what will trigger the scanning, what I'm focusing on mostly now is when you know file X exist and want to gather data on it what to do. I'd love some thoughts on the actual scanning process too if its of interest in this project.

(2012-06-15, 10:27)da-anda Wrote: [ -> ]maybe also take parental control into account, like having scrapers for just movie ratings. But as I understood the new engine would easily allow this.

Hmm, this is a good question. Not sure where parantal control should lie, its an interesting topic I need to take into account. On one hand, what scrapers exist is within the engine and what scrapers is run is definatly of interest for parental control.

Not sure how you mean with movie ratings? What issues are there with movie ratings? Your refering to the comments (with bad language) or that just the rating is safe but the images or description may not?
So you're saying ALL scrapers will need to be re-written to accomodate the new API?
Can you already give a hint what converting involves? Is it just like changing the tags, or the whole concept will change?

Are you actually going to keep the buffer and regexp based scraper framework or it will be something different?
(2012-06-16, 12:02)olympia Wrote: [ -> ]So you're saying ALL scrapers will need to be re-written to accomodate the new API?
Can you already give a hint what converting involves? Is it just like changing the tags, or the whole concept will change?

Are you actually going to keep the buffer and regexp based scraper framework or it will be something different?

Yeah, but obviously I will add code so that the old scrapers work just fine in the new engine. Guaranteeing a scraper can mix and match between new api and old I will not garantue (perhaps it will work but I won't garantue it Tongue).
So the concept will in some ways change I guess, I'm sketching still and want input on how best to do it.

What I want is to move away from only using regexp. Regexp is great for lots of things but its not the best for all things, namely html traversal there are better tools. So I want the new engine to allow for more tools, like things similar to jquery and beautifulsoup. But as said, this is why I'm doing this post so we can discuss whats wanted Smile
So basically, we can have game/comic scrapers, and an extra art scraper ? (like the addon used in skins). That's amazing Smile

I'd suggest the ability to only update 1 field, like 'update every ratings of every movie', without rescraping anything else. (that would also mean bypass the search and directly use the ID in the db).


Looks amazing by the way Smile
(2012-06-16, 11:30)topfs2 Wrote: [ -> ]Not sure how you mean with movie ratings? What issues are there with movie ratings? Your refering to the comments (with bad language) or that just the rating is safe but the images or description may not?
I was refering to scrapers that scrape parental ratings, like PG*, FSK* etc - so that users that are in need of parental control can update their collection with parental meta data (whatever is out there besides of PG etc) which itself could be used for access restrictions.

Not sure if parenting the scrapers themselfs is needed - but some sort field protection (exclude from updates) would be nice. Or maybe some kind of access restriction (so that not everybody can trigger scrapers) would be good (if not yet possible) - but that's more of a core issue.
I doubt that it is the right place but scrapping information for movie sets (like fanart, poster and overview) would be a nice feature to have. If it is something that the API have to allow first then consider it as a feature request.
Just like some of the comments here:
http://xbmc.org/topfs2/2012/06/20/gsoc-2...gathering/
I couldnĀ“t install the script through the addon browser.
After downloading it manually from here:
http://mirrors.xbmc.org/addons/eden/scri...2.scraper/

This is the error I got after starting the script:
xbmc log
Same error as above for me.
Downloading the script should work now but there are two bugs in the script. I've let someone know but it might take a bit till it's updated.

Btw once it starts working don't be surprised if there's no visual feedback on which control is focued. You'll start on the topmost control for movie data collection. By pressing Enter you can see the toggle button change it's state. Just navigate with the arrow keys up and down and when you're down, once right and click enter. It's not very user-friendly but you'll only use it once anyway.
(2012-06-20, 10:08)Hitcher Wrote: [ -> ]Same error as above for me.

Darn i thought i fixed the correct one. Will ask some one to push a fix.

Edit:
Correct version has been pushed to repo
(2012-06-17, 01:19)Maxoo Wrote: [ -> ]I'd suggest the ability to only update 1 field, like 'update every ratings of every movie', without rescraping anything else. (that would also mean bypass the search and directly use the ID in the db).
+1 Smile

I hate it when scrape a movie just at the beginning of its release as the user ratings are usually quite wrong then when only 10-20 people have voted to give their rating, so it would be great to rescrape the rating again once a month or there about when more people have given their rating.

There is already an addon available that does update IMDb user ratings, but that is only for IMDb and it would be better if XBMC could do it for any movie site source.
http://forum.xbmc.org/showthread.php?tid=107331
Pages: 1 2 3 4 5 6