[Proposal] Seperating the scraper in a library.
#16
Just wanted to throw my 2 cents in. From an addon dev point of view, no matter how it's implemented, making the scrapers usable by python addon devs would be a huge advantage. There are multiple addons out there (the most complete of which is Eldorado's Metahandlers lib) which have to reinvent the wheel to provide a decent eye candy experience.

Currently, the addons feel like a whole separate experience from local media. I believe the difference between scripts and addons was meant to be that addons emulated local content, and therefore should be as seamless as possible. I think that enabling addons to access the same scrapers used for true local content would go a long way toward making that happen.

I admit that I don't fully understand the description and intention of your project, but from what I understand, this seems to be within scope.
Reply
#17
Smile 
(2012-03-26, 09:34)dzan Wrote: This seems like a very useful project, again it's mostly up to scraper developers to use it but it might come in handy when working on the modular scraping ( would have to ask around about depending on such a new initiative ). I could write an optional library function accepting the show's name and the desired source which would then return an url from that source instead of having to search the source ( ex. tvdb ) directly. So the first 'block' of xml scrapers could use it and wouldn't need to parse each specific website anymore... Good stuff to think about Smile

if you have any questions about xem or need a feature fell free to contact me either here or on freenode in #sickbeard or #xem
Reply
#18
+1 for getting this done in GSoC 2012. More flexible scrapers are needed to provide a much improved meta-data experience. And isn't metadata the thing that makes media centers such a great thing?
Reply
#19
(2012-03-26, 18:47)jim0thy Wrote: Just thought I'd check. Smile You're right that it can be quite confusing for non-technically minded people to get their sources configured correctly.
No problem. Indeed it's hard for us ( geeks ) to image what "regular" people must go through when using some of our stuff sometimes Big Grin

(2012-03-26, 18:47)jim0thy Wrote: What would be nice would be if XBMC could return a list via an API call of all the parameters it requires from a scraper. That way a possible scraper-building front end could be developed that would allow for the end-user to provide the path to the the data source's API, then match the returned values to the relevant XBMC fields. The aim being to automatically generate the scraper code for new data sources without the end-user having to get their hands dirty with the code.

Returning the parameters via an API call would allow for the front-end to keep itself in-sync with any changes to the XBMC library fields.
This is a pretty advanced idea but I don't see why it wouldn't be possible even with the current xml-scrapers? You know how an xml scraper is constructed so one could write a gui the way you describe to fill in the blanks in the xml template?

(2012-03-26, 21:09)Bstrdsmkr Wrote: Just wanted to throw my 2 cents in. From an addon dev point of view, no matter how it's implemented, making the scrapers usable by python addon devs would be a huge advantage. There are multiple addons out there (the most complete of which is Eldorado's Metahandlers lib) which have to reinvent the wheel to provide a decent eye candy experience.

Currently, the addons feel like a whole separate experience from local media. I believe the difference between scripts and addons was meant to be that addons emulated local content, and therefore should be as seamless as possible. I think that enabling addons to access the same scrapers used for true local content would go a long way toward making that happen.

I admit that I don't fully understand the description and intention of your project, but from what I understand, this seems to be within scope.
Thanks for you feedback! I already concluded from a talk with spiff that there are huge benefits for exposing the scraper lib in the addon-fashion too. I haven't really looked into this yet but I'm pretty sure that once everything is separated in a lib it wouldn't be hard to also expose this functionality that was too. So I'm carefully saying my work would lead to what you want.

(2012-03-27, 02:34)lad1337 Wrote: if you have any questions about xem or need a feature fell free to contact me either here or on freenode in #sickbeard or #xem
Thanks! If I'd implement some callback to "translate" movie/show id's in the library using xem i'll contact you for sure.

(2012-03-27, 12:58)Syncopation Wrote: +1 for getting this done in GSoC 2012. More flexible scrapers are needed to provide a much improved meta-data experience. And isn't metadata the thing that makes media centers such a great thing?
Thx!
Reply
#20
I've just discovered this thread having raised a 'similar' proposal in the feature suggestions thread (for non developers).

I'm a former software architect for elements of the Symbian operating system, along with being a former developer on a number of handsets for Nokia, SonyEricsson and Motorola. Sadly, like most folks I stopped coding and spent more time drawing UML then more time infront of 'managers to becoming a manager. Suffice to say, I don't know the XBMC code, but I'm not a total 'flake' either.

At the risk of sounding patronising, or being too simplistic I think it's important to get right back to basics:

1) We have media on the users system, and we 'force' the user to at least classify it as TV, Movie, Music etc. Whether this is entirely necessary is up for debate, but it's how things stand right now.

2) In some cases, we ask the user to go a little further still and help us 'classify' media with more detail by having them stick to SOME semblance of a scheme... e.g. using a filename that might help define the media, or a TV season/episode scheme that regex can parse etc.

3) Based on what we can 'deduce' from 1 and 2 we then try to obtain additional meta information and content (trailers, thumbs, fanart etc).

3a) it SEEMS that XBMC introduces some 'core' functionality that attempts to find some additional meta info and content from the 'local' sources - typically side by side, or within a sub folder of the media itself.... (I'm talking about the NFO file, tbn, jpg etc)

3b) If we can't find that 'local' stuff, or we deem there wasn't enough stuff found, or that local stuff explicitly pointed to online resources, we then use 'scrapers' to try and get meta info and content from online resources

My assertion is this:

3a and 3b are the same thing.... 'an attempt to obtain meta info and content'. The highly abstract concept is that XBMC is saying to a scraper:

X
PHP Code:
BMC                                           Scraper                                       Level 42

This is what I know so far
find
more info please
---------------------------------------------->           He's asked me to ask you 
                                                       ------------------------------------------------->

                                                         It'
s a lousy movie from 1982
                                                       
<-----------------------------------------------

  
     
Lousy Movie (1982)
<-------------------------------------------- 



How 'Scraper' finds its information should be of no concern to XBMC. XBMC only needs to provide the scraper with as much information as it possibly can in order the 'help' the scraper.
It would be a mistake to make assumptions on how Scraper might do its work... so to assume it will search online, and only provide it with simplistic hints such as 'we think the movie is called Lady In Red' isn't enough.

We COULD provide it with:
- The media resides at 'C:\my movies\lady in red\man in blue.mp4'
- We THINK the movie title is Lady In Red

This allows the scraper to think for itself!... it can go along with our suggestion of 'lady in red' OR it attempt to be smarter and opt for 'man in blue'.


With me so far?

So here's my particular small suggestion in the grand scheme of things (and I'll comment more on the bigger picture later!)....

***** Make the 'local meta / content' detection a scraper like all other scrapers (putting aside the current limitations of scaper API's.*****

The fact that that content exists in the local file system as opposed to an online resource simply doesn't matter. The local file systems IS a 'scrapeable resource' and should be scraped by a scraper.

That means of course that scrapers would have to be able to exercise logic and effectively be 'executable modules' in some way. But it makes sense to me that scrapers have this ability.


Benefits:
It abstracts the collection of metadata retrieval away from core XBMC in a consistent manner through the use of 'scrapers'.
XBMC makes no assumptions whatsoever on how a scraper deduces the information
It allows (theoretically) for an entirely different local NFO / tbn / jpg scheme to be implemented as long as there's a scraper that supports it
It moves the NFO / tbn / jpg scanning functionality out of XBMC core and into a scraper

Cons:
It's probably a lot of work initially and widening of scraper capability
There's more to come.... e.g. a strategy for 'daisy chaining' scrapers so that Meta Data and content can progressively be enhanced (sequentially / via priority)
There's even scope for a more complex parallel scrape where multiple sources of Meta Data and content are collated and rationalised.

It's a lot of words, but a simple concept, and I THINK it's in keeping with the OP's line of thinking.

If it's way off, I'll gladly drop out and leave you guys to it. I'm at an 'abstact' level... you guys are at a practical level... but there's a chance somewhere in between lies perfection ;-)


I'll possibly come back with the 'daisy chaining' / sequentially scraping stuff later....
Reply
#21
First of all, thanks a lot for your reply! I really appreciate feedback like this and the opinion of experienced developers, this things make me think about the code, see the flaws in my plan, etc...

(2012-04-07, 03:19)AnalogKid Wrote: I've just discovered this thread having raised a 'similar' proposal in the feature suggestions thread (for non developers).

I'm a former software architect for elements of the Symbian operating system, along with being a former developer on a number of handsets for Nokia, SonyEricsson and Motorola. Sadly, like most folks I stopped coding and spent more time drawing UML then more time infront of 'managers to becoming a manager. Suffice to say, I don't know the XBMC code, but I'm not a total 'flake' either.

At the risk of sounding patronising, or being too simplistic I think it's important to get right back to basics:

1) We have media on the users system, and we 'force' the user to at least classify it as TV, Movie, Music etc. Whether this is entirely necessary is up for debate, but it's how things stand right now.

2) In some cases, we ask the user to go a little further still and help us 'classify' media with more detail by having them stick to SOME semblance of a scheme... e.g. using a filename that might help define the media, or a TV season/episode scheme that regex can parse etc.

3) Based on what we can 'deduce' from 1 and 2 we then try to obtain additional meta information and content (trailers, thumbs, fanart etc).

3a) it SEEMS that XBMC introduces some 'core' functionality that attempts to find some additional meta info and content from the 'local' sources - typically side by side, or within a sub folder of the media itself.... (I'm talking about the NFO file, tbn, jpg etc)

3b) If we can't find that 'local' stuff, or we deem there wasn't enough stuff found, or that local stuff explicitly pointed to online resources, we then use 'scrapers' to try and get meta info and content from online resources

My assertion is this:

3a and 3b are the same thing.... 'an attempt to obtain meta info and content'. The highly abstract concept is that XBMC is saying to a scraper:

PHP Code:
scheme 

You are right, this is more or less how it's done right now. I do think you oversimplified the scraper's role a bit in the scheme, the scraper has to determine what date of the online resources matches XBMC's request. It sort of maps "title", "plot",... to a part of the data. So it contains a bit more intelligence as you suggest but I'm guessing you know this and just wanted to keep things simple here.

(2012-04-07, 03:19)AnalogKid Wrote: How 'Scraper' finds its information should be of no concern to XBMC. XBMC only needs to provide the scraper with as much information as it possibly can in order the 'help' the scraper.
It would be a mistake to make assumptions on how Scraper might do its work... so to assume it will search online, and only provide it with simplistic hints such as 'we think the movie is called Lady In Red' isn't enough.

We COULD provide it with:
- The media resides at 'C:\my movies\lady in red\man in blue.mp4'
- We THINK the movie title is Lady In Red

This allows the scraper to think for itself!... it can go along with our suggestion of 'lady in red' OR it attempt to be smarter and opt for 'man in blue'.


With me so far?

Yes I am Wink You are 100% correct in stating that XBMC shouldn't have to know or care how it gets by the metadata. It should just have to pass "something" the file name and possibly wherever its a tv show or a movie and get the metadata in return. My whole proposal is about this "something", it would be the library I create! A total separation from the current codebase, not depending on anything else. The library would be used by the VideoInfoScanner class for example. This class would get an instance of it initialized with the chosen scraper and then just hand it some callbacks ( if the library itself is threaded which some said wouldn't be a good idea and I rather agree ) or do some API calls in it's own thread ( which is the path I'll take 90% sure ).

I hope that after a while this file iteration would also be separated from XBMC and become a part of the lib so that all XBMC should do is pass some paths to movie/show folders and register callbacks ( now the threading would have to be done in the lib I'd imagine ).

(2012-04-07, 03:19)AnalogKid Wrote: So here's my particular small suggestion in the grand scheme of things (and I'll comment more on the bigger picture later!)....

***** Make the 'local meta / content' detection a scraper like all other scrapers (putting aside the current limitations of scaper API's.*****

The fact that that content exists in the local file system as opposed to an online resource simply doesn't matter. The local file systems IS a 'scrapeable resource' and should be scraped by a scraper.

That means of course that scrapers would have to be able to exercise logic and effectively be 'executable modules' in some way. But it makes sense to me that scrapers have this ability.

You are right that on a very abstract level using the data in the nfo and some data from an online source makes no difference. Both are paths that end up with XBMC acquiring metadata about a file. However extreme abstraction is never very good in my opinion, in this case and for my idea their is an really is a big difference... the availability, speed and all other factors related to the use of a network connection. These things are of no importance when using local data. Also there isn't a lot of local data to work with when new movies come in so I don't really see how it would benefit the end-user or scraper developers a lot to have a scraper for it. Also ( I'm not entirely sure ) I think currently both sources are used in conjunction but you would probably cover this with the daisy chaining.

All in all I think you are right on a software architecture level about treating local data the same but I have my questions about how far this concept should be followed for overall convenience in the actual code.

(2012-04-07, 03:19)AnalogKid Wrote: Benefits:
It abstracts the collection of metadata retrieval away from core XBMC in a consistent manner through the use of 'scrapers'.
XBMC makes no assumptions whatsoever on how a scraper deduces the information
It allows (theoretically) for an entirely different local NFO / tbn / jpg scheme to be implemented as long as there's a scraper that supports it
It moves the NFO / tbn / jpg scanning functionality out of XBMC core and into a scraper

The first two benefits would be achieved by the library anyway since it's a separate library. About the last benefit I already expressed my doubts about its usefulness but if you replace the word "scraper" with "library" the proposal also achieves it. About the possibility of implementing an entirely different local scheme... I really don't see the use but since this would also be part of the library it wouldn't be as hard to do so anyway and if there is really a demand for this I could abstract this a bit more in the lib so it would become even simpler and allows different schemes without treating it like a remote scraping.

(2012-04-07, 03:19)AnalogKid Wrote: Cons:
It's probably a lot of work initially and widening of scraper capability
I wouldn't overestimate the extra work this bring neither, if the back-end is adapted a bit and XML scrapers can take "file:///" url's ( after all those are UNIFORM resource locators ) it defiantly doable, I'm just not really convinced if it's worth it.

(2012-04-07, 03:19)AnalogKid Wrote: There's more to come.... e.g. a strategy for 'daisy chaining' scrapers so that Meta Data and content can progressively be enhanced (sequentially / via priority)
There's even scope for a more complex parallel scrape where multiple sources of Meta Data and content are collated and rationalised.

It's a lot of words, but a simple concept, and I THINK it's in keeping with the OP's line of thinking.

If it's way off, I'll gladly drop out and leave you guys to it. I'm at an 'abstact' level... you guys are at a practical level... but there's a chance somewhere in between lies perfection ;-)

I'll possibly come back with the 'daisy chaining' / sequentially scraping stuff later....
Parallel scraping of different parts of the metadata could be done with my initial suggestion for "modular scrapers" too. I am very curious about what you mean exactly with "daisy chaining" scrapers so please continue your post!

Thanks again for your input! It's a good idea and I would ask the xbmc developers' opinion on this too.

Greets
Reply
#22
Good points, a good chunk of them are exactly how things almost works today.
E.g. chaining and sequential scraping is the core ideas of the xml scrapers, and adding logic around to concate results from several would be 15m of coding.
I never went there cause what we have suffices for me, and the community has focused on reimplementing things in (insert various managed languages).

I strongly disagree on one point. To me reading nfo files etc is completely separate from a scraper library. The latter takes data in one format and convert to another, while nfo files are formatted data. Which shows in the current code, in that the only difference between the nfo code path and the scrape path is the scraping plus the 5 lines or so to load the .nfo.

Anyways, just wanted to point out that what is there already is quite thought through, just underutilized. Secondly; sentences like 'it is probably a lot of work to make this happen' can be extremely hurtful to the community and can easily lead to duplicated and unnecessary work (in this context). If you don't know how the current code works, don't pretend you do. Cause you portray yourself as an expert and you will be interpreted as one no matter the number of half-hearted disclaimers Wink
Reply
#23
I think what AnalogKid is getting at as far as scraping nfo and other local data is that regardless of the source, metadata is metadata. Say you use program X to manage a series or two for whatever reason. Program X saves said metadata locally (whether nfo or not is unimportant). You could then write a scraper for this local metadata. Additionally, you could scrape metadata from another source or two and prioritize them.

Maybe you like titles from thetvdb.com while tvrage.com has your preferred ratings and program X gets the thumbnails you want. You should be able to rank scrapers for specified fields (scrapers should of course in their programming say what fields they generate) and it shouldn't matter if the data is local or not. I know that's an advanced example, but you could include generic priority as well that automatically fills out the field priority.

Anyways, that's my 2 cents.
kodi from https://github.com/eternalsword/xbmc/tree/delta, transparency 7.1.5, funtoo kernel 3.18.0
Reply
#24
I do believe 'metadata is metadata'.

I think the perception of 'scraper' can be slightly varied too... some see it as a parser to 'scrape' websites for key information (or if you're lucky a nicely structured XML), whilst other see it as a 'metadata search engine' they MAY parse a website or XML doc, or might just scour a filesystem or database for it.

On a side note. If you're going to quote somebody, or even just paraphrase, be fair about it. Nobody said "it's probably a lot of work to make this happen'. The phrase was it's probably a lot of work initially. And if someone says 'it's probably easy' - what then? A criticism about trivialising it?
And, if such phrases can easily result in duplicated work, then I would vehemently argue that a developer ought to think things through prior to jumping in and coding something.

A discussion forum for ideas is just that, an activity PRIOR to jumping in. If a comment in a forum results in developers coding stuff - then more fool them.

You have a choice - assume anybody trying to offer some input is an egotist passing themselves off as an 'expert' (and that their disclaimers are merely a ruse) OR you accept input for what it is.... it might have some value, might be a million miles wide of the mark, or just assign to the junk folder.
There's no 'kudos' to be gained by trying to be an expert here, and I'm surprised anybody might have a mindset that would consider it.

Anyway, back on topic...

I disagree that the reading of an NFO file is different than 'scraping'. A website is formatted data, just not necessarily in a format convenient for a machine reader to grab info from without some smart parsing and in some cases reliance on 'probability'. NFO just happens to be ONE format that's widely promoted (and adopted) by the community.
I don't have any desire to change that - seems to work for 99% of cases, it's just an example of local metadata working well hence being a prime candidate as a reference 'local scraper'.

Why would you ever have a different local scraper? Well, one example might be that you're migrating from another system and the metadata of that system can also be found in the filesystem, just with a different structure and different filenames.

You might also want to make the filesystem 'scrape' act in the same manner as any other scraper so that it be placed at any location in 'the chain'... rather than taking maximum priority (NFO > IMDB > TMBD) it can be shuffled (potentially by the user) to (IMDB>TMDB>NFO). To eternalsword's point, that could in theory be right down to field level).


Reply
#25
No. Nfo's aren't some random file format. it is a dump of the data returned from the scrapers.
Reply
#26
(2012-03-25, 15:17)DonJ Wrote: I think a 'smart' file scanner would definitely benefit the average xbmc user the most. The learning barrier to set up paths is too big a hurdle to make use of the library feature for some.

I disagree that biggest problem is defining paths. I can't imagine anyone having music, videos and pictures all in one folder. Usage of subfolders is different question altogether.

What I see as problematic is absence of "media sources" item in settings with pre-defined sections for Movies, Music, TV, Music videos and Videos (videos that don't fit in any of main categories) libraries.

You may argue that this is more or less how it is now, but difference is that current situation enforces files vs library duality (changed for the better in Eden) and if metadata is not found, item will not show in your library at all. IMHO, XBMC should use _only_ library for browsing media: if you want to take a look at your media "from files level" you would still use movies.db with filter "By Folder".

So if you have Movies, you add folder(s) to this section. If you define metadata agent as well, it will try to fetch info from web. If you don't, it should still add all items found in that folder to library, using file names as item labels and populating info extracted from files. You can still filter this content by duration, resolution, or by (sub)folder:

- process given folder and add items to appropriate library (Movies,TV, Music, Music Videos, Videos)
- add data found in files
- add to that any additional data found in folder (folder.jpg, NFO files)
- if one is defined, try to fill up any _missing_ data using scraper - i.e don't fetch duration if it was read from file

My skins:

Amber
Quartz

Reply
#27
(2012-04-08, 10:14)spiff Wrote: No. Nfo's aren't some random file format. it is a dump of the data returned from the scrapers.

You are making an assumption that the nfo is generated by XBMC's scrapers, which is not necessarily the case. I have one series I manage with Sickbeard because for whatever reason XBMC wouldn't scrape it right. So Sickbeard generates the nfo. I then pull it into XBMC. While Sickbeard does acquire all the info I want, that's not always guaranteed to be the case. Granted, in this case fixing the scraper would be a better resolution, but that might not always be the case when dealing with third-party applications. If pulling data from the nfo is outside the scope of scraping, it wouldn't be possible to chain this with additional metadata from another source.

Think also of the fields in nfo that are manually entered, for example the sorttitle field. None of the scrapers that I've seen ever generate this field. Maybe I write a little application that I can fill out this field or whatever other fields prior to loading into XBMC or whatever other media application that utilizes this proposed library. If you don't do this first you end having to export the currently compiled data in individual nfo format (at this time, that means exporting the entire library), remove the affected file from the library, edit the nfo for that file, and re-add the file. Having the nfo be part of the chain would simplify the process.
kodi from https://github.com/eternalsword/xbmc/tree/delta, transparency 7.1.5, funtoo kernel 3.18.0
Reply
#28
(2012-04-08, 13:41)pecinko Wrote:
(2012-03-25, 15:17)DonJ Wrote: I think a 'smart' file scanner would definitely benefit the average xbmc user the most. The learning barrier to set up paths is too big a hurdle to make use of the library feature for some.

I disagree that biggest problem is defining paths. I can't imagine anyone having music, videos and pictures all in one folder. Usage of subfolders is different question altogether.

What I see as problematic is absence of "media sources" item in settings with pre-defined sections for Movies, Music, TV, Music videos and Videos (videos that don't fit in any of main categories) libraries.

You may argue that this is more or less how it is now, but difference is that current situation enforces files vs library duality (changed for the better in Eden) and if metadata is not found, item will not show in your library at all. IMHO, XBMC should use _only_ library for browsing media: if you want to take a look at your media "from files level" you would still use movies.db with filter "By Folder".

So if you have Movies, you add folder(s) to this section. If you define metadata agent as well, it will try to fetch info from web. If you don't, it should still add all items found in that folder to library, using file names as item labels and populating info extracted from files. You can still filter this content by duration, resolution, or by (sub)folder:

- process given folder and add items to appropriate library (Movies,TV, Music, Music Videos, Videos)
- add data found in files
- add to that any additional data found in folder (folder.jpg, NFO files)
- if one is defined, try to fill up any _missing_ data using scraper - i.e don't fetch duration if it was read from file

Thanks for your input! It further confirms I should do the "smart file recursion" only if there is time left and focus on extending scraper functionality at first. About your suggestion of removing the difference between a library view and a file view; this is beyond the scope of my proposal. I myself find the file view rather useful sometimes but I can see the improvement in what you suggest too. I remember reading somewhere about an option doing just what you said: putting "unmatched movies" in the library too. I don't recall if this was some skin's option or an addon or just my dreams but I do remember such a thing.


About the other subject:
I'm just new to xbmc's code so judging about this isn't really my thing. It's not part of my initial proposal but if there would be decided it's better off this way ( treating local file system as just another metadata source through a scraper ) I would of course do this. My personal opinion is that on a very abstract level it's indeed the same thing but on a code level it wouldn't make sense to treat it the same. It's just the same "idea" if you ask me.
Reply
#29
I don't really know the current scraping code, but my 2 cents about the proposal for modular scrapers:
It sounds like a nice idea to be able to compose your own scraper out of a set of small scraping modules (rating from imdb, plot from wikipedia, etc.)
However, I don't think this is interesting for 99% of the users and it will make the interface much more complex. XBMC is supposed to be controlled with a remote from the couch, and I can't imagine this making things more user friendly. Furthermore, a lot of people think the current scraping is slow. I don't know if there is any specific part taking up most of the time (xml parsing, url I/O, ...), but this sounds like it would make things even slower. (Being able to scrape off-site would of course overcome this problem for users with a NAS).

So I think a little more thought needs to be spend about the costs of this flexibility. I only see benefits listed in the first post, but adding flexibility almost always comes with some drawback in performance, ease-of-use etc.
But again, I don't know the current scraping code well.

What I do have enough knowledge about are recommender systems: From what I understand you want to use 2-way communication for feedback to train a recommender system. But all applicable methods I can think of suffer from cold-start. A movie library of the average user (I guess that would be around 200 movies) probably wouldn't be large enough to get a good recommendation model (or there should be really clear patterns, e.g. all movies share a common actor, only original movies and no remakes, ... but I guess not a lot of libraries show this kind of clear patterns).
Reply
#30
First of all, thanks a lot for your feedback. I do think you are getting the wrong impression Big Grin

(2012-04-09, 14:47)sebak Wrote: I don't really know the current scraping code, but my 2 cents about the proposal for modular scrapers:
It sounds like a nice idea to be able to compose your own scraper out of a set of small scraping modules (rating from imdb, plot from wikipedia, etc.)
However, I don't think this is interesting for 99% of the users and it will make the interface much more complex. XBMC is supposed to be controlled with a remote from the couch, and I can't imagine this making things more user friendly.

I do think it will be useful for a meaningful amount of the users, I don't know if English is your mother tongue but if it's not you probably have another source for plots then most scrapers use. You would want the benefits of some of the excellent scrapers but combined with another one for the plots, taglines,..

A second benefit from this approach which is far greater I think is the fact you will be able to rescrape only certain parts of the metadata. An update for the IMDB scores, new background fanart,.. without having to redownload it all. This wouldn't be a hard gui to grasp. Just a dialog with all modules and a update button next to each one.

Scrapers would still come as a "whole" too. The difference being that with some extra xml tags and attributes they now are more of a "group of scrapers". Something, if simplified like this:
PHP Code:
<Scraper>
      
generic data

      
<PlotScraper>
          
plot regexes,....
      </
PlotScraper>

      <
PosterScraper>
          
regexes and such
      
</..>
 
      ....
</
Scraper

So for normal users there would be no difference, they just select one scraper. For advanced users a dialog would exist somewhere with a list of the modules and a dropdown of available scrapers for it ( gathered from all installed ones ).
They would also be able to download specific module scrapers.

I think this are some of the great benefits this would bring and it would defiantly be implementable without extra complexity for normal users.

(2012-04-09, 14:47)sebak Wrote: Furthermore, a lot of people think the current scraping is slow. I don't know if there is any specific part taking up most of the time (xml parsing, url I/O, ...), but this sounds like it would make things even slower. (Being able to scrape off-site would of course overcome this problem for users with a NAS).

I'm not yet an expert myself on the code but I did have some in depth looks in the scraper stuff and I am pretty sure this won't significantly increase the scraping time. In the end the same amount of data is downloaded, it's just done more configurable. Existing scrapers don't always scrape from 1 resource neither so they also have to set up connections with different hosts. Also I don't think users would mind if scraping would take 10%-20% longer. It's not something they do every minute.

(2012-04-09, 14:47)sebak Wrote: So I think a little more thought needs to be spend about the costs of this flexibility. I only see benefits listed in the first post, but adding flexibility almost always comes with some drawback in performance, ease-of-use etc.
But again, I don't know the current scraping code well.

What I do have enough knowledge about are recommender systems: From what I understand you want to use 2-way communication for feedback to train a recommender system. But all applicable methods I can think of suffer from cold-start. A movie library of the average user (I guess that would be around 200 movies) probably wouldn't be large enough to get a good recommendation model (or there should be really clear patterns, e.g. all movies share a common actor, only original movies and no remakes, ... but I guess not a lot of libraries show this kind of clear patterns).

I too have knowledge of recommender systems ( did a project on it for university and had to read a lot of research papers on it ) and you are only partially right about the cold-start problem. There are solutions and workarounds. A recommender system is not the focus of this project, it's an example of what 2-way communication would allow us to do. I don't think xbmc will ever build one but existing once could use xbmc to acquire more data. One user's library will indeed probably not suffice but that's the reason a feedback channel would be implemented. If we would just use the user's local library for him alone this wouldn't be necesarry anyway. But please note, it's just an example of the possible usages, I see it used a lot quicker for authentication on some metadata providers.
Reply

Logout Mark Read Team Forum Stats Members Help
[Proposal] Seperating the scraper in a library.0