Clean scraping API
#31
(2012-07-07, 12:43)topfs2 Wrote: You can now follow the work in https://github.com/topfs2/heimdall

I'm closely following your works on heimdall. It seems that you have made a lot of progress recently.
I get rough idea how scraping sources works but have no idea how heimdall works with xbmc.
Are you developing a separate xbmc branch that can work with heimdall? if so, will you open the branch soon as well?


Reply
#32
(2012-08-22, 03:28)kimp93 Wrote:
(2012-07-07, 12:43)topfs2 Wrote: You can now follow the work in https://github.com/topfs2/heimdall

I'm closely following your works on heimdall. It seems that you have made a lot of progress recently.
I get rough idea how scraping sources works but have no idea how heimdall works with xbmc.
Are you developing a separate xbmc branch that can work with heimdall? if so, will you open the branch soon as well?

I actually haven't connected it to xbmc in any capacity. The reason for this is simple time constraints. I wanted to focus all possible time on heimdall and not xbmc Smile I have had my mentor steering my on what xbmc side needs from the API and most of the functions ought to work if implemented. Another reason for this is that we don't want only xbmc to use the API, the end goal is that any application could use heimdall for finding information about an item, i.e. all nfo editors could focus on making good UIs instead of the actual scraper process, or if the want to focus on it they can do that in heimdall and xbmc will benefit.

There are some parts which isn't fully finished yet though, namely searching for data for subjects without file identification. I.e. if xbmc would want to retrieve new covers for the Artist Foo Fighters (not coupled to any track or file). The most part of the code does work but the redirection isn't fully functional in all cases.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#33
Thanks for make it clearer. I can see how other applications and nfo editors get benefits from heimdall.
I'm not still quite clear how heimdall will work with xbmc.
For working with xbmc, will it run as a service addon? or people need to install python to run heimdall.
If I make a py file for third party source, does it need to be part of heimdall or can I simply import heimdall when I make scaper addon?
Reply
#34
Most of that is still unknown. I've designed heimdall so that it could run as a seperate process, it could also act as a library and it could even be run on a seperate machine, the API will work no matter. So heimdall could be run inside each application, on each system or even shared between applications/machines on a network.

To answer your question, how xbmc will use it is unknown. Most likely it will use it as a library first, and perhaps move it to a process later. To facilitate that we could do it as a service addon most likely.
My main focus has been the API of the "scrapers", but I'll try to explain how the API between heimdall and xbmc might be.

In psuedo SQL:
heimdall extract [ title, date ... ] on SUBJECT with [ IMDB, TMDB .... ]

But it will most likely use some RPC, perhaps JSON RPC. An example with that would be:
Code:
{
    "jsonrpc": "2.0",
    "method": "heimdall.Extract",
    "params": {
        "properties": [
            "title",
            "date"
        ],
        "subject": {
            "http://purl.org/dc/elements/1.1/identifier": "file:///home/foo/bar.mp3"
        },
        "modules": [
            "com.imdb",
            "org.themoviedb"
        ]
    },
    "id": 1
}

If you want to extend heimdall it is still somewhat unsure how this will work. The idea is that heimdall will have its own addons/modules. So xbmc will ask heimdall what addons it has and showcase this in GUI for user. So that user can "install" IMDB scraping. It might be that xbmc/nfo editors could send modules (its python afterall) but that might pose as a security risk, so undecided.

EDIT: So in short, if you want to extend heimdall and add new modules, you'll most likely submit those to heimdall directly rather than XBMC repo.

If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#35
What is the status of this project? Is it finished or on hold right now or has discussion and/or implementation just moved to another location?

I am asking because I am at the beginning of rewriting all my scrapers for Rom Collection Browser. I had a first look at heimdall and I am quite impressed. If you want to have a look at my first attempts you can monitor them here: https://github.com/maloep/heimdall/tree/master/src . Comments are welcome, especially if you see anything where I may have misread the design goals of your API.

I am also unsure how to integrate this in RCB. I have added some features to RCBs current scraping process that I don't want to miss in the future (name guessing, interactive scraping, ...). So I am also struggling with keeping my scrapers generalized enough that other clients could make use of them.

It may be too early to answer these things but if you are interested I might come up with some thoughts/questions when I make progress with my implementation.
Reply
#36
LONG LIVE HEIMDALL may your WRATH DO BATTLE WITH UNWORTHY SCRAPERS APIS
Reply
#37
(2013-03-03, 10:10)malte Wrote: What is the status of this project? Is it finished or on hold right now or has discussion and/or implementation just moved to another location?

I am asking because I am at the beginning of rewriting all my scrapers for Rom Collection Browser. I had a first look at heimdall and I am quite impressed. If you want to have a look at my first attempts you can monitor them here: https://github.com/maloep/heimdall/tree/master/src . Comments are welcome, especially if you see anything where I may have misread the design goals of your API.

I am also unsure how to integrate this in RCB. I have added some features to RCBs current scraping process that I don't want to miss in the future (name guessing, interactive scraping, ...). So I am also struggling with keeping my scrapers generalized enough that other clients could make use of them.

It may be too early to answer these things but if you are interested I might come up with some thoughts/questions when I make progress with my implementation.

Cool, I actually did a trial at a gamesdb scraper aswell but I didn't commit it for some reason. Looks like yours is far more than what mine was anyway Smile
I think this could be a perfect starting point to get heimdall into xbmc tbh, make it available as a library for scripts and we take it from there.

Regarding the status, I've been rather swamped at work sadly, so no real progress.

If it would help to make it available as a library in xbmc repo I could probably do that, would it?
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#38
hehe i woke the dragon

I think a library in the official repo would be a good thing. It would get the project lots of exposure, and hopefully encourage pull requests on github for new scrapers and code. From my work with facebook's Open Graph, using basic graph algorithms on categorized nodes, I'm pretty interested in this kind of thing.

I already got a pretty big project under my belt.. When I need a break, I'll head over this way to help with XBMC integration, bug fixes, and scraper development, especially seeing how much (possibly useless) data/connections we might be able to harvest from Facebook.
Reply
#39
heimdall as a library in xbmc repo would be a good thing in general but not absoultely necessary for me at the current stage. I guess when I have finished the scraper and integrated it into RCB and want to submit this to the official repo this would be the point where it would help to have heimdall in the repo already. In the meantime I could just reference it locally as I do now. But as garbear said, it may encourage others to start working with it.

As far as I have tested it I am quite happy with the current status. I did not run into any bugs or anything like this. But I am just using heimdall as kind of a task processing and threading API atm. And as this it already works wellSmile

For integration in RCB or other clients I guess there are three ways of doing it (or a combination of all three):
1. heimdall as a complete standalone project. Like in your above pseudo SQL statement a client could invoke heimdall via RPC or kind of command line style with some parameters and a defined result set.
2. heimdall as task processing API including scrapers
3. heimdall as task processing API without scrapers

I guess you have no 1 in mind. With my current approach I think I am somehow sitting between 2 and 3. So, if I would start integrating heimdall in RCB today I would simply use the code from scrapegame.py and thegamesdb.py and put it somewhere inside RCB. But I will need a deeper look into all this before I start integrating anything. Chances are good that I will change my point of view on this road.
Reply
#40
Glad you like it Smile

I think you highlighted the three cases I tried to design around quite well.

What I think most would use it as (and xbmc probably as well in the beginning) would be as 2, and with addons etc they would use it as 3, so you, somewhere between 2 and 3. I mostly wanted it to have the possibility to go towards 1, as that would be incredibly nice if more applications than just xbmc would use the same daemon to fetch metadata. But I guess thats a rather far fetched project and I guess a project like that could easily use heimdall as 2-3 to achieve 1. So I really think 2-3 is by far the most important use case.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#41
ok, topfs2. I'm determined to do a scraper, and I'm determined to do it right.

i'm stuck on an easy concept. Basically, https://github.com/garbear/heimdall/comm...0e1f#L2R20 . I've defined the platform predicate in two namespaces and I use Heimdall for the conversion. I assume this translation part is an appropriate use of Heimdall (stop me now if it's an abuse...)

What concerns me more is, should I be using two virtually identical predicates? TheGamesDB, MobyGames, and other sites aren't 1:1 mappings, so they're not exactly identical in all situations (differing platform names for one, and e.g. multiple atari systems can be grouped under a single one on a different site). However, even though the objects could share an owlConfusedameAs edge, they have distinct IDs and are useless across different scrapers.

How can I represent all platforms, including collections of platforms (like the atari groups mentions above) using a single predicate? Or am I correct in that I'm thinking about this the wrong way, and that multiple predicates are the right way to do things?

Thanks for your insight topfs2!
Reply
#42
(2013-03-30, 16:09)garbear Wrote: i'm stuck on an easy concept. Basically, https://github.com/garbear/heimdall/comm...0e1f#L2R20 . I've defined the platform predicate in two namespaces and I use Heimdall for the conversion. I assume this translation part is an appropriate use of Heimdall (stop me now if it's an abuse...)

As I understand it you have one identifier, games.platform which you want to map to another, site specific, identifier? For this I think that the owlConfusedameAs is the way to do it. But that would only really work if the subject is the platform, which is probably not the case?
Another way would simply be to let the URI specify what ID it is, i.e. games.platform = [ thegamesdb.org/ID, mobygames.org/ID ] It should be rather clear from the URI here which is of interest and which is not?

(2013-03-30, 16:09)garbear Wrote: What concerns me more is, should I be using two virtually identical predicates? TheGamesDB, MobyGames, and other sites aren't 1:1 mappings, so they're not exactly identical in all situations (differing platform names for one, and e.g. multiple atari systems can be grouped under a single one on a different site). However, even though the objects could share an owlConfusedameAs edge, they have distinct IDs and are useless across different scrapers.

How can I represent all platforms, including collections of platforms (like the atari groups mentions above) using a single predicate? Or am I correct in that I'm thinking about this the wrong way, and that multiple predicates are the right way to do things?

I think you've outlined the biggested problems with semantic web and the ontologies. It gets very confusing and its rather hard to know when to add new ontologies, or when to adapt a site to an already created ontology. Atleast this is what I find extremely annoying Smile

I'm not 100% versed in how consoles works so I might have missunderstood this but with platform names I think the best would be to create a global ID for each system and put in an ontology, if there doesn't already exist one? This each scraper (moby and thegamesdb) needs to use. They could obviously provide each of their, site specific, IDs in addition. With groups this obviously gets incredibly much more complex. Perhaps make the IDs hierarchical? So you have nintendo.nes and then you could provide nintendo just. I guess the problem is more that they might combine wii and wii u and snes and nes? And that it might be hard to create these groups naturally?


In general I'm somewhat starting to think that loosing the semantic web and ontology in favor of a less globally unique document structure might be beneficial.
So have each scraper return dicts with properties which are less globally unique. For example:

{
gamePlatform: "SNES"
}

And if its needed to have something more globally unique just prefix it with the site, e.g.
{
imdbID: tt812393
}

or

{
mobyPlatformID: X,
thegamesdbID: y
}

In the end this is essentially the same thing as semantics and ontology but a bit less tiresome to combine multiple ontologies and translate between ontologies and make sure it exists in the ontology (for example on your improvement PR I can't find the page stating if trailers is in the ontology of video) etc.

Thanks for pushing this and helping me with heimdall!
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#43
(2013-03-31, 20:00)topfs2 Wrote: As I understand it you have one identifier, games.platform which you want to map to another, site specific, identifier? For this I think that the owlConfusedameAs is the way to do it. But that would only really work if the subject is the platform, which is probably not the case?
Another way would simply be to let the URI specify what ID it is, i.e. games.platform = [ thegamesdb.org/ID, mobygames.org/ID ] It should be rather clear from the URI here which is of interest and which is not?
Letting the URI specify what ID it is solves my dilemma. Now I'm struggling with a new concept, thegamesdb platforms have two URIs: http://thegamesdb.net/platform/4 and http://thegamesdb.net/platform/nintendo-gameboy . Both are returned by "thegamesdb.listplatforms". Only the ID is returned by "thegamesdb.getgame". And only the label can be used to query "thegamesdb.getgame". I know, their api is kinda screwed up Wink My point is, how to settle this new source of cognitive dissonance?

topfs2 Wrote:I think you've outlined the biggested problems with semantic web and the ontologies. It gets very confusing and its rather hard to know when to add new ontologies, or when to adapt a site to an already created ontology. Atleast this is what I find extremely annoying Smile

In general I'm somewhat starting to think that loosing the semantic web and ontology in favor of a less globally unique document structure might be beneficial.
So have each scraper return dicts with properties which are less globally unique. For example:

In the end this is essentially the same thing as semantics and ontology but a bit less tiresome to combine multiple ontologies and translate between ontologies and make sure it exists in the ontology (for example on your improvement PR I can't find the page stating if trailers is in the ontology of video) etc.
IMO there aren't too many reasons a program should include more than a single ontology in its operating logic. For instance, the fully-inferred thegamesdb object, it has 6 ontologies and two dict keys, not to mention interspersing ontology vocabulary with flat data classes. If it aims to be multi-domain, interoperability with other ontologies should probably be abstracted inside an external function that does the translation, that way the influence of ontology choices can be hidden from the program. The program is build around semantic web and ontology, they definitely belong in the program, but lets leave the rest of the ontology crud behind Smile
Reply
#44
garbear Wrote:ok, topfs2. I'm determined to do a scraper, and I'm determined to do it right.
Great that you took up the task. I am planning to resume my work on the scrapers for RCB in some days/weeks so I think I can greatly benefit from your additions. But I guess there may still be some things left that could be too RCB specific to implement it in XBMCs scrapers itself.

Some questions about your current implementation and your future plans (sorry for interrupting your academical discussion, feel free to ignore or park my post for later consideration):

1. artwork download:
I have seen that you removed the artwork download in latest commits. Do you want to remove it completely from heimdall or do you just plan to move it to another module?

2. platform detection:
It looks like you do platform detection via file extension. I am afraid this will always be error-prone at least when you have to deal with multi platform extensions like .img or .bin. Any plans how to solve this? Of course this depends on the way you plan to integrate all this in XBMC/RetroPlayer but isn't it possible to let the user define the platform before (s)he starts scraping? Usually all users that I had to deal with in the past have organized their collections separated by platform. So it should be easy to define sources on a per platform basis and let the user specifiy the platform per folder/source.

3. title matching:
Same with games, movies and music, finding the correct item by title is an uphill battle. XBMCs current scrapers often fail with not finding movies although they are present on the searched site or they come back with mismatches. Not sure how good "difflib.get_close_matches()" is doing this task but I can imagine some scenarios where this approach might fail. For example "Super Mario Bros.", "Super Mario Bros. 2" and "New Super Mario Bros." might be looking quite similar to a computer algorithm, some users will also name their games "Super Mario Bros. II" instead of "Super Mario Bros. 2", etc.

In RCB I added some more logic to this part of the program to get better scrape results. E.g. trying to find 100% matches as automatic as possible (replacing metadata in [] and (), handling sequel numbers with digits or romes, ...) and also offer an interactive mode where the user gets a list of matches and can select the correct item manually. Is this a feature that you might consider as part of heimdall?
Reply
#45
(2013-04-02, 14:25)malte Wrote: 1. artwork download:
I have seen that you removed the artwork download in latest commits. Do you want to remove it completely from heimdall or do you just plan to move it to another module?

Personally I think heimdall should just fetch the knowledge and info, i.e. the URL to the artwork. Obviously there isn't anything really wrong with doing the actual download in the tasks either Smile

(2013-04-02, 14:25)malte Wrote: 2. platform detection:
It looks like you do platform detection via file extension. I am afraid this will always be error-prone at least when you have to deal with multi platform extensions like .img or .bin. Any plans how to solve this? Of course this depends on the way you plan to integrate all this in XBMC/RetroPlayer but isn't it possible to let the user define the platform before (s)he starts scraping? Usually all users that I had to deal with in the past have organized their collections separated by platform. So it should be easy to define sources on a per platform basis and let the user specifiy the platform per folder/source.

This is also a problem with audio/video items when you have some containers which may and may not contain video. The way its done in heimdall atm is that it will, on those containers which are certain, set it directly. And those which it can't deduct it will have to probe the info.

As heimdalls scheduling is implemented its impossible to revert. So if you set item.media.video you cannot go back and change it to item.media.audio, or even item.media. So any task should only report what it is very certain on.

This is a harder problem than audio/video though as it might not be possible/easy to guess from the file. So it could lend from video -> movie/tv show guess. I.e. make so you have one task which will react on the file extention (or mimetype). but if platform is already set then skip.

e.g.
supply: [ required(dc.type), not(games.platform) ]

then you could in RCB provide this subject for scraping

{
dc.title: "Super Cool game",
games.platform: PresetPlatformID
}

(2013-04-02, 14:25)malte Wrote: 3. title matching:
Same with games, movies and music, finding the correct item by title is an uphill battle. XBMCs current scrapers often fail with not finding movies although they are present on the searched site or they come back with mismatches. Not sure how good "difflib.get_close_matches()" is doing this task but I can imagine some scenarios where this approach might fail. For example "Super Mario Bros.", "Super Mario Bros. 2" and "New Super Mario Bros." might be looking quite similar to a computer algorithm, some users will also name their games "Super Mario Bros. II" instead of "Super Mario Bros. 2", etc.

Indeed, this will always be a big problem. ATM The SearchTasks only returns a url (without any other data). This is due to https://github.com/topfs2/heimdall/issues/7 and if they could return more complex items they could return title and certainty percentage. That way RCB for example could only choose games with more than a hitrate of X %. (This is how current scraper engine in xbmc does IIRC).

Cheers,
Tobias
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply

Logout Mark Read Team Forum Stats Members Help
Clean scraping API3