Clean scraping API
#49
garbear Wrote:I took your artwork downloaders to the chop shop. They don't belong in a scraper module, because they aren't particular to a single scraper.
I see Smile. I had in mind to move them to a separate module when I would start to implement other scrapers. But I can understand if you don't want them to be a part of heimdall at all. Maybe I will re-add them in a RCB specific version later (as you might have noticed I am not an elegant coder in the first place). From a pragmatic point of view downloading artwork is the most time consuming task in the whole scraping process and doing it multithreaded saves some valuable seconds per game. So before I start writing up some crappy threading thing in RCB I prefer to reuse what Tobias has already implemented.

garbear Wrote:EDIT: by best, I meant readily thought of at 2am a bayesian filter and some basic discrimination training should make an interactive mode almost unnecessary.
Not yet convinced that this may work but I am ready to be surprised (and to test). If it helps: I have created kind of a testset some years ago to test some of the more problematic games with RCB. Maybe I am able to dig it out or to recreate it.

garbear Wrote:
topfs2 Wrote:Indeed, this will always be a big problem. ATM The SearchTasks only returns a url (without any other data). This is due to https://github.com/topfs2/heimdall/issues/7 and if they could return more complex items they could return title and certainty percentage. That way RCB for example could only choose games with more than a hitrate of X %. (This is how current scraper engine in xbmc does IIRC).
Being able to retrieve a name, year, thumbnail and possibly other info alongside the URL would be helpful for the client so that it can solve possible disambiguation problems in the way that the client best sees fit.
Not sure if I understand that right. Will the client get one result and decides if it is the correct one or will it be a two step approach where the client gets a list of results and starts a second run with the correct item?

garbear Wrote:Malte, have you started to work on other game scrapers yet? I'm planning one that parses ROMs directly for available embedded data.
Not yet. But I wanted to reimplement all scrapers that RCB currently supports and maybe add one or two that are currently not supported. This is the list I have in mind:

Online scrapers
- thegamesdb
- archive.vg (xml API)
- giantbomb (xml API)
- mobygames (html scraping)
- some alternative to MAWS to scrape Arcade games
- maybe GameFAQ (see question below)

Offline scrapers:
- nfo files
- emuxtras desc files
- maybe other desc files that are provided for MAME or other rom sets

But I guess some of these sources may just remain RCB specific and are not relevant to heimdall.

As mobygames is one of the most complete available sources atm I think I would continue with this one. If you want to start this yourself just let me know and I will sit still. Otherwise I could do the groundwork again and you could chime in where I sucked. If I would give it a try I would start with BeautifulSoup. Any better ideas?

Another question about implementing scrapers: As heimdall might be more official part of XBMC, how do you handle scraping permissions for the scraped sites? For example, I got official permission to scrape data from thegamesdb, archive.vg and giantbomb. Mobygames did not respond to my request but they do not explicitly forbid scraping in their terms of use, so I decided it will be ok to implement it. No idea if XBMC can go the same route or if we have to ask again. GameFAQ did also not respond but they explicitly forbid automated scraping so I decided not to implement it. Maybe they will respond if XBMC asks officially for scraping permission.

I also want to get as much information as possible from the scrapers. E.g. thegamesdb also provides information and artwork for consoles and publishers/developers. Do you plan to use this in heimdall/RetroPlayer too?
Reply


Messages In This Thread
Clean scraping API - by topfs2 - 2012-06-14, 19:33
RE: Clean scraping API - by olympia - 2012-06-14, 22:38
RE: Clean scraping API - by DonJ - 2012-06-15, 01:36
RE: Clean scraping API - by da-anda - 2012-06-15, 10:27
RE: Clean scraping API - by topfs2 - 2012-06-16, 11:30
RE: Clean scraping API - by da-anda - 2012-06-18, 22:19
RE: Clean scraping API - by DonJ - 2012-06-27, 12:59
RE: Clean scraping API - by lboregard - 2012-07-01, 04:57
RE: Clean scraping API - by topfs2 - 2012-07-04, 10:34
RE: Clean scraping API - by lboregard - 2012-07-04, 12:09
RE: Clean scraping API - by olympia - 2012-06-16, 12:02
RE: Clean scraping API - by topfs2 - 2012-06-16, 17:05
RE: Clean scraping API - by Maxoo - 2012-06-17, 01:19
RE: Clean scraping API - by RockerC - 2012-06-20, 15:38
RE: Clean scraping API - by NEOhidra - 2012-06-19, 16:25
RE: Clean scraping API - by solidsatras - 2012-06-20, 09:40
RE: Clean scraping API - by Hitcher - 2012-06-20, 10:08
RE: Clean scraping API - by Martijn - 2012-06-20, 10:16
RE: Clean scraping API - by Montellese - 2012-06-20, 10:13
Re: Clean scraping API - by Martijn - 2012-06-20, 16:34
RE: Clean scraping API - by Martijn - 2012-06-20, 21:04
RE: Clean scraping API - by jmarshall - 2012-06-20, 23:46
RE: Clean scraping API - by solidsatras - 2012-06-30, 16:09
RE: Clean scraping API - by Thorbear - 2012-06-30, 13:53
RE: Clean scraping API - by TheAstronaut - 2012-07-02, 16:39
RE: Clean scraping API - by spiff - 2012-07-03, 18:53
RE: Clean scraping API - by TheAstronaut - 2012-07-03, 21:03
RE: Clean scraping API - by Martijn - 2012-07-04, 11:37
RE: Clean scraping API - by topfs2 - 2012-07-07, 12:43
RE: Clean scraping API - by kimp93 - 2012-08-22, 03:28
RE: Clean scraping API - by topfs2 - 2012-08-22, 11:37
RE: Clean scraping API - by aptalca - 2012-07-24, 21:37
RE: Clean scraping API - by kimp93 - 2012-08-23, 05:26
RE: Clean scraping API - by topfs2 - 2012-08-23, 11:53
RE: Clean scraping API - by malte - 2013-03-03, 10:10
RE: Clean scraping API - by topfs2 - 2013-03-06, 09:19
RE: Clean scraping API - by garbear - 2013-03-06, 08:09
RE: Clean scraping API - by garbear - 2013-03-06, 10:11
RE: Clean scraping API - by malte - 2013-03-06, 18:01
RE: Clean scraping API - by topfs2 - 2013-03-11, 15:11
RE: Clean scraping API - by garbear - 2013-03-30, 16:09
RE: Clean scraping API - by topfs2 - 2013-03-31, 20:00
RE: Clean scraping API - by garbear - 2013-04-01, 07:35
RE: Clean scraping API - by malte - 2013-04-02, 14:25
RE: Clean scraping API - by topfs2 - 2013-04-02, 15:03
RE: Clean scraping API - by garbear - 2013-04-02, 16:56
RE: Clean scraping API - by N3MIS15 - 2013-04-03, 07:12
RE: Clean scraping API - by garbear - 2013-04-03, 11:27
RE: Clean scraping API - by topfs2 - 2013-04-04, 08:59
RE: Clean scraping API - by malte - 2013-04-03, 12:56
RE: Clean scraping API - by garbear - 2013-04-04, 08:38
RE: Clean scraping API - by natethomas - 2013-04-04, 10:23
RE: Clean scraping API - by topfs2 - 2013-04-04, 10:56
RE: Clean scraping API - by natethomas - 2013-04-05, 09:58
RE: Clean scraping API - by da-anda - 2013-04-05, 11:25
RE: Clean scraping API - by Bstrdsmkr - 2013-04-05, 16:05
RE: Clean scraping API - by topfs2 - 2013-04-05, 12:27
RE: Clean scraping API - by garbear - 2013-04-05, 16:27
RE: Clean scraping API - by jmarshall - 2013-04-06, 07:36
RE: Clean scraping API - by topfs2 - 2013-04-10, 08:38
RE: Clean scraping API - by natethomas - 2013-04-10, 09:28
RE: Clean scraping API - by garbear - 2013-04-10, 09:42
RE: Clean scraping API - by N3MIS15 - 2013-04-10, 10:40
RE: Clean scraping API - by garbear - 2013-04-10, 09:34
RE: Clean scraping API - by topfs2 - 2013-04-10, 13:29
RE: Clean scraping API - by garbear - 2013-04-10, 13:43
RE: Clean scraping API - by topfs2 - 2013-04-10, 13:58
RE: Clean scraping API - by jmarshall - 2013-04-10, 10:05
RE: Clean scraping API - by garbear - 2013-04-10, 12:08
RE: Clean scraping API - by topfs2 - 2013-04-11, 11:07
RE: Clean scraping API - by N3MIS15 - 2013-04-11, 11:32
RE: Clean scraping API - by topfs2 - 2013-04-11, 11:42
RE: Clean scraping API - by jmarshall - 2013-04-11, 09:00
RE: Clean scraping API - by topfs2 - 2013-04-11, 11:04
RE: Clean scraping API - by garbear - 2013-04-11, 12:05
Re: Clean scraping API - by queeup - 2013-04-11, 16:58
RE: Clean scraping API - by topfs2 - 2013-04-11, 18:04
Re: Clean scraping API - by queeup - 2013-04-11, 19:44
RE: Clean scraping API - by garbear - 2013-04-11, 21:41
Re: Clean scraping API - by queeup - 2013-04-11, 22:05
RE: Clean scraping API - by garbear - 2013-04-11, 22:51
RE: Clean scraping API - by topfs2 - 2013-04-17, 10:50
RE: Clean scraping API - by garbear - 2013-05-09, 23:05
RE: Clean scraping API - by TheMonkeyKing - 2013-10-18, 22:31


Logout Mark Read Team Forum Stats Members Help
Clean scraping API3
This forum uses Lukasz Tkacz MyBB addons.