Clean scraping API
#50
(2013-04-03, 12:56)malte Wrote: I see Smile. I had in mind to move them to a separate module when I would start to implement other scrapers. But I can understand if you don't want them to be a part of heimdall at all. Maybe I will re-add them in a RCB specific version later (as you might have noticed I am not an elegant coder in the first place). From a pragmatic point of view downloading artwork is the most time consuming task in the whole scraping process and doing it multithreaded saves some valuable seconds per game. So before I start writing up some crappy threading thing in RCB I prefer to reuse what Tobias has already implemented.
In my last post I mentioned one of the theories that a pragmatic point of view might overlook. And your pragmatism didn't go unnoticed Smile Artwork downloaders might not belong in a metadata extractor, but the dataflow paradigm in heimdall is perfect for executing a large number of heavy tasks in parallel as their data becomes available. I think that when tobias makes heimdall an xbmc library, you can import the package and supply your own artwork downloader module. That way the download locations and configuration, etc. can be hidden from heimdall.

malte Wrote:Not yet convinced that this may work but I am ready to be surprised (and to test). If it helps: I have created kind of a testset some years ago to test some of the more problematic games with RCB. Maybe I am able to dig it out or to recreate it.
Cool, I'd like to see the test. As with any sort of natural-language search, domain knowledge and machine learning will probably be the key to optimality. The last filter I wrote somehow got dumber as it went on... not a good sign Smile

malte Wrote:Not sure if I understand that right. Will the client get one result and decides if it is the correct one or will it be a two step approach where the client gets a list of results and starts a second run with the correct item?
Probably a callback function that takes a list of results and returns the correct/chosen one, or an error code to abort the current task.

malte Wrote:As mobygames is one of the most complete available sources atm I think I would continue with this one. If you want to start this yourself just let me know and I will sit still. Otherwise I could do the groundwork again and you could chime in where I sucked. If I would give it a try I would start with BeautifulSoup. Any better ideas?
I'm of the opinion that the dude with the 10,000 line python program knows what he's doing Smile And I'm pretty wrapped up in my filter right now. If you push to a development branch, I'll probably check in after a few days to make sure Heimdall is being worshiped properly Wink

malte Wrote:Another question about implementing scrapers: As heimdall might be more official part of XBMC, how do you handle scraping permissions for the scraped sites? For example, I got official permission to scrape data from thegamesdb, archive.vg and giantbomb. Mobygames did not respond to my request but they do not explicitly forbid scraping in their terms of use, so I decided it will be ok to implement it. No idea if XBMC can go the same route or if we have to ask again. GameFAQ did also not respond but they explicitly forbid automated scraping so I decided not to implement it. Maybe they will respond if XBMC asks officially for scraping permission.
For now, Heimdall and XBMC are separate, so TOS negotiations are handled separately.

malte Wrote:I also want to get as much information as possible from the scrapers. E.g. thegamesdb also provides information and artwork for consoles and publishers/developers. Do you plan to use this in heimdall/RetroPlayer too?
The more the merrier Smile I think you would target consoles or publishers as subjects in Heimdall, like container.audio.Artist and container.audio.Album classes.
Reply


Messages In This Thread
Clean scraping API - by topfs2 - 2012-06-14, 19:33
RE: Clean scraping API - by olympia - 2012-06-14, 22:38
RE: Clean scraping API - by DonJ - 2012-06-15, 01:36
RE: Clean scraping API - by da-anda - 2012-06-15, 10:27
RE: Clean scraping API - by topfs2 - 2012-06-16, 11:30
RE: Clean scraping API - by da-anda - 2012-06-18, 22:19
RE: Clean scraping API - by DonJ - 2012-06-27, 12:59
RE: Clean scraping API - by lboregard - 2012-07-01, 04:57
RE: Clean scraping API - by topfs2 - 2012-07-04, 10:34
RE: Clean scraping API - by lboregard - 2012-07-04, 12:09
RE: Clean scraping API - by olympia - 2012-06-16, 12:02
RE: Clean scraping API - by topfs2 - 2012-06-16, 17:05
RE: Clean scraping API - by Maxoo - 2012-06-17, 01:19
RE: Clean scraping API - by RockerC - 2012-06-20, 15:38
RE: Clean scraping API - by NEOhidra - 2012-06-19, 16:25
RE: Clean scraping API - by solidsatras - 2012-06-20, 09:40
RE: Clean scraping API - by Hitcher - 2012-06-20, 10:08
RE: Clean scraping API - by Martijn - 2012-06-20, 10:16
RE: Clean scraping API - by Montellese - 2012-06-20, 10:13
Re: Clean scraping API - by Martijn - 2012-06-20, 16:34
RE: Clean scraping API - by Martijn - 2012-06-20, 21:04
RE: Clean scraping API - by jmarshall - 2012-06-20, 23:46
RE: Clean scraping API - by solidsatras - 2012-06-30, 16:09
RE: Clean scraping API - by Thorbear - 2012-06-30, 13:53
RE: Clean scraping API - by TheAstronaut - 2012-07-02, 16:39
RE: Clean scraping API - by spiff - 2012-07-03, 18:53
RE: Clean scraping API - by TheAstronaut - 2012-07-03, 21:03
RE: Clean scraping API - by Martijn - 2012-07-04, 11:37
RE: Clean scraping API - by topfs2 - 2012-07-07, 12:43
RE: Clean scraping API - by kimp93 - 2012-08-22, 03:28
RE: Clean scraping API - by topfs2 - 2012-08-22, 11:37
RE: Clean scraping API - by aptalca - 2012-07-24, 21:37
RE: Clean scraping API - by kimp93 - 2012-08-23, 05:26
RE: Clean scraping API - by topfs2 - 2012-08-23, 11:53
RE: Clean scraping API - by malte - 2013-03-03, 10:10
RE: Clean scraping API - by topfs2 - 2013-03-06, 09:19
RE: Clean scraping API - by garbear - 2013-03-06, 08:09
RE: Clean scraping API - by garbear - 2013-03-06, 10:11
RE: Clean scraping API - by malte - 2013-03-06, 18:01
RE: Clean scraping API - by topfs2 - 2013-03-11, 15:11
RE: Clean scraping API - by garbear - 2013-03-30, 16:09
RE: Clean scraping API - by topfs2 - 2013-03-31, 20:00
RE: Clean scraping API - by garbear - 2013-04-01, 07:35
RE: Clean scraping API - by malte - 2013-04-02, 14:25
RE: Clean scraping API - by topfs2 - 2013-04-02, 15:03
RE: Clean scraping API - by garbear - 2013-04-02, 16:56
RE: Clean scraping API - by N3MIS15 - 2013-04-03, 07:12
RE: Clean scraping API - by garbear - 2013-04-03, 11:27
RE: Clean scraping API - by topfs2 - 2013-04-04, 08:59
RE: Clean scraping API - by malte - 2013-04-03, 12:56
RE: Clean scraping API - by garbear - 2013-04-04, 08:38
RE: Clean scraping API - by natethomas - 2013-04-04, 10:23
RE: Clean scraping API - by topfs2 - 2013-04-04, 10:56
RE: Clean scraping API - by natethomas - 2013-04-05, 09:58
RE: Clean scraping API - by da-anda - 2013-04-05, 11:25
RE: Clean scraping API - by Bstrdsmkr - 2013-04-05, 16:05
RE: Clean scraping API - by topfs2 - 2013-04-05, 12:27
RE: Clean scraping API - by garbear - 2013-04-05, 16:27
RE: Clean scraping API - by jmarshall - 2013-04-06, 07:36
RE: Clean scraping API - by topfs2 - 2013-04-10, 08:38
RE: Clean scraping API - by natethomas - 2013-04-10, 09:28
RE: Clean scraping API - by garbear - 2013-04-10, 09:42
RE: Clean scraping API - by N3MIS15 - 2013-04-10, 10:40
RE: Clean scraping API - by garbear - 2013-04-10, 09:34
RE: Clean scraping API - by topfs2 - 2013-04-10, 13:29
RE: Clean scraping API - by garbear - 2013-04-10, 13:43
RE: Clean scraping API - by topfs2 - 2013-04-10, 13:58
RE: Clean scraping API - by jmarshall - 2013-04-10, 10:05
RE: Clean scraping API - by garbear - 2013-04-10, 12:08
RE: Clean scraping API - by topfs2 - 2013-04-11, 11:07
RE: Clean scraping API - by N3MIS15 - 2013-04-11, 11:32
RE: Clean scraping API - by topfs2 - 2013-04-11, 11:42
RE: Clean scraping API - by jmarshall - 2013-04-11, 09:00
RE: Clean scraping API - by topfs2 - 2013-04-11, 11:04
RE: Clean scraping API - by garbear - 2013-04-11, 12:05
Re: Clean scraping API - by queeup - 2013-04-11, 16:58
RE: Clean scraping API - by topfs2 - 2013-04-11, 18:04
Re: Clean scraping API - by queeup - 2013-04-11, 19:44
RE: Clean scraping API - by garbear - 2013-04-11, 21:41
Re: Clean scraping API - by queeup - 2013-04-11, 22:05
RE: Clean scraping API - by garbear - 2013-04-11, 22:51
RE: Clean scraping API - by topfs2 - 2013-04-17, 10:50
RE: Clean scraping API - by garbear - 2013-05-09, 23:05
RE: Clean scraping API - by TheMonkeyKing - 2013-10-18, 22:31
Logout Mark Read Team Forum Stats Members Help
Clean scraping API3