2013-04-02, 16:56
(2013-04-02, 14:25)malte Wrote: 1. artwork download: I have seen that you removed the artwork download in latest commits. Do you want to remove it completely from heimdall or do you just plan to move it to another module?I took your artwork downloaders to the chop shop (the rest of the tgdb scraper work was the perfect groundwork). They don't belong in a scraper module, because they aren't particular to a single scraper. Also, as they "supply" a file instead of data, heimdall sees an empty supply[], which is a conceptual problem. For example, topfs2 earlier mentioned sql-like queries, and if we want to optimize for small amounts of data extracted then backwards-chaining can be added to heimdall, which relies on conclusions being present.
Conclusion-less tasks aren't bad, per se, in fact it might they might be a useful tool in heimdall's data-driven programming environment. I feel that it should be discouraged, however, so that Heimdall can purely focus on enhancing its metadata processing algorithms.
malte Wrote:2. platform detection: It looks like you do platform detection via file extension...Ideally, one of the principals behind heimdall is that it only runs tasks when necessary. So topfs2 brings up sticking in a demand = [demands.none(game.platform)] line, that way the client can specify the platform and cause that task to be skipped.
malte Wrote:3. title matching: In RCB I added some more logic to this part of the program to get better scrape results. E.g. trying to find 100% matches as automatic as possible (replacing metadata in [] and (), handling sequel numbers with digits or romes, ...) and also offer an interactive mode where the user gets a list of matches and can select the correct item manually. Is this a feature that you might consider as part of heimdall?The best approach here is probably similar to my platform comparison algorithm, canonicalize the titles first (translate II to 2, brothers to bros, etc. like RCB does currently) and then run a fuzzy string comparison on the canonicalized titles.
EDIT: by best, I meant readily thought of at 2am a bayesian filter and some basic discrimination training should make an interactive mode almost unnecessary.
(2013-04-02, 15:03)topfs2 Wrote: Indeed, this will always be a big problem. ATM The SearchTasks only returns a url (without any other data). This is due to https://github.com/topfs2/heimdall/issues/7 and if they could return more complex items they could return title and certainty percentage. That way RCB for example could only choose games with more than a hitrate of X %. (This is how current scraper engine in xbmc does IIRC).Being able to retrieve a name, year, thumbnail and possibly other info alongside the URL would be helpful for the client so that it can solve possible disambiguation problems in the way that the client best sees fit.
Malte, have you started to work on other game scrapers yet? I'm planning one that parses ROMs directly for available embedded data.