I have finished a first working version of the above mentioned "parser framework". If you are interested, you can get it
here.
It is very stupid and simple but it works for my needs. If anyone wants to use and improve it, feel free. But let me know if you have done fixes or improvements.
Usage from code is quite simple. This code will return a dictionary with all found results:
Code:
descParser = DescriptionParserFactory.getParser(parseInstruction)
results = descParser.parseDescription(descFile)
parseInstruction must be a local available xml file, that tells the parser how to parse the input and how to name the results in the dictionary.
descFile can be a local file or an url to the content that should be parsed.
Examples of the parseInstructions look like this:
XML Input (Mame.dat):
Code:
<datafile build="0.138 (May 15 2010)" debug="no">
<game name="puckman" sourcefile="pacman.c">
<description>PuckMan (Japan set 1)</description>
<year>1980</year>
<manufacturer>Namco</manufacturer>
<rom name="namcopac.6e" size="4096" crc="fee263b3" sha1="87117ba5082cd7a615b4ec7c02dd819003fbd669"/>
</game>
</datafile>
parseInstruction:
Code:
<parserConfig>
<GameGrammar type="xml" root="game">
<crc>rom/@crc</crc>
<Game>./@name</Game>
<ReleaseYear>year</ReleaseYear>
<Developer>manufacturer</Developer>
<Description>description</Description>
</GameGrammar>
</parserConfig>
Flat file Input (Key/Value):
Code:
Game: Dogfight
Platform: Amiga
Region:
Media:
Controller:
Genre: Action
Release Year: 1990
Developer: ????
Publisher: ????
Players: 2 Players
URL: http://www.mobygames.com/game/amiga/dogfight
Description:
Dogfight is a two-player game
********************************************************************
parseInstruction:
Code:
<parserConfig>
<GameGrammar type="multiline">
<SkippableContent>Game: </SkippableContent>
<Game restOfLine="true"></Game>
<SkippableContent>Platform: </SkippableContent>
<Platform delimiter="," restOfLine="true"></Platform>
<SkippableContent>Region: </SkippableContent>
<Region delimiter="," restOfLine="true"></Region>
<SkippableContent>Media: </SkippableContent>
<Media delimiter="," restOfLine="true"></Media>
<SkippableContent>Controller: </SkippableContent>
<Controller delimiter="," restOfLine="true"></Controller>
<SkippableContent>Genre: </SkippableContent>
<Genre delimiter="," restOfLine="true"></Genre>
<SkippableContent>Release Year: </SkippableContent>
<ReleaseYear delimiter="," restOfLine="true"></ReleaseYear>
<SkippableContent>Developer: </SkippableContent>
<Developer delimiter="," restOfLine="true"></Developer>
<SkippableContent>Publisher: </SkippableContent>
<Publisher restOfLine="true"></Publisher>
<SkippableContent>Players: </SkippableContent>
<Players delimiter="," restOfLine="true"></Players>
<SkippableContent>URL: </SkippableContent>
<URL delimiter="," restOfLine="true"></URL>
<SkippableContent restOfLine="true">Description:</SkippableContent>
<Description skipTo="*LineEnd"></Description>
<SkippableContent restOfLine="true"></SkippableContent>
</GameGrammar>
</parserConfig>
A simple scraping example could look like this (scraping a mobygames search
http://www.mobygames.com/search/quick?q=...ctraiser):
Code:
<parserConfig>
<GameGrammar type="multiline">
<SkippableContent skipTo="<span style="white-space: nowrap"><a href="" closeStmnt="true"></SkippableContent>
<SkippableContent closeStmnt="true"><span style="white-space: nowrap"><a href="</SkippableContent>
<url closeStmnt="true" skipTo="""></url>
</GameGrammar>
</parserConfig>
The first version of the flat file parser was implemented to parse multiline key/value files like the first flat file example, so scraping web sites is not very good atm. I think I will try to improve this in the future.
I am using this parser in my Rom Browser script, so there may be some more documentation in the related Wiki. Atm there is just a short description of the older
flat file parser. But most of it should still be valid.