Login at Kodi Home

malte · 2010-09-07, 14:00

I need to parse several different flat file and xml formats from inside a python addon. The files may be available locally but it should also be possible to download the content from web sites (very simple scraping). Is there anything out that can do this (or a part of it)?

I already had a look on other scripts and the forums but I did not find any general solution, most scripts seem to have their own code for one specific purpose. I found the C# scraperXml lib but this does not seem to be useful from python (and I really don't like regular expressions Smile

).

I started to implement something like this (using pyparsing for the flat files and elementtree/xpath for xml). It should use simple xml configuration files that define the parse instructions for the different formats. But before I get too much into the details I would like to ask you if there is something that I have not seen until now.

Thanks in advance,
malte

malte · 2010-10-01, 09:10

I have finished a first working version of the above mentioned "parser framework". If you are interested, you can get it here.

It is very stupid and simple but it works for my needs. If anyone wants to use and improve it, feel free. But let me know if you have done fixes or improvements.

Usage from code is quite simple. This code will return a dictionary with all found results:

Code:
descParser = DescriptionParserFactory.getParser(parseInstruction)

results = descParser.parseDescription(descFile)

parseInstruction must be a local available xml file, that tells the parser how to parse the input and how to name the results in the dictionary.
descFile can be a local file or an url to the content that should be parsed.

Examples of the parseInstructions look like this:

XML Input (Mame.dat):

Code:
<datafile build="0.138 (May 15 2010)" debug="no">    

    <game name="puckman" sourcefile="pacman.c">

        <description>PuckMan (Japan set 1)</description>

        <year>1980</year>

        <manufacturer>Namco</manufacturer>

        <rom name="namcopac.6e" size="4096" crc="fee263b3" sha1="87117ba5082cd7a615b4ec7c02dd819003fbd669"/>        

    </game>

</datafile>

parseInstruction:

Code:
<parserConfig>

    <GameGrammar type="xml" root="game">

        <crc>rom/@crc</crc>

        <Game>./@name</Game>        

        <ReleaseYear>year</ReleaseYear>

        <Developer>manufacturer</Developer>        

        <Description>description</Description>

    </GameGrammar>

</parserConfig>

Flat file Input (Key/Value):

Code:
Game: Dogfight

Platform: Amiga

Region: 

Media: 

Controller: 

Genre: Action

Release Year: 1990

Developer: ????

Publisher: ????

Players: 2 Players

URL: http://www.mobygames.com/game/amiga/dogfight

Description:

Dogfight is a two-player game

********************************************************************

parseInstruction:

Code:
<parserConfig>

    <GameGrammar type="multiline">

        <SkippableContent>Game: </SkippableContent>        

        <Game restOfLine="true"></Game>        

        <SkippableContent>Platform: </SkippableContent>

        <Platform delimiter="," restOfLine="true"></Platform>

        <SkippableContent>Region: </SkippableContent>

        <Region delimiter="," restOfLine="true"></Region>

        <SkippableContent>Media: </SkippableContent>

        <Media delimiter="," restOfLine="true"></Media>

        <SkippableContent>Controller: </SkippableContent>

        <Controller delimiter="," restOfLine="true"></Controller>

        <SkippableContent>Genre: </SkippableContent>

        <Genre delimiter="," restOfLine="true"></Genre>

        <SkippableContent>Release Year: </SkippableContent>

        <ReleaseYear delimiter="," restOfLine="true"></ReleaseYear>

        <SkippableContent>Developer: </SkippableContent>

        <Developer delimiter="," restOfLine="true"></Developer>

        <SkippableContent>Publisher: </SkippableContent>

        <Publisher restOfLine="true"></Publisher>

        <SkippableContent>Players: </SkippableContent>

        <Players delimiter="," restOfLine="true"></Players>

        <SkippableContent>URL: </SkippableContent>

        <URL delimiter="," restOfLine="true"></URL>

        <SkippableContent restOfLine="true">Description:</SkippableContent>

        <Description skipTo="*LineEnd"></Description>

        <SkippableContent restOfLine="true"></SkippableContent>        

    </GameGrammar>

</parserConfig>

A simple scraping example could look like this (scraping a mobygames search http://www.mobygames.com/search/quick?q=...ctraiser):

Code:
<parserConfig>

    <GameGrammar type="multiline">

        <SkippableContent skipTo="&lt;span style=&quot;white-space: nowrap&quot;&gt;&lt;a href=&quot;" closeStmnt="true"></SkippableContent>

        <SkippableContent closeStmnt="true">&lt;span style=&quot;white-space: nowrap&quot;&gt;&lt;a href=&quot;</SkippableContent>

        <url closeStmnt="true" skipTo="&quot;"></url>

    </GameGrammar>

</parserConfig>

The first version of the flat file parser was implemented to parse multiline key/value files like the first flat file example, so scraping web sites is not very good atm. I think I will try to improve this in the future.

I am using this parser in my Rom Browser script, so there may be some more documentation in the related Wiki. Atm there is just a short description of the older flat file parser. But most of it should still be valid.

galvanash · 2010-10-01, 09:19

malte Wrote:I need to parse several different flat file and xml formats from inside a python addon. The files may be available locally but it should also be possible to download the content from web sites (very simple scraping). Is there anything out that can do this (or a part of it)?

I have not used it myself, but I have seen quite a few plugins/scripts using this:

http://www.crummy.com/software/BeautifulSoup/

It mostly for scraping, but I'm pretty sure you could use with local files.

malte · 2010-10-01, 19:32

Thanks a lot. I saw this already but thought it would just be an xml parser like elementtree. Will check it again.