v14 CSS Selector Script
#1
Hi Guys,

I've been working on a script to parse HTML in XBMC. I realize that there is not very straight forward way in XBMC (because XBMC is using Python 2.6) to parse HTML and select the HTML DOM. I wrote this class which is called HTMLParserEx based on HTMLParser in Python to parse the HTML. And also wrote another very simple class that works like CSS Selector but really simple.
BTW I tried the HTML Parser code on a few bad HTML mark-ups, and it works. However if the HTML code is really messy, it doesn't work properly. Because of the base XML Parser in Python. However the CSS Selector is still usefull

Supported CSS Selector:
Code:
#id           : div#this_is_the_id or simply #the-ID
.class        : a.this_is_the_class or div.the-class
[attr=""]     : [key*=part-of-value]  or [key=exact-value] or [key^=start-with-in-value] or [key]
tag           : div or img

Refining the filter (second find will only go through the first selected DOMs):
Code:
g = cssSelector(root)
g.find('a[href*=forumdisplay.php?]')
g.find('img')

Here is the code:
http://pastebin.com/fkF154Wm

And this is sample code to use it. In this code we compare this class with an RE example.

Code:
import HTMLParserEx

htmlSource = ''

timer = time.time()

with HTMLParserEx .httpConn() as conn:
  htmlSource = conn.request('http://forum.kodi.tv/index.php').read()

print 'DONWLOAD TIME:', str(round(( time.time() - timer ) * 1000 ))
timer = time.time()

parser = HTMLParserEx.HTMLParserEx()
parser.feed( htmlSource )
root = parser.close()

print 'PARSE TIME:', str(round(( time.time() - timer ) * 1000 ))
timer = time.time()

g = HTMLParserEx.cssSelector(root)
g.find('a[href*=forumdisplay.php?]')

print 'DOM CSS SELECTOR FILTER TIME:', str(round(( time.time() - timer ) * 1000 ))
timer = time.time()

a = re.compile(r'(<a\b[^h]*href="forumdisplay\.php\?[^>]*>(.*?)</a>)', re.DOTALL | re.IGNORECASE).findall( htmlSource )

print 'DOM REGEX FILTER TIME:', str(round(( time.time() - timer ) * 1000 ))

for idx, el in enumerate( g.selected ):
  print str( idx ) + " " + str( el.tag ) + ' ' + el.text + ' ' + g.toString( el )

Results:
Code:
DONWLOAD TIME: 534.0
PARSE TIME: 36.0
DOM CSS SELECTOR FILTER TIME: 4.0
DOM REGEX FILTER TIME: 1.0

Hope it would be useful. Share your thoughts.
Reply

Logout Mark Read Team Forum Stats Members Help
CSS Selector Script0