I also need help with parsing html
#1
hi,

could one of you get me started on cachedhttp and parsing html, please?

i just want to download http://www.fluidboxmedia.de/test.html via cachedhttp.py and then parse the content and get it displayed in xbmc

the first part should be easy (please correct me if i'm wrong):
Quote:import cachedhttp

c=mycachedhttp()
data=c.urlopen('http://www.fluidboxmedia.de/test.html')

the test.html just countains one table:
Quote:<table>
<tr><td width="100" height="20">header1</td><td width="900" height="20">header2</td><tr>
<tr><td width="100" height="20">row1cell1</td><td width="200" height="20">row1cell2</td><tr>
<tr><td width="100" height="20">row2cell1</td><td width="200" height="20">row2cell2</td><tr>
<tr><td width="100" height="20">row3cell1</td><td width="200" height="20">row3cell2</td><tr>
</table>

can you please show me how i would use re to get the text from the cells from rows 2-4 (in a list maybe)? i can learn best by looking at easy examples. you would help me a lot!

thanks in advance!
Reply
#2
this is what you look for:

Quote:result = re.findall('<tr><td width="100" height="20">(.+?)</td><td width="200" height="20">(.+?)</td><tr>',data, re.ignorecase | re.dotall)

print result

hope that helps, not tested yet

cu lolol
Reply
#3
just tested my sample doesn't work...

i think i made a mistake with the "greedy something... Blush

here the correct pattern:

Quote:result = re.findall('<tr><td .+?>(.+?)</td><td .+?>(.+?)</td><tr>',data, re.ignorecase)

gives back:
Quote:>>>
[('header1', 'header2'), ('row1cell1', 'row1cell2'), ('row2cell1', 'row2cell2'), ('row3cell1', 'row3cell2')]

cu lolol
Reply
#4
thanks for your help.. i will try it out and see if i get it, hehe
Reply

Logout Mark Read Team Forum Stats Members Help
I also need help with parsing html0