Washing away html tags from text?
#1
i need a way to remove html tags from a string so i can put it into a normal textbox. is there any convenient python method to do this? or can it be done with a regex?

ive managed to build a regex that finds all "<>" tags in a text, but i want the opposite. "<.*?>" will find all tags, but how can i negate this?

any ideas how to remove the html tags from the below text?
Quote:endorsements, such as the numerous commercials they've appeared in for <i>butterfinger</i> candy bars.<br>
<br>
<b>fox broadcasting schedule (all times eastern)</b><br>
december 1989 - may 1990: sunday 8:00-8:30<br>
october 1990 - may 1994: thursday 8:00-8:30<br>
Check out my XBMC scripts at http://xbmc.ramfelt.se
Reply
#2
you can remove text with regex, not just search.

Quote:import re

text = "<br>hello world<br>!!<br>"
replacetext = ""
pattern = "<(.*?)>"

print re.sub(pattern, replacetext, text)
output will be hello world!!



Reply
#3
ahh, didnt know about that method. thanks alot, it worked like a charm.

btw, is it any meaning to use re.compile() instead of using the regex directly every time? i mean the compiled regex takes up memory, etc.

ive just started using regex and it seems to be way advanced, but i wonder if it is "better" to use str.find() to find specific parts in a web page than using a regex for it. perhaps it is a question of different flavors.
Check out my XBMC scripts at http://xbmc.ramfelt.se
Reply
#4
here's a more advanced method of html stripping i am using. it also understands things like & and stuff


htmlstrip.py from ooba
Quote:# scriptname : htmlstrip.py
# version : 0.3 beta
# author : van der phunck aka aslak grinsted. [email="as@Phunck.cmo"]as@Phunck.cmo[/email] <- not cmo but com
# desc : can scrub html and do htmldecoding of &oslash;.
#
#
# useful in many scripts
#
#

import re,string,htmlentitydefs
import urlparse


entpat=re.compile("&#?([\w\d]+)[;\s]", re.ignorecase)
def htmldecode(name): #make titles look nicer in menu
nameout=''
lastidx=0
for match in entpat.finditer(name):
try:
ent=unichr(int(match.group(1)))
except:
try:
ent=unichr(htmlentitydefs.name2codepoint[match.group(1)])
except: ent='?'
nameout=nameout+name[lastidx:match.start()]+ent.encode('iso-8859-1','replace')
lastidx=match.end()
nameout=nameout+name[lastidx:]
nameout=nameout.replace('\xa0',' ') #make nbsp into normal space
return nameout


rcomment=re.compile('<!--.*?-->',re.ignorecase|re.dotall)
rscript=re.compile('<script.*?</script>',re.ignorecase|re.dotall)

rtd=re.compile('<td[^<>]*>',re.ignorecase)
rwhitespace=re.compile('\s{2,}',re.ignorecase)


rbodystart=re.compile('^.*<body[^<>]*>',re.ignorecase|re.dotall)
rbodyend=re.compile('</body.*$',re.ignorecase|re.dotall)

rli=re.compile('<li[^<>]*>',re.ignorecase)

rbr=re.compile('<(?:br|p|/?div|/?table|/tr|hr)[^<>]*>',re.ignorecase)

rtag=re.compile('<([^<>]*)>',re.ignorecase)

def htmlstrip(data):
data=rcomment.sub('',data)
data=rbodystart.sub('',data)
data=rbodyend.sub('',data)
data=rscript.sub('',data)
data=rtd.sub(' ',data)
data=rwhitespace.sub(' ',data)

data=rbr.sub('\n',data)
data=rli.sub('\n* ',data)

data=rtag.sub('',data)
return htmldecode(data)



Reply

Logout Mark Read Team Forum Stats Members Help
Washing away html tags from text?0