htmlcleaner
#1
get it from its git repo
there are instructions on how to use it in the .py

What does it do?
Provides an easy way to clean your scraped page source code of annoying encoded unicode entities.
If you have things like these showing up in XBMC: & Æ
then you can use this to convert them to what they should be.

I think beautifulsoup does this automatically, but i mostly just use re.compile, which doesn't.

Anyway, its called htmlcleaner, and it is a bastardised/cut down version of the fairly well-known scraper utility 'html2text', with a few of my own modifications added.

It can either convert the entities to unicode (so you get characters with accents etc) or just strip to the ascii pseudo-replcacements (ie replace an accented e with just an e).


I will probably update it every now and then, as i need it for my Icefilms addon.
Reply

Logout Mark Read Team Forum Stats Members Help
htmlcleaner0