What's the fastest way of parsing HTML?
#1
Hi,

Right now I am using a complex set of regular expressions so I can get basically anything out of a raw HTML input. However that doesn't seem to be the best and most optimal approach and usually all the devs keep telling not to use regex for html parsing. I saw many addons use BS4 or some beautifulsoup version. I made some benchmarks and that's even slower than my regex when I have to extract a lot of elements. However the pure lxml & xpath combination seems to be the absolute winner as the parsing is lightning fast. The only issue with that, that it's not entirely in Python so I would have to ship a precompiled binary for each platform in order to stay "universal". It'd be especially hacky for Androids.

So I am asking what's the best way to parse loads of HTML fastly?
Reply
#2
Quote:The only issue with that, that it's not entirely in Python so I would have to ship a precompiled binary for each platform in order to stay "universal". 
AFAIK that is a total no go, you can't include any kind of precompiled binary with an add-on 

How are you actually parsing with BS4? Chances are you can speed things up a fair bit using the SoupStrainer class to parse only part of a document. e.g.


only_a_tags = SoupStrainer("a")
links = BeautifulSoup(doc, "html.parser", parse_only=only_a_tags)

Otherwise you can focus your searches by calling extract() on elements then doing the finds on those rather than the entire html page. This also frees up a lot of memory as it removes the tag or string from the tree

Apologies if this is something you have already tried or are aware of.
Reply
#3
lxml and its combinations - lxml + bs4 or lxml + html5lib are indeed the fastest option but it require shipping binary components for each target platform.
html5lib with built-in ElementTree tree builder is reliable and relatively fast but supports only limited set of element selectors.
bs4 + html5lib is the most reliable and powerful combination but being pure Python it's the slowest of all options.
Raspberry PI 2 + LibreELEC 8 (Kodi 17)
Samsung Galaxy Tab A 10.1 + Kodi 17 for Android
Reply
#4
(2019-06-25, 20:44)fraserc Wrote:
Quote:The only issue with that, that it's not entirely in Python so I would have to ship a precompiled binary for each platform in order to stay "universal". 
AFAIK that is a total no go, you can't include any kind of precompiled binary with an add-on 

How are you actually parsing with BS4? Chances are you can speed things up a fair bit using the SoupStrainer class to parse only part of a document. e.g.


only_a_tags = SoupStrainer("a")
links = BeautifulSoup(doc, "html.parser", parse_only=only_a_tags)

Otherwise you can focus your searches by calling extract() on elements then doing the finds on those rather than the entire html page. This also frees up a lot of memory as it removes the tag or string from the tree

Apologies if this is something you have already tried or are aware of. 
Correct but I had a very complicated html once when I had to first extract everything inside the container div tags, iterate over the results and extract specific class values for every single item. Doesn't sound too hard to parse still, however even though that was still smooth on my computer but I could feel it slow on Raspberry and my phone. So right now I am trying to cache and also apply the parse_only trick where it's possible. But I used lxml and its xpath a lot of times in many different Python projects and it seems to be the killer solution. So bad it's not shipped with Kodi.
Reply
#5
(2019-06-25, 21:02)Roman_V_M Wrote: lxml and its combinations - lxml + bs4 or lxml + html5lib are indeed the fastest option but it require shipping binary components for each target platform.
html5lib with built-in ElementTree tree builder is reliable and relatively fast but supports only limited set of element selectors.
bs4 + html5lib is the most reliable and powerful combination but being pure Python it's the slowest of all options.

lxml also has a python binding or not sure how can I call it but with its xpath methods I could measure the fastest parsing. So not even bs4 is required. [link]

Well, sad news but thank you all for confirming.
Reply
#6
Regarding parsing HTML with regexps. Regexps can be used for picking relatively simple patterns and they are indeed very fast because they do not have the overhead of building a parsing tree. But with complex parsing criteria they simply become unmaintainable mess. That is why it's generally not recommended to parse HTML with regexps, but, of course, you should take into account your specific situation.
Raspberry PI 2 + LibreELEC 8 (Kodi 17)
Samsung Galaxy Tab A 10.1 + Kodi 17 for Android
Reply
 
Thread Rating:
  • 0 Vote(s) - 0 Average



Logout Mark Read Team Forum Stats Members Help
What's the fastest way of parsing HTML?00