2019-06-24, 15:22
Hi,
Right now I am using a complex set of regular expressions so I can get basically anything out of a raw HTML input. However that doesn't seem to be the best and most optimal approach and usually all the devs keep telling not to use regex for html parsing. I saw many addons use BS4 or some beautifulsoup version. I made some benchmarks and that's even slower than my regex when I have to extract a lot of elements. However the pure lxml & xpath combination seems to be the absolute winner as the parsing is lightning fast. The only issue with that, that it's not entirely in Python so I would have to ship a precompiled binary for each platform in order to stay "universal". It'd be especially hacky for Androids.
So I am asking what's the best way to parse loads of HTML fastly?
Right now I am using a complex set of regular expressions so I can get basically anything out of a raw HTML input. However that doesn't seem to be the best and most optimal approach and usually all the devs keep telling not to use regex for html parsing. I saw many addons use BS4 or some beautifulsoup version. I made some benchmarks and that's even slower than my regex when I have to extract a lot of elements. However the pure lxml & xpath combination seems to be the absolute winner as the parsing is lightning fast. The only issue with that, that it's not entirely in Python so I would have to ship a precompiled binary for each platform in order to stay "universal". It'd be especially hacky for Androids.
So I am asking what's the best way to parse loads of HTML fastly?