2019-12-04, 06:18
Hi all,
I have written a parsing library to scrape CGTN video and meta data. Now I am creating the Kodi add-on to make it available as a video add-on in Kodi.
The problem is that Kodi does not seem to support the concurrent.futures module (available for both Python2 and Python3).
My scraping library is available at https://github.com/boeboe/cgtn-videos
The reason I introduced concurrency is to decrease parsing time (sometimes I need to make an extra HTTP call per list item scraped to get all the data needed/wanted). This resulted scraping of some topics from 60sec to 5sec, which was exactly the performance boost needed. An example:
Any other best practices w.r.t concurrent/parallel scraping?
PS: I also had to refactor my scraping library to avoid Enum, as this was also not supported, although enum is available on Python 2 and 3 and back-ported to 3.3, 3.2, 3.1, 2.7, 2.6, 2.5, and 2.4.
Best regards,
Boeboe
I have written a parsing library to scrape CGTN video and meta data. Now I am creating the Kodi add-on to make it available as a video add-on in Kodi.
The problem is that Kodi does not seem to support the concurrent.futures module (available for both Python2 and Python3).
- Does anyone know whether this is available as a script module?
- Is there an alternative if not available?
My scraping library is available at https://github.com/boeboe/cgtn-videos
The reason I introduced concurrency is to decrease parsing time (sometimes I need to make an extra HTTP call per list item scraped to get all the data needed/wanted). This resulted scraping of some topics from 60sec to 5sec, which was exactly the performance boost needed. An example:
python:
def __process_m3u8_links(videos):
"""Helper methods to check if m3u8 links are valid and fix if not the case """
result_videos = []
with concurrent.futures.ThreadPoolExecutor(max_workers=len(videos)) as executor:
future_to_video_m3u8 = {
executor.submit(NewsParser.__check_m3u8_link, video): video
for video in videos
}
for future in concurrent.futures.as_completed(future_to_video_m3u8):
result_videos.append(future.result())
return result_videos
Any other best practices w.r.t concurrent/parallel scraping?
PS: I also had to refactor my scraping library to avoid Enum, as this was also not supported, although enum is available on Python 2 and 3 and back-ported to 3.3, 3.2, 3.1, 2.7, 2.6, 2.5, and 2.4.
Best regards,
Boeboe