Login at Kodi Home

Loewkie · 2016-03-20, 21:32

I'm trying to scrape all URLs from within a JS script on a webpage using BeautifulSoup, the script looks like this.

I want to scrape all the video "url" links from within this JS script. I'm pretty convinced I can not do this with BeautifulSoup, so what could I use instead (and preferably, what is the correct code).
The URL also updates every time the page is loaded (and expires after a set time), so is it possible to write a bit of code that always gets a fresh URL?

**Roman_V_M** · 2016-03-21, 12:41

I guess you need to use BS to extract <script> tag contents and then to use regexes to pick necessary elements out of <script> tag contents. But in this case, as far as I can see, all necessary data are held in a JSON object, so you can use 3-step process to convert the data into a Python dict that can then be processed as you like.
The process:
1. Pick out <script> tag contents using BS.
2. Pick out the JSON object using regexes.
3. Convert the JSON object to a Python dict using json.loads.
And with the dict you can do whatever you want.

Loewkie · 2016-03-21, 13:17

That is actually a very good idea. There is only one problem.

The code for "inspect element" (which contains the .mp4 URLs) is different from the source code (which does not contain the .mp4 source URLs)
Is there a way to let BS search in the "inspect element" code rather than the "view source" code?

**Roman_V_M** · 2016-03-21, 14:33

I understand. You need to parse the code that is created after JavaScripts were fired. This means that you need a Pythonic web-client that not only loads web-pages from URLs but, unlike urllib2 or requests, executes JavaScrpit as well, thus emulating an actual web browser. For this google "Selenium Python" and "script.module.webdriver". I haven't used any of this so I won't help you further.