Scraping m3u8 url
#1
Hi everyone,

I'm working on an addon to grab the local weather report from this page (http://www.nbcnewyork.com/weather/).

I can find the m3u8 link by looking at network activity, but don't know how to find it in python (urllib2 only grabs the html, which doesn't have the link).

I am guessing this is a common, beginner addon development question, but I haven't found anything helpful in the forums. Could someone point me in the right direction as to how to grab this link in python?

Thanks,
Dan
Reply
#2
(2017-04-13, 19:13)cardinaldv25 Wrote: Hi everyone,

I'm working on an addon to grab the local weather report from this page (http://www.nbcnewyork.com/weather/).

I can find the m3u8 link by looking at network activity, but don't know how to find it in python (urllib2 only grabs the html, which doesn't have the link).

I am guessing this is a common, beginner addon development question, but I haven't found anything helpful in the forums. Could someone point me in the right direction as to how to grab this link in python?

Thanks,
Dan

You need to parse the website's html source, then use either beautifulsoup or regex.

BTW viewing the source is easy in most browser, right click on page select view or inspect source.
Image Lunatixz - Kodi / Beta repository
Image PseudoTV - Forum | Website | Youtube | Help?
Reply
#3
(2017-04-13, 21:27)Lunatixz Wrote:
(2017-04-13, 19:13)cardinaldv25 Wrote: Hi everyone,

I'm working on an addon to grab the local weather report from this page (http://www.nbcnewyork.com/weather/).

I can find the m3u8 link by looking at network activity, but don't know how to find it in python (urllib2 only grabs the html, which doesn't have the link).

I am guessing this is a common, beginner addon development question, but I haven't found anything helpful in the forums. Could someone point me in the right direction as to how to grab this link in python?

Thanks,
Dan

You need to parse the website's html source, then use either beautifulsoup or regex.

BTW viewing the source is easy in most browser, right click on page select view or inspect source.


Thanks for your response. I guess I wasn't clear; the link is not in the source (I was treating the source as html). It seems like there is some javascript going on behind the scenes that I do not have access to.
Reply
#4
(2017-04-13, 21:33)cardinaldv25 Wrote:
(2017-04-13, 21:27)Lunatixz Wrote:
(2017-04-13, 19:13)cardinaldv25 Wrote: Hi everyone,

I'm working on an addon to grab the local weather report from this page (http://www.nbcnewyork.com/weather/).

I can find the m3u8 link by looking at network activity, but don't know how to find it in python (urllib2 only grabs the html, which doesn't have the link).

I am guessing this is a common, beginner addon development question, but I haven't found anything helpful in the forums. Could someone point me in the right direction as to how to grab this link in python?

Thanks,
Dan

You need to parse the website's html source, then use either beautifulsoup or regex.

BTW viewing the source is easy in most browser, right click on page select view or inspect source.


Thanks for your response. I guess I wasn't clear; the link is not in the source (I was treating the source as html). It seems like there is some javascript going on behind the scenes that I do not have access to.

There are some ways to work around this using python but it's a little advanced for a "beginner project".
Look into dryscrape and other utilities to help parse running scripts... or even sniffers and snoopers to catch the link...
Image Lunatixz - Kodi / Beta repository
Image PseudoTV - Forum | Website | Youtube | Help?
Reply
#5
There are tools in your browser to help you figure out what additional requests it is making.
I found this post really helpful
https://www.reddit.com/r/learnpython/com...ython_but/
Reply
#6
Actually all the dynamic info you need is given in the source html (http://www.nbcnewyork.com/weather/)
The video ID is given in the source html. If you look through the html you will see this (You can see the source in Chrome by pressing ctl-U while viewing the page):
Code:
http://www.nbcnewyork.com/portableplayer/?cmsID=419435644&videoID=tkrHIYoJkedU&origin=nbcnewyork.com&sec=weather&subsec=&width=600&height=360

The key thing you need is the video ID. NBC uses theplatform as their service for metadata and link lookups.
If you do a trace in a browser you will see the theplatform request: (You can get a trace in Chrome by pressing F12 when viewing the page and clicking on the network tab)
Code:
https://link.theplatform.com/s/Yh1nAC/tkrHIYoJkedU?manifest=m3u&formats=m3u,mpeg4,webm,ogg&format=SMIL&embedded=true&tracking=true

You can see the video ID "tkrHIYoJkedU" from the html source in the link request above. The theplatform link request returns:
Code:
<smil xmlns="http://www.w3.org/2005/SMIL21/Language">
<head>
</head>
<body>
<seq>
    <ref src="http://ads.freewheel.tv" type="application/smil+xml" no-skip="true" tags="preroll">
    </ref>
    <video src="https://nbclim-f.akamaihd.net/i/Prod/NBCU_LM_VMS_-_WNBC/683/67/WNBC_000000015849066__431887.mp4,,.csmil/master.m3u8" title="Early Evening Forecast for Thursday April 13, 2017" abstract="Janice Huff&apos;s forecast for April 13." copyright="NBCUniversal, Inc." dur="134434ms" guid="pV8lgN6JQEWSPl2_61_1m_WQ4K2iybDC" categories="weather/video" keywords="no keywords" type="application/x-mpegURL" height="352" width="624">
        <param name="trackingData" value="aid=2173816108|b=986529|bc=NBCU-NBCL|br=Chrome 57|cc=US|ci=1|cid=920911939813|d=1492137158085|l=134434|mediaPid=8ozkTeWZtNbu|os=Windows 8.1|pd=1492124180000|pid=tkrHIYoJkedU|prid=37017438|rc=ME|rid=920915523512"/>
    </video>
    <ref src="http://ads.freewheel.tv" type="application/smil+xml" no-skip="true" tags="postroll">
    </ref>
</seq>
</body>
</smil>

You can see the m3u8 following <video src=" in the above xml.


So, to write your program you need the following steps (greatly abbreviated and assumes you know python):

1) You would read in the source html page (http://www.nbcnewyork.com/weather/)
2) Search the html using regex 'http://www.nbcnewyork.com/portableplayer/.+?&videoID=(.+?)&' which will return the video id.

3) Plug the video id into the theplatform url:

url = 'https://link.theplatform.com/s/Yh1nAC/'+videoID+'?manifest=m3u&formats=m3u,mpeg4,webm,ogg&format=SMIL&embedded=true&tracking=true'

4) read the html page (actually xml) given by the url.

5) You can then parse or search the xml for <video src="(.+?)" to get the m3u8 url, as well as metadata such as title, date, duration

(Note: I don't know where you are but many NBC videos are geo-blocked to the US. The above may only work reliably in the US)
Also, for all I know the link request for theplatform is static (also meaning the videoID is static)and https://link.theplatform.com/s/Yh1nAC/tk...cking=true will always return the current m3u8. You would have to experiment to figure that out.
Reply
#7
Thanks for all the awesome help here. Looks like my problem is that I wasn't able to identify theplatform. I implement all this and try to reverse engineer how you identified theplatform.

Thanks,
Dan
Reply
#8
i dont know if this will help but i modified the yahoo weather app to automatically detect your location and set it at startup. i created a seperate py inside the package folder called location.py.



location.py:
import urllib2,urllib,json,xbmcaddon,xbmc
web = json.loads(urllib2.urlopen('http://freegeoip.net/json/').read())
City = web['city']
State = web['region_code']
Country = '('+web['country_code']+')'
q = City+', '+State
r = json.loads(urllib2.urlopen('http://query.yahooapis.com/v1/public/yql?q='+urllib.quote('select * from geo.places where text="'+q+'"')+'&format=json').read())
ID = r['query']['results']['place']['woeid']
Set = xbmcaddon.Addon(id='weather.yahoo').setSetting
Get = xbmcaddon.Addon(id='weather.yahoo').getSetting
try:
Set('Location1',City+' '+Country)
Set('Location1id',ID)
except:
pass


Then added this to the addon.xml

<extension library="location.py" point="xbmc.service" start="startup"/>
Reply

Logout Mark Read Team Forum Stats Members Help
Scraping m3u8 url0