2011-04-09, 17:56
I'm developing a script for http://www.soccertvlive.net for my first go at an addon and learning python
Among a few issues currently in the script, one piece I'm getting frustrated with is the step where I need to get the list of all video page urls
There are 3 levels needed to navigate to get to the actual video
Sport->Channel->Video Link
I'm having trouble with the 3rd level trying to grab the video page urls
If I open firefox and try to go directly to :
It looks like I get redirected to the main page and need to click the links to get there
Same thing happens in my script, when I try a request to that url I get html from the main page returned instead
Odd thing is when I wrote a quick test script in IDLE it worked fine and continues to
Any tips on getting it to return the correct html? Perhaps there's a better way to scrape this site?
My other issues include problems with their html where I'm not able to grab the full list of main sections (sports) in the MainPage function from the site due to variations in their html, as well as the times for the events are a bit messed as well because of the html variations
Could use some help on using the re.compile().findall function better, I'm sure there's better ways of searching this html
My last thing to do once I have the video page links is figuring out how then grab the RTMP video links from each page.. could anyone point me to some good examples? I've already manually found them and pulled them using rtmpdump
Among a few issues currently in the script, one piece I'm getting frustrated with is the step where I need to get the list of all video page urls
There are 3 levels needed to navigate to get to the actual video
Sport->Channel->Video Link
I'm having trouble with the 3rd level trying to grab the video page urls
If I open firefox and try to go directly to :
Code:
http://www.soccertvlive.net/watch/49039/1/watch-espn-sportscenter.html
Same thing happens in my script, when I try a request to that url I get html from the main page returned instead
Code:
import xbmc, xbmcgui, urllib2, urllib, re, xbmcplugin
MainUrl = 'http://www.soccertvlive.net'
pluginhandle = int(sys.argv[1])
def getURL( url ):
try:
txdata = None
txheaders = {'Referer': 'http://www.soccertvlive.net/',
'X-Forwarded-For': '12.13.14.15',
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)',
}
req = urllib2.Request(url, txdata, txheaders)
response = urllib2.urlopen(req)
link=response.read()
response.close()
except urllib2.URLError, e:
error = 'Error code: '+ str(e.code)
xbmcgui.Dialog().ok(error,error)
print 'Error code: ', e.code
return False
else:
return link
def MainPage():
xbmcplugin.setContent(pluginhandle, 'Sports')
mode=1
html=getURL(MainUrl)
match = re.compile(
'''<li class=.+?
<a href="(.+?)">(.+?)</a>'''
).findall(html)
for u,name in match:
sysname = urllib.quote_plus(name)
sysurl = urllib.quote_plus(MainUrl+u)
u = sys.argv[0] + "?url=" + sysurl + "&mode=" + str(mode) + "&name=" + sysname
liz=xbmcgui.ListItem(name)
liz.setInfo( type="Video", infoLabels={ "Title": name,
"Season":int(1),
"TVShowTitle":"Sport"
})
xbmcplugin.addDirectoryItem(handle=pluginhandle,url=u,listitem=liz,isFolder=True)
xbmcplugin.endOfDirectory(pluginhandle)
def Sport():
xbmcplugin.setContent(pluginhandle, 'Sport')
mode=2
html=getURL(url)
match = re.compile(
"""<a class="accordlink" href='(.+?)' target="_blank">
<img class="chimg" alt="(.+?)" src="(.+?)"/>
<span>
(.+?) </span>"""
).findall(html)
for u,name,icon,time in match:
icon=MainUrl + icon
name=time + " - " + name
sysname = urllib.quote_plus(name)
sysurl = urllib.quote_plus(url+u)
u = sys.argv[0] + "?url=" + sysurl + "&mode=" + str(mode) + "&name=" + sysname
liz=xbmcgui.ListItem(name, iconImage=icon)
liz.setInfo( type="Video", infoLabels={ "Title": name,
"Season":int(1),
"TVShowTitle":"Sport"
})
#liz.setProperty('fanart_image',fanart)
xbmcplugin.addDirectoryItem(handle=pluginhandle,url=u,listitem=liz,isFolder=True)
xbmcplugin.endOfDirectory(pluginhandle)
def Links():
xbmcplugin.setContent(pluginhandle, 'Sport')
mode=3
html=getURL(url)
match = re.compile(
"""<a style='font-size:12pt;color:limef;' title='(.+?)'href='(.+?)'>.+?</a>"""
).findall(html)
for name,u in match:
name=time + " - " + name
liz=xbmcgui.ListItem(name)
liz.setInfo( type="Video", infoLabels={ "Title": name,
"Season":int(1),
"TVShowTitle":"Sport"
})
#liz.setProperty('fanart_image',fanart)
xbmcplugin.addDirectoryItem(handle=pluginhandle,url=u,listitem=liz,isFolder=True)
xbmcplugin.endOfDirectory(pluginhandle)
def get_params():
param=[]
paramstring=sys.argv[2]
if len(paramstring)>=2:
params=sys.argv[2]
cleanedparams=params.replace('?','')
if (params[len(params)-1]=='/'):
params=params[0:len(params)-2]
pairsofparams=cleanedparams.split('&')
param={}
for i in range(len(pairsofparams)):
splitparams={}
splitparams=pairsofparams[i].split('=')
if (len(splitparams))==2:
param[splitparams[0]]=splitparams[1]
return param
params=get_params()
url=None
name=None
mode=None
try:
url=urllib.unquote_plus(params["url"])
except:
pass
try:
name=urllib.unquote_plus(params["name"])
except:
pass
try:
mode=int(params["mode"])
except:
pass
print "Mode: "+str(mode)
print "URL: "+str(url)
print "Name: "+str(name)
if mode==None or url==None or len(url)<1:
MainPage()
elif mode==1:
Sport()
elif mode==2:
Links()
elif mode==3:
print "Get Rtmp"
elif mode==4:
print "Random Episode"
Odd thing is when I wrote a quick test script in IDLE it worked fine and continues to
Code:
url='http://www.soccertvlive.net/watch/49039/1/watch-espn-sportscenter.html'
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = response.read()
response.close()
match = re.compile(
"""<a style='font-size:12pt;color:limef;' title='(.+?)'href='(.+?)'>.+?</a>"""
).findall(html)
for name,url in match:
print name, url
Any tips on getting it to return the correct html? Perhaps there's a better way to scrape this site?
My other issues include problems with their html where I'm not able to grab the full list of main sections (sports) in the MainPage function from the site due to variations in their html, as well as the times for the events are a bit messed as well because of the html variations
Could use some help on using the re.compile().findall function better, I'm sure there's better ways of searching this html
My last thing to do once I have the video page links is figuring out how then grab the RTMP video links from each page.. could anyone point me to some good examples? I've already manually found them and pulled them using rtmpdump