Help with new script, parsing html
#1
I'm developing a script for http://www.soccertvlive.net for my first go at an addon and learning python

Among a few issues currently in the script, one piece I'm getting frustrated with is the step where I need to get the list of all video page urls

There are 3 levels needed to navigate to get to the actual video
Sport->Channel->Video Link

I'm having trouble with the 3rd level trying to grab the video page urls

If I open firefox and try to go directly to :
Code:
http://www.soccertvlive.net/watch/49039/1/watch-espn-sportscenter.html
It looks like I get redirected to the main page and need to click the links to get there

Same thing happens in my script, when I try a request to that url I get html from the main page returned instead

Code:
import xbmc, xbmcgui, urllib2, urllib, re, xbmcplugin

MainUrl = 'http://www.soccertvlive.net'      
pluginhandle = int(sys.argv[1])                

def getURL( url ):
    try:
        txdata = None
        txheaders = {'Referer': 'http://www.soccertvlive.net/',
            'X-Forwarded-For': '12.13.14.15',
            'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)',    
        }
        req = urllib2.Request(url, txdata, txheaders)
        response = urllib2.urlopen(req)
        link=response.read()
        response.close()
    except urllib2.URLError, e:
        error = 'Error code: '+ str(e.code)
        xbmcgui.Dialog().ok(error,error)
        print 'Error code: ', e.code
        return False
    else:
        return link
        
def MainPage():
    xbmcplugin.setContent(pluginhandle, 'Sports')
    mode=1
    html=getURL(MainUrl)
    match = re.compile(
       '''<li class=.+?
            <a href="(.+?)">(.+?)</a>'''
       ).findall(html)
    for u,name in match:
       sysname = urllib.quote_plus(name)
       sysurl = urllib.quote_plus(MainUrl+u)
       u = sys.argv[0] + "?url=" + sysurl + "&mode=" + str(mode) + "&name=" + sysname
       liz=xbmcgui.ListItem(name)
       liz.setInfo( type="Video", infoLabels={ "Title": name,
                                                    "Season":int(1),
                                                    "TVShowTitle":"Sport"
                                                    })
       xbmcplugin.addDirectoryItem(handle=pluginhandle,url=u,listitem=liz,isFolder=True)
    xbmcplugin.endOfDirectory(pluginhandle)


def Sport():
    xbmcplugin.setContent(pluginhandle, 'Sport')
    mode=2
    html=getURL(url)
    match = re.compile(
       """<a class="accordlink"  href='(.+?)' target="_blank">
                    <img class="chimg" alt="(.+?)" src="(.+?)"/>
                    <span>
                        &nbsp;
                                (.+?)                    </span>"""
       ).findall(html)
    for u,name,icon,time in match:
       icon=MainUrl + icon
       name=time + " - " + name
       sysname = urllib.quote_plus(name)
       sysurl = urllib.quote_plus(url+u)
       u = sys.argv[0] + "?url=" + sysurl + "&mode=" + str(mode) + "&name=" + sysname
       liz=xbmcgui.ListItem(name, iconImage=icon)
       liz.setInfo( type="Video", infoLabels={ "Title": name,
                                                    "Season":int(1),
                                                    "TVShowTitle":"Sport"
                                                    })
       #liz.setProperty('fanart_image',fanart)
       xbmcplugin.addDirectoryItem(handle=pluginhandle,url=u,listitem=liz,isFolder=True)
    xbmcplugin.endOfDirectory(pluginhandle)

def Links():
    xbmcplugin.setContent(pluginhandle, 'Sport')
    mode=3
    html=getURL(url)
    match = re.compile(
       """<a style='font-size:12pt;color:limef;'  title='(.+?)'href='(.+?)'>.+?</a>"""
       ).findall(html)
    for name,u in match:
       name=time + " - " + name
       liz=xbmcgui.ListItem(name)
       liz.setInfo( type="Video", infoLabels={ "Title": name,
                                                    "Season":int(1),
                                                    "TVShowTitle":"Sport"
                                                    })
       #liz.setProperty('fanart_image',fanart)
       xbmcplugin.addDirectoryItem(handle=pluginhandle,url=u,listitem=liz,isFolder=True)
    xbmcplugin.endOfDirectory(pluginhandle)    
    
def get_params():
    param=[]
    paramstring=sys.argv[2]
    if len(paramstring)>=2:
        params=sys.argv[2]
        cleanedparams=params.replace('?','')
        if (params[len(params)-1]=='/'):
            params=params[0:len(params)-2]
        pairsofparams=cleanedparams.split('&')
        param={}
        for i in range(len(pairsofparams)):
            splitparams={}
            splitparams=pairsofparams[i].split('=')
            if (len(splitparams))==2:
                param[splitparams[0]]=splitparams[1]
    return param
        
params=get_params()    
url=None
name=None
mode=None

try:
        url=urllib.unquote_plus(params["url"])
except:
        pass
try:
        name=urllib.unquote_plus(params["name"])
except:
        pass
try:
        mode=int(params["mode"])
except:
        pass

print "Mode: "+str(mode)
print "URL: "+str(url)
print "Name: "+str(name)
if mode==None or url==None or len(url)<1:
    MainPage()
elif mode==1:
    Sport()
elif mode==2:
    Links()
elif mode==3:
    print "Get Rtmp"
elif mode==4:
    print "Random Episode"

Odd thing is when I wrote a quick test script in IDLE it worked fine and continues to

Code:
url='http://www.soccertvlive.net/watch/49039/1/watch-espn-sportscenter.html'
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = response.read()
response.close()
match = re.compile(
       """<a style='font-size:12pt;color:limef;'  title='(.+?)'href='(.+?)'>.+?</a>"""
   ).findall(html)
for name,url in match:
    print name, url

Any tips on getting it to return the correct html? Perhaps there's a better way to scrape this site?

My other issues include problems with their html where I'm not able to grab the full list of main sections (sports) in the MainPage function from the site due to variations in their html, as well as the times for the events are a bit messed as well because of the html variations

Could use some help on using the re.compile().findall function better, I'm sure there's better ways of searching this html

My last thing to do once I have the video page links is figuring out how then grab the RTMP video links from each page.. could anyone point me to some good examples? I've already manually found them and pulled them using rtmpdump
Reply
#2
i have to say, for a first plugin, your code is looking pretty neat.

when i tried that link in my browser i didn't get redirected....

as for re.compile , i normally use a combination of re.search (to check if something exists in the html) and very simple re.compile (if i need to scrape some text or links).
but, what many xbmc addon coders reccommend is BeautifulSoup (i never really got the hang of it), its supposed to be one of the best python scrapers.


unfortunately, i'm not sure how you can do the RTMP thing.... my guess is that you might need to dump them all (you might be able to script this) and then just pack them in the plugin
Reply
#3
I'm a developer by day, so clean code is a must! But took me awhile hacking away and looking at other scripts to get a hang of how the basics work Smile

It's very odd, it seems sometimes I get redirected and sometimes not, but definitely every time in the script I get the home page html returned

I'm wondering if it's due to a python version difference, I have 2.7.1 installed that I'm testing with and XBMC is 2.4? But then if it is, and the XBMC version returns me the main page html instead, how to detect and handle this?

Thanks, I'll take a look into the use of those functions! Finding the right keywords to google is sometimes tough

Hopefully getting the RTMP links are not too tough! These feeds come and go and I'm not sure how to find them outside of using sniffing tools, so hopefully it can be scripted
Reply
#4
There are a few differences between 2.4.4 and 2.7 , but they are not so many. The older versions of IDLE are really crappy, so i'd reccommend sticking with 2.7 (its what most addon coders use)

i'm not sure about the whole redirect scenario Huh, good luck with working it out
Reply
#5
XBMC will be moving to external python soon if it isn't already .. You can compile xbmc wth flag ( --external-python Huh ) for now to use python 2.7.

Rtmpdump is fully supported with authentication in xbmc. Ask BlueCop on his support for it in his Hulu Plugin.
The normal XBMC log IS NOT a debug log, to enable debug logging you must toggle it on under XBMC Settings - System or in advancedsettings.xml. Use XBMC Debug Log Addon to retrieve it.
Reply

Logout Mark Read Team Forum Stats Members Help
Help with new script, parsing html0