Thread Rating:
  • 0 Vote(s) - 0 Average
Parsing DOM element from web page?
#1
Question 
hi
i try to grab movie link from
http://diziport.com/gilmore_girls_4_sezo.../12_bolum/
in source code div playxf is empty, but with opera browser in 'inspect element' i can see there is another div inside playxf div '_playercontainer'....with movie link.
i suppose i have to read data from dom like browser but i'm newbee and dont know starting point anyone can show me some sample address or show the way ?
in that picture upper site shows source code lover page shows html dom


Image
Reply
#2
You can just use regular expressions for parsing. Take look at subtiltes addon
If the html code does not contain what you're searching for, it has probably been generated by javascript
Reply
#3
I have the same problem with the StreamPro plugin.

It looks like the javascript it used to generate the additional HTML on the page.

Is there a standard way to handle this is the XBMC world? or is it impossible to scape websites that do alot of javascript HTML generation.
Reply
#4
looks like it has been done with javascript.
i need to do a similiar thing for 2shared (execute javascript). I had exactly the same thing... could'nt find link in source, but when i did Inspect Element in Google Chrome, it gave me the file link.
i know theres spidermonkey and webkit.... my advice would be to search google for 'execute javascript on webpage python' or something like that.

if I find a way i'll share! PM me if you succeed.
Reply
#5
i'm not expert but google says i have to use python mini dom parser or beautifulsoup.py . i read some wiki and example but can't understand well.
Reply
#6
im still caught up with the idea of using webkit
(Plex uses this in it's plugin framework)
EDIT: http://www.google.co.uk/search?sourceid=...M+Bindings
Reply
#7
The <div> with id 'playxf' is updated via javascript from an XHR request. What you need to do is:

(1) Download the original page from your post:
http://diziport.com/gilmore_girls_4_sezon-izle/12_bolum/


In the source for the page, find the following section:
Code:
function billy()
{
    
    islem("nesne.php","get","olay=sayac&sid=1c9c35a6f2ed8bf64befc13f7eb54b4c","playxf",false);
    
}

(2) You need to extract
Code:
olay=sayac&sid=1c9c35a6f2ed8bf64befc13f7eb54b4c
to build your new URL.

Your new URL will be similar to the following:
Code:
http://diziport.com/nesne.php?olay=sayac&sid=7c4a2711a6772a6d4b2cb3a2beef04a1


You need to add a Referer header or the site returns an HTTP 404.
Code:
Referer: http://diziport.com/kizim_nerede-izle/3_bolum/

now issue a GET request and from the returned source:
Code:
<div id="_playerContainer">Flash player\xfdn\xfdz g\xfcncel de\xf0il. L\xfctfen <a href="http://get.adobe.com/tr/flashplayer/" target="_blank" rel="nofollow">bu adresten</a> en son s\xfcr\xfcm\xfc indiriniz.</div> \n<script language="javascript" type="text/javascript"> \n                var diziPortVid = new SWFObject("https://www.vidyoda.com/player/_vidyodaPlayer.swf", "_vidyodaPlayer" ,"567", "369", "10.0.0", "#000000");\n                diziPortVid.addParam("allowfullscreen", "true");\n                diziPortVid.addParam("allowscriptaccess", "always");\n                diziPortVid.addParam("wmode", "transparent");\n                diziPortVid.addParam(\'flashvars\', \'autoLoad=true&autoStart=false&channelID=WIMa1s2eh4E=&categoryID=hrGz0JHypH0=&strSource=http://video.ak.fbcdn.net/cfs-ak-snc6/78825/127/137133719679170_22061.mp4\');\n                diziPortVid.write("_playerContainer");\n</script>
you can extract your mp4 URL.

Some sample code:

Code:
>>> import urllib2
>>> import re
>>> url = 'http://diziport.com/gilmore_girls_4_sezon-izle/12_bolum/'
>>> src = urllib2.urlopen(url).read()
>>> p = r'islem\("(?P<path>.+?)","get","(?P<qs>.+?)"'
>>> m = re.search(p, src)
>>> m.group('path')
'nesne.php'
>>> m.group('qs')
'olay=sayac&sid=1c9c35a6f2ed8bf64befc13f7eb54b4c'
>>> path = m.group('path')
>>> qs = m.group('qs')
>>> url2 = 'http://diziport.com/%s?%s' % (path, qs)
>>> req = urllib2.Request(url2)
>>> req.add_header('Referer', 'http://diziport.com/kizim_nerede-izle/3_bolum/')
>>> src2 = urllib2.urlopen(req).read()
>>> src2
'<div id="_playerContainer">Flash player\xfdn\xfdz g\xfcncel de\xf0il. L\xfctfen <a href="http://get.adobe.com/tr/flashplayer/" target="_blank" rel="nofollow">bu adresten</a> en son s\xfcr\xfcm\xfc indiriniz.</div> \n<script language="javascript" type="text/javascript"> \n                var diziPortVid = new SWFObject("https://www.vidyoda.com/player/_vidyodaPlayer.swf", "_vidyodaPlayer" ,"567", "369", "10.0.0", "#000000");\n                diziPortVid.addParam("allowfullscreen", "true");\n                diziPortVid.addParam("allowscriptaccess", "always");\n                diziPortVid.addParam("wmode", "transparent");\n                diziPortVid.addParam(\'flashvars\', \'autoLoad=true&autoStart=false&channelID=WIMa1s2eh4E=&categoryID=hrGz0JHypH0=&strSource=http://video.ak.fbcdn.net/cfs-ak-snc6/78825/127/137133719679170_22061.mp4\');\n                diziPortVid.write("_playerContainer");\n</script>'
>>> p2 = r'strSource=(.+?)\''
>>> m2 = re.search(p2, src2)
>>> m2.group(1)
'http://video.ak.fbcdn.net/cfs-ak-snc6/78825/127/137133719679170_22061.mp4'
Reply
#8
Nod
thank you for explanation and sample code.....
now i have something to do again.thanks and thanks again....
Reply
#9
i wrote code with your help.here is video section
Code:
def VIDEOLINKS(url):  
      req = urllib2.Request(url)  
      req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')          
response = urllib2.urlopen(req)          
link=response.read()          
link=link.replace('\xf6',"o").replace('\xd6',"O").replace('\xfc',"u").replace('\xdd',"I").replace('\xfd',"i").replace('\xe7',"c").replace('\xde',"s").replace('\xfe',"s").replace('\xc7',"c").replace('\xf0',"g")          
response.close()          
match=re.compile('islem\("(.+?)","get","(.+?)"').findall(link)          
for path,code in match:                  
p= path                  
c= code          
url2='http://diziport.com/%s?%s' % (p, c)          
req = urllib2.Request(url2)          
req.add_header('Referer',url)          
response = urllib2.urlopen(req)          
link2=response.read()          
response.close()          
movie=re.compile('strSource=(.+?)\'').findall(link2)          
         for url in movie:                  
              addLink(__language__(30003),url,'')          
page=re.compile('<li><a href="(.+?)"  rel="nofollow">(.+?)</a></li>').findall(link)          
for url,name in page:                
      addDir('>> '+name,'http://diziport.com/'+url,5,'special://home/addons/plugin.video.diziport/resources/images/next.png')          
MAINMENU(url)

all code ready for test. can found on my signature.install diziport from my repository
Reply
#10
If you are still interested let me know and I shall provide any codes required
Reply



Parsing DOM element from web page?00