Parsing DOM element from web page?

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
drascom Offline
Senior Member
Posts: 129
Joined: Dec 2008
Reputation: 0
Question  Parsing DOM element from web page?
Post: #1
hi
i try to grab movie link from
http://diziport.com/gilmore_girls_4_sezo.../12_bolum/
in source code div playxf is empty, but with opera browser in 'inspect element' i can see there is another div inside playxf div '_playercontainer'....with movie link.
i suppose i have to read data from dom like browser but i'm newbee and dont know starting point anyone can show me some sample address or show the way ?
in that picture upper site shows source code lover page shows html dom


[Image: resim_204502695.JPG]
(This post was last modified: 2010-12-28 18:44 by drascom.)
find quote
lzoubek Offline
Member
Posts: 81
Joined: Aug 2010
Reputation: 0
Post: #2
You can just use regular expressions for parsing. Take look at subtiltes addon
If the html code does not contain what you're searching for, it has probably been generated by javascript
find quote
xxxnelly Offline
Senior Member
Posts: 195
Joined: Dec 2010
Reputation: 1
Post: #3
I have the same problem with the StreamPro plugin.

It looks like the javascript it used to generate the additional HTML on the page.

Is there a standard way to handle this is the XBMC world? or is it impossible to scape websites that do alot of javascript HTML generation.
find quote
anarchintosh Offline
Senior Member
Posts: 279
Joined: Jul 2010
Reputation: 4
Post: #4
looks like it has been done with javascript.
i need to do a similiar thing for 2shared (execute javascript). I had exactly the same thing... could'nt find link in source, but when i did Inspect Element in Google Chrome, it gave me the file link.
i know theres spidermonkey and webkit.... my advice would be to search google for 'execute javascript on webpage python' or something like that.

if I find a way i'll share! PM me if you succeed.
find quote
drascom Offline
Senior Member
Posts: 129
Joined: Dec 2008
Reputation: 0
Post: #5
i'm not expert but google says i have to use python mini dom parser or beautifulsoup.py . i read some wiki and example but can't understand well.
find quote
anarchintosh Offline
Senior Member
Posts: 279
Joined: Jul 2010
Reputation: 4
Post: #6
im still caught up with the idea of using webkit
(Plex uses this in it's plugin framework)
EDIT: http://www.google.co.uk/search?sourceid=...M+Bindings
(This post was last modified: 2011-01-07 17:25 by anarchintosh.)
find quote
jbel Offline
Senior Member
Posts: 109
Joined: Apr 2009
Reputation: 5
Location: nyc
Post: #7
The <div> with id 'playxf' is updated via javascript from an XHR request. What you need to do is:

(1) Download the original page from your post:
http://diziport.com/gilmore_girls_4_sezon-izle/12_bolum/


In the source for the page, find the following section:
Code:
function billy()
{
    
    islem("nesne.php","get","olay=sayac&sid=1c9c35a6f2ed8bf64befc13f7eb54b4c","playxf",false);
    
}

(2) You need to extract
Code:
olay=sayac&sid=1c9c35a6f2ed8bf64befc13f7eb54b4c
to build your new URL.

Your new URL will be similar to the following:
Code:
http://diziport.com/nesne.php?olay=sayac&sid=7c4a2711a6772a6d4b2cb3a2beef04a1


You need to add a Referer header or the site returns an HTTP 404.
Code:
Referer: http://diziport.com/kizim_nerede-izle/3_bolum/

now issue a GET request and from the returned source:
Code:
<div id="_playerContainer">Flash player\xfdn\xfdz g\xfcncel de\xf0il. L\xfctfen <a href="http://get.adobe.com/tr/flashplayer/" target="_blank" rel="nofollow">bu adresten</a> en son s\xfcr\xfcm\xfc indiriniz.</div> \n<script language="javascript" type="text/javascript"> \n                var diziPortVid = new SWFObject("https://www.vidyoda.com/player/_vidyodaPlayer.swf", "_vidyodaPlayer" ,"567", "369", "10.0.0", "#000000");\n                diziPortVid.addParam("allowfullscreen", "true");\n                diziPortVid.addParam("allowscriptaccess", "always");\n                diziPortVid.addParam("wmode", "transparent");\n                diziPortVid.addParam(\'flashvars\', \'autoLoad=true&autoStart=false&channelID=WIMa1s2eh4E=&categoryID=hrGz0JHypH0=&strSource=http://video.ak.fbcdn.net/cfs-ak-snc6/78825/127/137133719679170_22061.mp4\');\n                diziPortVid.write("_playerContainer");\n</script>
you can extract your mp4 URL.

Some sample code:

Code:
>>> import urllib2
>>> import re
>>> url = 'http://diziport.com/gilmore_girls_4_sezon-izle/12_bolum/'
>>> src = urllib2.urlopen(url).read()
>>> p = r'islem\("(?P<path>.+?)","get","(?P<qs>.+?)"'
>>> m = re.search(p, src)
>>> m.group('path')
'nesne.php'
>>> m.group('qs')
'olay=sayac&sid=1c9c35a6f2ed8bf64befc13f7eb54b4c'
>>> path = m.group('path')
>>> qs = m.group('qs')
>>> url2 = 'http://diziport.com/%s?%s' % (path, qs)
>>> req = urllib2.Request(url2)
>>> req.add_header('Referer', 'http://diziport.com/kizim_nerede-izle/3_bolum/')
>>> src2 = urllib2.urlopen(req).read()
>>> src2
'<div id="_playerContainer">Flash player\xfdn\xfdz g\xfcncel de\xf0il. L\xfctfen <a href="http://get.adobe.com/tr/flashplayer/" target="_blank" rel="nofollow">bu adresten</a> en son s\xfcr\xfcm\xfc indiriniz.</div> \n<script language="javascript" type="text/javascript"> \n                var diziPortVid = new SWFObject("https://www.vidyoda.com/player/_vidyodaPlayer.swf", "_vidyodaPlayer" ,"567", "369", "10.0.0", "#000000");\n                diziPortVid.addParam("allowfullscreen", "true");\n                diziPortVid.addParam("allowscriptaccess", "always");\n                diziPortVid.addParam("wmode", "transparent");\n                diziPortVid.addParam(\'flashvars\', \'autoLoad=true&autoStart=false&channelID=WIMa1s2eh4E=&categoryID=hrGz0JHypH0=&strSource=http://video.ak.fbcdn.net/cfs-ak-snc6/78825/127/137133719679170_22061.mp4\');\n                diziPortVid.write("_playerContainer");\n</script>'
>>> p2 = r'strSource=(.+?)\''
>>> m2 = re.search(p2, src2)
>>> m2.group(1)
'http://video.ak.fbcdn.net/cfs-ak-snc6/78825/127/137133719679170_22061.mp4'
find quote
drascom Offline
Senior Member
Posts: 129
Joined: Dec 2008
Reputation: 0
Post: #8
Nod
thank you for explanation and sample code.....
now i have something to do again.thanks and thanks again....
find quote
drascom Offline
Senior Member
Posts: 129
Joined: Dec 2008
Reputation: 0
Post: #9
i wrote code with your help.here is video section
Code:
def VIDEOLINKS(url):  
      req = urllib2.Request(url)  
      req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')          
response = urllib2.urlopen(req)          
link=response.read()          
link=link.replace('\xf6',"o").replace('\xd6',"O").replace('\xfc',"u").replace('\xdd',"I").replace('\xfd',"i").replace('\xe7',"c").replace('\xde',"s").replace('\xfe',"s").replace('\xc7',"c").replace('\xf0',"g")          
response.close()          
match=re.compile('islem\("(.+?)","get","(.+?)"').findall(link)          
for path,code in match:                  
p= path                  
c= code          
url2='http://diziport.com/%s?%s' % (p, c)          
req = urllib2.Request(url2)          
req.add_header('Referer',url)          
response = urllib2.urlopen(req)          
link2=response.read()          
response.close()          
movie=re.compile('strSource=(.+?)\'').findall(link2)          
         for url in movie:                  
              addLink(__language__(30003),url,'')          
page=re.compile('<li><a href="(.+?)"  rel="nofollow">(.+?)</a></li>').findall(link)          
for url,name in page:                
      addDir('>> '+name,'http://diziport.com/'+url,5,'special://home/addons/plugin.video.diziport/resources/images/next.png')          
MAINMENU(url)

all code ready for test. can found on my signature.install diziport from my repository
(This post was last modified: 2011-01-16 21:17 by drascom.)
find quote
StreamProTv Offline
Junior Member
Posts: 3
Joined: Jul 2011
Reputation: 0
Post: #10
If you are still interested let me know and I shall provide any codes required
find quote