How to parse a dynamic web page
#1
Hi,
I'm stuck trying to parse a webpage for a new addon I'm trying to write.
I don't have a lot of knowledge about javascript and HTML but it seems that the web page I'm trying to parse is build through JS functions.
this is the page I'm trying to parse: http://shironet.mako.co.il/artist?type=d...1&prfid=41
when I try to get the contents of this page through my kodi addon (using urllib2 or t0mm0) I get the following response:

Code:
<html><head><meta charset="utf-8"></head><body><script>window.rbzns = {fiftyeightkb: 3600000, days_in_week : 1};</script><script src="//d1a702rd0dylue.cloudfront.net/js/sugarman/v7/flat.js"></script><script>rbzns.challdomain="shironet.mako.co.il"; rbzns.ctrbg="V32/QtF3Pgi6Dqt1ARD+4KIc7mDk8muFK4YT2EpGDH2GuQJ8YW7yps0GJb4Env2G6ZtynOCpu2ncPr94nCR/J9MJ7yfHAXdmAtdIWKk1bw2gsNdKYGkIF9cQ7iMvqvejyO2NxcNNXO6p7i3blEW+R/jYbbWycDq+cxOK+VKlVC7h8xPdu8T6VaoO8eKbnfjxQy3kCY+wqxvgQoml4jDwlh9fD5+UWYURw9b29rIBjHFGtJZT2Qn2t6pTdv8yzwLm";rbzns.rbzreqid="rbz-mako-reblazer05313434373137313430342c0b4411ceb84627"; winsocks(true);</script></body></html>

but when I check the page source in chrome I can see all the data I am looking for.
what am I doing wrong? how can I solve this issue in my addon?
please help.
thanks.
Reply
#2
Go into developer tools in the browser and select "Network" view. Re-request the page, record the requested urls and then save them out as HAR. (The chrome interface is pretty intuitive, but google is your best friend for answers about the development tools). Once you've saved out the HAR, load it up in a text editor and look at what's happening. Most of the time the whole story is in front of you.
Reply
#3
thanks for the quick reply!
I did what you suggested but still looking at the result I can't understand what to do. (or what am I doing wrong)
here is a link to my HAR file:
http://filebin.ca/2M9H6OYbGRo6/shironet.mako.co.il.har
I can see the content text I am looking for in the first response. How do I perform a proper request in python to get this content?
Reply
#4
Whoa, I just had a look at the content on the site page you're looking at. There are video contents blocked due to copyright restrictions, which is a pretty good indicator that this is not a legal site. If so, this is not for discussion in the kodi.org forum.
Reply
#5
I dont know what you are talking about.
This site is the biggest music DB in Israel.
All I'm trying to do s extract from it is the artists names and their discography and also image links.
I think this is pretty legit.
Can you please help me or reffer me to manuals that can help?
Reply
#6
(2015-11-11, 07:40)deran Wrote: I dont know what you are talking about.
This site is the biggest music DB in Israel.
All I'm trying to do s extract from it is the artists names and their discography and also image links.
I think this is pretty legit.
Can you please help me or reffer me to manuals that can help?

Don't know, but have a look here: http://shironet.mako.co.il/artist?type=l...rkid=22236
That looks like a copyright infringement, but I may be wrong.

The URL to the HAR doesn't work for me. Post it again and check it can be viewed. I'll help with the metadata only scrape, if I can.
Reply
#7
(2015-11-11, 07:47)learningit Wrote:
(2015-11-11, 07:40)deran Wrote: I dont know what you are talking about.
This site is the biggest music DB in Israel.
All I'm trying to do s extract from it is the artists names and their discography and also image links.
I think this is pretty legit.
Can you please help me or reffer me to manuals that can help?

Don't know, but have a look here: http://shironet.mako.co.il/artist?type=l...rkid=22236
That looks like a copyright infringement, but I may be wrong.

The URL to the HAR doesn't work for me. Post it again and check it can be viewed. I'll help with the metadata only scrape, if I can.

No. this site is totally legit.
Again, all I'm trying to do is extract artists metadata. No video or youtube links at all.
try this one:
http://www.filedropper.com/shironetmakocoil
thanks!
Reply
#8
I had a look. Your ability to scrape that page will depend on your programming skills. here's what's happening:

1) first time you request http://shironet.mako.co.il/artist?type=d...1&prfid=41 without the proper cookies set it invokes a javascript called http://d1a702rd0dylue.cloudfront.net/js/...v7/flat.js with a few parameters: rbzns.challdomain, rbzns.ctrbg and rbzns.rbzreqid which can be seen in your first post of the returned html.

2) the js sets up the cookies so that on the second call the cookies look like:
Code:
Cookie: uniqueId=ad3266c9-b35d-be40-986a-eb195e66924d; WT_FPC=id=28a7990c9ab695fd1251447243886122:lv=1447275492634:ss=1447275492634; _ga=GA1.3.1657243271.1447218686; JSESSIONID=0FA860512A3051982CBF11B7E3A09166; _ga=GA1.4.1657243271.1447218686; rbzreqid=rbz-mako-reblazer05313434373235313839346abe0b40da203ff0; rbzid=V3Y3NjR3SWkzanoyZUhuU3hMZ1RxVE1SaFBXVmNXRWRudkVXZWZsWnBvblJ0RGpCVmp2Y3djZ2xBa1hNcitTWExOWTk1anVHa1ltOVFsUFBkS29pOWlISUpQNWRZZ25nZjFXNWp5MkpTVlhHTjBNaStUandxSFR0OEowdVcyZzc1Z3dJY21qOXlFTktBWjl4QUxTQ0k3Uy9OVlRBTmxORmh0YiszTWI1c0JQdm5TM3lJUHFxSUNkU0NXb2FUMzV2a3d5MjArb0k1cEtLbWxnaTRZcUdoU0l2TDg2VDRXN2txcys3TGRjaDh6QjNZME9jWTgvVnMvRkFWbjVzb01xN3BEUFhLTmZxYU1WK3h3TzRDOWhkQW1DVmRDOG03MGlHT3pkVVlJZjI2dVU9QEBAMEBAQDM3MDM3MDM2NzAw

3) when you request the same url http://shironet.mako.co.il/artist?type=d...1&prfid=41 with the above cookie set, it magically returns the html page that you want scrape. Unfortunately the cookie has a rather short duration of validity so it needs to be refreshed from time to time.

4) If you want to dynamically scrape this page you need to emulate the script http://d1a702rd0dylue.cloudfront.net/js/...v7/flat.js to turn the js parameters into the required cookies. This is probably more work than you want to do, unless the code to do this already exists somewhere (I haven't come across it before, but there are people out there with lots more knowledge than I have).

There are scripts/tools that convert js into py. They have been used in the version of the youtube addon prior to Bromix writing a proper addon using the API. There is an open source js engine available from Google which is used in Chrome. Another way is to stare at the js code and rewrite in py. Yet another way is to launch a browser that requests the page and then passes the html back to a py addon.

The list goes on, but it may just be easier to find another source for your metadata.
Reply
#9
thanks a lot.
Yes, I already figured out that this will be your answer. seems like they protecting their DB very closely.
Reply
#10
How about parsing a JS heavy page using NodeJS ecosystem tunning as a webserver so that you have a rest URL which you can request from your add-on , and the NodeJS script returns data in JSON.

Sent from my MotoG3-TE
Reply
#11
Again, as I said in another topic, the correct question is not "how to do somethign in Kodi?" but "how to do something in Python?".

There is Selenium Python library that allows to interact with various browser engines. Also there's Phantom.js headless browser engine that allows to interact with web-pages. This combination is often used for web-site scraping or automated testing. The only problem here is that you need to provide a browser engine binary compiled for your target platform.
Reply
#12
(2017-02-01, 16:00)Roman_V_M Wrote: Again, as I said in another topic, the correct question is not "how to do somethign in Kodi?" but "how to do something in Python?".

There is Selenium Python library that allows to interact with various browser engines. Also there's Phantom.js headless browser engine that allows to interact with web-pages. This combination is often used for web-site scraping or automated testing. The only problem here is that you need to provide a browser engine binary compiled for your target platform.

Thank you,But,
I know how to read dynamic web pages through Selenium + python, but I don't know how to put Selenium into Python built in KODI, because there is no module for this adaptation to download.
Reply



Logout Mark Read Team Forum Stats Members Help
How to parse a dynamic web page0
This forum uses Lukasz Tkacz MyBB addons.