New help with scraping error
#1
Hey guys I'm new to XBMC add-on development and just starting out. I was getting the hang of this until I hit a roadblock. I'm trying to first do a simple scrape of a site using the code below

req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
link=response.read()
response.close()


now this code works fine with all the sites I've tried with so far but with this site "http://khmerportal.com/videos" it keeps giving me the 403 access denied error and tells me to go to this site "http://www.ioerror.us/bb2-support-key?key=17566707" Which seems to tell me that the site is using some kind of software like "Bad Behavior" that protects it from malicious scripts. Now I've been able to scrape this site just fine with .NET so I'm not sure what's in the python library that's causing the site to detect the script. any help would be appreciated.

Oh if you can't help with the error at minimum can you see if you can scrape "http://khmerportal.com/videos" successfully without getting the same error?
though donation is not necessary but just in case you want to: Donate Here
Reply
#2
Looks like they're detecting a cookie:

Set-Cookie:2c11dd65e346e5d53370eae0c74e1ee3=3c7c5569b02a89ba0a802a45aa7f90e1; path=/, bb2_screener_=1334813409+68.34.222.67; path=/videos

You'll probably need to grab this info and pass it to/from each page you visit
Reply
#3
Ok but do you have sample of how to grab this cookie and pass it along? The only examples I've seen involves using urlopen but if the site is giving me a 403 error no matter what page I'm parsing how would I get the cookie value? I check and even though the cookie name is the same as what you've posted the value for me is different so I can't hard code it.
though donation is not necessary but just in case you want to: Donate Here
Reply
#4
The easiest way is probably going to be through the t0mm0.common library: http://t0mm0.github.com/xbmc-urlresolver...n/net.html
It's in the official repo so it's easy to include as a dependency in your addon. Here's a quick crash course in writing scraper addons for non-cooperative sites:

1. Download Google Chrome
2. Disable cookies
3. Enable incognito mode

This will be a pretty decent simulation of what you can expect from python scripting.

4. Press ctrl + shift + i to bring up the inspection window
5. Click the Network tab of the inspection window
6. In the toolbar across the bottom, click the little black circle (it will turn red which indicates that the log won't clear when you load a new page)
7. Type in your address and press enter to load the page

You'll see a bunch of things pop into the logs. Each of these is a request and response to and from the server. It also includes ALL of the information that was sent to the server. Since they can ONLY track you based on this information, this is what you need to know.

8. Scroll back to the top of the inspection window to see what the first request was

In this case, it should just say "videos" under name

9. Click on the very first request (videos in this example) and then on the Headers tab to the right

You should see something similar to the following:
Quote:Request URL:http://khmerportal.com/videos
Request Method:GET
Status Code:200 OK

Request Headers

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Host:khmerportal.com
If-Modified-Since:Fri, 20 Apr 2012 00:57:30 GMT
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11

Response Headers

Cache-Control:post-check=0, pre-check=0
Connection:Keep-Alive
Content-Type:text/html; charset=utf-8
Date:Fri, 20 Apr 2012 00:59:19 GMT
Expires:Mon, 1 Jan 2001 00:00:00 GMT
Keep-Alive:timeout=5, max=100
Last-Modified:Fri, 20 Apr 2012 00:59:19 GMT
P3P:CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Pragma:no-cache
Server:Apache/2.2.17 (Unix) mod_ssl/2.2.17 OpenSSL/0.9.8e-fips-rhel5 mod_bwlimited/1.4
Set-Cookie:2c11dd65e346e5d53370eae0c74e1ee3=ec92f78df9e959085c4b0ef013a0d66c; path=/, bb2_screener_=1334883559+68.34.222.76; path=/videos
Transfer-Encoding:chunked
X-Content-Encoded-By:Khmer Portal
X-Powered-By:Khmer Portal

The request headers are what we sent to the server, the response headers are what it sent back. Not all of these are going to be required, so you'll want to experiment a little to figure out what you need in your request to get the desired result. Below are a couple of examples using t0mm0.common that should give you everything you need:

Requesting a page and printing the html it sends back:
PHP Code:
from t0mm0.common.net import Net
net 
Net()
response net.http_GET('http://khmerportal.com/videos')
print 
response 

Requesting a page, saving the cookie it gives you, and requesting a new page using that cookie:
PHP Code:
from t0mm0.common.net import Net
net 
Net()
first_response net.http_GET('http://khmerportal.com/videos')
net.save_cookies('c:\cookies.txt')

net.set_cookies('c:\cookies.txt')
second_response net.http_GET('http://khmerportal.com/videos/viewvideo/4947/chinese/white-hair'

Now, you just need to integrate the headers that you discovered were necessary above:
PHP Code:
from t0mm0.common.net import Net
net 
Net()

header_dict = {}
header_dict['Host'] = 'khmerportal.com'
header_dict['Accept-Language'] = 'en-US,en;q=0.8'

first_response net.http_GET('http://khmerportal.com/videos'headers=header_dict)
net.save_cookies('c:\cookies.txt')

net.set_cookies('c:\cookies.txt')
second_response net.http_GET('http://khmerportal.com/videos/viewvideo/4947/chinese/white-hair'

And now you have a python web browser capable of fooling 95% of the detectors out there.
Reply
#5
Hey, thanks for the help but I"m not sure if your examples is just preliminary code to get me started or if you ran it and it worked for you but I execute your code and it works (didn't generate an error a got a response) for other sites I've tested but gives me the same 403 error when connecting to http://khmerportal.com/. Though after some searching one thing did work. I've manage to scrape the site content if I use httplib.HTTPConnection so I've manage to overcome one hurdle. But now to get to the premium content I have to login so I used the script below to try and login but it won't take. each time I get a not a registered user page. I've passed the exact post variables I grab from the site. thanks again for any insight you can give. Sorry for how the code looks but I don't know how to format code in the forum post. I don't see any options so I assume I need to actually know and enter the tokens

conn = httplib.HTTPConnection(host=strdomain,timeout=30)
req = '/index.php?option=com_user&task=login'
headers = {}
headers['Accept']='text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Encoding'] = 'gzip, deflate'
headers['Referer'] = 'http://khmerportal.com/forum/index'
headers['Content-Type'] = 'application/x-www-form-urlencoded'
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.2; rv:11.0) Gecko/20100101 Firefox/11.0'
headers['Connection'] = 'keep-alive'
headers['Host']='khmerportal.com'
headers['Accept-Language']='en-us,en;q=0.5'
headers['Pragma']='no-cache'
headers['Cookie'] = cookieval
urlvar='username=myusername&passwd=mypassword&remember=yes&submit=Login&return=L3ZpZGVvcy92aWV3dmlkZW8vMjA5OS9jaGluZXNlL3BhcmNobWVudC0wOA%3D%3D&5ce54ae2174100a3df7efc6512bbd02a=1'
print headers
try:
conn.request('POST',req,urlvar,headers)
except:
print 'echec de connexion'
response = conn.getresponse()
link = response.read()
content=''.join(link.splitlines()).replace('\t','')
ckvalue = response.getheader('Set-cookie')
print 'content: ' + link
conn.close()
though donation is not necessary but just in case you want to: Donate Here
Reply
#6
Use the "Post Reply" button instead of "Quick Reply" to get the full interface with the formating buttons. You can use for syntax highlighted code.

I hadn't tested that code against this site in particular, it's just what I use as my go to scraper.

It looks like you're hardcoding the return= part above. Generally, if a site is that touchy, those tokens are only good for one request and you get a new token in the response headers of each request to be used in the next request.
Reply
#7
Actually I have a separate function that is parsing the site and grabbing the login variable, build the string the return it in that format. I was just too lazy to post it so I hardcoded one of the outputs that I get from the function in this example. I'll keep on trying various things. Thanks for the help. At least what you've shown me here will help me scrape other sites.
though donation is not necessary but just in case you want to: Donate Here
Reply
#8
Gotcha. The next thing to check that you're following a logical/possible path. They could track your progress using those tokens, so if you go from one page to another and it's not possible to follow that progression, maybe that's how they're busting you?

Another related possibility is redirects. If you're being redirected silently and not grabbing the token in between, that could cause problems too
Reply
#9
it looks like it's not just me. I saw that there was a web browser add-on within the program addon list and I assume that it was essentially parsing whatever web pages you enter into the url. I figured that if that add-on worked for this site I can use the same technique for my script. Well if you use that addon and browse to the khmerportal.com site it actually gives the exact same error that I'm getting from my script. It at least indicates that I might not be doing anything wrong but it might just not be something a python amateur like me can handle. What I know so far about the site is that it's using some open source spam/script preventing program call bad-behavior http://bad-behavior.ioerror.us/ I can't seem to find any discussion about scraping sites with this software installed anywhere. I guess this thing really works lol.
though donation is not necessary but just in case you want to: Donate Here
Reply
#10
Every lock can be unlocked.
Looks like you should be able to use this: https://gist.github.com/tuxsoul/bad-beha...es.inc.php
to get a better idea of what it's catching on.
Reply
#11
Well I'm going to leave the registered content of the Khmerportal site aside for awhile until I get better at this. But at lease now my Add-on is working for the public videos. Thanks for your help
though donation is not necessary but just in case you want to: Donate Here
Reply

Logout Mark Read Team Forum Stats Members Help
New help with scraping error1