[Solved, maybe] How do request headers in urllib2 work?
#1
I have a problem with the Russia Today News addon. Occasionally the server returns responses compressed with content-encoding = gzip, making it difficult to parse the page. I've tried to force the request header accept-encoding to not accept a gzip content encoding. I've tried this by both omitting gzip from the accept-encoding header (as below) and explicitly by setting accept-encoding to 'gzip;q=0, deflate, sdch' . Neither appear to work. I've also tried a variety of upper/lower case alternatives thinking that there may some case sensitivity issue. I'm not sure that I'm using the add_header method correctly, I assume that it overwrites the current/default values and doesn't just add to them when the actual request goes out. If the usage is correct, I think that the server is actually ignoring content-encoding header and sending the content gzip encoded which is new to me, I normally get back a 406 error code if the server can't send the requested encoding. Are there any library routines which can uncompress the gzip content if I can't get the server to stop sending gzip? Thanks


log("RT -- RT Live main page")
req = urllib2.Request("http://rt.com/shows/")
req.add_header('User-Agent', USER_AGENT)
req.add_header('Accept',"text/html")
req.add_header('Accept-Encoding', 'deflate,sdch')
req.add_header('Accept-Language', 'en-US,en;q=0.8')
req.add_header('Cookie','hide_ce=true')
log("request headers from req.header_items() = "+str(req.header_items()))
try:
response = urllib2.urlopen(req)
link1=response.read()
response.close()
except:
link1 = ""

log("RT.COM Response Content-Encoding = "+response.info().getheader('Content-Encoding'))

xbmc.log:

13:34:34 T:5820 NOTICE: request headers = [('Accept-language', 'en-US,en;q=0.8'), ('Cookie', 'hide_ce=true'), ('Accept-encoding', 'deflate,sdch'), ('Accept', 'text/html'), ('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')]
13:34:34 T:6828 DEBUG: ### [Qlock] - Delaying 3 secs
13:34:34 T:6500 DEBUG: ------ Window Init (DialogBusy.xml) ------
13:34:34 T:5820 DEBUG: Russia Today News: RT.COM Response Content-Encoding = gzip
Reply
#2
Update:
After trying everything I could think of to stop the server from sending gzip encoded text, I decided to check for gzip content-encoding and unzip it. Turned out to be easier than I thought, I changed the try: shown above to:

try:
response = urllib2.urlopen(req)
if response.info().getheader('Content-Encoding') == 'gzip':
log("RT -- Content Encoding == 'gzip")
buf = StringIO( response.read())
f = gzip.GzipFile(fileobj=buf)
link1 = f.read()
else:
link1=response.read()
response.close()
except:
link1 = ""


and added the following to the start of the module:

from StringIO import StringIO
import gzip
Reply
#3
You should really consider using add-on:requests (wiki). Will decompresses gzip automatically and save you a headache
Reply

Logout Mark Read Team Forum Stats Members Help
[Solved, maybe] How do request headers in urllib2 work?0