Issue scraping, possible encryption
#1
My TMZ addon stop working recently. I started digging around and noticed that the web pages downloaded are scrambled (or encrypted). This is the first time I came across this issue.

Code:
import urllib2
req = urllib2.Request('http://tmz.com/videos')
content = urllib2.urlopen(req)
html = content.read()
content.close()
print html

XBMC returns:TypeError: argument 1 must be string without null bytes, not str
python returns: random text

I also tried curl and got random text. Wireshark didn't help either.

If anyone has time, please take a look and let me know what you find. Big Grin
Reply
#2
Took a look, it appears their servers are just returning gzipped content by default, even if the client doesn't support it.

Code:
$ curl http://www.tmz.com/videos | gunzip | head
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html class="videos-section" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">
  <head><script>    var __exp = new Date(new Date().valueOf() + 604800000).toGMTString();
    document.cookie = '__fwInfo=1; path=/; expires=' + __exp + '; domain=' + document.domain + ';';
</script>
        <script type="text/javascript">var _sf_startpt=(new Date()).getTime()</script>

    <title>Celebrity Videos | TMZ.com </title>

    <meta name="robots" content="all"/>
Reply
#3
There's likely a better way but...

Code:
import urllib2
import gzip,StringIO
req = urllib2.Request('http://tmz.com/videos')
content = urllib2.urlopen(req)
gzip_filehandle=gzip.GzipFile(fileobj=StringIO.StringIO(content.read()))
html = gzip_filehandle.read()
content.close()
print html
Reply
#4
(2012-12-05, 09:24)jbel Wrote: Took a look, it appears their servers are just returning gzipped content by default, even if the client doesn't support it.

Code:
$ curl http://www.tmz.com/videos | gunzip | head
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html class="videos-section" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">
  <head><script>    var __exp = new Date(new Date().valueOf() + 604800000).toGMTString();
    document.cookie = '__fwInfo=1; path=/; expires=' + __exp + '; domain=' + document.domain + ';';
</script>
        <script type="text/javascript">var _sf_startpt=(new Date()).getTime()</script>

    <title>Celebrity Videos | TMZ.com </title>

    <meta name="robots" content="all"/>

Ok. That makes sense. I'm curious on how you figured that out.

(2012-12-05, 15:15)divingmule Wrote: There's likely a better way but...

Code:
import urllib2
import gzip,StringIO
req = urllib2.Request('http://tmz.com/videos')
content = urllib2.urlopen(req)
gzip_filehandle=gzip.GzipFile(fileobj=StringIO.StringIO(content.read()))
html = gzip_filehandle.read()
content.close()
print html

Wow. Thanks. That works. Rofl
Reply
#5
(2012-12-05, 17:54)stacked Wrote: Ok. That makes sense. I'm curious on how you figured that out.

Took a peek at the HTTP headers being returned with curl and the saw this: Content-Encoding: gzip

Code:
$ curl -i http://www.tmz.com/videos -s | head -20
HTTP/1.1 200 OK
Date: Wed, 05 Dec 2012 17:50:18 GMT
Server: Apache/2.2.14 (Ubuntu)
X-Powered-By: Crowd Fusion
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Wed, 05 Dec 2012 17:50:18 GMT
Vary: Accept-Encoding
Content-Type: text/html; charset="utf-8"
Cache-Control: private
Set-Cookie: phpsessionid=l28mbuj1erqma6s0sphsdmh067; expires=Fri, 07-Dec-2012 17:50:18 GMT; path=/; domain=www.tmz.com
Set-Cookie: SERVERID=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/
Content-Encoding: gzip
FEA: 8
X-LLNW-FW-Debug: data:text/plain;charset=utf-8;base64,dmVyc2lvbkluZm86CiAgdmVyc2lvbjogMTMuMy4wLjAKICByZXZpc2lvbjogMDIxMjEyLTE1MjcxMgogIGJ1aWxkdGFnOiBqZW5raW5zLWJvYi12MTMtMTA1CnN5c3RlbUluZm86CiAgc3lzdGVtSWRlbnRpZmllcjogaWFkCiAgc2VydmVyTmFtZTogd2FhOC5pYWQubGxudy5uZXQK
Transfer-Encoding: chunked
Connection: keep-alive

If you don't mind adding a dependency, you could also use the requests library. It automatically handles gzip decoding. It's already in the eden/frodo repos.

Code:
>>> import requests
>>> resp = requests.get('http://www.tmz.com/videos').content
>>> resp[:200]
'  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html class="videos-section" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="'
Reply

Logout Mark Read Team Forum Stats Members Help
Issue scraping, possible encryption 0