Login at Kodi Home

stacked · 2012-12-05, 09:03

My TMZ addon stop working recently. I started digging around and noticed that the web pages downloaded are scrambled (or encrypted). This is the first time I came across this issue.

Code:
import urllib2

req = urllib2.Request('http://tmz.com/videos')

content = urllib2.urlopen(req)

html = content.read()

content.close()

print html

XBMC returns:TypeError: argument 1 must be string without null bytes, not str
python returns: random text

I also tried curl and got random text. Wireshark didn't help either.

If anyone has time, please take a look and let me know what you find. Big Grin

jbel · 2012-12-05, 09:24

Took a look, it appears their servers are just returning gzipped content by default, even if the client doesn't support it.

Code:
$ curl http://www.tmz.com/videos | gunzip | head

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html class="videos-section" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">

  <head><script>    var __exp = new Date(new Date().valueOf() + 604800000).toGMTString(); 

    document.cookie = '__fwInfo=1; path=/; expires=' + __exp + '; domain=' + document.domain + ';'; 

</script>

        <script type="text/javascript">var _sf_startpt=(new Date()).getTime()</script>

    <title>Celebrity Videos | TMZ.com </title>

    <meta name="robots" content="all"/>

divingmule · 2012-12-05, 15:15

There's likely a better way but...

Code:
import urllib2

import gzip,StringIO

req = urllib2.Request('http://tmz.com/videos')

content = urllib2.urlopen(req)

gzip_filehandle=gzip.GzipFile(fileobj=StringIO.StringIO(content.read()))

html = gzip_filehandle.read()

content.close()

print html

stacked · 2012-12-05, 17:54

(2012-12-05, 09:24)jbel Wrote: Took a look, it appears their servers are just returning gzipped content by default, even if the client doesn't support it.

Code:
$ curl http://www.tmz.com/videos | gunzip | head <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html class="videos-section" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#"> <head><script> var __exp = new Date(new Date().valueOf() + 604800000).toGMTString(); document.cookie = '__fwInfo=1; path=/; expires=' + __exp + '; domain=' + document.domain + ';'; </script> <script type="text/javascript">var _sf_startpt=(new Date()).getTime()</script> <title>Celebrity Videos | TMZ.com </title> <meta name="robots" content="all"/>

Ok. That makes sense. I'm curious on how you figured that out.

(2012-12-05, 15:15)divingmule Wrote: There's likely a better way but...

Code:
import urllib2 import gzip,StringIO req = urllib2.Request('http://tmz.com/videos') content = urllib2.urlopen(req) gzip_filehandle=gzip.GzipFile(fileobj=StringIO.StringIO(content.read())) html = gzip_filehandle.read() content.close() print html

Wow. Thanks. That works. Rofl

jbel · (This post was last modified: 2012-12-05, 19:59 by jbel.)

(2012-12-05, 17:54)stacked Wrote: Ok. That makes sense. I'm curious on how you figured that out.

Took a peek at the HTTP headers being returned with curl and the saw this: Content-Encoding: gzip

Code:
$ curl -i http://www.tmz.com/videos -s | head -20

HTTP/1.1 200 OK

Date: Wed, 05 Dec 2012 17:50:18 GMT

Server: Apache/2.2.14 (Ubuntu)

X-Powered-By: Crowd Fusion

Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0

Expires: Wed, 05 Dec 2012 17:50:18 GMT

Vary: Accept-Encoding

Content-Type: text/html; charset="utf-8"

Cache-Control: private

Set-Cookie: phpsessionid=l28mbuj1erqma6s0sphsdmh067; expires=Fri, 07-Dec-2012 17:50:18 GMT; path=/; domain=www.tmz.com

Set-Cookie: SERVERID=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/

Content-Encoding: gzip

FEA: 8

X-LLNW-FW-Debug: data:text/plain;charset=utf-8;base64,dmVyc2lvbkluZm86CiAgdmVyc2lvbjogMTMuMy4wLjAKICByZXZpc2lvbjogMDIxMjEyLTE1MjcxMgogIGJ1aWxkdGFnOiBqZW5raW5zLWJvYi12MTMtMTA1CnN5c3RlbUluZm86CiAgc3lzdGVtSWRlbnRpZmllcjogaWFkCiAgc2VydmVyTmFtZTogd2FhOC5pYWQubGxudy5uZXQK

Transfer-Encoding: chunked

Connection: keep-alive

If you don't mind adding a dependency, you could also use the requests library. It automatically handles gzip decoding. It's already in the eden/frodo repos.

Code:
>>> import requests

>>> resp = requests.get('http://www.tmz.com/videos').content

>>> resp[:200]

'  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html class="videos-section" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="'