Kodi Community Forum

Full Version: Scrape some text from a page?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
In my addon I want to scrape some json and turn it into a dict so I can grab the data I"m interested in. What's the easiest way to do this?

The string is all one line in the page and looks like this:

Code:
<script>var data = { "foo": "bar" };</script>


I want to somehow get it into a python dict, so I would scrape just the json object w/ a regex I'm assuming, and then convert that from JSON to dict.

Let's assume the URL is http://example.com I'm not sure how to make an HTTP request in my addon.
Kodi Python includes all the Standard Library which is very versatile. E.g. a web-client can be implemented using urllib2 module.
Code:
import urllib2
import socket

def load_page(url):
    """
    Minimalistic web-client
    """
    request = urllib2.Request(url, None,
                              {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
                               'Accept-Charset': 'UTF-8',
                               'Accept': 'text/html'})
    try:
        session = urllib2.urlopen(request, None)
    except (urllib2.URLError, socket.timeout):
        page = '404'
    else:
        page = session.read()
        session.close()
    return page
To add to the above:
Code:
import json

      html = load_page('http://example.com')
      stuff = re.compile('<script>var data = (.+?);', re.DOTALL).search(html).group(1)
      js     = json.loads(stuff)

# js['foo'] should equal 'bar'

One other extremely irritating thing with web servers is that they can return the data encoded gzip (compressed) even though you specifically tell the web server not to. In that case you need to modify the Minimalistic web-client from this:

Code:
else:
        page = session.read()
        session.close()
    return page

to this:
Code:
import gzip
from StringIO import StringIO

    else:
       if session.info().getheader('Content-Encoding') == 'gzip':
                 buf = StringIO( session.read())
                 f = gzip.GzipFile(fileobj=buf)
                 page = f.read()
       else:

                 page = session.read()
    session.close()
    return page
(2015-05-09, 00:14)learningit Wrote: [ -> ]One other extremely irritating thing with web servers is that they can return the data encoded gzip (compressed) even though you specifically tell the web server not to. In that case you need to modify the Minimalistic web-client from this:

I've wanted to stick to the minimal, but you are right. However, IMO your code is a bit excessive. The following will do the trick, if page contents may be returned gzipped:
Code:
import urllib2
import socket
import zlib

def load_page(url, data=None):
    """
    Minimalistic web-client
    """
    request = urllib2.Request(url, None,
                           {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
                           'Accept-Charset': 'UTF-8',
                           'Accept': 'text/html'})
    try:
        session = urllib2.urlopen(request, data)
    except (urllib2.URLError, socket.timeout):
        page = '404'
    else:
        page = session.read()
        if session.info().getheader('Content-Encoding') == 'gzip':
            page = zlib.decompress(page, zlib.MAX_WBITS + 16)
        session.close()
    return page
(2015-05-09, 11:18)Roman_V_M Wrote: [ -> ]
(2015-05-09, 00:14)learningit Wrote: [ -> ]One other extremely irritating thing with web servers is that they can return the data encoded gzip (compressed) even though you specifically tell the web server not to. In that case you need to modify the Minimalistic web-client from this:

I've wanted to stick to the minimal, but you are right. However, IMO your code is a bit excessive. The following will do the trick, if page contents may be returned gzipped:
Code:
import requests
requests.get(url).text
It "just works". Life's too short for urllib2
Yep they all work. The bigger point I was trying to make is getting zipped text back from the web server when you don't specify gzip as an accept can be really frustrating for someone who hasn't dealt with it.