py+soup basic parse question
I'm pretty comfortable with bash, cut, grep and awk, but doing the same stuff in py+soup is doing my head in. So far I can fetch the 'desc' class from an IMDB watch list, but I cant turn it into 'keys' or 'variables' that I can do anything useful with. Here is my basic tutorial script:

from bs4 import BeautifulSoup
from mechanize import Browser
import urllib2
import re

soup = BeautifulSoup(
for eachmovie in movies:
#    print eachmovie['href']+","+eachmovie.string
    print eachmovie

This will output something like:
Quote:<div class="desc">
<a href="/title/tt0187078/">Gone in Sixty Seconds</a>
<div class="desc">
<a href="/title/tt0477472/">Solo</a>
<div class="desc">
<a href="/title/tt0086250/">Scarface</a>
<div class="desc">
<a href="/title/tt0072890/">Dog Day Afternoon</a>
Which is cool and all, but I want clean strings I can feed into xbmc.

Can anyone help me carve this text up into something useful?

------to be more specific, I want: IMDB ID (the tt string (^tt[0-9]{7} as regex)), the imdb URL (/title/id/), the title and of course the thumbnail. (imdbid,url,title,thumnail).
I have imdbpy, which is great for fetching stuff once I have a name or an ID, but here I just want that info for a given watchlist.
Check eachmovie.a['href'] and eachmovie.a.string (or eachmovie.text)
xbmc.log: /Users/<username>/Library/Logs/xbmc.log
Always read the XBMC online-manual, FAQ and search the forum before posting.
For troubleshooting and bug reporting please make sure you read this first.
.text works and returns the title, very neatly.

.a.string produces an error:
print "HREF=" + eachmovie.a.string
AttributeError: 'NoneType' object has no attribute 'string'

.a['href'] returns:
print "HREF=" + eachmovie.a['href']
TypeError: 'NoneType' object is not subscriptable
tried this too:
print "HREF=" + eachmovie.index

which returns:
<bound method Tag.index of <div class="desc">
<a href="/title/tt0072890/">Dog Day Afternoon</a>

it's as if there are no tags inside my selection....will keep at it.
got it:

print eachmovie.text
print eachmovie.a['href']
print eachmovie.img

after changing the soup find to:


Dog Day Afternoon

<img class="loadlate hidden zero-z-index" height="209" loadlate=",0,140,209_.jpg" src="" width="140"/>

soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES)
items = soup('div', attrs={'class' : "list_item grid"})
for i in items:
    thumb = i.a.img['src']
    name = i('a')[1].string
    href = i('a')[1]['href']
    print(name, href, thumb)

Logout Mark Read Team Forum Stats Members Help
py+soup basic parse question0