py+soup basic parse question
#1
Hi,
I'm pretty comfortable with bash, cut, grep and awk, but doing the same stuff in py+soup is doing my head in. So far I can fetch the 'desc' class from an IMDB watch list, but I cant turn it into 'keys' or 'variables' that I can do anything useful with. Here is my basic tutorial script:

Code:
from bs4 import BeautifulSoup
from mechanize import Browser
import urllib2
import re

url="http://www.imdb.com/user/ur35645275/watchlist"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
movies=soup.findAll('div',{'class':'desc'})
for eachmovie in movies:
#    print eachmovie['href']+","+eachmovie.string
    print eachmovie

This will output something like:
Quote:<div class="desc">
<a href="/title/tt0187078/">Gone in Sixty Seconds</a>
</div>
<div class="desc">
<a href="/title/tt0477472/">Solo</a>
</div>
<div class="desc">
<a href="/title/tt0086250/">Scarface</a>
</div>
<div class="desc">
<a href="/title/tt0072890/">Dog Day Afternoon</a>
</div>
Which is cool and all, but I want clean strings I can feed into xbmc.

Can anyone help me carve this text up into something useful?

------to be more specific, I want: IMDB ID (the tt string (^tt[0-9]{7} as regex)), the imdb URL (/title/id/), the title and of course the thumbnail. (imdbid,url,title,thumnail).
I have imdbpy, which is great for fetching stuff once I have a name or an ID, but here I just want that info for a given watchlist.
Reply
#2
Check eachmovie.a['href'] and eachmovie.a.string (or eachmovie.text)
xbmc.log: /Users/<username>/Library/Logs/xbmc.log
Always read the XBMC online-manual, FAQ and search the forum before posting.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#3
.text works and returns the title, very neatly.

.a.string produces an error:
print "HREF=" + eachmovie.a.string
AttributeError: 'NoneType' object has no attribute 'string'

.a['href'] returns:
print "HREF=" + eachmovie.a['href']
TypeError: 'NoneType' object is not subscriptable
Reply
#4
tried this too:
print "HREF=" + eachmovie.index

which returns:
<bound method Tag.index of <div class="desc">
<a href="/title/tt0072890/">Dog Day Afternoon</a>
</div>>

it's as if there are no tags inside my selection....will keep at it.
Reply
#5
got it:

print eachmovie.text
print eachmovie.a['href']
print eachmovie.img

after changing the soup find to:

movies=soup.findAll('div',{'class':'list_item'})

returns:
Dog Day Afternoon


/title/tt0072890/
<img class="loadlate hidden zero-z-index" height="209" loadlate="http://ia.media-imdb.com/images/M/MV5BMTQyNjQ5NjczM15BMl5BanBnXkFtZTYwNDA4MTk4._V1._SY209_CR1,0,140,209_.jpg" src="http://i.media-imdb.com/images/SFaa265aa19162c9e4f3781fbae59f856d/nopicture/medium/film.png" width="140"/>
htpc@xbmc:~/scripts/wip$

Reply
#6
Code:
soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES)
items = soup('div', attrs={'class' : "list_item grid"})
for i in items:
    thumb = i.a.img['src']
    name = i('a')[1].string
    href = i('a')[1]['href']
    print(name, href, thumb)
Reply

Logout Mark Read Team Forum Stats Members Help
py+soup basic parse question0