I need help for my new plugin(College Humor + BeautifulSoup)
#1
I really hate BeautifulSoup. Sometimes work great but mostly i don't understand why it isn't working.

I need to print all "title, desc, thumb, video" with this code. But return just first ones.

Code:
import urllib2
from BeautifulSoup import BeautifulSoup

CH_ROOT = "http://www.collegehumor.com"
CH_RECENT = "/originals/recent"
CH_VIEWED = "/originals/most-viewed"
CH_LIKED = "/originals/most-liked"
CH_PLAYLIST = "/moogaloop"

def getHTML(url):
        try:
            print 'common :: getHTML :: url = ' + url
            req = urllib2.Request(url)
            response = urllib2.urlopen(req)
            link = response.read()
            response.close()
        except urllib2.HTTPError, e:
            print "HTTP error: %d" % e.code
        except urllib2.URLError, e:
            print "Network error: %s" % e.reason.args[1]
        else:
            return link

html = getHTML(CH_ROOT + CH_RECENT)
soup = BeautifulSoup(html)

for result in soup.findAll("div", id="tab_content_0"):
    title = result.findAll("strong", {"class":"title"})[0].a.string.strip()
    desc = result.findAll("div", {"class":"linked_details"})[0].p.string.strip()
    thumb = result.findAll("img", {"class":"media_thumb"})[0]['src']
    video = CH_ROOT + CH_PLAYLIST + result.findAll("a", {"class":"video_link"})[0]['href']
    print title, desc, thumb, video
Reply
#2
There should only be 1 div with the id tab_content_0. That's why you only get 1 result. If you do len(soup.findAll("div", id="tab_content_0")) you should get 1 as a result.

What you're really interested in is the <li class="video"> stuff. So search for that instead!

If you find yourself using findAll(something)[0], just use find(something). It stops searching after the first result, and is therefore faster on large documents.

In BeautifulSoup, classes don't need to be defined in a dict unless you just like to be explicit.

PHP Code:
from BeautifulSoup import MinimalSoup as BeautifulSoupSoupStrainer

videoContainers 
SoupStrainer('li''video')
for 
liTag in BeautifulSoup(htmlparseOnlyThese videoContainers):
    
title liTag.find('strong''title').a.string.strip()
    
desc liTag.p.string.strip()
    
thumb liTag.find('img''media_thumb')['src']
    
video liTag.find('a''video_link')['href']
    print 
titledescthumbvideo 
Always read the XBMC online-manual, FAQ and search and search the forum before posting.
For troubleshooting and bug reporting please read how to submit a proper bug report.

If you're interested in writing addons for xbmc, read docs and how-to for plugins and scripts ||| http://code.google.com/p/xbmc-addons/
Reply
#3
@queeup
I found BeautifulSoup being just like any other tool - you can use it to hammer nails in, or just as well to hammer your own fingers :-)

I found that starting small helps, as in do the outer search first and print what you found, then write the "for" and display the larger bits you find, that will help you then write the code to parse the little bits, without having to go every time to the DOM tree explorer tool.

@rwparris2
Nice code, thanks for that. Personally I like the explicit dicts for parameters, makes maintenance easier as well as for other people to learn and understand what the code was trying to do, although true makes the code tidier.
Reply
#4
Yea my fingers hurt Smile

Just one issue left. How can i limit search result?

Because @rwparris2 code find 60 result. I need ('li', 'video') inside in ("div", id="tab_content_0").

tab_content_0 = recent
('li', 'video') = 20
tab_content_1 = most-viewed
('li', 'video') = 20
tab_content_2 = most-liked
('li', 'video') = 20
Reply
#5
Then set the strainer to tab_content_0, then do a findAll('li', 'video') on that and for on the results, something like:

PHP Code:
divContent =  BeautifulSoup(htmlparseOnlyThese SoupStrainer('div''tab_content_0'))
liTags     divContent.findAll('li''video')
for 
liTag in liTags:
    
title liTag.find('strong''title').a.string.strip()
    
desc  liTag.p.string.strip()
    
thumb liTag.find('img''media_thumb')['src']
    
video liTag.find('a''video_link')['href']
    print 
titledescthumbvideo 
Reply
#6
grrrrr am I stupid?

PHP Code:
import urllib
from BeautifulSoup import BeautifulSoup
SoupStrainer

html 
urllib.urlopen('http://www.collegehumor.com/originals/recent')
divContent BeautifulSoup(htmlparseOnlyThese=SoupStrainer('div''tab_content_0'))
print 
divContent 
Nothing print Sad
Reply
#7
Try SoupStrainer('div', { 'id' : 'tab_content_0' })
Reply
#8
Dan Dare Wrote:Try SoupStrainer('div', { 'id' : 'tab_content_0' })
Right, the no dict shortcut only applies to classes.
Always read the XBMC online-manual, FAQ and search and search the forum before posting.
For troubleshooting and bug reporting please read how to submit a proper bug report.

If you're interested in writing addons for xbmc, read docs and how-to for plugins and scripts ||| http://code.google.com/p/xbmc-addons/
Reply
#9
Thanks anyway both of you Smile I'm confusing atm. My brain stoped. I need a break. Hey Dan are you sick and tired of my questions? lol...
Reply
#10
In my view, asking questions is the best way to learn, but you still have to put the effort to understand the answers - not that you don't, that was a generic thought Smile That's how I learned everything I know, whether it was asking people, the Internet and even myself sometimes Smile Ask, understand and try (sometimes lots of trying)...
Reply
#11
Hi Guys,

I found your thread on google and was wondering if you could help me.

I want to grab videos from collegehumor. For this I need to strip title and image from collegehumor.

Code:
public function title()
    {
        $this->parse();

        switch($this->_aData['site_id'])
        {
case 'collegehumor':
                preg_match('/item_title[^>]+>([^<]+)/ms', $this->_aData['html'], $aMatches);
                if (isset($aMatches[1]))
                {
                    $sTitle = trim($aMatches[1]);
                }
                break;

and for the picture:

Code:
public function image($iId)
    {
        switch($this->_aData['site_id'])
        {
case 'collegehumor':
                preg_match('/image_src[^=]+="([^"]+)/ms', $this->_aData['html'], $aMatches);
                if (isset($aMatches[1]))
                {
                    $sImage = trim($aMatches[1]);
                }
                break;

Above code does everything except getting me the title and the picture. Sad
Later I would like to implement also other things and I really would appreciate if somebody could help me.

Kind regards,

webmaster
http://www.ourclass.co.uk
Reply

Logout Mark Read Team Forum Stats Members Help
I need help for my new plugin(College Humor + BeautifulSoup)0