Login at Kodi Home

queeup · 2009-09-18, 15:29

I really hate BeautifulSoup. Sometimes work great but mostly i don't understand why it isn't working.

I need to print all "title, desc, thumb, video" with this code. But return just first ones.

Code:
import urllib2

from BeautifulSoup import BeautifulSoup

CH_ROOT = "http://www.collegehumor.com"

CH_RECENT = "/originals/recent"

CH_VIEWED = "/originals/most-viewed"

CH_LIKED = "/originals/most-liked"

CH_PLAYLIST = "/moogaloop"

def getHTML(url):

        try:

            print 'common :: getHTML :: url = ' + url

            req = urllib2.Request(url)

            response = urllib2.urlopen(req)

            link = response.read()

            response.close()

        except urllib2.HTTPError, e:

            print "HTTP error: %d" % e.code

        except urllib2.URLError, e:

            print "Network error: %s" % e.reason.args[1]

        else:

            return link

html = getHTML(CH_ROOT + CH_RECENT)

soup = BeautifulSoup(html)

for result in soup.findAll("div", id="tab_content_0"):

    title = result.findAll("strong", {"class":"title"})[0].a.string.strip()

    desc = result.findAll("div", {"class":"linked_details"})[0].p.string.strip()

    thumb = result.findAll("img", {"class":"media_thumb"})[0]['src']

    video = CH_ROOT + CH_PLAYLIST + result.findAll("a", {"class":"video_link"})[0]['href']

    print title, desc, thumb, video

rwparris2 · (This post was last modified: 2009-09-18, 19:12 by rwparris2.)

There should only be 1 div with the id tab_content_0. That's why you only get 1 result. If you do len(soup.findAll("div", id="tab_content_0")) you should get 1 as a result.

What you're really interested in is the <li class="video"> stuff. So search for that instead!

If you find yourself using findAll(something)[0], just use find(something). It stops searching after the first result, and is therefore faster on large documents.

In BeautifulSoup, classes don't need to be defined in a dict unless you just like to be explicit.

PHP Code:
from BeautifulSoup import MinimalSoup as BeautifulSoup, SoupStrainer

videoContainers = SoupStrainer('li', 'video')
for liTag in BeautifulSoup(html, parseOnlyThese = videoContainers):
    title = liTag.find('strong', 'title').a.string.strip()
    desc = liTag.p.string.strip()
    thumb = liTag.find('img', 'media_thumb')['src']
    video = liTag.find('a', 'video_link')['href']
    print title, desc, thumb, video 

Dan Dare · 2009-09-19, 12:05

@queeup
I found BeautifulSoup being just like any other tool - you can use it to hammer nails in, or just as well to hammer your own fingers :-)

I found that starting small helps, as in do the outer search first and print what you found, then write the "for" and display the larger bits you find, that will help you then write the code to parse the little bits, without having to go every time to the DOM tree explorer tool.

@rwparris2
Nice code, thanks for that. Personally I like the explicit dicts for parameters, makes maintenance easier as well as for other people to learn and understand what the code was trying to do, although true makes the code tidier.

queeup · 2009-09-19, 14:54

Yea my fingers hurt Smile

Just one issue left. How can i limit search result?

Because @rwparris2 code find 60 result. I need ('li', 'video') inside in ("div", id="tab_content_0").

tab_content_0 = recent

('li', 'video') = 20

tab_content_1 = most-viewed

('li', 'video') = 20

tab_content_2 = most-liked

('li', 'video') = 20

Dan Dare · 2009-09-19, 15:03

Then set the strainer to tab_content_0, then do a findAll('li', 'video') on that and for on the results, something like:

PHP Code:
divContent =  BeautifulSoup(html, parseOnlyThese = SoupStrainer('div', 'tab_content_0'))
liTags     = divContent.findAll('li', 'video')
for liTag in liTags:
    title = liTag.find('strong', 'title').a.string.strip()
    desc  = liTag.p.string.strip()
    thumb = liTag.find('img', 'media_thumb')['src']
    video = liTag.find('a', 'video_link')['href']
    print title, desc, thumb, video 

queeup · 2009-09-19, 15:34

grrrrr am I stupid?

PHP Code:
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen('http://www.collegehumor.com/originals/recent')
divContent = BeautifulSoup(html, parseOnlyThese=SoupStrainer('div', 'tab_content_0'))
print divContent 

Nothing print Sad

Dan Dare · 2009-09-19, 15:40

Try SoupStrainer('div', { 'id' : 'tab_content_0' })

rwparris2 · 2009-09-19, 15:42

Dan Dare Wrote:Try SoupStrainer('div', { 'id' : 'tab_content_0' })

Right, the no dict shortcut only applies to classes.

queeup · 2009-09-19, 15:51

Thanks anyway both of you Smile

I'm confusing atm. My brain stoped. I need a break. Hey Dan are you sick and tired of my questions? lol...

Dan Dare · (This post was last modified: 2009-09-19, 16:03 by Dan Dare.)

In my view, asking questions is the best way to learn, but you still have to put the effort to understand the answers - not that you don't, that was a generic thought Smile

That's how I learned everything I know, whether it was asking people, the Internet and even myself sometimes Smile

Ask, understand and try (sometimes lots of trying)...

Ourclass · (This post was last modified: 2011-03-21, 18:25 by Ourclass.)

Hi Guys,

I found your thread on google and was wondering if you could help me.

I want to grab videos from collegehumor. For this I need to strip title and image from collegehumor.

Code:
public function title()

    {

        $this->parse();

        switch($this->_aData['site_id'])

        {

case 'collegehumor':

                preg_match('/item_title[^>]+>([^<]+)/ms', $this->_aData['html'], $aMatches);

                if (isset($aMatches[1]))

                {

                    $sTitle = trim($aMatches[1]);

                }

                break;

and for the picture:

Code:
public function image($iId)

    {

        switch($this->_aData['site_id'])

        {

case 'collegehumor':

                preg_match('/image_src[^=]+="([^"]+)/ms', $this->_aData['html'], $aMatches);

                if (isset($aMatches[1]))

                {

                    $sImage = trim($aMatches[1]);

                }

                break;

Above code does everything except getting me the title and the picture. Sad

Later I would like to implement also other things and I really would appreciate if somebody could help me.

Kind regards,

webmaster
http://www.ourclass.co.uk