Kodi Community Forum

Full Version: I need help for my new plugin(College Humor + BeautifulSoup)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I really hate BeautifulSoup. Sometimes work great but mostly i don't understand why it isn't working.

I need to print all "title, desc, thumb, video" with this code. But return just first ones.

import urllib2
from BeautifulSoup import BeautifulSoup

CH_ROOT = "http://www.collegehumor.com"
CH_RECENT = "/originals/recent"
CH_VIEWED = "/originals/most-viewed"
CH_LIKED = "/originals/most-liked"
CH_PLAYLIST = "/moogaloop"

def getHTML(url):
            print 'common :: getHTML :: url = ' + url
            req = urllib2.Request(url)
            response = urllib2.urlopen(req)
            link = response.read()
        except urllib2.HTTPError, e:
            print "HTTP error: %d" % e.code
        except urllib2.URLError, e:
            print "Network error: %s" % e.reason.args[1]
            return link

soup = BeautifulSoup(html)

for result in soup.findAll("div", id="tab_content_0"):
    title = result.findAll("strong", {"class":"title"})[0].a.string.strip()
    desc = result.findAll("div", {"class":"linked_details"})[0].p.string.strip()
    thumb = result.findAll("img", {"class":"media_thumb"})[0]['src']
    video = CH_ROOT + CH_PLAYLIST + result.findAll("a", {"class":"video_link"})[0]['href']
    print title, desc, thumb, video
There should only be 1 div with the id tab_content_0. That's why you only get 1 result. If you do len(soup.findAll("div", id="tab_content_0")) you should get 1 as a result.

What you're really interested in is the <li class="video"> stuff. So search for that instead!

If you find yourself using findAll(something)[0], just use find(something). It stops searching after the first result, and is therefore faster on large documents.

In BeautifulSoup, classes don't need to be defined in a dict unless you just like to be explicit.

PHP Code:
from BeautifulSoup import MinimalSoup as BeautifulSoupSoupStrainer

liTag in BeautifulSoup(htmlparseOnlyThese videoContainers):
title liTag.find('strong''title').a.string.strip()
desc liTag.p.string.strip()
thumb liTag.find('img''media_thumb')['src']
video liTag.find('a''video_link')['href']
I found BeautifulSoup being just like any other tool - you can use it to hammer nails in, or just as well to hammer your own fingers :-)

I found that starting small helps, as in do the outer search first and print what you found, then write the "for" and display the larger bits you find, that will help you then write the code to parse the little bits, without having to go every time to the DOM tree explorer tool.

Nice code, thanks for that. Personally I like the explicit dicts for parameters, makes maintenance easier as well as for other people to learn and understand what the code was trying to do, although true makes the code tidier.
Yea my fingers hurt Smile

Just one issue left. How can i limit search result?

Because @rwparris2 code find 60 result. I need ('li', 'video') inside in ("div", id="tab_content_0").

tab_content_0 = recent
('li', 'video') = 20
tab_content_1 = most-viewed
('li', 'video') = 20
tab_content_2 = most-liked
('li', 'video') = 20
Then set the strainer to tab_content_0, then do a findAll('li', 'video') on that and for on the results, something like:

PHP Code:
divContent =  BeautifulSoup(htmlparseOnlyThese SoupStrainer('div''tab_content_0'))
liTags     divContent.findAll('li''video')
liTag in liTags:
title liTag.find('strong''title').a.string.strip()
desc  liTag.p.string.strip()
thumb liTag.find('img''media_thumb')['src']
video liTag.find('a''video_link')['href']
grrrrr am I stupid?

PHP Code:
import urllib
from BeautifulSoup import BeautifulSoup

divContent BeautifulSoup(htmlparseOnlyThese=SoupStrainer('div''tab_content_0'))
Nothing print Sad
Try SoupStrainer('div', { 'id' : 'tab_content_0' })
Dan Dare Wrote:Try SoupStrainer('div', { 'id' : 'tab_content_0' })
Right, the no dict shortcut only applies to classes.
Thanks anyway both of you Smile I'm confusing atm. My brain stoped. I need a break. Hey Dan are you sick and tired of my questions? lol...
In my view, asking questions is the best way to learn, but you still have to put the effort to understand the answers - not that you don't, that was a generic thought Smile That's how I learned everything I know, whether it was asking people, the Internet and even myself sometimes Smile Ask, understand and try (sometimes lots of trying)...
Hi Guys,

I found your thread on google and was wondering if you could help me.

I want to grab videos from collegehumor. For this I need to strip title and image from collegehumor.

public function title()

case 'collegehumor':
                preg_match('/item_title[^>]+>([^<]+)/ms', $this->_aData['html'], $aMatches);
                if (isset($aMatches[1]))
                    $sTitle = trim($aMatches[1]);

and for the picture:

public function image($iId)
case 'collegehumor':
                preg_match('/image_src[^=]+="([^"]+)/ms', $this->_aData['html'], $aMatches);
                if (isset($aMatches[1]))
                    $sImage = trim($aMatches[1]);

Above code does everything except getting me the title and the picture. Sad
Later I would like to implement also other things and I really would appreciate if somebody could help me.

Kind regards,
