BeautifulSoup parsing help
#1
Hi,
My html has below structure..
Code:
<div id="messages">                
                    <table class="list" cellspacing="0">
                    <tbody>                    
                    <tr>
                        <td class="message"><a href="#PCC-021713-V2">Shifting from Hollow to Hallowed</a></td>
                        <td class="series">Shift</td>
                        <td class="date">February 17, 2013</td>
                    </tr>                    
                    <tr>
                        <td class="message"><a href="#PCC-021013-V2">Jesus at the Center of it All</a></td>
                        <td class="series"></td>
                        <td class="date">February 10, 2013</td>
                    </tr>

What would be the right way to extract each message, series and date?
I plan to store these in some kind of python data structure....
Reply
#2
Soup is pretty heavy for something like that. Why not just a regex?
Code:
import re

pattern   = '<td class="message"><.+?>(.+?)</a>.+?<td class="series">'
pattern +='(.+?)?</td>.+?<td class="date">(.+?)</td>'
pattern   = re.compile(pattern)
for match in re.finditer(pattern, html):
    message,series,date = match.groups()
    print message
    print series
    print date
Reply
#3
Thanks for you help.
I did give RE a try, but looks like i'm no good with it Big Grin .
Went ahead with BeautifulSoup coz i wanted to learn it..

Ended up using this...
Code:
for eachMessage in soup.findAll("td", {"class":"message"}):
    AllMessages.append(eachMessage.text)
for eachSeries in soup.findAll("td", {"class":"series"}):
    AllSeries.append(eachSeries.text)
for eachDate in soup.findAll("td", {"class":"date"}):
    AllDates.append(eachDate.text)
Reply

Logout Mark Read Team Forum Stats Members Help
BeautifulSoup parsing help0