2009-04-29, 17:22
Hello,
I'd like to developp a script to scrape thus display french TV programs from http://television.telerama.fr/tele/grille.php.
I've passed these steps:
- grab the page
- parse it
- store the needed datas
Here is my partial code (I've just began 2 days ago ...)
Partial grabbed page:
I try to parse each data string, but my problem is that the datas are malformatted (csv like): each field (between "any field") is separated with a ',', but the delimiter can be inside a field.
ex: [HTML]"blah,blahblah","none here","here yes, grrr","here:not","last"[/HTML]
Is it possible to solve it easily ?
Next steps:
- re-generate HTML with timeline style
- include ability to choose channels (with cookies)
- add possibility to view programs of "this evening", "tomorrow", "next week", "thursday" ...
- May be add rss scrolling
- ...
Thanks a lot for your useful help !!
I'd like to developp a script to scrape thus display french TV programs from http://television.telerama.fr/tele/grille.php.
I've passed these steps:
- grab the page
- parse it
- store the needed datas
Here is my partial code (I've just began 2 days ago ...)
Partial grabbed page:
Quote:<div id="leprogramme"> <!-- ####################################################################chaine -->
<div id="item_192" class="sortable">
<div class="chaine" alt="TF 1" title="TF 1">
<div class="logo logo_ch_192" title="TF 1" alt="TF 1">
<a href="javascript:void(0);" onclick="return retirerChaine('192');" class="pointer" ><img src="http://icon.telerama.fr/iconsv2/grille_croix.gif" alt="" border="0" /></a>
</div><!-- logo -->
<div class="sep-15">.</div>
<div class="programme" id="ch_programme_192" alt="" title="">
<div class="emission genre-vide jeunesse pointer" style="width:563px;left:0px;z-index: 1;" id="emission_12711016" onclick="return afficherEmission('12711016', '192');" alt="TFou - Mercredi 22 avril de 06h30 à 11h05" title="TFou - Mercredi 22 avril de 06h30 à 11h05">
<div class="conteneur">
<span class="genre">Jeunesse</span><br />
<span class="titre">
<a href="/tele/emission.php?id=12711016" onclick="return false;" class="annulahref">TFou</a>
</span><br />
<span style="resume">Au sommaire : - Tweenies - Charlie et Lola - Ni hao, ...</span>
<div id="data_12711016" style="display:none;">{"Id_Diffusion":"12711016","Id_Emission":"12550526","Id_Chaine":"192","Date_Debut":"2009-04-22 06:30:00","Date_Fin":"2009-04-22 11:05:00","Titre":"TFou","Sous_Titre":"","ShowViewFr":"25041803","note_T":"0","Id_Rubrique":"8403","Rubrique_Libelle":"Magazine jeunesse","Rubrique_Niveau":"3","Type":"Jeunesse","Chaine_Nom":"TF 1","Logo":"192.gif","Id_Hierarchie":"030302","DureeEnSecondes":"10500","resume_court":"Au sommaire : - Tweenies - Charlie et Lola - Ni hao, ...","resume_long":"Au sommaire : - Tweenies - Charlie et Lola - Ni hao, Kai-Lan - Chuggington - La Maison de Mickey - Le Petit Dinosaure - Casper, l'école de la peur - Spiez, nouvelle génération - Totally Spies - Bob l'éponge - Monster Buster Club - Power Rangers - Les Fées ...","dateheurechaine":"Mercredi 22 avril de 06h30 à 11h05 sur TF 1","intervenant":"","Url_Fiche":""}</div>
</div>
</div><!-- emission-->
<div class="sep" style="left:563px;"> </div>
<div class="emission genre-vide serie pointer" style="width:223px;left:565px;z-index: 2;" id="emission_12711017" onclick="return afficherEmission('12711017', '192');" alt="7 à la maison - Mercredi 22 avril de 11h05 à 11h55" title="7 à la maison - Mercredi 22 avril de 11h05 à 11h55">
<div class="conteneur">
<span class="genre">Série</span><br />
<span class="titre">
<a href="/tele/emission.php?id=12711017" onclick="return false;" class="annulahref">7 à la maison</a>
</span><br />
<span style="resume">Les jumeaux, Sam et David, acceptent de dévoiler des ...</span>
<div id="data_12711017" style="display:none;">{"Id_Diffusion":"12711017","Id_Emission":"6585632","Id_Chaine":"192","Date_Debut":"2009-04-22 11:05:00","Date_Fin":"2009-04-22 11:55:00","Titre":"7 à la maison","Sous_Titre":"Petits secrets de famille","ShowViewFr":"1241984","note_T":"0","Id_Rubrique":"8534","Rubrique_Libelle":"Série sentimentale","Rubrique_Niveau":"3","Type":"Série","Chaine_Nom":"TF 1","Logo":"192.gif","Id_Hierarchie":"060902","DureeEnSecondes":"3000","resume_court":"Les jumeaux, Sam et David, acceptent de dévoiler des ...","resume_long":"Les jumeaux, Sam et David, acceptent de dévoiler des secrets en contrepartie de lait et de gâteaux. Rapidement, la situation s'envenime...","dateheurechaine":"Mercredi 22 avril de 11h05 à 11h55 sur TF 1","intervenant":"<strong>Réalisateur : </strong> Harry Harris<br><strong>Acteur : </strong> Stephen Collins (Eric Camden), Catherine Hicks (Annie Camden) ...<br><br>","Url_Fiche":""}</div>
</div>
</div><!-- emission-->
Code:
#!/usr/bin/env python
# -*- coding: cp1252 -*-
#############################################################################
import httplib
import urllib
import sys
import re
import csv
from BeautifulSoup import BeautifulSoup
from BeautifulSoup import NavigableString
#############################################################################
url_a_parser = 'http://television.telerama.fr/tele/grille.php'
conn.putrequest('GET', url_a_parser)
conn.putheader('Accept', 'text/html')
conn.putheader('Accept', 'text/plain')
conn.endheaders()
## Récupération de la réponse
errcode, errmsg, headers = conn.getreply()
## ToDo : Add a check on errors
f=conn.getfile()
f=myPage.read()
mySoup=BeautifulSoup(myPageBuffer)
for resultats in mySoup.findAll('div'):
machaine = resultats.string
taillechaine = len(str(machaine))
if taillechaine > 30: # (what i want to grab is bigger than 30 car)
trim_left = str(machaine)[1:]
trim_result = trim_left[:len(trim_left)-1] # lead and tail string's cleaning
la_liste.append(trim_result) # storing expected datas in a list object
# test
split_datas = str(la_liste[-1]).split(',')
print "split_datas: \n" + str(split_datas[18])
I try to parse each data string, but my problem is that the datas are malformatted (csv like): each field (between "any field") is separated with a ',', but the delimiter can be inside a field.
ex: [HTML]"blah,blahblah","none here","here yes, grrr","here:not","last"[/HTML]
Is it possible to solve it easily ?
Next steps:
- re-generate HTML with timeline style
- include ability to choose channels (with cookies)
- add possibility to view programs of "this evening", "tomorrow", "next week", "thursday" ...
- May be add rss scrolling
- ...
Thanks a lot for your useful help !!