How to pull info from a website *urgent*
#1
how do i pull info from a website , like how do the people that do the i films and launch script do it , i tried to look at their script but it didnt work , i just want to write a news script that shows the latest news headlines but how do i pull the information from the website?

please reply quick as im going out in about 2 hours :kickass:
Reply
#2
if you look in the ifilm script there is a function called gethtmldata. you pass in a url and get the html from the page back. you then have to parse the page to get the data from it that you want.
Reply
#3
ah yes i see it , but how do i tell it which data i need ?,
Reply
#4
well look at the webpage..you want to get the headlines dont you....look for the text in the html and then you will have to come up with some way of getting the text out of the page...regular expressions are the best way....obviously the text is always going to change but the html tags around it are not...
Reply
#5
imo, using sgmlparser is both the best and easiest way. i put this link before on the forum. it gives examples of how you can parse html. also useful for other python stuff.

greetz,
piscator
Reply
#6
im really sorry i only started learning pythonm yesterday , could you give me an example :idea: , and i will bow down to you :kickass:
Reply
#7
for which way!!! what site are you trying to do..if you show me what you want then i can show you how.
Reply
#8
http://news.bbc.co.uk/1/hi/england/default.stm thats the site that i am trying to get the news from
Reply
#9
for that site there is also an rss version
http://news.bbc.co.uk/rss/newsonline_uk_...rss091.xml

what you can do is get and rss reader in the scripts section and give it that url and you should see the news without having to write the script yourself
Reply
#10
no i wanted to make it so that when you click on the item in the list box it will display the full story :thumbsup: but i think id be right in syaing that i could use that as a sourceHuh
Reply
#11
well i have not used the rss readers but thats exactly what they do....if you are to write a something for that site then still use the rss xml....if gives you the headline and a link to the main story and also its in a known format...
Reply
#12
so how would i pull the headlines from that site and display them on a list item?

sprry to keep bugging you
:help:
Reply
#13
ok use the following code and see if you can understand.....by right i should use and xml parser but i have not used it so i am using regular expressions instead...
your going to have to to the rest yourself!
:lol:


import urllib,urllib2 , re
import xml.dom.minidom
import xbmc, xbmcgui
from string import split, replace, find

try: emulating = xbmcgui.emulating
except: emulating = false

action_move_left = 1
action_move_right = 2
action_move_up = 3
action_move_down = 4
action_page_up = 5
action_page_down = 6
action_select_item = 7
action_highlight_item = 8
action_parent_dir = 9
action_previous_menu = 10
action_show_info = 11
action_pause = 12
action_stop = 13
action_next_item = 14
action_prev_item = 15

txheaders = {

'user-agent': 'mozilla 1.0',

'accept-language': 'en-us',

}

txdata = none

class cheadline:
def (self, title, description, link):
self.title = title
self.description = description
self.link = link


class bbc(xbmcgui.window):
def (self):
if emulating: xbmcgui.window.(self)

self.scalex = ( float(self.getwidth()) / float(720) )
self.scaley = ( float(self.getheight()) / float(480) )

self.background = xbmcgui.controlimage(0,0, int(720*self.scalex),int(480*self.scaley), 'q:\\scripts\\background.png')
self.headlinelist = xbmcgui.controllist(int(50*self.scalex), int(50*self.scaley), int(450*self.scalex), int(350*self.scaley))

self.addcontrol(self.headlinelist)

self.headlinelist.setvisible(1)
self.setfocus(self.headlinelist)

self.headlines = self.getheadlines('http://news.bbc.co.uk/rss/newsonline_uk_...ss091.xml')

for c in self.headlines:
self.headlinelist.additem(c.title)


def oncontrol(self, control):
if control == self.headlinelist:
desc = self.headlines[self.headlinelist.getselectedposition()].description
link = self.headlines[self.headlinelist.getselectedposition()].link
print (desc)
print (link)


def getheadlines(self,url):
rss_data = self.gethtmldata(url)

#<item>
#<title>carlisle reels from flood chaos</title>
#<description>flooding in carlisle leaves schools and roads closed and the city's court and hospital running skeleton services.</description>
#<link>http://news.bbc.co.uk/go/click/rss/0.91/...stm</link>

#</item>
out = []
headlinesre = re.compile('<item>.+?<title>(.+?)</title>.+?<description>(.+?)</description>.+?<link>(.+?)</link>.+?</item>',re.multiline + re.dotall)
headlines = headlinesre.findall(rss_data)
for headline in headlines:
print(headline[0])
print(headline[1])
print(headline[2])
out.append(cheadline(headline[0],headline[1],headline[2]))


return out



def gethtmldata(self,url):
req = urllib2.request(url, txdata, txheaders)

try:

response = urllib2.urlopen(req)

except urllib2.httperror, error:

print ("error retrieving data : " + error)
if not emulating:
error_dialog = xbmcgui.dialog()
error_dialog.ok("net error", error)

return error

html = response.read()
response.close()
return html

def correctchars(self,txt):
data = txt
data = replace(data,'&','\'')
data = replace(data,''','\'')
data = replace(data,'"','"')
data = replace(data,'&;','&')
data = replace(data,'\n','')
return data



w = bbc()
w.domodal()
del w
Reply
#14
sorry all the formatting got messed up let me try again
Reply
#15
for c in self.headlines:
self.headlinelist.additem(c.title)


where did the item c.title come from thats the only part i dont understand now , but thanks to flash for all his hard work cos i can post this image now:

:fixed: :fixed: :fixed:
Reply

Logout Mark Read Team Forum Stats Members Help
How to pull info from a website *urgent*0