Need help in code
#1
Hi,

I need help in python code:

below is actual url

http://www.filmicity.net/forumdisplay.php?f=33

whenever I try to scrape it, the output comes as

Code:
http://www.filmicity.net/forumdisplay.php?"s=****************amp"f=33

I am not sure where - " "- comes from

So I thought of doing htmlencode as below

Code:
def htmlencode(text):
    """Use HTML entities to encode special characters in the given text."""
    text = text.replace('?', '?',)
    return text


but later found out that the value after "?s=**** keeps changing like below

Code:
http://www.filmicity.net/forumdisplay.php?"s=********amp"f=33

So I just wanted to know how to get rid of these randomly changing value in htmlencode, is there any wildcard concept in htmlencode or any other way I can do it ?

Please let me know.

Thanks
Reply
#2
Actually when I try to paste the result where it has values like s=*****, it not showing in above message

actually it comes after

http://www.filmicity.net/forumdisplay.php? "s=*************amp;

Hope you are able to get it as i dont see it after i type it in prev message
Reply
#3
not sure what you are after, please post the code you are using for that parsing.
Reply
#4
here is the code

Code:
match=re.compile('<a href="(.+?)" class="forum-link"><strong>(.+?)</strong></a>').findall(link)
        #print match
        #print url
        for url,name in match:
            #url='http://www.filmicity.net/'+url
            url=htmlencode('http://www.filmicity.net/'+url)    
            print url
            print name
            addDir(name,url,2,'')

Instead of getting the actual url like below

hxxp://www.filmicity.net/forumdisplay.php?f=33

It adds
hxxp://www.filmicity.net/forumdisplay.php?s=1323234&amp;f=33

Please let me know.
Thanks
Reply
#5
Here is the full code

http://pastebin.com/f499102b8

Thanks
Reply
#6
I'm no python coder, but as the forum you are trying to scrape is a vbulletin board, the 's=xxxxxxx' you are getting is more than likely a session variable generated by the board (usually because the user does not have cookies enabled)

More specifically, it is a random md5 hashcode that uniquely identifies your particular session.

hope that helps.
Reply
#7
perhaps the ?s=XXXXX is a session variable...

replace your line:
Code:
url=htmlencode('http://www.filmicity.net/'+url)
with the following:
Code:
url = url.replace('&amp;', '&')
url = url[:url.index('?s')] + url[url.index('&f'):]
url = url.replace('&f=', '?f=')
that should give you correct urls
Reply
#8
Thankyou very much.

That worked
Reply
#9
glad to be of service Wink
Reply
#10
One more question - Is it possible to login to this site i.e http://www.filmicity.net/ via XBMC and then watch movies, as they want users to be logged in to watch movies -( sign up for login is free.) ?

Please let me know.

Thanks
Reply

Logout Mark Read Team Forum Stats Members Help
Need help in code0