[Release] Parsedom and other functions

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
TobiasTheCommie Offline
Skilled Python Coder
Posts: 618
Joined: Apr 2008
Reputation: 6
Post: #76
if you do
result = parseDOM(data, "p")

Then the second p should be in result[1].
find quote
newatv2user Offline
Senior Member
Posts: 295
Joined: May 2011
Reputation: 27
Post: #77
That is exactly what I think I am doing.

Quote:print item
Plot = common.parseDOM(item, "p")
print 'ParseDOM returned: ' + str(len(Plot))

But I am not getting the desired result:
Quote:08:36:34 T:828 NOTICE: <div class="post-left"><a href="http://documentarystorm.com/last-chance-to-see/" title="Last Chance to See"><img src="http://documentarystorm.com/files/2012/01/last-chance-to-see1.jpg" alt="Last Chance to See (documentary)" height="150" width="150" /></a></div><div class="post-right"><h3><a href="http://documentarystorm.com/last-chance-to-see/" rel="bookmark" title="Stream this documentary: Last Chance to See">Last Chance to See</a></h3><p class="post-meta">Jan 29th, 2012 // <a href="http://documentarystorm.com/category/nature-biology/animals-nature-biology/" title="View all posts in Animals" rel="category tag">Animals</a>, <a href="http://documentarystorm.com/category/nature-biology/" title="View all posts in Nature" rel="category tag">Nature</a> // <a href="http://documentarystorm.com/last-chance-to-see/#comments" title="Comment on Last Chance to See">2 Comments »</a></p><p>Stephen Fry and zoologist Mark Carwardine head to the ends of the earth in search of animals on the edge of extinction.</p><div class="gdsrcacheloader gdsrclsmall" id="gdsrc_asr.7827.0.1.1327816953.48.1.20.6.4.0"><strong>GD Star Rating</strong><br /><em>a WordPress rating system</em></div></div><div class="clearfix"></div>
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'start: 'p' - {} - False - <type 'str'>'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'no list found, making one on just the element name'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'Getting element content for 1 matches '
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] _getDOMContent : 'match: <p class="post-meta">'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] _getDOMContent : 'start: 441, len: 21, end: 887'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] _getDOMContent : 'done html length: 425'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'Done'
08:36:34 T:828 NOTICE: ParseDOM returned: 1

I think my problem on post #74 is also similar. If there are mixed <li> with and without attributes, it is causing problem.

Or maybe I have a corrupted copy of parsedom. How do I check or reinstall?

Thanks.
find quote
TobiasTheCommie Offline
Skilled Python Coder
Posts: 618
Joined: Apr 2008
Reputation: 6
Post: #78
newatv2user Wrote:That is exactly what I think I am doing.



But I am not getting the desired result:


I think my problem on post #74 is also similar. If there are mixed <li> with and without attributes, it is causing problem.

Or maybe I have a corrupted copy of parsedom. How do I check or reinstall?

Thanks.

Confirmed and fixed in trunk.

Workaround is to do .replace("<p>", "<p />") on the input before passing it to parseDOM.
find quote
newatv2user Offline
Senior Member
Posts: 295
Joined: May 2011
Reputation: 27
Post: #79
More problems.

Portion of HTML I'm using: http://pastebin.com/C1imeTMG

My code:
Quote:suckerfishDOM = common.parseDOM(contents, "ul", attrs = { "id": "suckerfishnav"})[0]
catDOM = common.parseDOM(suckerfishDOM, "li", attrs = { "class": "cat-item cat-item-[0-9]{1,}"})
print 'Debug Info - catDOM length: ' + str(len(catDOM))
for dCat in catDOM:
print 'looping through catDOM'
print 'Debug Info: ' + dCat
if dCat is None or dCat == '':
continue

Resulting portion of log: http://pastebin.com/zAbk89n9

In summary, only the first match in catDOM is non empty. All the rest are empty. Am I doing it correctly?

Thanks.
find quote
TobiasTheCommie Offline
Skilled Python Coder
Posts: 618
Joined: Apr 2008
Reputation: 6
Post: #80
newatv2user Wrote:More problems.

Portion of HTML I'm using: http://pastebin.com/C1imeTMG

My code:


Resulting portion of log: http://pastebin.com/zAbk89n9

In summary, only the first match in catDOM is non empty. All the rest are empty. Am I doing it correctly?

Thanks.

Hm, annoying,

I will integration test this and get back to you.

ETA:
This works for me (albeit in trunk code, but should work for you as well i hope).
Code:
ret = common.parseDOM(self.readTestInput("documentarystorm2.html", False), "ul", attrs = { "id": "suckerfishnav"})
        print repr(ret)

        ret2 = common.parseDOM(ret, "ul", attrs = { "class": "children"})
        print "2: " + repr(ret2[0])

    for ret    in ret2:
            ret3 = common.parseDOM(ret, "li", attrs = { "class": "cat-item cat-item-[0-9]{1,}"})
            print "3: " + repr(ret3)
(This post was last modified: 2012-02-05 13:07 by TobiasTheCommie.)
find quote
newatv2user Offline
Senior Member
Posts: 295
Joined: May 2011
Reputation: 27
Post: #81
Thanks.
find quote
_Pierre_ Offline
Junior Member
Posts: 33
Joined: Nov 2009
Reputation: 0
Post: #82
I saw that you updated parseDOM till version 0.9.2
So I tought gonna try it out.
Still have problems with the fetchPage with posting data

Code:
T:3232  NOTICE: [SoundCloud] fetchPage : 'called for : 'https://soundcloud.com/connect/login''
20:44:39 T:3232  NOTICE: [SoundCloud] fetchPage : 'Posting data: username=*******&redirect_uri=plugin%3A%2F%2Fplugin.audio.soundcloud%2Foauth_callback&response_type=token&client_id=hijuflqxoOqzLdtr6W4NA&scope=non-expiring&password=******&display=popup'
20:44:39 T:3232  NOTICE: [SoundCloud] fetchPage : 'Added refering url: http://soundcloud.com'
20:44:39 T:3232  NOTICE: [SoundCloud] fetchPage : 'connecting to server...'
20:44:39 T:3232  NOTICE: [SoundCloud] fetchPage : 'URLError : <urlopen error unknown url type: plugin>'

Getting crazy about it grr
find quote
takoi Offline
Team-Kodi Member
Posts: 840
Joined: Oct 2009
Reputation: 12
Location: Norway
Post: #83
Would be nice if you could remove the xbmc and xbmcgui dependencies from functions that don't use them. Annoyed by having to write xbmc = None and xbmcgui = None in the interpreter, every time I try to use this outside xbmc.
find quote
takoi Offline
Team-Kodi Member
Posts: 840
Joined: Oct 2009
Reputation: 12
Location: Norway
Post: #84
Found a bug in the ret value parsing. Think it has to do with tab characters:

Code:
html = '<div id="player" class="loading tv " \r \tdata-media="http://nordond25a-f.akamaihd.net/z/no/open/db/db70c9ca4be6c56b4813f550d822b27e77116bd9/db70c9ca4be6c56b4813f550d822b27e77116bd9_,141,316,563,1266,2250,.mp4.csmil/manifest.f4m" \r \tdata-timezoneoffset="2" \r \tdata-startingbitrateindex="3"\r \tdata-streamingerrormessageurl="/streamingerror"\r \tdata-outoflivebuffermessageurl="/outoflivebuffer"\r \t\t\t\t data-subtitlesurl = "/programsubtitles/koid21008710"\r \t\t\t data-IsRatedR = "False"\r >\r\n\t<!--googleoff: all-->\r\n\t\t\t<div id="nrkFlashContainer">\r\n\t\t\t\t<div class="msg-board">\r\n\t\t\t\t\t\r\n\t<img width="960" \r \t\t class=""\r \t\t alt="" \r \t\t src="http://gfx.nrk.no/iiUIuSEgJNUZ5ESHnHRXHgpqjzVQx3q0AqWf4v5n3sEQ" />\r\n\t<div class="msg no-js-msg">\r\n\t\t<h2><strong class="heading">Ooops, Javascript mangler!</strong></h2>\r\n\t\t<p>\r\n\t\t\tVi kan ikke se at du har aktivert Javascript p\xc3\xa5 din PC, dette m\xc3\xa5 v\xc3\xa6re aktivert for at v\xc3\xa5r videoavspiller skal fungere.<br />\r\n\t\t\t<a href="http://www.nrk.no/some/support/page" target="_blank">Les mer<span class="offscreen"> om hvorfor vi krever javascript</span></a> p\xc3\xa5 v\xc3\xa5re hjelpesider.\r\n\t\t</p>\r\n\t</div>\r\n\r\n\t\t\t\t\t<div class="msg no-flash-msg">\r\n\t\t\t\t\t\t    <h2><strong class="heading">Ooops, vi har problemer med \xc3\xa5 laste Flash for avspilling!</strong></h2>\r\n\t\t\t\t\t\t    <p>\r\n\t\t\t\t\t\t\t\t<a href="http://get.adobe.com/flashplayer" target="_blank">Klikk her for \xc3\xa5 installere Flash p\xc3\xa5 din maskin.</a><br /><br />\r\n\t\t\t\t\t\t\t\tVirker det fortsatt ikke?<br/>\r\n\t\t\t\t\t\t\t    <a href="/hjelp/1.7916314">Les mer<span class="offscreen"> om flash og hvorfor vi krever det</span></a> p\xc3\xa5 v\xc3\xa5re hjelpesider.\r\n\t\t\t\t\t\t    </p>\r\n\t\t\t\t\t</div>\r\n\t\t\t\t</div>\r\n\t\t\t</div>\r\n\t<!--googleon: all-->\r\n</div>\r\n\r\n\r\n\r\n\r\n\t<section id="programMetaData" class="container tight">\r\n\t\t<aside id="episode2" class="span-5 clearfix">\t\t\r\n\r\n\t<img width="300" \r \t\t class="episode-image"\r \t\t alt="Verda vi skaper" \r \t\t src="http://gfx.nrk.no/iiUIuSEgJNUZ5ESHnHRXHgeDOYjDhHYN0qWf4v5n3sEQ" />\r\n\t\t\t<!--googleoff: snippet-->\r\n\t\t\t<ul class="infolist clearfix">\r\n\t\t\t\r\n\t\t\t\t\t<li><mark class="age-restriction"><span>A</span></mark> Tillatt for alle</li>\r\n\r\n\t\t\t\t<li><strong>Tilgjengelig til:</strong> \r\n<time datetime="2012-06-16T16:25:00+02:00">16.06.2012</time>\r\n\t\t\t\t</li>\r\n\t\t\t</ul>\r\n\t\t\r\n\t\t\t\r\n\t\t\t<ul class="sharethis clearfix">\r\n\t\t\t\t<li><a href="http://twitter.com/home?status=\r \t\t\t\t\t\t\t\tSe+%27Verda+vi+skaper%27+p%c3%a5+NRK+TV+http%3a%2f%2ftv.nrk.no%2​fserie%2fverda-vi-skaper%2fkoid21008710%2fsesong-1%2fepisode-6"\r \t\t\t\t\t   target="_blank" title="Del/tips på Twitter"><img src="http://psfil.nrk.no/content/images/tweet.png?1.1.4533.14084a" alt="Del/tips på Twitter" /></a></li>\r\n\t\t\t\t<li><a href="http://www.facebook.com/sharer.php?u=http://tv.nrk.no/serie/verda-vi-skaper/koid21008710/sesong-1/episode-6"\r \t\t\t\t\t   target="_blank" title="Del/tips på Facebook"><img src="http://psfil.nrk.no/content/images/facebook.png?1.1.4533.14084a" alt="Del/tips på Facebook" /></a></li>\r\n\t\t\t</ul>\r\n\t\t\t<!--googleon: snippet-->\r\n\t\t</aside>\r\n\r\n\t\t<article id="episode" class="span-10 last">\r\n\t\t\t<hgroup>\r\n\t\t\r\n\t\t\t\t\t<h2><a href="http://tv.nrk.no/serie/verda-vi-skaper">Verda vi skaper</a>  \r\n\t\t\t\t\t</h2>\r\n\t\t\t\t<h1>\r\n\t\t\t\t\tVerda vi skaper \r\n\t\t\t\t\t\t<span class="small">6:8</span> \t\t\r\n\t\t\t\t</h1>\t\t\r\n\t\t\t</hgroup>\r\n\t \r\n\t\t\t<section id="taglist" class="stack-links">\r\n\t\t\t\t<strong>Emner:</strong>\r\n\t\t\t\r\n<a href="/kategori/dokumentar-og-fakta" title="Vis flere programmer i kategori &quot;Dokumentar og fakta&quot;">Dokumentar og fakta</a>, <a class="thin" href="/sok?m=tv&amp;q=Kenya&amp;filter=rettigheter&amp;side=1" title="Vis flere programmer tagget med &quot;Kenya&quot;">Kenya</a>, <a class="thin" href="/sok?m=tv&amp;q=Slettelandet&amp;filter=rettigheter&amp;side=1" title="Vis flere programmer tagget med &quot;Slettelandet&quot;">Slettelandet</a>, <a class="thin" href="/sok?m=tv&amp;q=rovdyr&amp;filter=rettigheter&amp;side=1" title="Vis flere programmer tagget med &quot;rovdyr&quot;">rovdyr</a>, <a class="thin" href="/sok?m=tv&amp;q=urbefolkning&amp;filter=rettigheter&amp;side=1" title="Vis flere programmer tagget med &quot;urbefolkning&quot;">urbefolkning</a>, <a class="thin" href="/sok?m=tv&amp;q=tilpasning&amp;filter=rettigheter&amp;side=1" title="Vis flere programmer tagget med &quot;tilpasning&quot;">tilpasning</a>, <a class="thin" href="/sok?m=tv&amp;q=kultur&amp;filter=rettigheter&amp;side=1" title="Vis flere programmer tagget med &quot;kultur&quot;">kultur</a>\r\n\t\t\t</section>\r\n\t\t\r\n\t\t\r\n\t\t\t<div class="tab">\r\n\t\t\t\t<ul class="tab-nav line-sep clearfix">\r\n\t\t\t\t\t<li class="active"><h2><a href="#information">Programinformasjon</a></h2></li>\r\n\t\t\r\n\t\t\t\t\t\t<li><a href="/programreview/koid21008710" id="reviewLink" rel="nofollow">Omtale</a>\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t</li>\r\n\t\t\t\t\t\t<li><a href="/programsubtitles/koid21008710/html" id="subtitlesLink" rel="nofollow">Teksting</a></li>\r\n\t\t\t\t</ul>\r\n\t\t\t\t<div class="tab-panels">\r\n\t\t\t\t\t<section id="information" class="tab-panel">\r\n\t\t\t\t\t\t<div class="mod toggle closed">\r\n\t\t\t\t\t\t\t<p>\r\n\t\t\t\t\t\t\t\tBr. naturserie. På slettelandet veks gras som gir mat til dyr og menneske. Men nokre gonger er kampen for føda farleg. Dorobo-folket i Kenya må jage vekk svoltne løver for å skaffe levebrød. Mennesket og dyra lever tett saman på stepper over heile kloden. Norsk kommentar: Ola Bøe. (Human Planet: Grasslands) (6:8)\r\n\t\t\t\t\t\t\t\t<a href="#" class="control hide-when-open" title="Vis mer om Verda vi skaper">Vis mer</a>\r\n\t\t\t\t\t\t\t</p>\r\n\t\t\t\t\t\t\t<div class="details">\r\n\t\t\t\t\t\t\t\t<dl class="infolist">\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t<dt>Tilgjengelig i:</dt> <dd>Norge</dd>\r\n\t\t\t\t\t\t\t\t\t\t<dt>Første gang sendt:</dt> <dd>    <strong></strong> 08.06.2012 20:05</dd>\r\n\t\t\t\t\t\t\t\t\t\t<dt>Siste gang sendt:</dt> <dd>    <strong></strong> 08.06.2012 20:05</dd>\r\n\t\t\t\t\t\t\t\t\t\t<dt>Planlagt sendt:</dt> <dd>    <strong></strong> 09.06.2012 16:25</dd>\r\n\t\t\t\t\t\t\t\t</dl>\r\n\t\t\t\t\t\t\t\t<dl class="infolist">\r\n\t\t\t\t\t\t\t\t\t\t<dt>Serietittel:</dt> <dd>Verda vi skaper</dd>\r\n\r\n\t\t\t\t\t\t\t\t\t\t<dt>Episodetittel:</dt><dd>Verda vi skaper 6:8</dd>\r\n\r\n\t\t\t\t\t\t\t\t\t\t<dt>Orginal episodetittel:</dt> <dd>Human Planet</dd>\r\n\t\t\t\t\t\t\t\t\t\t<dt>Varighet:</dt> <dd>48 minutter</dd>\r\n\t\t\t\t\t\t\t\t</dl>\r\n\t\t\t\t\t\t\t\t\r\n\r\n\r\n\t\t\t\t\t\t\t\t\t<h3>Seriebeskrivelse:</h3><p>Britisk dokumentarserie</p>\r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t<a href="#" class="control" title="Vis mindre om Verda vi skaper">Vis mindre</a>\r\n\t\t\t\t\t\t\t</div>\r\n\t\t\t\t\t\t</div>\r\n\t\t\t\t\t</section>\r\n\t\t\t\t</div>\r\n\t\t\t</div>\r\n\t\t</article>\r\n\r\n\t</section>'

This returns an empty list:
parseDOM(html, 'div', {'id':'player'}, ret='data-media')

however, parseDOM(html, 'div', {'id':'player'}, ret='\tdata-media') works, but \t should not be considered to be part of the tag.
find quote
takoi Offline
Team-Kodi Member
Posts: 840
Joined: Oct 2009
Reputation: 12
Location: Norway
Post: #85
more examples of 'ret' not working correctly:

Code:
>>> html ='<div id="player" class="loading tv " \r \tdata-media="http://nordond2b-f.akamaihd.net/z/no/open/1e/1ee465d30cdea83ac036714a0d4e7c7ff7a1095d/1ee465d30cdea83ac036714a0d4e7c7ff7a1095d_,141,316,563,1266,2250,.mp4.csmil/manifest.f4m" \r \tdata-timezoneoffset="2" \r \tdata-startingbitrateindex="3"\r \tdata-streamingerrormessageurl="/streamingerror"\r \tdata-outoflivebuffermessageurl="/outoflivebuffer"\r \t\t\t\t data-subtitlesurl = "/programsubtitles/mkds61000910"\r \t\t\t data-IsRatedR = "False"\r >dsgdsfsdf</div>'

>>> parseDOM(html, 'div', {'id':'player'}, ret='\tdata-outoflivebuffermessageurl')
['/outoflivebuffer"\r \t\t\t\t data-subtitlesurl = "/programsubtitles/mkds61000910"\r \t\t\t data-IsRatedR = "False']

>>> parseDOM(html, 'div', {'id':'player'}, ret='data-subtitlesurl')
[]
find quote
newatv2user Offline
Senior Member
Posts: 295
Joined: May 2011
Reputation: 27
Post: #86
Is the debug feature not available with parsedom anymore? I am not able to get it to give me any debug message.
(This post was last modified: 2012-07-04 07:05 by newatv2user.)
find quote
stacked Offline
Skilled Python Coder
Posts: 802
Joined: Jun 2007
Reputation: 18
Post: #87
I've been using buggalo to track errors and I've notice two common issues with the fetchPage function.

1. Here it looks like the connection times out at line 399 of CommonFunctions.py and there is no exception to handle the socket timeout.

Log:
Code:
Type    <class 'socket.timeout'>
Message    timed out
Stacktrace     File "/home/xbmc/.xbmc/addons/plugin.video.revision3/default.py", line 335, in <module>
    build_main_directory(url)
File "/home/xbmc/.xbmc/addons/plugin.video.revision3/default.py", line 51, in build_main_directory
    html = common.fetchPage({"link": url})['content']
File "/home/xbmc/.xbmc/addons/script.module.parsedom/lib/CommonFunctions.py", line 399, in fetchPage
    ret_obj["content"] = con.read()
File "/usr/lib/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
    value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 647, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)


2. For some odd reason, fetchPage returns without a 'content' key. I believe this happens when the statement at line 398 of CommonFunctions.py is not true.

Log:
Code:
Type    <type 'exceptions.KeyError'>
Message    'content'
Stacktrace     File "/storage/sdcard0/Android/data/org.xbmc.xbmc/files/.xbmc/addons/plugin.video.revision3/default.py", line 339, in <module>
    get_video(url, name, plot, studio, episode, thumb, date)
File "/storage/sdcard0/Android/data/org.xbmc.xbmc/files/.xbmc/addons/plugin.video.revision3/default.py", line 231, in get_video
    result = common.fetchPage({"link": url})['content']


Sorry, I don't have the full debug logs. Thanks again for this script.
find quote
takoi Offline
Team-Kodi Member
Posts: 840
Joined: Oct 2009
Reputation: 12
Location: Norway
Post: #88
What did you expect to happen? When it times out it times out. If you have a way to recover, then catch it..
(This post was last modified: 2012-08-23 11:08 by takoi.)
find quote
stacked Offline
Skilled Python Coder
Posts: 802
Joined: Jun 2007
Reputation: 18
Post: #89
(2012-08-23 11:08)takoi Wrote:  What did you expect to happen? When it times out it times out. If you have a way to recover, then catch it..

I did create a way to recover. I was just trying to point out the problem so it can be corrected within the fetchPage function.

Anyways, here is what I did. I created a function that uses fetchPage. If there is an error in fetchPage, the function will have attempt to load the page. If it still fails after 3 retries, buggalo will catch the error.
(This post was last modified: 2012-08-23 22:44 by stacked.)
find quote
mrstealth Offline
Junior Member
Posts: 4
Joined: Sep 2012
Reputation: 0
Post: #90
Thank you for great script, this script helps to speed up the add-on development and I really like it and use it in all my xbmc add-ons.

But since today all my plugins are broken and I get the following error:

Code:
response = common.fetchPage({"link": url})
File ".../Library/Application Support/XBMC/addons/script.module.parsedom/lib/CommonFunctions.py", line 410, in fetchPage
ret_obj["content"] = inputdata.decode("utf-8")
File "/Applications/XBMC.app/Contents/Frameworks/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 275-276: invalid data

I investigated the issue and found out, that the default encoding is set to 'utf-8' in version 1.2.0, but it causes crash for non utf-8 pages ( Fetchpage should decode binary to utf-8 ).

Is it possible to provide some configuration option like: common.encoding = 'cp1251'?

Many thanks in advance for your help and I hope this will be fix in the next add-on version.
(This post was last modified: 2012-09-19 18:24 by mrstealth.)
find quote
Post Reply