[Release] Parsedom and other functions

[Release] Parsedom and other functions - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Add-ons (https://forum.kodi.tv/forumdisplay.php?fid=26)
+--- Thread: [Release] Parsedom and other functions (/showthread.php?tid=116498)

Pages: 1 2 3 4 5 6 7 8 9

- takoi - 2012-01-05

Are you pushing to the addon repo anytime soon? I have some addons depending on the changes so im just waiting for a new version...

- TobiasTheCommie - 2012-01-05

The dependencies are all ready to go. But because the interfaces aren't backward compatibility we can't push before BlipTV and YouTube (and possible Vimeo) are in better states than they are now.

- takoi - 2012-01-08

shouldn't be hard to make it backwards compatible.. Just add the old module CommonFunctions and let the old 'constructor' return the module, like so:

Code:
def CommonFunctions():

    import common

    return common

..anything else changed?

- TobiasTheCommie - 2012-01-08

Yes, it would be very easy to add backwards compatibility. But i'm not going too before the final versíon.

- BlueCop - 2012-01-09

Many thanks for all the tools. I am anxious to try your speedy regex's later!

- TobiasTheCommie - 2012-01-27

Pull requests have been sent for 0.9.1.

The updated version should land in the Eden repository soonish.

- bossanova808 - 2012-01-27

Care to detail the main changes?

- TobiasTheCommie - 2012-01-27

Changelog:
- Make it easier to include the dependency. (It no longer needs to be instantiated)
- Changed _fetchPage to fetchPage.
- Added POST(And more) support to fetchPage. See http://wiki.xbmc.org/index.php?title=Add-on:Parsedom_for_xbmc_plugins#fetchPage.28self.2C_dict.29 for details.
- Changed _replaceHtmlCodes to replaceHTMLCodes
- Make replaceHTMLCodes match many more HTML codes
- Cleanup, stability.

- newatv2user - 2012-01-30

Can Parsedom parse html formatted like below? Or does it break with {CR}{LF} i.e newline.

Code:
<div

id="header"><div

class="sitedesc">

My parsedom using code has been returning empty in the past couple of days. So I'm guessing the site changed their code, or parsedom changed. The div, class are still there in the html but parsedom does not seem to recognize it.

The log has been returning something like this.

Quote:20:08:27 T:3108 NOTICE: [TopDoc - 0.0.1] parseDOM : 'start: 'div' - {'class': 'wrapexcerpt'} - False - <type 'str'>'
20:08:27 T:3108 NOTICE: [TopDoc - 0.0.1] parseDOM : 'Getting element content for 0 matches '
20:08:27 T:3108 NOTICE: [TopDoc - 0.0.1] parseDOM : 'Done'

- TobiasTheCommie - 2012-01-30

newatv2user Wrote:Can Parsedom parse html formatted like below? Or does it break with {CR}{LF} i.e newline.

Code:
<div id="header"><div class="sitedesc">

My parsedom using code has been returning empty in the past couple of days. So I'm guessing the site changed their code, or parsedom changed. The div, class are still there in the html but parsedom does not seem to recognize it.

The log has been returning something like this.

Currently the code does:
item = item.replace("\n", "")

But there is a better implementation i am going to test before the next version.

I'll need more code and debug to be able to help. But you could try doing replace.("\n", "") or maybe replace("\r\n", "") on the string before you give it to parseDOM to remove those pesky linebreaks yourself.

- newatv2user - 2012-01-31

The URL I'm trying to parse is:
http://topdocumentaryfilms.com/all/

My code is this:

Quote:itemsDOM = common.parseDOM(contents, "div", attrs = { "class": "wrapexcerpt"}, ret=False)

I swear it was working couple of days back. It's not working anymore. I tried your suggestion with replace, but still no go. Any hint on how I could fix this would be great. Thanks.

- TobiasTheCommie - 2012-01-31

newatv2user Wrote:The URL I'm trying to parse is:
http://topdocumentaryfilms.com/all/

My code is this:

I swear it was working couple of days back. It's not working anymore. I tried your suggestion with replace, but still no go. Any hint on how I could fix this would be great. Thanks.

Ok, i've downloaded the page and added two(so far) integration tests on it, that fail.

I'll see what i figure out(On fix and workaround).

ETA: .replace("\n", " ") should do the trick. I'm doing that in the parseDOM for the next version. You can do it beforehand so you don't have to wait.

- newatv2user - 2012-01-31

Awesome. That worked. I did replace("\n", "") before which didn't work. The space did the trick.

Thanks a lot.

- newatv2user - 2012-01-31

More problem on the same page.

Part of debug log:

Quote:21:05:21 T:1036 NOTICE: <ul style="background:#efefef;"><li style="padding:5px;font-size:13px;"><strong>Recommended Documentaries</strong></li></ul><ul><li><a href="http://topdocumentaryfilms.com/planet-earth-the-complete-bbc-series/">Planet Earth: The Complete BBC Series</a></li><li><a href="http://topdocumentaryfilms.com/cosmos/">Cosmos: A Personal Voyage (Carl Sagan)</a></li><li><a href="http://topdocumentaryfilms.com/philosophy-guide-to-happiness/">Philosophy – Guide to Happiness</a></li><li><a href="http://topdocumentaryfilms.com/through-the-wormhole/">Through the Wormhole</a></li><li><a href="http://topdocumentaryfilms.com/the-lost-world-of-lake-vostok/">The Lost World of Lake Vostok</a></li><li><a href="http://topdocumentaryfilms.com/story-of-science/">The Story of Science: Power, Proof and Passion</a></li><li><a href="http://topdocumentaryfilms.com/james-burke-connections/">James Burke: Connections</a></li><li><a href="http://topdocumentaryfilms.com/genius-charles-darwin/">The Genius of Charles Darwin</a></li><li>Universe: <a href="http://topdocumentaryfilms.com/universe-season-1/">Season 1</a>, <a href="http://topdocumentaryfilms.com/universe-season-2/">Season 2</a>, <a href="http://topdocumentaryfilms.com/universe-season-3/">Season 3</a>, <a href="http://topdocumentaryfilms.com/universe-season-4/">Season 4</a>, <a href="http://topdocumentaryfilms.com/universe-season-5/">Season 5</a></li><li><a href="http://topdocumentaryfilms.com/why-i-am-no-longer-a-christian/">Why I Am No Longer a Christian</a></li></ul>
21:05:21 T:1036 NOTICE: [TopDoc - 0.0.1] parseDOM : 'start: 'li' - {} - False - <type 'str'>'
21:05:21 T:1036 NOTICE: [TopDoc - 0.0.1] parseDOM : 'no list found, making one on just the element name'
21:05:21 T:1036 NOTICE: [TopDoc - 0.0.1] parseDOM : 'Getting element content for 1 matches '

This is the code i'm using:

Quote:print item
recDOM2 = common.parseDOM(item, "li")

It used to find all "li", now it won't. Am I doing something wrong here?

- newatv2user - 2012-02-04

Hi Tobi

I am still having some problems with parsedom.

Quote:<div class="post-right"><h3><a href="http://documentarystorm.com/last-chance-to-see/" rel="bookmark" title="Stream this documentary: Last Chance to See">Last Chance to See</a></h3><p class="post-meta">Jan 29th, 2012 // <a href="http://documentarystorm.com/category/nature-biology/animals-nature-biology/" title="View all posts in Animals" rel="category tag">Animals</a>, <a href="http://documentarystorm.com/category/nature-biology/" title="View all posts in Nature" rel="category tag">Nature</a> // <a href="http://documentarystorm.com/last-chance-to-see/#comments" title="Comment on Last Chance to See">2 Comments »</a></p><p>Stephen Fry and zoologist Mark Carwardine head to the ends of the earth in search of animals on the edge of extinction.</p><div style="display: none">VN:RO [1.9.13_1145]</div><div class="ratingblock "><div class="ratingheader "></div><div class="ratingstarsinline "><div id="article_rater_7827" class="ratepost gdsr-pumpkin gdsr-size-20"><div class="starsbar gdsr-size-20"><div class="gdouter gdheight"><div id="gdr_vote_a7827" style="width: 118.181818182px;" class="gdinner gdheight"></div></div></div></div></div></div></div>

On the above HTML, how do I get the second <p> that contains the description. It is only returning the first <p> with class. How do I get the second one?

Thanks.