Kodi Community Forum

Full Version: [Release] Parsedom and other functions
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9
Yep looks like this brok my addon as well. What happened to

'we do unit testing' blah blah - seems like a big regression really...even if you extend it to do utf8 surely the default behaviour should remain what it was??



Sorry that sounds rude, didn't mean it to be - parsedom is really nifty and made things very easy to write, but this is a big-ish change without forewarning...
I adjusted my addon to cope with the new unicode strings and it's all working well again
How did you fixed it? I am getting errors too.

"UnicodeDecodeError: 'ascii' codec can't decode byte.........."
For me it wasn't really errors in fetchpage itself, it was my crappy code - this was the first python I ever wrote and I was calling member function of str (e.g. strip.strip()) - directly, rather than on the string object. All the str functions exist for unicode as well, so I just called them directly on the string object, i.e. mystir.strip() instead.

Honestly, I looked in there and was embarrassed at what I saw!! I was quite a while ago now, but geez some ugly code is in there. Still, it works - indeed if I'd jsut written it properly to start, it would have worked even with the parsedom changes, so my bad really.

In your case, maybe pinch the fetchpage function from the older parsedom as an ugly workaround to keep going??

(At a guess on your issue: I would say you're getting strings back with embedded unicode in them - there's a well known bug for this in Python that requires a hack to get around - I came across this in XSqueeze and solve it with a function):

Code:
def unquoteUni(text):

    try:
        import urllib.parse
        return urllib.parse.unquote(text, encoding=self.charset)
    except ImportError:
        _hexdig = '0123456789ABCDEFabcdef'
        _hextochr = dict((a+b, chr(int(a+b,16))) for a in _hexdig for b in _hexdig)
        if isinstance(text, unicode):
            text = text.encode('utf-8')
        res = text.split('%')
        for i in xrange(1, len(res)):
            item = res[i]
            try:
                res[i] = _hextochr[item[:2]] + item[2:]
            except KeyError:
                res[i] = '%' + item
            except UnicodeDecodeError:
                res[i] = unichr(int(item[:2], 16)) + item[2:]
        return "".join(res)

giftie helped me find (or wrote? dude is awesome) - that function which deals with unicode in strings....this is for the case where you have a str type but with actually unicode in it like 'The message is \xe8\x91\xa3' or similar...those characters are unicode encoded (even though the type is str) - and fall outside of the ascii range hence the error you get above. The normal unquote does not work...

Sorry, but I don't understand. Here's more details to what I'm facing.

This is the portion of my code:

Code:
show_page = 'http://www.vice.com/shows'
try:
        
        soup = get_remote_data(show_page)
    
except HTTPError:
        
        return ''

story_list = common.parseDOM(soup, "ul", attrs={"class": "story_list.*?"})

And this is the error I'm getting:
Code:
10:50:14 T:5592 ERROR: Error Type: <type 'exceptions.UnicodeDecodeError'>
10:50:14 T:5592 ERROR: Error Contents: 'ascii' codec can't decode byte 0xd0 in position 8508: ordinal not in range(128)
10:50:14 T:5592 ERROR: Traceback (most recent call last):
File "C:\Documents and Settings\xxxxxx\Application Data\XBMC\addons\plugin.video.vice2\default.py", line 95, in <module>
Main()
File "C:\Documents and Settings\xxxxxx\Application Data\XBMC\addons\plugin.video.vice2\default.py", line 62, in __init__
for show in cache.cacheFunction(vice.get_episodes):
File "C:\Documents and Settings\xxxxxx\Application Data\XBMC\addons\script.common.plugin.cache\lib\StorageServer.py", line 541, in cacheFunction
ret_val = funct(*args)
File "C:\Documents and Settings\xxxxxx\Application Data\XBMC\addons\plugin.video.vice2\resources\lib\vice.py", line 147, in get_episodes
story_list = common.parseDOM(soup, "ul", attrs={"class": "story_list.*?"})
File "C:\Documents and Settings\xxxxxx\Application Data\XBMC\addons\script.module.parsedom\lib\CommonFunctions.py", line 267, in parseDOM
temp = _getDOMContent(item, name, match, ret).strip()
File "C:\Documents and Settings\xxxxxx\Application Data\XBMC\addons\script.module.parsedom\lib\CommonFunctions.py", line 144, in _getDOMContent
end = html.find(endstr, start)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 8508: ordinal not in range(128)

So do I use unquoteUni before I try parsedom? Thanks for the help.
Try fetching the page with fetchpage from parsedom for a start I'd say...
Code:
08:18:26 T:3860  NOTICE: CommonFunctions-1.2.0
08:18:28 T:3860   ERROR: Traceback (most recent call last):
08:18:28 T:3860   ERROR:   File "C:\Program Files (x86)\XBMC\portable_data\addons\plugin.video.revision3\default.py", line 382, in <module>
08:18:28 T:3860   ERROR:     build_sub_directory(url, name)
08:18:28 T:3860   ERROR:   File "C:\Program Files (x86)\XBMC\portable_data\addons\plugin.video.revision3\default.py", line 139, in build_sub_directory
08:18:28 T:3860   ERROR:     blah2 = common.fetchPage({"link": blah})['status']
08:18:28 T:3860   ERROR:   File "C:\Program Files (x86)\XBMC\portable_data\addons\script.module.parsedom\lib\CommonFunctions.py", line 410, in fetchPage
08:18:28 T:3860   ERROR:     ret_obj["content"] = inputdata.decode("utf-8")
08:18:28 T:3860   ERROR:   File "C:\Program Files (x86)\XBMC\system\python\Lib\encodings\utf_8.py", line 16, in decode
08:18:28 T:3860   ERROR:     return codecs.utf_8_decode(input, errors, True)
08:18:28 T:3860   ERROR: UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

I'm also having the UnicodeDecodeError problem. It happens when checking the status on this url. The error goes away if you edit line 410 in fetchPage and remove decode("utf-8"). I'm not sure if that would be the permanent solution.

Code:
line 410: ret_obj["content"] = inputdata.decode("utf-8")

to

line 410: ret_obj["content"] = inputdata

Maybe try using the unquoteUni I posted above as an alternative to decode??

You'll probably have to ask TobiasTheCommie for help with stuff inside of parseDOM itself.
Well, the easiest fix for me was removing 1.2 and installing 1.1. Unfortunately, even with automatic updates disabled, it just keeps updating to the borked 1.2. Wtf? Is there any way to disable update for this?

Update:
Installed 1.1 and changed the version in addon.xml to 1.2. Hope that works.
(2012-09-26, 04:28)newatv2user Wrote: [ -> ]Well, the easiest fix for me was removing 1.2 and installing 1.1. Unfortunately, even with automatic updates disabled, it just keeps updating to the borked 1.2. Wtf? Is there any way to disable update for this?

Update:
Installed 1.1 and changed the version in addon.xml to 1.2. Hope that works.

I just put the script.module.parsedom-1.1.0 version in my xbmc repository and raised the version to 1.2.1 in addon.xml.

Code:
<addon id='script.module.parsedom' version='1.2.1' name='Parsedom for xbmc plugins' provider-name='TheCollective'>

But this is a workaround and I hope the issue will be fixed in the next parsedom version
I don't understand the 1.3.0 update ( View the diff ).

The unquote-ing causes the below statement to produce an index error in getParameters.

Code:
params = common.getParameters("?path=/root/favorites&login=true&name=Tom+%26+Jerry")

I found a workaround of just using quote_plus. Is this the intended use?

Code:
params = common.getParameters(urllib.quote_plus("?path=/root/favorites&login=true&name=Tom+%26+Jerry"))
The commit is just crazy IMHO Big Grin. The issues others have had with unicode characters is probably due to unquote_plus inability to manage this properly.
I've been using below for ages without any issues what so ever...
PHP Code:
# FROM plugin.video.youtube.beta  -- converts the request url passed on by xbmc to our plugin into a dict  
def get_parameters(parameterString):
    
commands = {}
    
splitCommands parameterString[parameterString.find('?')+1:].split('&')
    for 
command in splitCommands
        if (
len(command) > 0):
            
splitCommand command.split('=')
            
name splitCommand[0]
            
value splitCommand[1]
            
commands[name] = value  
    
return commands 
Quote:Version 1.4.0
- Version 1.3 was too aggressive on frodo and not needed in eden, so we're doing a rollback on eden and fix on frodo

Great. Everything seems to be working again in eden, but I'm still having the same issue with frodo. I think frodo also needs a rollback.

(2012-11-23, 21:50)Popeye Wrote: [ -> ]The commit is just crazy IMHO Big Grin. The issues others have had with unicode characters is probably due to unquote_plus inability to manage this properly.
I've been using below for ages without any issues what so ever...
PHP Code:
# FROM plugin.video.youtube.beta  -- converts the request url passed on by xbmc to our plugin into a dict  
def get_parameters(parameterString):
    
commands = {}
    
splitCommands parameterString[parameterString.find('?')+1:].split('&')
    for 
command in splitCommands
        if (
len(command) > 0):
            
splitCommand command.split('=')
            
name splitCommand[0]
            
value splitCommand[1]
            
commands[name] = value  
    
return commands 

Thanks, that's what I was planning on using.
Whats the deal with frodo? To me it seems as if xbmc frodo url encode the whole plugin:// uri . If so, this is cursial information for all addon devs that must be shared asap....
Any fix for the Unicode error yet? If not I'm just gonna go back to 1.1.

Code:
File "C:\Documents and Settings\***\Application Data\XBMC\addons\script.module.parsedom\lib\CommonFunctions.py", line 278, in parseDOM
temp = _getDOMContent(item, name, match, ret).strip()
File "C:\Documents and Settings\***\Application Data\XBMC\addons\script.module.parsedom\lib\CommonFunctions.py", line 144, in _getDOMContent
end = html.find(endstr, start)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 15193: ordinal not in range(128)
Pages: 1 2 3 4 5 6 7 8 9