Very strange UnicodeEncodeError
#1
Hello,

First of all this is my first post in this forum so hello Kodi's community! Smile

I have been banging my head on this for hours now so I give up and kindly ask for your help...
I thought I understood the tricky handling of byte and unicode strings in python but now I am completly lost.

Here is the code that is driving me nuts:
python:

# -*- coding: utf-8 -*-
if __name__ == "__main__":
    s = u"string with àccénts"
    print(u"Type of s = {}".format(type(s)))
    print("s with repr = {}".format(repr(s)))
    print("s with encode = {}".format(s.encode("utf-8")))
    print(u"s = {}".format(s))

When I run this code with python 2.7.10 on MacOS I get this:
Code:
Type of s = <type 'unicode'>
s with repr = u'string with \xe0cc\xe9nts'
s with encode = string with àccénts
s = string with àccénts

This is the expected behavior according to me because as you can see "s" is a unicode string so from what I understood there are 2 safe ways to print it using "format()":
  1. Convert the string containing the literal to unicode with "u": print(u"s = {}".format(s))
  2. Use repr(): print("s with repr = {}".format(repr(s)))
And indeed both options work here.

But if I run the same code in my add-on I get this:
Code:

DEBUG: Type of s = <type 'unicode'>
DEBUG: s with repr = u'string with \xe0cc\xe9nts'
DEBUG: s with encode = string with àccénts
ERROR: EXCEPTION Thrown (PythonToCppException) : -->Python callback/script returned the following error<--
- NOTE: IGNORING THIS CAN LEAD TO MEMORY LEAKS!
Error Type: <type 'exceptions.UnicodeEncodeError'>
Error Contents: 'ascii' codec can't encode character u'\xe0' in position 16: ordinal not in range(128)
Traceback (most recent call last):
File "<my_plugin_path>/main.py", line 39, in <module>
print(u"s = {}".format(s))
File "<string>", line 7, in write
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 16: ordinal not in range(128)
-->End of Python script error report<--

I just don't understand why I get this UnicodeEncodeError exception here.

I am using Kodi Leia v18.6 on MacOS (so python 2 is used) and it uses xbmc.python v2.26.0.
Also please note that the behavior is the same with xbmc.log() instead of print. I used print here just to shorten the code.

Do you have any idea what is wrong here?
Thank you in advance for your help! Smile
Reply
#2
I've encountered this issue with log (print just sends to log iirc). I'm not sure what the reason is, but I've found that I need to decode the string before passing to the formatter.

Code:
print(u"s = {}".format(s.decode('utf-8'))
Arctic Fuse - Alpha now available. Support me on Ko-fi.
Reply
#3
**Rant mode on**
Those who decided that in Python 2.0 str and unicode types should be compatible should get a special hell.
**Rant mode off**

There's nothing strange with this error. You are trying to format a binary string (produced by .encode()) into a unicode one, and since Python 2 does silent conversion under the hood using ASCII encoding, you get this error because your accented characters do not fit into ASCII code table.

Long story short: unicode is not str with additional characters. Those are different types: str stores a sequence of bytes, that is a minimal unit of str is a byte or 8 bits, while unicode stores text and a minimal unit of unicode is a Unicode codepoint (a character of some language or a modification command). You should not mix those types together despite the fact that Python 2 allows you to. Fortunately, in Python 3 those types are made distinct and incompatible.

The rule of thumb is not to mix str and unicode in the same context (string concatenation or formatting) and to store all user-facing text as unicode. Use .decode() method to convert str (a sequence of bytes) to text, if necessary, or .encode() method to convert unicode to bytes, if you really need to. BTW, bytes is a valid type in Python 2.7 and mans the same as str.
Reply
#4
Thanks both of you for answering!
(2020-04-17, 04:29)jurialmunkey Wrote: I've encountered this issue with log (print just sends to log iirc). I'm not sure what the reason is, but I've found that I need to decode the string before passing to the formatter.

Code:
print(u"s = {}".format(s.decode('utf-8'))

Are you sure you meant "decode" and not "encode"? Because I tried the following code:

python:

print(u"s = {}".format(s.decode('utf-8')))
print("s = {}".format(s.decode('utf-8')))

and none of these options worked...
 
(2020-04-17, 09:49)Roman_V_M Wrote: There's nothing strange with this error. You are trying to format a binary string (produced by .encode()) into a unicode one, and since Python 2 does silent conversion under the hood using ASCII encoding, you get this error because your accented characters do not fit into ASCII code table.
Actually I am trying to print an unicode string (called s) concatenated with another unicode string using format(). Maybe my explanation was not clear enough: I listed several commands to show what works and what does not but maybe it is confusing.
To simplify here is the issue:
python:

# -*- coding: utf-8 -*-
if __name__ == "__main__":
    s = u"string with àccénts"
    print(u"Type of s = {}".format(type(s)))
    print(u"s = {}".format(s))

This code works when executed directly with python 2.7.10:
Code:
Type of s = <type 'unicode'>
s = string with àccénts

But when called in my Kodi's add-on I get this:
Code:
DEBUG: Type of s = <type 'unicode'>
ERROR: EXCEPTION Thrown (PythonToCppException) : -->Python callback/script returned the following error<--
- NOTE: IGNORING THIS CAN LEAD TO MEMORY LEAKS!
Error Type: <type 'exceptions.UnicodeEncodeError'>
Error Contents: 'ascii' codec can't encode character u'\xe0' in position 16: ordinal not in range(128)
Traceback (most recent call last):
File "/path/to/my/plugin/main.py", line 29, in <module>
print(u"s = {}".format(s))
File "<string>", line 7, in write
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 16: ordinal not in range(128)
-->End of Python script error report<--

s is a unicode string because I used the "u" prefix and it is confirmed by the print with type(s). So I don't understand why the last print generates an error whereas it does not with the "classic" python interpreter...
Reply
#5
(2020-04-17, 15:37)ThomB Wrote: s is a unicode string because I used the "u" prefix and it is confirmed by the print with type(s). So I don't understand why the last print generates an error whereas it does not with the "classic" python interpreter...

Sorry, I missed the fact that you try to use print in Kodi. This is a slightly different story. From technical POW, in Kodi print uses xbmc.log() under the hood and xbmc.log() accepts only encoded byte strings, that is str type. If you pass a unicode string to it, Python 2 tries to do implicit encoding using ASCII code table and fails for the reasons explained above. As far as supported types concerned, Kodi Python API is a bit messy, at least with Python 2. The situation is much better with Python 3 in Kodi "Matrix".

You can check my Kodistubs project that includes the information about what types Kodi Python API functions and methods expect and return: https://github.com/romanvm/Kodistubs
For example, xbmc.log(): https://github.com/romanvm/Kodistubs/blo...c.py#L1817
Reply
#6
(2020-04-17, 15:37)ThomB Wrote: Are you sure you meant "decode" and not "encode"? Because I tried the following code:

Ah sorry, yes you're right - I meant encode to byte string print(u"s = {}".format(s).encode('utf-8')). Was getting things back to front because I didn't have laptop in front of me to check.

What I do to get things working in both Leia and Matrix.
I use a special log method encodes to bytes to make the logging Py2/3 compatible. Then any string formatting I will run through a decoding helper to make sure I get the proper decoding before passing to the log method (that's where I was misremembering the decode part).
python:

def kodi_log(logvalue):
logvalue = u'{0}{1}'.format(_addonlogname, logvalue)
if sys.version_info.major != 3:
logvalue = logvalue.encode('utf-8', 'ignore')
xbmc.log(logvalue, level=xbmc.LOGNOTICE)

def try_decode_string(string, encoding='utf-8'):
try:
return string.decode(encoding)
except Exception:
return string

kodi_log(u"s = {}".format(try_decode_string(s)))

It's ugly I guess but it works for most scenarios and largely avoids the cross version encoding difficulties.
Arctic Fuse - Alpha now available. Support me on Ko-fi.
Reply
#7
(2020-04-17, 16:29)Roman_V_M Wrote: From technical POW, in Kodi print uses xbmc.log() under the hood and xbmc.log() accepts only encoded byte strings, that is str type. If you pass a unicode string to it, Python 2 tries to do implicit encoding using ASCII code table and fails for the reasons explained above. As far as supported types concerned, Kodi Python API is a bit messy, at least with Python 2. The situation is much better with Python 3 in Kodi "Matrix".

You can check my Kodistubs project that includes the information about what types Kodi Python API functions and methods expect and return: https://github.com/romanvm/Kodistubs
For example, xbmc.log(): https://github.com/romanvm/Kodistubs/blo...c.py#L1817 

It explains it all Smile Thank you very much it is much clearer now!
I was already using Kodistubs but I didn't think of looking at the type expected by xbmc.log() in the docstring. I will check this more carefully next time.
 
(2020-04-18, 00:37)jurialmunkey Wrote: What I do to get things working in both Leia and Matrix.
I use a special log method encodes to bytes to make the logging Py2/3 compatible. Then any string formatting I will run through a decoding helper to make sure I get the proper decoding before passing to the log method (that's where I was misremembering the decode part).
python:

def kodi_log(logvalue):
    logvalue = u'{0}{1}'.format(_addonlogname, logvalue)
    if sys.version_info.major != 3:
        logvalue = logvalue.encode('utf-8', 'ignore')
    xbmc.log(logvalue, level=xbmc.LOGNOTICE)

def try_decode_string(string, encoding='utf-8'):
    try:
        return string.decode(encoding)
    except Exception:
        return string

kodi_log(u"s = {}".format(try_decode_string(s)))

It's ugly I guess but it works for most scenarios and largely avoids the cross version encoding difficulties. 

Thank you for the snippet! It may be ugly but it seems it works and that's all that matters until we switch definitively to python3 with Kodi Matrix.

I already had a wrapper for the Kodi log functions so I will implement a similar solution. This topic can now be closed.
Reply
#8
(2020-04-18, 00:37)jurialmunkey Wrote: It's ugly I guess but it works for most scenarios and largely avoids the cross version encoding difficulties.

Line #2 is a disaster going to happen if you pass a UTF-8 encoded binary string. And your entire code can be re-written as this:

python:

import sys
import xbmc


def kodi_log(message):
    if isinstance(message, bytes):
        message = message.decode('utf-8')
    log_message = u'[addon.name] {}'.format(message)
    if sys.version_info < (3, 0):
        log_message = log_message.encode('utf-8')
    xbmc.log(log_message)

Of course, it may still fail if you pass a binary string that is not encoded in UTF-8 but in most cases it will work.
Reply
#9
(2020-04-18, 22:48)Roman_V_M Wrote:
(2020-04-18, 00:37)jurialmunkey Wrote: It's ugly I guess but it works for most scenarios and largely avoids the cross version encoding difficulties.

Line #2 is a disaster going to happen if you pass a UTF-8 encoded binary string. And your entire code can be re-written as this:

python:

import sys
import xbmc


def kodi_log(message):
    if isinstance(message, bytes):
        message = message.decode('utf-8')
    log_message = u'[addon.name] {}'.format(message)
    if sys.version_info <= (3, 0):
        log_message = log_message.encode('utf-8')
    xbmc.log(log_message)

Of course, it may still fail if you pass a binary string that is not encoded in UTF-8 but in most cases it will work.
Thanks for this code snippet!

I should have prefaced that I only ever send Unicode to the log. Most of the time I've converted to Unicode earlier anyway to manage py 2/3 compatibility and if I haven't then I run the string through the decode helper first. Your approach is better though because it avoids the need for that extra try decode step.
Arctic Fuse - Alpha now available. Support me on Ko-fi.
Reply
#10
(2020-04-19, 00:53)jurialmunkey Wrote: Thanks for this code snippet!

Note that I have fixed it. Of course, version comparison should be sys.version_info < (3, 0)
Reply
#11
As a lurker, thanks for the discussion.  I've been modding a bunch of orphaned scripts for py3 and I think I about have the whole 2 vs 3 unicode thing sorted at this point (I know, famous last words).

scott s.
.
Reply

Logout Mark Read Team Forum Stats Members Help
Very strange UnicodeEncodeError0