2022-03-28, 16:01
@fbacher thanks for the detailed post. Are you still working on this?
I don't know about Kodi's utf-8 hack for Turkish exactly, but I do notice Kodi changing reasonable locale and filesystem encoding settings. In my addon I was able to download files with non-ASCII names (from yle, not youtube) and display them correctly within Kodi. I will give a quick explanation of my approach here in becomes useful.
Overview:
- Extract filename (and download link) into python string (unicode code points).
- Set download directory, using `xbmcvfs.translatePath()` if required.
- Append filename (utf-8 encoded) to complete download path.
- Download file
- Open filename (utf-8 encoded) and write downloaded data as binary data.
- Pass filename (utf-8 encoded) and path to Kodi as list item.
Obviously this code is incomplete, and requires setting relevant `addon_id` and `_handle` and having a function to handle "local_file_playback" parameter to play the video.
In summary:
- Changes Kodi makes to Python locale are ignored by explicitly using utf-8 for filenames.
- Changes Kodi makes to Python filesystem encoding are ignored by writing the file as binary data. i.e. video file is not a text format so encoding is not an issue.
- Filenames and paths are passed to Kodi functions are utf-8 encoded bytes.
Extra notes:
- Python os functions accept either bytes or unicode strings: return type matches input type.
- Everything in the OS is bytes (eg. filenames), so use bytes as input to os functions.
- Not all bytes have valid unicode representations when decoded from utf-8.
- Since the underlying bytes can possibly be invalid, error handling is required..
The surrogateescape error mode represents the invalid utf-8 bytes as reserved unicode code points.
It is the only error mode in python that provides lossless handling of the invalid utf-8 bytes.
e.g. b"\xff" == b"\xff".decode("utf-8", "surrogateescape").encode("utf-8", "surrogateescape")
- Many Kodi functions can accept either python unicode strings or utf-8 bytestrings, but they don't accept python unicode strings containing reserved code points (i.e. unprintable utf-8 decoded using surrogateescape error handler.)
I don't know about Kodi's utf-8 hack for Turkish exactly, but I do notice Kodi changing reasonable locale and filesystem encoding settings. In my addon I was able to download files with non-ASCII names (from yle, not youtube) and display them correctly within Kodi. I will give a quick explanation of my approach here in becomes useful.
Overview:
- Extract filename (and download link) into python string (unicode code points).
- Set download directory, using `xbmcvfs.translatePath()` if required.
- Append filename (utf-8 encoded) to complete download path.
- Download file
- Open filename (utf-8 encoded) and write downloaded data as binary data.
- Pass filename (utf-8 encoded) and path to Kodi as list item.
python:
from urllib.parse import urlencode
filename = "non-ääsci"
url = "example.xyz/video/12345"
# Download path
target_dir = xbmcvfs.translatePath("special://temp").encode("utf-8", "surrogateescape")
filepath = os.path.join(target_dir, filename.encode("utf-8", "surrogateescape"))
# Download remote file.
res = requests.get(url, stream=True)
# Open filename (utf-8 encoded) and write downloaded data as binary data.
with open(filepath, "ab") as file:
for chunk in res.iter_content(chunk_size=1024**2):
file.write(chunk)
# Create list item entry for downloaded file.
list_item = xbmcgui.ListItem(label=filename.encode("utf-8", "surrogateescape"))
# From memory it is best to give unicode to urlencode
attrs = {"local_file_playback": filepath.decode("utf-8", "surrogateescape")}
param_string = urlencode(attrs, encoding="utf-8", errors="surrogateescape")
callback_url = f"plugin://{addon_id}/?{param_string}"
listing = (callback_url, list_item, False)
Obviously this code is incomplete, and requires setting relevant `addon_id` and `_handle` and having a function to handle "local_file_playback" parameter to play the video.
In summary:
- Changes Kodi makes to Python locale are ignored by explicitly using utf-8 for filenames.
- Changes Kodi makes to Python filesystem encoding are ignored by writing the file as binary data. i.e. video file is not a text format so encoding is not an issue.
- Filenames and paths are passed to Kodi functions are utf-8 encoded bytes.
Extra notes:
- Python os functions accept either bytes or unicode strings: return type matches input type.
- Everything in the OS is bytes (eg. filenames), so use bytes as input to os functions.
- Not all bytes have valid unicode representations when decoded from utf-8.
- Since the underlying bytes can possibly be invalid, error handling is required..
The surrogateescape error mode represents the invalid utf-8 bytes as reserved unicode code points.
It is the only error mode in python that provides lossless handling of the invalid utf-8 bytes.
e.g. b"\xff" == b"\xff".decode("utf-8", "surrogateescape").encode("utf-8", "surrogateescape")
- Many Kodi functions can accept either python unicode strings or utf-8 bytestrings, but they don't accept python unicode strings containing reserved code points (i.e. unprintable utf-8 decoded using surrogateescape error handler.)