•   
  • 1
  • 18
  • 19
  • 20
  • 21
  • 22(current)
Release [Module] youtube-dl - multi-site playable URL resolver
The "Turkic I" problem is deceptively simple but creates big problems. It is not the only problem, but it shows up in a pretty dramatic fashion. Here is a quick rundown, but just search for "Turkish I unicode problem" and you will get more than you wanted to know (but we should all read about it and similar problems, not to make everyone an expert, but to get an appreciation of the wild world of printed language).

Anyway, Turkic has 4 versions of the letter I: two dotless and two dotted, with upper and lower case versions of them. The history is that the script changed in the early 20'th century by the great founder of modern Turkey and he wanted the changes put in quickly. At the time both Arabic and Roman style writing was in use and created much mayhem (at least in his eyes). Anyway, due to the speed with which the changes were inacted, not everything got completely thought through.

The big problem comes up when you want to lower_case (or upper case) strings for a map index. I forget the fine details, but I think that if your locale is Turkic, then the lower case dotted i becomes the upper case dotted i and similarly for the dotless. But if your locale is most other languages that use a latin like alphabet, they do not transform this way. Further, for creating indexes you don't want the index value to be dependent on your locale. What you really want is a case-fold which ignores the less important stuff about a character (case, accents, etc.) and pay attention to the major stuff.

Enough of that. There are plenty of other pitfalls, a number of which are more important for correct text processing than for breaking programs.

A number of unicode libraries ignore the "Turkic" and other problems. The ICU library (a bit of a gorilla and not trivial to learn, and slower) does handle a bunch of these problems (including much localization, regular expressions, etc.). Python 10 still has the Turkic I problem. It's fold case method fails the Turkic test. I confirmed with Python support that it is broken and they did not offer a timeline to fix it. I'm more worried about the similar problems that we don't know about. One can fairly easily get around a few exceptions, but it could turn into a nightmare (what happens when you have to upper case a German sharp-s? (It turns into two upper case S characters, also case folding it would turn it into two ss characters).

THE kodi PATCH:

The patch made was simple and broad. Such drastic action was to fix a high-priority bug from Debian that was causing Kodi to be completely and utterly unusable when you switched to Turkic. The were two changes made:

The Language and Character set were forced to ASCII at startup
The Character set was relaxed to utf-8 at run-time.
The Locale was changed, except for those with the Turkic-i problem.

As a side-effect, Python got impacted in several unanticipated ways:
- The default encoding for filenames was ASCII
- I forget if xbmc.getLanguage is broken as well.

It is not easy properly fix these Python problems since it is the low-level default encoding that is frozen once Kodi starts it. Without a Kodi change we can't get the default filename encoding switched back to utf-8 (maybe I should consider this as an interim suggestion).

To address most of the issues in my own code (as you discovered) I have to specify UTF-8 for both filename and content:

  with io.open(path.encode('utf-8'), mode='rt', newline=None,
                         encoding='utf-8') as cacheFile:
 
 The problem gets worse when you depend upon other plugins, or in the case of youtube-dl or yt-dlp, third party libs. It was when I was frantically trying to replace youtube-dl with yt-dlp that it really hit me. I put in a number of work arounds and was planning on asking yt-dlp to fix some problems (they have code which sets the file and filename encoding, but it is not 100% and normally not needed). I'm also sure they were hitting the problems with to-lower (or similar).  I just saw an article that mentions some issue with JSON, hopefully not a big one.
 
 I'm hoping that Operating Systems like (old?) Windows, that support upper-lower case filenames, but that internally fold them to one case doesn't present a nightmare. Probably not since something must already be in place to handle it now....
 
 Anyway, I'm wanting to help out with the fundemental issues in Kodi before I do much more work on several addons since they are greatly impacted by this.
 
Have fun!
Reply
  •   
  • 1
  • 18
  • 19
  • 20
  • 21
  • 22(current)



Logout Mark Read Team Forum Stats Members Help
[Module] youtube-dl - multi-site playable URL resolver2
This forum uses Lukasz Tkacz MyBB addons.