Reversed punctuations in RTL language
#1
Hi

There is a problem with Hebrew subtitles in the latest Matrix nightly (KodiSetup-20200612-3c5b7694-master-x64). The punctuation is reversed.
I am not sure which version started to show this issue as I previously tried an older HDR build where it is working fine. It was also working fine in Leia.
The Language setting is on Hebrew (Windows). I also tried Hebrew (ISO) and it made no difference.

Here is a screenshot for comparison. Incorrect version on top.

Image
Reply
#2
Sadly, this problem has found it's way now into the released matrix version.
Practically all published Hebrew subtitle files will now display wrong for users who used to enjoy them up until now.
The root reason is that the convention for years has been codepage 1255 (Windows Hebrew), with "LTR assumption". (yes, LTR, not RTL).
Prior releases seemed to recognize this, and display the subtitle punctuation as expected. (period on the left of the sentence, etc.)
Perhaps in retrospect this was not a great practice, but this is the de-facto standard.
I seem to remember older Kodi releases even had two "encoding" options: "Windows Hebrew (RTL)" and "Windows Hebrew (LTR)".
Maybe I am remembering wrong, and this as another player altogether, but it is still a good idea to have this, to keep compatibility.
Reply
#3
there is an open issue on github for this https://github.com/xbmc/xbmc/issues?q=hebrew https://github.com/xbmc/xbmc/issues/22398
Reply
#4
We did some massive rework of text rendering in Kodi now using some robust open source libraries like libass and harfbuzz.   In doing that I tried to test as much as I could though I am native English speaker.  I now see that subtitle specs don't provide for setting a "base direction" for text which strict compliance with internet/unicode bidi handing expects.  So subtitle editors have developed the de-facto "work around" of placing beginning/ending punctuation reversed within subtitle text lines (the proper fix would have been to place the unicode RTL codepoint in the text instead but it is what it is).

I was away from home for a month and now just getting back so looking at the code is on my "to do" list for this.

Something I'm not sure about -- I assume that with the existing workaround text is going to default to be left-justified.  I guess if it is set to be centered it won't matter but I would think that if justified it would render ragged-right while my understanding is in RTL scripts it's expected to be ragged-left. 

scott s.
.
Reply
#5
Any idea when can we download a fixed version (Win 64)? ..Nightly maybe?
Would like to mention that v20 RTL works great with UTF-8 that comes from all the streaming services (embedded subs).
Hoping that the fix is only regarding .srt SubRip 1255 format and won't break the UTF-8.
Reply
#6
The problem in fixing is that the subtitles in question are not properly developed, partly due to limitations of srt text subtitle format spec and partly due to design of VSfilter, which has been used as a performance baseline due to its wide-spread use.

As you know, cp 1255 was intended to encode Hebrew in logical order.  So proper display of text (visual order) requires "reversing" the logical order (compared to LTR encodings).  Unicode-based encodings such as utf-8 do not have an encoding order; instead each unicode codepoint is a assigned a direction and the bidi algorithm determines how the visual display of text is done.  The problem is cp-1255 pre-dates unicode.  If text in cp-1255 is decoded to unicode codepoints and then the bidi applied, leading and trailing "neutral" codepoints will not be handled as intended.  There can also be problems with "directional" symbols like ()[] <>.   Apparently this is the issue with VSfilter and fan subbers have worked around this by reversing the punctuation in cp-1255 encoded subs.  Unfortunately this practice seems to also "leak" into utf-8 encoded subs.

So the problem for Kodi (and really all subtitle rendering programs)  is to determine if a subtitle text should be handled per spec or the "convention" used.  Keep in mind, subrip isn't the only text subtitle format.  ASSA, SAMI, EIA 608/708, EBU Teletext, and WebVTT also have to be considered.  Just blindly treating all punctuation as following fansub convention is not a viable solution.  IMO moving to WebVTT, being managed by W3C, is the best way forward as WebVTT is designed to be compliant with WWW internationalization "best practices".

I'm assuming the same issues exist in other RTL languages (Arabic, Farsi) but that also needs to be reviewed.

scott s.
.
Reply

Logout Mark Read Team Forum Stats Members Help
Reversed punctuations in RTL language0