Bug Kodi 17 RC3 (and earlyer) Kodi mess up characters while installing from .zip
#16
OK, so it's "correct", but can it unzip files zipped by a windows system?

update:

I copied the zip fife created by Windows 7 over to an Android tablet and viewed it using the system app "File Manager". This one provides an option to set the text encoding. By default it selects UTF-8. In this encoding, apparently the character U+00B0 ( ° ) as encoded in the zip is invalid UTF-8 - the android app shows it as U+FFFD ( � ).

So I switched the encoding to ISO-8859-1. Doing this I get the same results as seen in Kodi, namely it is displayed as U+00F8 ( ø ). I don't know how to look at a zip file using a hex editor to see what actually is set as the filename, but it appears that Windows does do something different from Android when it zips filenames.

Since Windows 7 itself and 7-zip correctly unzip the filename, I assume that it's some sort of windows VB or VC function or win32 api that is invoked during zip/unzip and where the problem is introduced.

update2

Played around with some zip files in a hex editor. I see the filenames are uncompressed in these zips so I could look at what is happening. Results are strange/inconsistent (IMHO).

1. It appears the included Windows file archiver/zipped in Windows (File) Explorer can only handle filenames in the 0x00- 0xFF namespace. However, Windows doesn't correctly encode these filenames using CP-1252, ISO-8859-1 or any other 8 bit encoding that I could see. 7-Zip produces identical filenames in the zip file.

2. Filenames in Windows with characters 0x0100 and above in the unicode BMP results in the Windows (File) Explorer not zipping the files, instead raises a popup stating that filename characters couldn't be zipped. 7-Zip does zip these files, and seems to be using UTF-8 encoding but this is only done correctly for characters 0x0100 and above. Characters in the range 0x80 - 0xFF are not properly encoded into UTF-8.

scott s.
.
Reply
#17
well, it's not related to windows as such. 8-bit encodings are a horrible thing from the past. the same numbers means different characters based on the code page used. that's why unicode was invented in the first place.

according to the wikipedia page, windows compressed folder got unicode support in win8. the craptop i got my hands on has win7 so i cannot check if that's correct.

after more thorough inspection and reading the fine print, i realize that the whole of CP437 is supported, not just the lower 7 bits. so the files are not out-of-spec.
i have amended the PR with hard-coded CP437 for non-utf8 files. this makes the files created by these tools behave, at the cost of possibly breaking out-of-spec files created before 2007 (out of necessity - the file format was not specified). a fair compromise i think, and not really a problem.
Reply
#18
it you hexedit you will see that it's stored as 248 / 0xF8. this was the cluestick that made me realize what was going on (see https://en.wikipedia.org/wiki/Code_page_437)
Reply
#19
(2017-01-24, 00:16)ironic_monkey Wrote: it you hexedit you will see that it's stored as 248 / 0xF8. this was the cluestick that made me realize what was going on (see https://en.wikipedia.org/wiki/Code_page_437)

OK I see now that CP-437 seems to work. (see my update 2 in prior post.)

The thing is with "7-Zip" what he does is use UTF-8 when the filename can't be 8-bit encoded (i.e., contains characters U+0100 and up) but still the windows 8 bit encoding on other filenames. But I guess you can test for an illegal UTF-8 byte string and revert to CP-437 mapping.

Thanks for leading me through this. I'm not OP, but this kind of caught my eye and when I could reproduce it had me scratching my head.

scott s.
.
Reply
#20
the flag is per-entry in the zip file so we should be covered assuming those chars between 0x7F and 0xFF are always cp437.

it sounds like you are hinged just like me. too damn curious for your own good (e.g. causing lack of sleep) at times Wink
Reply
#21
I see now that on a per-file basis it works as long as you assume CP-437 for non-UTF-8 files. I looked at a payware program WInRAR, and it can be set to read zip files with various encodings when UTF-8 isn't specified (I assume it looks at that flag bit 11), but no apparent way to force it to use UTF-8 on archiving.

scott s.
.
Reply
#22
First of all, thanks for looking in to it guies!!

Would like to ask, if there is currently a Solution (other "packer" than "7zip", other Os?). Which would allow Kodi to unzip correctly?

(Wanst able to follow ur discussion 100% may do to my limited Englisch or HEX skills.)

Regards.
Reply
#23
Any News?

I did try to use Unicode Filenames in the zip File, But Kodi doesnt accept such zips and Quit installing with a Error message (No Log entries found).
Reply

Logout Mark Read Team Forum Stats Members Help
Kodi 17 RC3 (and earlyer) Kodi mess up characters while installing from .zip0