cyrillic filenames in python scraper
#1
Hi !
I am writing a python scraper for russian films. its almost complete. last problem when I get movie filename from KODI I got
something like. movie title become in wrong coding.
(Pdb) print(sys.argv[2])
?action=find&pathSettings=%7b%7d&title=%d0%9a%d0%be%d0%bd%d1%82%d0%b0%d0%ba%d1%82&year=2012
(Pdb) print(sys.argv[2].encode())
b'?action=find&pathSettings=%7b%7d&title=%d0%9a%d0%be%d0%bd%d1%82%d0%b0%d0%ba%d1%82&year=2012' 
In result i can't pass corect search name to my scraper. 
Can anyone help me to solve it?
Reply
#2
все решилось вот таким образом

def decode_str(s: str):  
    return  bytes.fromhex(s.replace('%25', '')).decode(encoding='utf-8')

тема закрыта
Reply
#3
@alanhk

I've moved you to another forum. Hopefully a developer will spot your post and comment.
Maybe @Roman_V_M or @pkscout might know
My Signature
Links to : Official:Forum rules (wiki) | Official:Forum rules/Banned add-ons (wiki) | Debug Log (wiki)
Links to : HOW-TO:Create Music Library (wiki) | HOW-TO:Create_Video_Library (wiki) || Artwork (wiki) | Basic controls (wiki) | Import-export library (wiki) | Movie sets (wiki) | Movie universe (wiki) | NFO files (wiki) | Quick start guide (wiki)
Reply
#4
Sorry, no help here.  Python string decoding still feels like something of a black art to me.
Reply
#5
The topic starter said that he resolved his/her problem with the mentioned code that is somewhat weird IMO. This is a simple URL encoding:
>>> import urllib.parse
>>> urllib.parse.unquote('%d0%9a%d0%be%d0%bd%d1%82%d0%b0%d0%ba%d1%82')
'Контакт'

@pkscout  There is nothing complex with encoding/decoding as soon as you understand the difference between abstract text (unicode in Python 2 and str in Python 3) and its binary representation (str in Python 2 and bytes in Python 3). A text encoding is just a set of rules for converting text to/from its binary representation. An encode error happens when a text encoding you are using cannot represent a specific text character as a sequence of bytes and a decode error happens when a text encoding cannot interpret some sequence of bytes. For example, ASCII encoding (a fixed length 1 bytes encoding) include only English characters and some punctuation and control symbols. On the other side, UTF-8 (a variable length 1-4 bytes encoding) can represent every possible character described in the Unicode standard.

rant mode on
When Python core developers introduced unicode type in Python 2.0 for storing abstract text they IMO committed a cardinal sin by allowing unicode and str types to be used in the same context by applying implicit decoding from ASCII, thus breaking Python's strong typing paradigm. Fortunately, it was finally fixed in Python 3.
rand mode off

BTW, text encodings were used long before computers were invented. For example, Morse code was a variable-length binary encoding for transmitting text over telegraph wires.
Reply
#6
(2021-10-14, 10:33)Roman_V_M Wrote: The topic starter said that he resolved his/her problem with the mentioned code that is somewhat weird IMO. This is a simple URL encoding:
>>> import urllib.parse
>>> urllib.parse.unquote('%d0%9a%d0%be%d0%bd%d1%82%d0%b0%d0%ba%d1%82')
'Контакт'

@pkscout  There is nothing complex with encoding/decoding as soon as you understand the difference between abstract text (unicode in Python 2 and str in Python 3) and its binary representation (str in Python 2 and bytes in Python 3). A text encoding is just a set of rules for converting text to/from its binary representation. An encode error happens when a text encoding you are using cannot represent a specific text character as a sequence of bytes and a decode error happens when a text encoding cannot interpret some sequence of bytes. For example, ASCII encoding (a fixed length 1 bytes encoding) include only English characters and some punctuation and control symbols. On the other side, UTF-8 (a variable length 1-4 bytes encoding) can represent every possible character described in the Unicode standard.

rant mode on
When Python core developers introduced unicode type in Python 2.0 for storing abstract text they IMO committed a cardinal sin by allowing unicode and str types to be used in the same context by applying implicit decoding from ASCII, thus breaking Python's strong typing paradigm. Fortunately, it was finally fixed in Python 3.
rand mode off

BTW, text encodings were used long before computers were invented. For example, Morse code was a variable-length binary encoding for transmitting text over telegraph wires.
Thank you very much !!!
Reply



Logout Mark Read Team Forum Stats Members Help
cyrillic filenames in python scraper0
This forum uses Lukasz Tkacz MyBB addons.