2013-12-12, 17:06
Updated 24.03.2014
There are some changes for scrapers processing in upcoming Gotham.
I'll summarize changes in this thread.
or XBMC will spend some CPU time trying to detect correct XML charset. Note: now you can save scraper XML in any encoding, as long as encoding is supported by XBMC (by libiconv actually).
2 means that you don't need any workarounds/hacks to correctly process national non-US-ASCII characters.
For 4, new expression attribute "utf8" was introduced, which can be "yes", "no" or "auto" ("auto" is default value)
Example of use:In UTF-8 mode all PCRE Unicode Properties are supported (see http://vcs.pcre.org/viewvc/code/tags/pcr...ml?view=co). If regexp matching done in UTF-8 mode than Regexp pattern and text for matching are checked for valid UTF-8 before matching. (If an invalid UTF-8 sequence is found, then matching is aborted with error).
In "auto" mode Regexp pattern is checked for non US-ASCII characters, Unicode Properties or character codes more than 255 (like "\x{2000}) and if any are found, UTF-8 mode is enabled.
In not-UTF-8 mode everything is processed as ASCII strings, Unicode Properties are not available.
There are some changes for scrapers processing in upcoming Gotham.
I'll summarize changes in this thread.
- Scraper XML file is always converted to UTF-8 before parsing. Actual scraper charset is read from XML declaration.
- XBMC detect charset of downloaded data (HTML, XML...) and convert data to UTF-8 before passing it to scraper.
- All scraper generated XMLs are processed as UTF-8
- All Regexps now can be in UTF-8 and can use Unicode Properties.
Code:
<?xml version="1.0" encoding="UTF-8"?>
2 means that you don't need any workarounds/hacks to correctly process national non-US-ASCII characters.
For 4, new expression attribute "utf8" was introduced, which can be "yes", "no" or "auto" ("auto" is default value)
Example of use:
Code:
<RegExp input="$$2" output="<details>\1</details>" dest="5">
<expression utf8="no">Director: (.*),</expression>
</RegExp>
<RegExp input="$$2" output="<details>\1</details>" dest="5">
<expression utf8="yes">Режиссёр: (.*),</expression>
</RegExp>
In "auto" mode Regexp pattern is checked for non US-ASCII characters, Unicode Properties or character codes more than 255 (like "\x{2000}) and if any are found, UTF-8 mode is enabled.
In not-UTF-8 mode everything is processed as ASCII strings, Unicode Properties are not available.