Kodi Community Forum

Full Version: What's new in Gotham for Scrapers
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Updated 24.03.2014
There are some changes for scrapers processing in upcoming Gotham.
I'll summarize changes in this thread.

  1. Scraper XML file is always converted to UTF-8 before parsing. Actual scraper charset is read from XML declaration.
  2. XBMC detect charset of downloaded data (HTML, XML...) and convert data to UTF-8 before passing it to scraper.
  3. All scraper generated XMLs are processed as UTF-8
  4. All Regexps now can be in UTF-8 and can use Unicode Properties.
For 1, make sure that you have proper XML header, like
Code:
<?xml version="1.0" encoding="UTF-8"?>
or XBMC will spend some CPU time trying to detect correct XML charset. Note: now you can save scraper XML in any encoding, as long as encoding is supported by XBMC (by libiconv actually).

2 means that you don't need any workarounds/hacks to correctly process national non-US-ASCII characters.

For 4, new expression attribute "utf8" was introduced, which can be "yes", "no" or "auto" ("auto" is default value)
Example of use:
Code:
<RegExp input="$$2" output="&lt;details&gt;\1&lt;/details&gt;" dest="5">
    <expression utf8="no">Director: (.*),</expression>
</RegExp>
<RegExp input="$$2" output="&lt;details&gt;\1&lt;/details&gt;" dest="5">
    <expression utf8="yes">Режиссёр: (.*),</expression>
</RegExp>
In UTF-8 mode all PCRE Unicode Properties are supported (see http://vcs.pcre.org/viewvc/code/tags/pcr...ml?view=co). If regexp matching done in UTF-8 mode than Regexp pattern and text for matching are checked for valid UTF-8 before matching. (If an invalid UTF-8 sequence is found, then matching is aborted with error).
In "auto" mode Regexp pattern is checked for non US-ASCII characters, Unicode Properties or character codes more than 255 (like "\x{2000}) and if any are found, UTF-8 mode is enabled.
In not-UTF-8 mode everything is processed as ASCII strings, Unicode Properties are not available.
Forgot to say:
If you have any questions, feel free to ask them here.
As far as the issue with the music library and scrapers, Currently I have issues with Artist names, wrong names are listed on may Artist. (What I've noticed is, if I have an artist "Duke Ellington" with 5 albums, and the last album listed could be Duke Ellington and Ella Fitzgerald, the next artist listed will be labelled Duke Ellington and Ella Fitzgerald regardless who it is. Is this an issue others are seeing and will be need to be addressed in the scraper tool or within xbmc v13 build?)



Looking at the image below, you'll notice the artist "Al Jarreau" his album has track/s with Kathleen Battle. Now the next artist to the right is named Alex Bugnon but he's labelled as "Al Jarreau/Kathleen Battle, I'm seeing this with quite a few artist.

Image
(2013-12-16, 23:36)hoopsdavis Wrote: [ -> ]As far as the issue with the music library and scrapers, Currently I have issues with Artist names, wrong names are listed on may Artist.
It's not related to scraper processing changes in Gotham.

If you have problem with your scraper - open a separate thread in this forum. Or use XBMC General Help and Support.
Hi @Karlson2k I have a problem with filmaffinity scrapper.

I think that points 1 and 2 are OK in the scrapper (first line is the same). I think the problem is the third step:

in log I see that the we have to search in iso-8859-1 or there are no matches, so I think that the XML that is parsed by XBMC is in iso-8859-1. How we can convert to utf-8?

Thanks
MaDDoGo,
As I mentioned in first post, all XMLs are converted to UTF-8 before parsing. XML can be in any encoding as long as correct encoding was put to XML header.
Even if encoding is incorrect or missing, XBMC will try to find suitable encoding, but result is not guaranteed to be correct and costs time/resources.

Try with latest nightly, it contains some additional webserver charset detection logic. If it didn't help, check trac for similar problems
Hi,

I tried next nightly and all problems are gone. Sorry for the inconvencience but I thought that the problems were from the scrapper.

Thank you
Updated first post.
Changes: all scraper generated XMLs are always processed as UTF-8.
(2014-03-24, 15:30)Karlson2k Wrote: [ -> ]Updated first post.
Changes: all scraper generated XMLs are always processed as UTF-8.

Is there some reason this can't be extended to nfo files? At least UTF-16LE encoding is ignored from my testing (music videos).

sacott s.
.
Starting from Gotham alpha 10, all XMLs (including .nfo) should be automatically converted to UTF-8 before processing.
Scraper generated XMLs are forcibly processed as UTF-8.
Did you mean that your nfo files didn't converted? If your nfo file is valid XML (with proper charset in declaration) then it must be loaded correctly. Even if your file don't have "charset" in declaration, but starts from "<?xml", XBMC should detect and use UTF-16 encoding.
Send me your nfo file, I'll try to reproduce.
(2014-03-24, 15:30)Karlson2k Wrote: [ -> ]Updated first post.
Changes: all scraper generated XMLs are always processed as UTF-8.
unfortunately not. using the FilmAffinity scraper on Windows 8.1 64 XBMC Gotham v13.2beta2 a warning comes out:
Code:
CScraperUrl::Get: Can't find precise charset for HTML "http://www.filmaffinity.com/es/film109378.html", using "CP1252" as fallback
any way the scraper or XBMC can be forced to process the scrapped content as UTF-8?
Hi there I'm not sure i have the right part of the forum to be asking this but...
I'm good with the scraping part but what i would like to know is when i scrape my 10 hard drives all separated out for different subjects
I want to move or tell xbmc or kodi to scrape the covers to an sd card instead of my unit because it takes up alot of space .

I know you guys are busy and you have been doing an amazing job but i just thought i would ask on this.
Cant you simply make the thumbnails dir a link?