What's new in Gotham for Scrapers

  Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Karlson2k Offline
Team-Kodi Developer
Posts: 333
Joined: Oct 2010
Reputation: 9
Location: Moscow, Russia
Information  What's new in Gotham for Scrapers
Post: #1
Updated 24.03.2014
There are some changes for scrapers processing in upcoming Gotham.
I'll summarize changes in this thread.

  1. Scraper XML file is always converted to UTF-8 before parsing. Actual scraper charset is read from XML declaration.
  2. XBMC detect charset of downloaded data (HTML, XML...) and convert data to UTF-8 before passing it to scraper.
  3. All scraper generated XMLs are processed as UTF-8
  4. All Regexps now can be in UTF-8 and can use Unicode Properties.
For 1, make sure that you have proper XML header, like
Code:
<?xml version="1.0" encoding="UTF-8"?>
or XBMC will spend some CPU time trying to detect correct XML charset. Note: now you can save scraper XML in any encoding, as long as encoding is supported by XBMC (by libiconv actually).

2 means that you don't need any workarounds/hacks to correctly process national non-US-ASCII characters.

For 4, new expression attribute "utf8" was introduced, which can be "yes", "no" or "auto" ("auto" is default value)
Example of use:
Code:
<RegExp input="$$2" output="&lt;details&gt;\1&lt;/details&gt;" dest="5">
    <expression utf8="no">Director: (.*),</expression>
</RegExp>
<RegExp input="$$2" output="&lt;details&gt;\1&lt;/details&gt;" dest="5">
    <expression utf8="yes">Режиссёр: (.*),</expression>
</RegExp>
In UTF-8 mode all PCRE Unicode Properties are supported (see http://vcs.pcre.org/viewvc/code/tags/pcr...ml?view=co). If regexp matching done in UTF-8 mode than Regexp pattern and text for matching are checked for valid UTF-8 before matching. (If an invalid UTF-8 sequence is found, then matching is aborted with error).
In "auto" mode Regexp pattern is checked for non US-ASCII characters, Unicode Properties or character codes more than 255 (like "\x{2000}) and if any are found, UTF-8 mode is enabled.
In not-UTF-8 mode everything is processed as ASCII strings, Unicode Properties are not available.
(This post was last modified: 2014-03-24 18:14 by Karlson2k.)
find quote
Karlson2k Offline
Team-Kodi Developer
Posts: 333
Joined: Oct 2010
Reputation: 9
Location: Moscow, Russia
Post: #2
Forgot to say:
If you have any questions, feel free to ask them here.
find quote
hoopsdavis Offline
Fan
Posts: 590
Joined: Nov 2012
Reputation: 6
Location: Woodbridge, Va
Post: #3
As far as the issue with the music library and scrapers, Currently I have issues with Artist names, wrong names are listed on may Artist. (What I've noticed is, if I have an artist "Duke Ellington" with 5 albums, and the last album listed could be Duke Ellington and Ella Fitzgerald, the next artist listed will be labelled Duke Ellington and Ella Fitzgerald regardless who it is. Is this an issue others are seeing and will be need to be addressed in the scraper tool or within xbmc v13 build?)



Looking at the image below, you'll notice the artist "Al Jarreau" his album has track/s with Kathleen Battle. Now the next artist to the right is named Alex Bugnon but he's labelled as "Al Jarreau/Kathleen Battle, I'm seeing this with quite a few artist.

[Image: 11qsl6q.png]

Office: Fire TV | Kodi 16.1 w/pvr.WMC | Samsung EH4003 32" (Client)
Bedroom: Fire TV | Kodi 16.1 w/pvr.WMC | Samsung 40D6000 40" (Client)
Basement: Windows 7 | Hauppauge HD-PVR | Kodi 16.1 w/serverWMC | Samsung FH6030 40"





(This post was last modified: 2013-12-16 23:49 by hoopsdavis.)
find quote
Karlson2k Offline
Team-Kodi Developer
Posts: 333
Joined: Oct 2010
Reputation: 9
Location: Moscow, Russia
Post: #4
(2013-12-16 23:36)hoopsdavis Wrote:  As far as the issue with the music library and scrapers, Currently I have issues with Artist names, wrong names are listed on may Artist.
It's not related to scraper processing changes in Gotham.

If you have problem with your scraper - open a separate thread in this forum. Or use XBMC General Help and Support.
(This post was last modified: 2013-12-17 09:29 by Karlson2k.)
find quote
MaDDoGo Offline
Senior Member
Posts: 248
Joined: Sep 2009
Reputation: 1
Location: Sabadell (Barcelona)
Post: #5
Hi @Karlson2k I have a problem with filmaffinity scrapper.

I think that points 1 and 2 are OK in the scrapper (first line is the same). I think the problem is the third step:

in log I see that the we have to search in iso-8859-1 or there are no matches, so I think that the XML that is parsed by XBMC is in iso-8859-1. How we can convert to utf-8?

Thanks

[Image: widget]
find quote
Karlson2k Offline
Team-Kodi Developer
Posts: 333
Joined: Oct 2010
Reputation: 9
Location: Moscow, Russia
Post: #6
MaDDoGo,
As I mentioned in first post, all XMLs are converted to UTF-8 before parsing. XML can be in any encoding as long as correct encoding was put to XML header.
Even if encoding is incorrect or missing, XBMC will try to find suitable encoding, but result is not guaranteed to be correct and costs time/resources.

Try with latest nightly, it contains some additional webserver charset detection logic. If it didn't help, check trac for similar problems
(This post was last modified: 2014-01-11 19:08 by Karlson2k.)
find quote
MaDDoGo Offline
Senior Member
Posts: 248
Joined: Sep 2009
Reputation: 1
Location: Sabadell (Barcelona)
Post: #7
Hi,

I tried next nightly and all problems are gone. Sorry for the inconvencience but I thought that the problems were from the scrapper.

Thank you

[Image: widget]
find quote
Karlson2k Offline
Team-Kodi Developer
Posts: 333
Joined: Oct 2010
Reputation: 9
Location: Moscow, Russia
Post: #8
Updated first post.
Changes: all scraper generated XMLs are always processed as UTF-8.
find quote
scott967 Offline
Posting Freak
Posts: 2,890
Joined: Jul 2012
Reputation: 91
Post: #9
(2014-03-24 15:30)Karlson2k Wrote:  Updated first post.
Changes: all scraper generated XMLs are always processed as UTF-8.

Is there some reason this can't be extended to nfo files? At least UTF-16LE encoding is ignored from my testing (music videos).

sacott s.
.
find quote
Karlson2k Offline
Team-Kodi Developer
Posts: 333
Joined: Oct 2010
Reputation: 9
Location: Moscow, Russia
Post: #10
Starting from Gotham alpha 10, all XMLs (including .nfo) should be automatically converted to UTF-8 before processing.
Scraper generated XMLs are forcibly processed as UTF-8.
Did you mean that your nfo files didn't converted? If your nfo file is valid XML (with proper charset in declaration) then it must be loaded correctly. Even if your file don't have "charset" in declaration, but starts from "<?xml", XBMC should detect and use UTF-16 encoding.
Send me your nfo file, I'll try to reproduce.
find quote
pancheto Offline
Junior Member
Posts: 38
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Post: #11
(2014-03-24 15:30)Karlson2k Wrote:  Updated first post.
Changes: all scraper generated XMLs are always processed as UTF-8.
unfortunately not. using the FilmAffinity scraper on Windows 8.1 64 XBMC Gotham v13.2beta2 a warning comes out:
Code:
CScraperUrl::Get: Can't find precise charset for HTML "http://www.filmaffinity.com/es/film109378.html", using "CP1252" as fallback
any way the scraper or XBMC can be forced to process the scrapped content as UTF-8?
find quote
joemcisaac Offline
Junior Member
Posts: 2
Joined: Apr 2012
Reputation: 0
Post: #12
Hi there I'm not sure i have the right part of the forum to be asking this but...
I'm good with the scraping part but what i would like to know is when i scrape my 10 hard drives all separated out for different subjects
I want to move or tell xbmc or kodi to scrape the covers to an sd card instead of my unit because it takes up alot of space .

I know you guys are busy and you have been doing an amazing job but i just thought i would ask on this.
(This post was last modified: 2015-04-01 00:59 by joemcisaac.)
find quote
ironic_monkey Offline
Posting Freak
Posts: 1,494
Joined: Nov 2013
Reputation: 71
Post: #13
Cant you simply make the thumbnails dir a link?
find quote