themoviedb.org scraper - German Umlaute
#1
Sad 
Hi there,

(I'm using xbmc Eden on a mac mini OSX 10.7. The Libraries are local discs with exFat FS.)
I encountered several problems with German Umlaute (ö,ü,ä) in movie titles.

For example:
For 'Der Herr der Ringe - Die Gefährten-2001' (Lord of the Rings) it parses the 'ä' quiet well, but for 'Der Herr der Ringe - Die zwei Türme-2002' it fails at the 'ü'.

After reading the log file I found the problem. The scraper tried to find the movie with this URL:
http://api.themoviedb.org/3/search/movie...20zwei%20tu%cc%88rme+2002&language=de
(ü = u%cc%88)

But it should use this syntax:
http://api.themoviedb.org/3/search/movie...20zwei%20t%c3%bcrme+2002&language=de
(ü = %c3%bc)

I tried this with different charsets (default, Western Europe (Win), Western Europe (ISO)).

Below are some c&p's from the log. The whole debug log could be found here.

Code:
22:10:37 T:2953850880   DEBUG: VideoInfoScanner: Scanning dir '/Volumes/exLibrary/Deutsche Filme/Der Herr der Ringe - Die Gefährten-2001/' as not in the database
22:10:37 T:2953850880   DEBUG: VideoInfoScanner: No NFO file found. Using title search for '/Volumes/exLibrary/Deutsche Filme/Der Herr der Ringe - Die Gefährten-2001/Der Herr der Ringe - Die Gefährten-2001.mkv'
22:10:37 T:2953850880   DEBUG: FindMovie: Searching for 'Der Herr der Ringe - Die Gefährten' using The MovieDB scraper (path: '/Applications/XBMC.app/Contents/Resources/XBMC/addons/metadata.themoviedb.org', content: 'movies', version: '3.0.9')
22:10:37 T:2953850880   DEBUG: scraper: CreateSearchUrl returned <url>http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%2d%20die%20gefa%cc%88hrten+2001&language=de</url>
22:10:37 T:2953850880   DEBUG: FileCurl::Open(0xb01025e4) http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%2d%20die%20gefa%cc%88hrten+2001&language=de
22:10:37 T:2953850880   DEBUG: scraper: GetSearchResults returned <results></results>
22:10:37 T:2953850880   DEBUG: FindMovie: Searching for 'Der Herr der Ringe   Die Gefährten' using The MovieDB scraper (path: '/Applications/XBMC.app/Contents/Resources/XBMC/addons/metadata.themoviedb.org', content: 'movies', version: '3.0.9')
22:10:37 T:2953850880   DEBUG: scraper: CreateSearchUrl returned <url>http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%20%20die%20gefa%cc%88hrten+2001&language=de</url>
22:10:37 T:2953850880   DEBUG: FileCurl::Open(0xb01025e4) http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%20%20die%20gefa%cc%88hrten+2001&language=de
22:10:38 T:2953850880   DEBUG: scraper: GetSearchResults returned <results><entity><title>Der Herr der Ringe - Die Gefährten</title><id>120</id><year>2001</year><url cache="tmdb-de-120.json">http://api.themoviedb.org/3/movie/120?api_key=57983e31fb435df4df77afb854740ea9&language=de</url></entity><entity><title>The Lord of the Rings: The Fellowship of the Ring</title><id>120</id><year>2001</year><url cache="tmdb-de-120.json">http://api.themoviedb.org/3/movie/120?api_key=57983e31fb435df4df77afb854740ea9&language=de</url></entity></results>
22:10:38 T:2953850880   DEBUG: GetVideoDetails: Reading movie 'http://api.themoviedb.org/3/movie/120?api_key=57983e31fb435df4df77afb854740ea9&language=de' using The MovieDB scraper (file: '/Applications/XBMC.app/Contents/Resources/XBMC/addons/metadata.themoviedb.org', content: 'movies', version: '3.0.9')
22:10:38 T:2953850880   DEBUG: FileCurl::Open(0xb0102458) http://api.themoviedb.org/3/movie/120?api_key=57983e31fb435df4df77afb854740ea9&language=de
22:10:38 T:2953850880   DEBUG: scraper: GetDetails returned <details><id>tt0120737</id><title>Der Herr der Ringe - Die Gefährten</title><originaltitle>The Lord of the Rings: The Fellowship of the Ring</originaltitle><year>2001</year><runtime>178</runtime><url function="ParseFallbackTMDBTagline" cache="tmdb-en-120.json">http://api.themoviedb.org/3/movie/120?api_key=57983e31fb435df4df77afb854740ea9&language=en</url><studio>New Line Cinema</studio><country>New Zealand</country><chain function="GetIMDBRatingById">tt0120737</chain><chain function="GetTMDBDirectorsByIdChain">120</chain><chain function="GetTMDBWitersByIdChain">120</chain><chain function="GetTMDBCertificationsByIdChain">120</chain><chain function="GetTMDBSetByIdChain">120</chain><chain function="GetTMDBPlotByIdChain">120</chain><chain function="GetTMDBCastByIdChain">120</chain><chain function="GetTMDBGenresByIdChain">120</chain><chain function="GetTMDBThumbsByIdChain">120</chain><chain function="GetTMDBFanartByIdChain">120</chain><chain function="GetTMDBTrailerByIdChain">120</chain></details>
22:10:38 T:2953850880   DEBUG: FileCurl::Open(0xb0102458) http://api.themoviedb.org/3/movie/120?api_key=57983e31fb435df4df77afb854740ea9&language=en
22:10:38 T:2953850880   DEBUG: scraper: ParseFallbackTMDBTagline returned <details><tagline>One ring to rule them all</tagline></details>
22:10:38 T:2953850880   DEBUG: scraper: GetIMDBRatingById returned <details><url cache="tt0120737-main.html" function="ParseIMDBRating">http://akas.imdb.com/title/tt0120737/</url></details>
22:10:38 T:2953850880   DEBUG: FileCurl::Open(0xb0102458) http://akas.imdb.com/title/tt0120737/


22:12:21 T:2953850880   DEBUG: VideoInfoScanner: Scanning dir '/Volumes/exLibrary/Deutsche Filme/Der Herr der Ringe - Die zwei Türme-2002/' as not in the database
22:12:21 T:2953850880   DEBUG: VideoInfoScanner: No NFO file found. Using title search for '/Volumes/exLibrary/Deutsche Filme/Der Herr der Ringe - Die zwei Türme-2002/Der Herr der Ringe - Die zwei Türme-2002.mkv'
22:12:21 T:2953850880   DEBUG: FindMovie: Searching for 'Der Herr der Ringe - Die zwei Türme' using The MovieDB scraper (path: '/Applications/XBMC.app/Contents/Resources/XBMC/addons/metadata.themoviedb.org', content: 'movies', version: '3.0.9')
22:12:21 T:2953850880   DEBUG: scraper: CreateSearchUrl returned <url>http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%2d%20die%20zwei%20tu%cc%88rme+2002&language=de</url>
22:12:21 T:2953850880   DEBUG: FileCurl::Open(0xb0102814) http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%2d%20die%20zwei%20tu%cc%88rme+2002&language=de
22:12:22 T:2953850880   DEBUG: scraper: GetSearchResults returned <results></results>
22:12:22 T:2953850880   DEBUG: FindMovie: Searching for 'Der Herr der Ringe   Die zwei Türme' using The MovieDB scraper (path: '/Applications/XBMC.app/Contents/Resources/XBMC/addons/metadata.themoviedb.org', content: 'movies', version: '3.0.9')
22:12:22 T:2953850880   DEBUG: scraper: CreateSearchUrl returned <url>http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%20%20die%20zwei%20tu%cc%88rme+2002&language=de</url>
22:12:22 T:2953850880   DEBUG: FileCurl::Open(0xb0102814) http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=der%20herr%20der%20ringe%20%20%20die%20zwei%20tu%cc%88rme+2002&language=de
22:12:22 T:2953850880   DEBUG: scraper: GetSearchResults returned <results></results>
22:12:22 T:2953850880   DEBUG: VideoInfoScanner: No (new) information was found in dir /Volumes/exLibrary/Deutsche Filme/Der Herr der Ringe - Die zwei Türme-2002/

edit: same problem under RC1
Reply
#2
Hmz, works for me. Tested on Linux.

EDIT: The 'ü' in your post looks fishy. At least it does when i used this for creating the dummy file. Had to replace it with a "sane" char.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not PM or e-mail Team-Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#3
Thanks for your hint.

I found this:

OS/X stored u0xCC88 instead of 0xC3BC. Note that the u is not a typo. OS/X uses a different way to store the ü. The encoding we used is unicode codepoint U+00FC, which is ü. OS/X first stores the u and the two little dots as separate characters, taking up 3 bytes instead of 2.

This is called normalization. Unicode defines a few different normalization models which dictate how these combinations of characters are stored. So even though they are different byte-sequences and different codepoints they are still considered equivalent.
Reply
#4
XBMC already takes that into account on OSX builds - see CharsetConverter.h/cpp. What exactly is your filesystem returning (via opendir or similar)?
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#5
opendir returns the 3 bytes version (u0xCC88)

After the scraper fails. I can browse to the movie folder and select the "Movie Information" dialog, it will open a dialog with an editable movie name. I don't have to change the name, I can just hit enter and it will find the right movie. - After testing some movie names, i figured out, that the name in the movie name dialog is encoded in u0xCC88 too. That means, that the charset converter works well in this dialog, because it will generate the right api url.

So, where's the difference between the manual entered movie name and the automated one (both are encoded with u0xCC88)?
Reply
#6
You can play the files fine, right, and they list fine in the XBMC UI? If so, the charset converter is doing it's job for display purposes at least.

The debug logs must differ between movie info lookup and scraper lookups - mind fetching that out?
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#7
This is the URL from the 'automated' scraper:
http://api.themoviedb.org/3/search/movie...wilden%20hu%cc%88hner+2006&language=de

And this is the one the dialog generates:
http://api.themoviedb.org/3/search/movie...wilden%20h%c3%bchner%202006&language=de

Any clue?

Debug log from the "Scan for new content":

edit: Yes, I can play the files and they are listed correct in the UI.
Reply
#8
Any news?
Reply
#9
unclemaxx Wrote:Any news?

NASA found oxygen on one of Saturn's moons. The L.A. Clippers beat the Houston Rockets. Rock guitarist Ronnie Montrose died today at the age of 64. Skies will be sunny today with no chance of rain.
Reply
#10
(2012-03-05, 16:00)unclemaxx Wrote: Any news?
unclemaxx, did you solve this issue?
Reply

Logout Mark Read Team Forum Stats Members Help
themoviedb.org scraper - German Umlaute0