themoviedb.org scraper unable to scrap movies with non-ascii chars in file name
#1
In latest nighties (at least for a week now) themoviedb scraper isn't able to scrap any movie that has non-ascii chars in the file/folder name. This wasn't an issue before.

Last week it was causing a "No connection" pop-up (as when the scrapper can't connect to the server) and the library update was paused. This week the library update no longer pauses, but for those special files no results are found.

UPDATE: The issue seems to be Windows specific since Linux clients can update library with the same samba share correctly.

Reading other threads and trac issues (haven't found anything relevant for this particular scraper) I think it might be related to the upgrade of PCRE lib and the obligation to specify corrent encoding information in XML files or web responses.

Here is a debug log of a xbmc run trying to update the library for a buch of movies with non-ascii chars in foder/file name.

http://xbmclogs.com/show.php?id=96038

Apparently somewhere someone is doing a lower-case conversion of the file name and the conversion is incorrectly changing utf-codes to other not valid codes.
Taking one example from that log:

Code:
16:45:56 T:1788   DEBUG: ADDON::CScraper::FindMovie: Searching for 'Astérix y Obélix Al servicio de su majestad' using The Movie Database scraper (path: 'C:\Users\jurrabi\AppData\Roaming\XBMC\addons\metadata.themoviedb.org', content: 'movies', version: '3.7.4')
16:45:56 T:1788   DEBUG: scraper: CreateSearchUrl returned <url>http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&amp;query=ast%e3%a9rix%20y%20ob%e3%a9lix%20al%20servicio%20de%20su%20majestad&amp;year=2012&amp;language=es</url>
16:45:56 T:1788   DEBUG: CurlFile::Open(08D0C5A8) http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=ast%e3%a9rix%20y%20ob%e3%a9lix%20al%20servicio%20de%20su%20majestad&year=2012&language=es
16:45:57 T:1788   DEBUG: scraper: GetSearchResults returned <results></results>
The Spanish adapted name "Astérix" part of the file name you can see has been changed to lower case in "query=ast%e3%a9rixo".

But \xe3\xa9 is no the correct code for "é". it should be \xc3\xa9

you can see how the char \xc3 (character à in latin charset) has been incorrectly converted to \xe3 (ã in latin) causing the lack of results. If I manually change the API call to:
Code:
http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=ast%C3%A9rix%20y%20ob%C3%A9lix%20al%20servicio%20de%20su%20majestad&year=2012&language=es
I get the correct result:
Code:
{"page":1,"results":[{"adult":false,"backdrop_path":"/cvbqQyF4KIWQUViD2UwDCFe4hvq.jpg","id":99770,"original_title":"Astérix & Obélix - Au service de sa Majesté","release_date":"2012-10-17","poster_path":"/3tQslKo8oNG6mcDz4pF6YyjhDGq.jpg","popularity":3.608053125,"title":"Astérix y Obélix: Al servicio de su majestad","vote_average":5.9,"vote_count":20}],"total_pages":1,"total_results":1}

I'm gonna try and find where the issue lies.
And since a call with upper case A like:
Code:
http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=Ast%C3%A9rix%20y%20ob%C3%A9lix%20al%20servicio%20de%20su%20majestad&year=2012&language=es
works fine, I guess I'll try to comment the part that incorrectly converts to lowercase.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#2
It seems it's not in the scraper code, since CreateSearchUrl only uses \1 entry param ("query=\1&amp;amp;") without modification.

Although I don't have my scraping knowledge too fresh so I might be wrong.

The fact that the same scrapper works fine in linux is also a clue Wink
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#3
See http://trac.xbmc.org/ticket/14749 and add a Debug Log.

The lowercase thing has already been fixed in master though.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#4
I'll try next nightly then and report back.

Thanks for the feedback.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#5
No nightlies after 8.dic... will check when possible.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply

Logout Mark Read Team Forum Stats Members Help
themoviedb.org scraper unable to scrap movies with non-ascii chars in file name0