Scraping in other language(s?) is broken
#1
Hi,

Thanks for great software.

I am trying to scrape some series or movies using Russian titles for search.

This used to work perfectly in previous versions, before the settings got moved into a separate dialog (I think the last version which it worked on for me was Version 2.5.5).

Recent versions, although more user friendly in allowing to select the language in scrape dialog, rather than settings, only work for English titles.

From log files:
Language: English Chars: ASCII
Code:
2015-04-09 23:10:38,791 DEBUG [SwingWorker-pool-3-thread-10] o.t.s.thetvdb.TheTvDbMetadataProvider:98 - search() SearchQuery; Type: TV_SHOW; COUNTRY:US;QUERY:dd;LANGUAGE:en;
2015-04-09 23:10:38,792 DEBUG [SwingWorker-pool-3-thread-10] org.tinymediamanager.scraper.util.Url:223 - getting http://thetvdb.com/api/GetSeries.php?seriesname=dd&language=en
2015-04-09 23:10:39,140 DEBUG [SwingWorker-pool-3-thread-10] o.t.scraper.util.Similarity:168 - Similarity Score: [dd][DD Hokuto no Ken]=[0.2]
2015-04-09 23:10:39,140 DEBUG [SwingWorker-pool-3-thread-10] o.t.scraper.util.Similarity:168 - Similarity Score: [dd][DD Hokuto no Ken]

Language: Russian Chars: Cyrillic
Code:
2015-04-09 23:21:03,225 DEBUG [SwingWorker-pool-3-thread-6] org.tinymediamanager.scraper.util.Url:223 - getting http://thetvdb.com/api/GetSeries.php?seriesname=+&language=ru
2015-04-09 23:21:04,180 DEBUG [SwingWorker-pool-3-thread-6] o.t.scraper.util.Similarity:168 - Similarity Score: [ ][Сверхъестественное: Аниме]=[0.0]
2015-04-09 23:21:04,180 DEBUG [SwingWorker-pool-3-thread-6] o.t.scraper.util.Similarity:168 - Similarity Score: [ ][Сверхъестественное  Аниме]=[0.0]
2015-04-09 23:21:04,180 DEBUG [SwingWorker-pool-3-thread-6] o.t.scraper.util.Similarity:168 - Similarity Score: [ ][Зоопарк в обувной коробке]=[0.0]
2015-04-09 23:21:04,181 DEBUG [SwingWorker-pool-3-thread-6] o.t.scraper.util.Similarity:168 - Similarity Score: [ ][Зоопарк в обувной коробке]=[0.0]
..........loads of random results, some with no titles etc...................

Language: Russian Chars: ASCII
Code:
2015-04-09 23:19:30,768 DEBUG [SwingWorker-pool-3-thread-3] o.t.s.thetvdb.TheTvDbMetadataProvider:98 - search() SearchQuery; Type: TV_SHOW; COUNTRY:US;QUERY:dd;LANGUAGE:ru;
2015-04-09 23:19:30,768 DEBUG [SwingWorker-pool-3-thread-3] org.tinymediamanager.scraper.util.Url:223 - getting http://thetvdb.com/api/GetSeries.php?seriesname=dd&language=ru
2015-04-09 23:19:31,153 DEBUG [SwingWorker-pool-3-thread-3] o.t.scraper.util.Similarity:168 - Similarity Score: [dd][DD Hokuto no Ken]=[0.2]
2015-04-09 23:19:31,154 DEBUG [SwingWorker-pool-3-thread-3] o.t.scraper.util.Similarity:168 - Similarity Score: [dd][DD Hokuto no Ken]=[0.2]

So my guess from logs, is that non ascii chars do not get url encoded correctly for some reason, as the url contains "+" sign rather than search string. Which would explain random results.

Hopefully this is enough info, and can be fixed. Currently I have to use older version for non English scraping.

Thanks!


P.S. just tried manually opening the strange url from log, and the results match the random ones I get in TMM UI, so this is definitely what is being passed to tvdb api.
http://thetvdb.com/api/GetSeries.php?ser...anguage=ru

this is what it should be, and returns correct results
http://thetvdb.com/api/GetSeries.php?ser...anguage=ru
#2
Ah i see.
This comes from our intention to clean the search string, since the API behaves(behaved?) kinda weird on some special characters.
Might be a bit too extensive.

Thanks for reporting...
tinyMediaManager - THE media manager of your choice :)
Wanna help translate TMM ?
Image
#3
(2015-04-09, 23:15)myron Wrote: Ah i see.
This comes from our intention to clean the search string, since the API behaves(behaved?) kinda weird on some special characters.
Might be a bit too extensive.

Thanks for reporting...

Great thanks. Let me know if you want me to test a build or something.

Logout Mark Read Team Forum Stats Members Help
Scraping in other language(s?) is broken0