Scraper and code page conversion
#1
Not sure if it is the right forum to post.

There are four movie scrapers for Chinese user in trunk - Mtime, M1905, Getlib and 7176 (Thanks to taxigps) . Generally, they work well, but in some cases 7176/Getlib scrape the Chinese information in messy code.

Routine usage and findings:
Mtime: works fine always but does not support imdb ID and water-marked pics
M1905: seldom used since does not support multi-words search
7176: good but sometimes messy code
Getlib: sometimes messy code

I just got some time and had a testing with those scrapers.

Platform: Windows 7 Simplified Chinese, 32bit with XBMC svn30252, the built-in Arial font was replaced with a Chinese font (wqy-microHei) to support Chinese display. wqy-microHei is an open source font, find here: http://sourceforge.net/projects/wqy/file...z/download

Testing keywords (movies): iron man, feng sheng,风声
Actually, feng sheng and 风声 are the same movie, feng sheng is pinyin of 风声. See information about pinyin on wiki: http://en.wikipedia.org/wiki/Pinyin

Code page information of scrapers and movie websites:
Mime:
mtime.com - utf-8, scraper: xml encoding- utf-8, search string encoding: gb2312, search result encoding - iso-8859-1

M1095:
m1095.com - uft-8, scraper: xml encoding- utf-8, search string encoding: utf-8, search result encoding - iso-8859-1

7176:
7176.com- gb2312; scraper: xml encoding- gb2312, search string encoding: gb2312 search result encoding - gb2312

Getlib:
getlib.com - gb2312; scraper: xml encoding- gb2312, search string encoding: gb2312 search result encoding - gb2312

Results
Mtime has no messy code problem.

M1095: movie information (Chinese) in messy code with key word "iron man" (Iron Man 2)

Image

Debug log
http://pastebin.com/JRf7upg4


7167:
With Movie feng sheng , XBMC showed the Chinese as messy code in the title selection interface, while the final movie information showed correctly.
Withe movie 风声,XBMC displayed all Chinese correctly.

Title selection - feng sheng, Messy code
Image

Final movie information - feng sheng. Chinese displayed correctly
Image

Log - feng sheng
http://pastebin.com/xB8pxDVE

Log - 风声 (Just for comparison in case . Since it is the same movie as feng sheng , the scraper did capture the same URL)
http://pastebin.com/NYNfZThU

Getlib
Movie 风声 and feng sheng:
Both title selection interfaces displayed correctly.

Final movie information - title messy code / no introduction displayed (actually from log, a lot information scraped)
风声:
Image

fenng sheng
Image

Debug log:
风声
http://pastebin.com/eqU7iy1B

feng sheng
http://pastebin.com/vHTsyYD3

Some lines "ERROR: convert_checked failed" could be found in those debug logs. I presume it is indicating the code page conversion problem. Though the code page was specified in the scraper xml, XBMC seems not to follow these parameters to display the scraped information.

I changed the Chinese font for XBMC to msyh.ttf ( from M$, built-in Windows 7), the same results.

Is it a bug? Huh

Rex

PS. I tried the scrapers in scraper editor and found if the website was encoded in utf-8, the editor can display the Chinese Character correctly, and if GB2312, messy code.
Reply
#2
convert_checked failed indicates that whatever character set you're data is in, it's not in the format that XBMC is expecting. e.g. you tell us it's utf8 when it's not. As a guess, this is why the IronMan lookup gives a bad plot - from the debug log it looks a bit dodgy with all that "whitespace" there.

What would be useful would be for you to verify that the data that XBMC dumps into the debug log is in the encoding that XBMC is expecting.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#3
Thans for your prompt reply.Laugh

Quote:What would be useful would be for you to verify that the data that XBMC dumps into the debug log is in the encoding that XBMC is expecting.

What is the encodeing that XBMC expects? As specified in scraper xml? If yes ,this would be very strange.

e.g., Mtime scraper works fine with XBMC. In Mtime scraper, xml encoding- utf-8, search string encoding: gb2312, search result encoding - iso-8859-1, but actually, the movie information XBMC dumped into the debug log was encoded in UTF-8 (the information is unreadable in the log since the log is in ANSI encoding. Copy it to notepad++, change encoding to utf-8, the correct Chinese character showed), XBMC also can show those correctly.

In the log of Iron man, the dumped information was also in utf-8, but XBMC failed to display it.

With getlib scraper with "feng sheng", the dumped movie information in the log was in GB2312 (as the specified in xml), but XBMC display nothing.

Thanks

Rex
Reply

Logout Mark Read Team Forum Stats Members Help
Scraper and code page conversion0