2010-05-21, 06:27
Not sure if it is the right forum to post.
There are four movie scrapers for Chinese user in trunk - Mtime, M1905, Getlib and 7176 (Thanks to taxigps) . Generally, they work well, but in some cases 7176/Getlib scrape the Chinese information in messy code.
Routine usage and findings:
Mtime: works fine always but does not support imdb ID and water-marked pics
M1905: seldom used since does not support multi-words search
7176: good but sometimes messy code
Getlib: sometimes messy code
I just got some time and had a testing with those scrapers.
Platform: Windows 7 Simplified Chinese, 32bit with XBMC svn30252, the built-in Arial font was replaced with a Chinese font (wqy-microHei) to support Chinese display. wqy-microHei is an open source font, find here: http://sourceforge.net/projects/wqy/file...z/download
Testing keywords (movies): iron man, feng sheng,风声
Actually, feng sheng and 风声 are the same movie, feng sheng is pinyin of 风声. See information about pinyin on wiki: http://en.wikipedia.org/wiki/Pinyin
Code page information of scrapers and movie websites:
Mime:
mtime.com - utf-8, scraper: xml encoding- utf-8, search string encoding: gb2312, search result encoding - iso-8859-1
M1095:
m1095.com - uft-8, scraper: xml encoding- utf-8, search string encoding: utf-8, search result encoding - iso-8859-1
7176:
7176.com- gb2312; scraper: xml encoding- gb2312, search string encoding: gb2312 search result encoding - gb2312
Getlib:
getlib.com - gb2312; scraper: xml encoding- gb2312, search string encoding: gb2312 search result encoding - gb2312
Results
Mtime has no messy code problem.
M1095: movie information (Chinese) in messy code with key word "iron man" (Iron Man 2)
Debug log
http://pastebin.com/JRf7upg4
7167:
With Movie feng sheng , XBMC showed the Chinese as messy code in the title selection interface, while the final movie information showed correctly.
Withe movie 风声,XBMC displayed all Chinese correctly.
Title selection - feng sheng, Messy code
Final movie information - feng sheng. Chinese displayed correctly
Log - feng sheng
http://pastebin.com/xB8pxDVE
Log - 风声 (Just for comparison in case . Since it is the same movie as feng sheng , the scraper did capture the same URL)
http://pastebin.com/NYNfZThU
Getlib
Movie 风声 and feng sheng:
Both title selection interfaces displayed correctly.
Final movie information - title messy code / no introduction displayed (actually from log, a lot information scraped)
风声:
fenng sheng
Debug log:
风声
http://pastebin.com/eqU7iy1B
feng sheng
http://pastebin.com/vHTsyYD3
Some lines "ERROR: convert_checked failed" could be found in those debug logs. I presume it is indicating the code page conversion problem. Though the code page was specified in the scraper xml, XBMC seems not to follow these parameters to display the scraped information.
I changed the Chinese font for XBMC to msyh.ttf ( from M$, built-in Windows 7), the same results.
Is it a bug?
Rex
PS. I tried the scrapers in scraper editor and found if the website was encoded in utf-8, the editor can display the Chinese Character correctly, and if GB2312, messy code.
There are four movie scrapers for Chinese user in trunk - Mtime, M1905, Getlib and 7176 (Thanks to taxigps) . Generally, they work well, but in some cases 7176/Getlib scrape the Chinese information in messy code.
Routine usage and findings:
Mtime: works fine always but does not support imdb ID and water-marked pics
M1905: seldom used since does not support multi-words search
7176: good but sometimes messy code
Getlib: sometimes messy code
I just got some time and had a testing with those scrapers.
Platform: Windows 7 Simplified Chinese, 32bit with XBMC svn30252, the built-in Arial font was replaced with a Chinese font (wqy-microHei) to support Chinese display. wqy-microHei is an open source font, find here: http://sourceforge.net/projects/wqy/file...z/download
Testing keywords (movies): iron man, feng sheng,风声
Actually, feng sheng and 风声 are the same movie, feng sheng is pinyin of 风声. See information about pinyin on wiki: http://en.wikipedia.org/wiki/Pinyin
Code page information of scrapers and movie websites:
Mime:
mtime.com - utf-8, scraper: xml encoding- utf-8, search string encoding: gb2312, search result encoding - iso-8859-1
M1095:
m1095.com - uft-8, scraper: xml encoding- utf-8, search string encoding: utf-8, search result encoding - iso-8859-1
7176:
7176.com- gb2312; scraper: xml encoding- gb2312, search string encoding: gb2312 search result encoding - gb2312
Getlib:
getlib.com - gb2312; scraper: xml encoding- gb2312, search string encoding: gb2312 search result encoding - gb2312
Results
Mtime has no messy code problem.
M1095: movie information (Chinese) in messy code with key word "iron man" (Iron Man 2)
Debug log
http://pastebin.com/JRf7upg4
7167:
With Movie feng sheng , XBMC showed the Chinese as messy code in the title selection interface, while the final movie information showed correctly.
Withe movie 风声,XBMC displayed all Chinese correctly.
Title selection - feng sheng, Messy code
Final movie information - feng sheng. Chinese displayed correctly
Log - feng sheng
http://pastebin.com/xB8pxDVE
Log - 风声 (Just for comparison in case . Since it is the same movie as feng sheng , the scraper did capture the same URL)
http://pastebin.com/NYNfZThU
Getlib
Movie 风声 and feng sheng:
Both title selection interfaces displayed correctly.
Final movie information - title messy code / no introduction displayed (actually from log, a lot information scraped)
风声:
fenng sheng
Debug log:
风声
http://pastebin.com/eqU7iy1B
feng sheng
http://pastebin.com/vHTsyYD3
Some lines "ERROR: convert_checked failed" could be found in those debug logs. I presume it is indicating the code page conversion problem. Though the code page was specified in the scraper xml, XBMC seems not to follow these parameters to display the scraped information.
I changed the Chinese font for XBMC to msyh.ttf ( from M$, built-in Windows 7), the same results.
Is it a bug?
Rex
PS. I tried the scrapers in scraper editor and found if the website was encoded in utf-8, the editor can display the Chinese Character correctly, and if GB2312, messy code.