2014-11-12, 04:26
(2014-11-12, 02:44)DoctorD Wrote: As for the Tokyo247 data showing the name as "Bookmark", I actually noticed this same problem when I was writing the code to get it working! I don't really know Japanese, but from what I can gather, this is happening because the name for this particular person is all in hiragana: "しおり". In English letters, this is the correct name, Shiori. For whatever reason, google translate gives "Bookmark" as the translation for しおり and so that's why the scraper shows this. I don't know why google translate gives such a bad translation (Or maybe it's actually a good translation? I don't know, I don't speak Japanese), but it does!
A possible fix for this that I can write might be, on name elements, instead of going straight to google translation, if it detects every letter as hiragana it tries to do a phonetic translation into English letters. For example, "し" becomes "Shi". I might also be able to do the same if also has katakana in the name as I think () it follows a similar one to one translation structure. Do you think that will work?
Ha, I didn't even realize that "しおり" would mean "bookmark." I thought it means some kind of programming error and that it's a name of a function or library or something like that. While I understand the problem I don't know Japanese either so I can't give any cent about Hiragana or Katakana.
The fix for scanning name you propose sounds a bit complicate and may add more time to the scraping process. But if you can figure it out it will, as I understand, preserve not just the name, but also any other specific word that shouldn't be translated. Another idea is, if scraper sees actress name in Japanese as well as English, maybe you can have it to crosscheck the title with the actress name so it won't translate it. I don't know. It just an idea.
Thanks!
Edit: Just saw your update now. Impressively FAST! So you figured out the Hiragana/Katakana thing. Awesome!! I'll test around with some files and will let you know if there's any bug.