blacklist Wrote:For the bio information i think themoviedb or imdb are going to be the best bet. I think themoviedb is more friendly to scrape from, they don't tend to mess with the layout of their site like imdb does. Unfortunately, their site Has fairly limited information in comparison.
http://www.themoviedb.org/person/1813. Is Anne Hathaway at themoviedb
http://www.imdb.com/name/nm0004266/ is Anne at imdb.
Now, unless im wrong (might be) the database already records themoviedb id? If so thatbmakes the scrape pretty trivial.
For the image my first choice would be images.google.com i think:
PHP Code:
http://images.google.com/images?hl=en&gbv=2&tbs=isch:1&&sa=X&ei=tPkETbzkK8T7lweM8uDiCQ&ved=0CC8QBSgA&q=James+franco&spell=1#q=James+franco
Returns a pretty good selection of images of James Franco, and it is fairly easy to place image size constraints on the call.
Alternatively, there a a bunch of fan site and celebrity image archives that might be used... But have more likelihood of returning 0 results. (frankly they are geared at female celebrities for the most part....)
For instance, (possibly nsfw)
http://www.skins.be/Anne-Hathaway/
Returns a pretty thorough page devoted to the actress, and
http://www.skins.be/feeds/en/anne-hathaway.xml
Returns an rss feed of wallpapers dedicated to the actor.
now, these certainly wouldn't work for all actors, but then we could possibly fall back to google?
Wouldn't it be easier to petition TMDB folks to add more image fields to their actor db? Would require more user-input than just scraping google images for stuff that already exists, but the added bonus is that the results would be more uniformly useful and straightforward to scrape....
I know thetvdb folks have been threatening to upgrade for a long time without much external progress, but don't know about TMDB.... In that case, does it make sense to try to set up a new site for this? Seems like it wouldn't be terribly involved:
* Import all actor info from TMDB, and use their ID as primary key for lookups
* Add fields for fanarts and other ratios of images, and return multiples for each when they exist....
Certainly not a trivial task, but much more useful in the long term than any clever workarounds.
Kode has put together a great framework for something just like this for the TV Logos:
http://fanart.tv
It's freshly redone and moved to that URL, but he's said adding support for other "types" of info (logos, season thumbs, [presumably] actors) will be trivial once everything's ironed out....
I'm sure if we all pitched in, we could have a perfect solution in a reasonable time.