Improved allmusic.com scraper (plus a few questions)
#1
Hey gang,

Longtime user, (almost) first-time poster. Wink Anyhoo, I've been searching for the best music scraper for a while now, and haven't really been happy with any of them.. allmusic.com has fantastic artist information/reviews but awful photos (and the scraper never really seemed to get all the proper info), discog has decent photos but really limited artist information, and last.fm has medicore everything. So I started mucking around with the existing allmusic scraper (from r22528) and I've fixed a few problems with it, and improved it a bit (for my own needs, anyhow). Here's what I've done:

- The artist information (aside from the bio) wasn't being parsed properly; the ParseAMGArtist function was being passed the value "test" instead of the actual URL. Fixed.
- The album information wasn't being parsed properly either; the ParseAMGAlbum function was being passed the value "placeholder" instead of the actual URL. Fixed.
- The caching was glitchy, it seems that the cache file should be unique to each artist, whereas it was set to use the same cache file for every artist, so subsequent lookups would often have duplicate/incorrect information from the previous artist. Fixed.
- The scraper was set to only get thumbs from htbackdrops (same with all the other music scrapers now?), which is fine and may be preferrable to some, but htbackdrops barely has any thumbnails for the artists in my library. I noticed that discogs generally has decent photos for their artists, and are quite extensive, so I've changed the scraper to also check discogs for thumbs (though it will still use the thumb from htbackdrops as the primary if one is available).

With these changes, pretty much every artist and album in my library has a proper thumb, as well as full bio/reviews/discography/etc. Definitely a HUGE improvement.

I've posted it here: EDIT: my changes are now in the latest SVN builds. Just download that instead!

Hopefully it helps someone else as well Smile Maybe someone can merge these fixes into SVN.

1st question for the scraper gurus: is it possible to nest URL fetches or functions? I couldn't figure out how to fetch a URL, parse it, and then use the resulting string to fetch another URL and parse that.

2nd question: Is there a way to ensure a variable/buffer is URL encoded properly for use in a GET string?

Thanks!
Reply
#2
1) you can chain as deep as you want - just see how we call e.g. ParseAMGArtist and monkey that.
2) current no, but i've been pondering adding a urlencode feature - seems you need it as well, so it will come soonish.
Reply
#3
spiff, do you think this new and improved scraper could be included in future releases?

Asking just to understand whether it's worth adding it manually or if with a bit of patience I could find it by regularly updating.

Cheers!
Reply
#4
if a trac is posted it will get considered, if not it will be ignored
Reply
#5
Much MUCH improved! Thank you for your work on this talisto!

Now I need to rescrape 26,000 files....

Thanks again!
Reply
#6
spiff, does the trac need to be posted by the author?
Reply
#7
preferably yes, but no problem making an exception if the author doesnt pop up..
Reply
#8
blacklist Wrote:Much MUCH improved! Thank you for your work on this talisto!

Glad to hear it's working for you! Smile

I didn't submit this as a patch to trac because I assume there's a reason why all the music scrapers have been switched to use htbackdrops exclusively for thumbnails, and my inclusion of discogs is sort of a personal preference rather than a fix. However the other changes I've made are clearly bugfixes so perhaps I should submit a trac with only those changes, so that at least those get fixed promptly.

I've never used trac before, though, so I'd better read up on the "HOW-TO submit a patch" guidelines first! Wink
Reply
#9
reason is; major screwup on my behalf. must have been drunk Wink

the thumbs, however, you are dead wrong on. we still happily parse amg thumbs in GetAMGArtist and those are added first - so they get the priority.

as for including discogs, just make it a setting and default it to false and everybody's happy
edit2: just remember, i haven't gotten around to adding scraper settings to music scrapers yet. will do asap
Reply
#10
spiff Wrote:as for including discogs, just make it a setting and default it to false and everybody's happy

looks like you guys don't like discogs. why is that? just wondering...
Reply
#11
huh?

we have a separate discogs scraper. he wants to grab discogs thumbs in the allmusic scraper, which means much longer scraping times.
Reply
#12
spiff Wrote:the thumbs, however, you are dead wrong on. we still happily parse amg thumbs in GetAMGArtist and those are added first - so they get the priority.

Ugh, well, I made that assumption based on the fact that both the last.fm or freebase scrapers are only pulling thumbs from htbackdrops as well.. but I just realized they both have errors in them as well (in both scripts they're outputting the thumb results to buffer 7, and they should be outputting to buffer 5). I'll see if I can put a trac in for those too.

spiff Wrote:edit2: just remember, i haven't gotten around to adding scraper settings to music scrapers yet. will do asap

That'd be a great addition!!

Edit: I just realized the freebase scraper was fixed a couple days ago. I've submitted patches for the last.fm scraper, as well as the allmusic scraper (but without my Discogs additions.. just the fixes).
Reply
#13
stokedfish Wrote:looks like you guys don't like discogs. why is that? just wondering...

Discogs is a great resource for discography info (obviously, hence the name), but is pretty lacking when it comes to artist bios and other info. Take, for example, Ben Harper, who is a pretty popular artist: http://www.discogs.com/artist/Ben+Harper

Their entire "bio" for Ben Harper consists of his name and his birth date. Then compare that to allmusic.com's: http://www.allmusic.com/cg/amg.dll?p=amg...qe5ld0e~T1

Their bio is a comprehensive 6-paragraph history of the artist, along with their birthdate/location, years active as an artist, genres, styles, instruments played, etc. Unfortunately Allmusic's photos are limited to 200px wide, which look pretty awful in XBMC, which is why I modified the Allmusic script to use Discog's photos instead, as Discogs generally has photos 500px wide or more.
Reply
#14
delivered on my first promise (urlencoding of buffers);

Code:
<RegExp ....>
  <expression encode="1">...</expression>
</RegExp>
Reply
#15
thanks for the patches!
Reply

Logout Mark Read Team Forum Stats Members Help
Improved allmusic.com scraper (plus a few questions)0