The logic and future of Music scrapers? - Printable Version
+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (/forumdisplay.php?fid=32)
+--- Forum: Scrapers (/forumdisplay.php?fid=60)
+--- Thread: The logic and future of Music scrapers? (/showthread.php?tid=306218)
The logic and future of Music scrapers? - ronie - 2017-02-07 17:45
i'm currently working on python based scrapers for artists and albums.
since i'm pretty clueless on the scraping process, i'm hoping to get some feedback from more experienced devs.
so far, i came across a few things i don't quite understand:
1) if the 'prefer online info' setting is disabled, we pass the artistname to the artist scraper.
if the setting is enabled, we pass the artist mbid to the scraper.
why don't we always pass the mbid (if available) regardless of this setting?
2) if the album scraper returns no results, we completely skip the artist scraper. why?
3) if the 'prefer online info' setting is enabled, and 'show song and album artists' is enabled:
this causes the same artist being listed twice in your library if the artistname in your tags does not 100% match the artistname the scraper returns.
for instance "The B-52's" vs. "The B-52s":
3.1) i have all songs of an album tagged with artist "The B-52's"
3.2) we start the album scanner and it returns the mbid for this artist
3.3) we pass this mbid to the artist scraper and it returns info for "The B-52s" and kodi adds it to the db.
3.4) kodi now scans all songs for 'additional' artists. it finds "The B-52's" and checks if it's already in the db... nope
3.5) we pass "The B-52's" to the artist scraper and it returns info for whatever closest match it can find and kodi adds this artist to the db
RE: The logic of Music scrapers? - ironic_monkey - 2017-02-07 22:52
Sorry, I cannot answer any of these questions as both concern things added/changed after 'my' days. Neither seems to make much sense to me, that much i can say.
RE: The logic of Music scrapers? - ronie - 2017-02-08 00:14
when reading http://forum.kodi.tv/showthread.php?tid=184514&pid=1612266#pid1612266
i get the impression 'prefer online info' was meant to be a temporary setting for testing musicbrainz support when it was introduced
(2014-01-27 17:57)night199uk Wrote: Essentially the option is pretty redundant but it should allow us to bring the feature in more smoothly.
now i'm kind of curious what magic code we have in kodi to accomplish this:
(2014-01-27 17:57)night199uk Wrote: once you've MusicBrainz tagged your files once, your library and metadata is built dynamically from up-to-date MusicBrainz data, and later KEPT up-to-date and refreshed automatically so you always have good tags.
RE: The logic of Music scrapers? - jjd-uk - 2017-02-08 11:23
(2017-02-08 00:14)ronie Wrote: when reading http://forum.kodi.tv/showthread.php?tid=184514&pid=1612266#pid1612266
Jonathan purposely kept it in his rework of what night199uk did as an use/do not use MBID'ssetting as incorrect MBID's or matches between MBID's were causing lots of issues, so it could be use to keep the old behaviour where MBID's were not used all. There a pretty decent explanation somewhere if I can find it.
 Actually the explanation I remember may have been from that thread you linked to.
RE: The logic of Music scrapers? - DaveBlake - 2017-02-09 11:07
Ah just spotted this thread, thanks @ronie for starting the conversation. I would like more insight into music scraping too, because what I see seems to need changing, but I don't like to do that without some understanding of why it is like it is. But I do have a good understanding now of the muisic db and tag processing, so that should help!
(2017-02-07 17:45)ronie Wrote: 1) if the 'prefer online info' setting is disabled, we pass the artistname to the artist scraper.I thought that we did use mbids when we had them. Not doing so is a mistake for sure. The other effect of this setting is to overwrite the data derrived from music file tags, or not. That is the only thing I think it should do.
Quote:2) if the album scraper returns no results, we completely skip the artist scraper. why?That must be in the automated scraping, because albums and artists can be scraped separately via "query info for all". I would guess that the thinking is that if the album isn't known then the artists are unlikely to be known too. There is some sense to that. Is it that successfull album scraping could return the artist mbids?
(2017-02-08 00:14)ronie Wrote: when reading http://forum.kodi.tv/showthread.php?tid=184514&pid=1612266#pid1612266Interesting idea. As best I can tell whatever ambitions @night199uk had for Musicbrainz use replacing the need for music file tagging they never came to fruition. It needed albums and artists to be uniquely identifiable by name alone, and far too often they just aren't. So possibly the temp nature of this setting was to eventually always have the online info overwrite.
Quote:now i'm kind of curious what magic code we have in kodi to accomplish this:For Musicbrainz tagged music the idea would have been to have both "Prefer online info", "Fetch info on update", and "update lib on start up" enabled. Then, by magic, any changes in the cloud to artist biogs, dates etc, or album reviews etc. would be propagated into your music lib every time you turn it on.
But there are reasons this does not work in practice, let alone that many users are control freaks don't actually want their precious music lib details shifting underneath them! They are thrilled first time it appears, but don't like it to change afterwards.
The first big hurdle was that Kodi made a mess with the default tagging that came out of Picard for multiple artists on songs or albums, and this lead to many wanting to turn off mbid tags. Correct mbids is essential for the above magic. Krypton is greatly improved in this respect, but I still would not advise fetching online data by default, the user needs to look at the lib their tagging has created frist. There is an overhang too of upgraded databases that have not rescanned the tags (we encourage rescan but the user can cancel it). The mbid tag mess will haunt Kodi a while yet, and we need to manage that as best we can.
Another issue is server traffic, we hammer Musicbrainz and TADB, and end up with server time outs. We really need to be more efficient with scraping and try to avoid scraping the same info over and over again. If you have a big music collection then just scanning the hash table to identify changed music files takea a long time, let alone fetching online info etc. Even as an asynchronous process it can be a problem for smaller processors and "respectful" users that see the progress bar and don't like to switch off midway. What the hell is Kodi doing they ask?
I believe that kind of sync of artist and album info with the "wisdom of crowds" data needs to be an elective rather than a default automatic process e.g. user can do so when and if they want, but it does not happen automatically.
One thing I would like to start doing is storing the scraped mbid for albums and artists that did have them in tags. We need to flag then as scraped rather than from tags because they could be wrong, but it would be more efficient to use the value once we have it. We also could do with offering the user the disambiguation data e.g. 23 artists called "Eclipse" which do they mean. At least when doing a manual scrape of a single artist the user has some chance of picking the correct one from the list they are shown.
RE: The logic of Music scrapers? - ironic_monkey - 2017-02-09 11:20
the cure for the hammering is batch processing. this can only be achieved if we reboot the scraper system completely; drop the xml scrapers, then refine the python APIs to allow batching (i am assuming the servers supports it). i did this in the compatible way currently to aid the transition process (and minimize the impact in other parts of the code). at some point the pain needs to be absorbed, but i think it's wise to shake out the bugs from the generic stuff using the current implementation before taking this further.
it would also be very useful if i wasn't the one doing this, you guys maintain this now so you should have a firm understanding of what's going on. that doesn't mean i won't help though!
RE: The logic of Music scrapers? - DaveBlake - 2017-02-09 16:02
(2017-02-09 11:20)ironic_monkey Wrote: it would also be very useful if i wasn't the one doing this, you guys maintain this now so you should have a firm understanding of what's going on. that doesn't mean i won't help though!
Unfortunately my understanding fades as things move further away from the db, but willing to gain that "firm understanding" with a bit of guidence! I have clear ideas on what we want to scrape and when, but I'm very ignorant on how we actually go and get that data. My current knowledge is pittiful, so where to start?
Sure batch processing could help the server load, as it is I believe requests are limited to 1 per second (or something) for similar reasons. But there are other inefficiencies we could fix e.g. where music isn't tagged with mbids Kodi asks Musicbrainz to identify artist named "XXX" every rescrape and is given a mbid it uses on TADB etc. but forgets. Before v17 we were also scraping artist info for artists that the user never saw (because they only listed album artists not all the song artists as well). Better export/import facilities could also reduce the need for re-fetching online data after dropping and re-adding a music source. Not only pace our traffic, but reduce what Kodi demands to just those things it needs.
Do Python scrapers rely on JSON API at all to load the fetched data? I ask because there are some issues with JSON setting some of the music data. I really could do with it spelling out broadly what the change from xml to Python means.
RE: The logic of Music scrapers? - ironic_monkey - 2017-02-09 16:15
the xml scrapers were basically an api which did
1) grab some url
2) pass it through the regular expressions defined in the scraper xml to extract wanted data and return it in a xml-based format.
this process is repeated for each entity (album, artist, movie, tv show, episode, musicvideo). this was the correct tool back in the day when i created this (~ 2006), as repeated execution of python scripts were way too heavy for the humble xbox hardware, and dealing with this stuff in c++ is too tedious. the big problem it always suffered from is that it leads to many requests for the backends - one for each entity.
the point of moving it to python is to make this less restricted, in the sense that you can more easily use multiple data sources for the lookups (it was possibly in the xml scrapers as well, but a bit tedious - i "designed" this stuff as we went, and never did a proper reboot on it) as well as deal with more involved APIs and file formats. you can do whatever you want on the python side, be it json or whatever api is offered by whatever backend is available, you can combine as many data sources you want or whatever. you have the full power of python at hand.
i stuck to the old API as i said - so it's still processed item by item. but if we break this (and thus throw out the xml scrapers), we can process multiple things in one go - e.g. all episodes for a show, all albums for an artist, a batch of artists or whatever.
RE: The logic of Music scrapers? - scott967 - 2017-02-12 05:57
PMFJI, but some of the "one-off" scraper authors (python add-ons artwork downloader, cdArt manager, artist slideshow) probably have some experience / opinions on how scrapers could be handed in python.
RE: The logic of Music scrapers? - ronie - 2017-02-13 03:04
thanx all for the comments!
let me try to chip in with the knowledge i've gathered over the past week when working on the music scrapers...
1) batch scraping
we currently use 4 data sources in the music scraper:
all of them (except allmusic) support musicbrainz id's.
the basic process: the scraper passes a mbid to the server and the server returns the metadata for this mbid.
2) hammering servers
to scrape a single artist + album, we make:
- 4 calls to musicbrainz
- 3 calls to theaudiodb
- 3 calls to allmusic
- 2 calls to fanarttv
so that's 12 lookups in total, guess that's a lot :-)
the process (in details) is as follows:
this is for music that does not have musicbrainz tags.
even though your music does not have those tags, i would say this will result in pretty accurate results.
the combination of artistname + albumname is unique enough to return the correct info as far as i know.
and since the album scraper returns the artist mbid, you can get accurate results from the artist scraper as well
in case your music IS tagged with musicbrainz id's, the artist and album scraper could skip the initial call to musicbrainz.
that is the only optimization i can think of.
the biggest problem is the 1 call per second limit of the musicbrainz server. this makes scraping utterly slow.
i there's wasn't a limit, scraping would be reasonably fast, even though we have to make 12 queries in total.
solution: run our own musicbrainz mirror.
as for theaudiodb, i've read (not experienced it myself) they run out of bandwidth at the end of the month
and is basically offline from that point on till the next month starts.
solution: kodi should chip in to keep the metadata sites, we use and depend upon, alive.
i came across another one and added it to the first post... perhaps it's a know bug, i don't know.
for the issues i've reported so far, i have identified to code parts and added them to the first post as well.
i would really like to get bug #2 fixed (not scraping artist info, if album lookup fails).
i've patched the code locally and didn't experience any crash-boom-bangs if we simply proceed with an artist lookup in that particular case.
4) were to go next
for automatic scraping, perhaps we can get rid of the 'first provide a list of results' then 'get details for the first item' procedure?
ideally kodi would pass all the info it has to the album scraper (artistname, artist mbid, albumname, album mbid)
and leave it up to the album scraper to return the correct metadata right away.
same for the artist scraper, if we have them, pass both the artistname and artist mbid and let the artist scraper figure it out.
for manual scraping, we need to keep things as is of course.
RE: The logic of Music scrapers? - ronie - 2017-02-13 03:12
@cpt.spiff, at your convenience, could you please have a look at my python scrapers
and check why they produce a decent amount of 'invalid handle' warnings in the log?
the scraper.py file of both scrapers is where the magic happens:
i'm clueless when it comes to plugins :-)
RE: The logic of Music scrapers? - ronie - 2017-02-13 03:28
last questions for tonight.. i promise
about the info the album scraper needs to return:
do we use this in kodi? it's not stored in the db afaik.
in case it's needed, in what format (localization) does the scraper need to return the date?
am i correct in understanding we basically always need to return 'album' here?
we're an album scraper.. we only scrape albums..
i think this a subtype to the 'release_type' value?
musicbrainz has the concept of 'primary type' (album) and 'secondary type' (live / soundtrack / compilation / remix)
pretty sure it's not up to the scraper to provide this info
compilation is in 'various artists' or 'greatest hits album' by a single artist?
if it's the first, i guess it's up to kodi to determine this based on artist/albumartist tags?
btw. there is no metadata site that provides 'artist.instruments'
perhaps it's worth dropping support for it?
RE: The logic and future of Music scrapers? - DaveBlake - 2017-02-13 12:44
Forgive my total ignorance, but can I just confirm that what we use to fetch album/artist data from online sources is different to what we use to fetch album/artist data from NFO files?
Assuming we are only talking about from online sources...
(2017-02-13 03:28)ronie Wrote: - 'album.release_date'Not yet, but I am looking to use in v18 so keep please.
As a string YYYY-MM-DD, YYYY-MM or YYYY, release dates are often just year or year and month.
Quote:- 'album.release_type'Yes always "album". I don't think we should be trying to scrape it at all. It is an internal flag to separate albums from the fake album entry we make in the db for singles. One fake album entry, release_type = "single", for all the singles by an artist.
Quote:- 'album.type'No, not a subtype. Really it is an open text value, set to whatever the user thinks the "album type" is, or more often left blank, and avilable in playlist rules etc.. It is initially populated from music file tags TXXX:MUSICBRAINZ ALBUM TYPE (Id3) or RELEASETYPE (Vorbis), if present. Sadly the name of the tag in Vorbis means it is easily confued with our internal flag.
When scraped from Musicbrainz both 'primary type' and 'secondary type' values should be combined e.g. "album / soundtrack", or "EP / Live / remix".
Not sure if other sources provide album type at all, but if they do then put it in here.
Quote:- 'album.userrating'Yes, just rating and votes are from online sources, not userrating.
Unless we start supporting some kind of cloud backup of user values (I think Zag had some idea like that), in which case I guess a scraper would need to fetch them.
Quote:- 'album.compiliation'A compilation can be either (to borrow from MB)
a) a collection of recordings from various old sources (not necessarily released) combined together. For example a "best of", retrospective or rarities type release.
b) a various artists song collection, usually based on a general theme ("Songs for Lovers"), a particular time period ("Hits of 1998"), or some other kind of grouping ("Songs From the Movies", the "Café del Mar" series, etc).
Really an album is a compilation if the user says it is: some people would exclude anthologies, and not all compilations have blank or "various artists" as the album artist tag (some users give the collection name as album artist e.g. "Sunday Express Freebie"). There is a COMPILATION tag, (from iTunes but escaped into common use) and if that is set for the songs then the album is a compilation.
But like some other tags, we scrape the value and if "Prefer online info" is enabled it overrides what the tags said.
Quote:btw. there is no metadata site that provides 'artist.instruments'Sure, assuming we are just talking online sources. Would still want the facility to scrape this from artist.NFO
RE: The logic and future of Music scrapers? - ironic_monkey - 2017-02-13 15:35
Quote: Forgive my total ignorance, but can I just confirm that what we use to fetch album/artist data from online sources is different to what we use to fetch album/artist data from NFO files?
there are two kinds of nfo files;
1) full nfo files. these hold full information, in the xml format usually returned by the xml scrapers. code wise, this shortcuts the entire scraper process, the xml file is just read and we continue in the code as-if the data came from a scraper.
2) 'url' nfo files. these hold an url to some backend. it was sort-of the manual mbrainz before mbrainz support was added. this is used to shortcut the two first steps in ronie's overview, that is, no search is performed, and thus there is no 'first entry' in a list to guess as match. code wise, this means that we enter the code at the 'grab info stage', skipping the 'search' stage.
RE: The logic and future of Music scrapers? - DaveBlake - 2017-02-13 16:01
@ronie thanks for sharing the knowledge you gathered. A few points back (as a heavy music user).
(2017-02-13 03:04)ronie Wrote: even though your music does not have those [Musicbrainz ID] tags, i would say this will result in pretty accurate results.The uniqueness of artistname + albumname really does depend on what music you have in your collection. Classical music, where there are multiple album artists (composer, conductor, orchestra) and many albums with the same title e.g. "Symphony No. 5" is particularly bad. Also use of artist and albumartist tags varies - composer in album artist or just as (song) artist etc. The scraper happily returns completely the wrong album, and if "Prefer online info" is enabled it will wipe out the accurate artist names derrived from tags.
Also Musicbrainz has multiple releases of the same album (artist and album title combination), some with different release dates and bonus tracks etc. Scraping by artistname + albumname just takes the first, which may be inaccurate.
With automated scraping of music without mbid tags we just do our best, but it is falable and can mess up a perfectly accurate library. Scraping album first (with artistname + albumname so better chance of accuracy) to get artist mbid is a good thing, but just because the scraper found something does not mean it is right.
Quote:2) hammering serversWhat data can the Musicbrainz API return? Do we get more than just mbid?
I think part of the original design was that re-scraping could update info that had changed e.g. someone in the community had added or edited data. So if we are getting anything else from MB then we still need to scrape even when we have a mbid.
Quote:solution: run our own musicbrainz mirror.+1
But I think we can do some other things to reduce how often we scrape too. Even running mirrors we will soon have the problem that the existing servers do.
We can also stop fetching the track lists for an album, users just don't care, all they want is the songs they have, although I doubt that will help much.
Quote:3) bugsI'd like to check that out too with my less scraper friendly music. And think about what happens when we have mbid for artists but not the albums etc. Something is niggling at the back of my mind over a reason why #2 is like it is.
BTW I'm up for working on any of the code on the music side of things. Not a clue with the Python though.
Quote:4) were to go nextNo so sure about this, but will give it some thought. I wonder if we need to check the artists returned by the album scrape against those we already have (from tags) before we scrape those artists, or I guess the scraper could do that.
There is also the (re-)scraping of filtered list of artists or albums from "Query Info for all". Inaccurate returns mean that I do scrape artists but don't even try albums for some kinds of music.
A new thing I would like to see on the scraper side is return of disambiguation data.
When manual scraping of a single item (from refresh on info dialog) without mbids is inconclusive we get a list of possibles but no disambiguation data to help spot which is the right one. Can we get that back from Musicbrainz? Olympia mentioned that the scraper used to do it, but stopped for some reason.
Also with artist discography we need to start saving the mbids if they are (or could be) returned. Currently we do a mess match by name when the info dialog happens, makes no sense to me at all. Although the discography for classical music is totally useless - care to guess how many albums Beethoven has on Musicbrainz?
Hope at least some of this is helpful.