The logic and future of Music scrapers?
#46
I think that PR12120 makes the core changes needed to address points #1 & #2 of the first post.

A reminder:

(2017-02-07, 17:45)ronie Wrote: 1) if the 'prefer online info' setting is disabled, we pass the artistname to the artist scraper.
if the setting is enabled, we pass the artist mbid to the scraper.

why don't we always pass the mbid (if available) regardless of this setting?

2) if the album scraper returns no results, we completely skip the artist scraper. why?

3) if the 'prefer online info' setting is enabled, and 'show song and album artists' is enabled:
this causes the same artist being listed twice in your library if the artistname in your tags does not 100% match the artistname the scraper returns.

for instance "The B-52's" vs. "The B-52s":
3.1) i have all songs of an album tagged with artist "The B-52's"
3.2) we start the album scanner and it returns the mbid for this artist
3.3) we pass this mbid to the artist scraper and it returns info for "The B-52s" and kodi adds it to the db.
3.4) kodi now scans all songs for 'additional' artists. it finds "The B-52's" and checks if it's already in the db... nope
3.5) we pass "The B-52's" to the artist scraper and it returns info for whatever closest match it can find and kodi adds this artist to the db

ref: https://github.com/xbmc/xbmc/blob/99c25f...#L843-L883

#1. With the PR scraped album and artist mbids are stored. Hence having scraped an album, the mbids for the artists returned with the results are added to the db (if we didn't have them) and available for scraping the album artists, or for other addons to use to fetch art etc.

Of course the artists returned by the scraper can differ from those in the library. This could happen with the user has tagged with mbids but then edited (they do that), but even with album lookup by album and artist name this can happen too, especially with classical music [My example is Dvorak Symphony No. 7]. In that case when "perfer online nfo" is disabled, we can still store any returned mbids where the names match.

#2. I think the "why" was to avoid repeatedly trying to get artist data that Musicbrainz didn't have. For example say I have 3 albums by artist1 none of them in MB database, on scanning it would check the albums and go no further. Otherwise it would have requested artist for evey album and every song. Anyway with the PR it does scrape artists even if the album fails, but uses a list to ensure each attempt is made only once per run.

#3 I'm not so sure I have resolved this and need to do more testing. But I'm not sure I repeat it either. Assuming that the music has no mbid in tags, I just get The B52's replaced with whaever the scraper finds e.g. "The 'B' Girls". It mis-identifies wildly, not spotting "The B52's" is "The B52s"
Reply
#47
Having a tedious time testing as getting lots of 503 server timeout errors.

This has made me wonder if, with refresh (from the info dialog) at least, on a failed lookup by name we should separate the returns between
a) Musicbrainz server accessed but item not found
and
b) The scraping process failed for technical reasons - 503 errors mostly

At the moment both scenarious return INFO_NOT_FOUND and let the user re-enter the names. While this is worthwhile for a), it is misleading for b).

In the b) case it would be better to tell the user that the timeout happened and try again later. I went of to check the log to see what happened, but the normal user isn't going to do that. They just get pissed that Kodi can't find images and info etc.

It also makes me wonder about #2 in previous post. Say no mbids and you get a 503 on album lookup, then the subsequent artist lookup will be name only. Even if successfull it is less accurate than the album lookup, and more likely to get the wrong artist. Perhaps another reason the original approach did not attempt artist if the album failed?

Again a split return makes more sense to me. If album 503s then don't attempt artist, if album is just unknown to Musicbarinz, then do lookup artist on just name, it is the best we can do.

Thoughts anyone?
Reply
#48
i don't think album being found was used as a sanity check. there is a scoring that uses a weighted fuzzy match on artist, album and year, and a threshold for this (0.9 or something, do not recall) for an album to be auto-chosen. artist is more wonky as all you had was the name. always was.

it makes perfect sense to distinguish error codes, just wasn't a thing back when the code was written (it was true or false).
Reply
#49
Thanks for input Spiff.

So it may not have been a sanity check but maybe should be one now? If we save artist mbids returned by a successfull album lookup and can use them to subsequently lookup artist, then perhaps it is better for the auto scraping (fetch on library update) to skip the artist until we get the album and it can do a better job? I feel that the wrong artist info is far worse than none yet, for mssing stuff go try again (manualy) later. But that is just me maybe, OCD about controlling my lib contents?

Or even always retry immediately on 503 error? Picard hits the same server overload issue as Kodi, their devs are thinking about doing this see https://tickets.metabrainz.org/browse/PICARD-807

I like the idea of distingishing error codes, but not so sure about implementation. How generic does it need to be? CCurlFile::FillBuffer raises the "HTTP returned error 503" error, but getting it flagged back to the scraper level is not so obvious (scared to mess with CCurlFile).
Reply
#50
i agree, with mbid it's better to delay so we can have more info. and i totally share your ocd, before the days of spotify i had nfo files for everything Wink

it will be slightly tricky to propagate the error code across the interfaces. you will not have access to the file instance you need at scanner level since it will be buried in several layers. i think the least invasive approach is hooking up some callback interface in CCurlFile and then flagging the enabling of this through e.g. a protocol option. only naughty bit with this design is that the callback class would likely have to sit in some global instance.
Reply
#51
Much to my surprize, the 503 error already reaches the CScraper::GetArtistDetails and GetAlbumDetails in fcurl.m_httpresponse, see
https://github.com/xbmc/xbmc/blob/master....cpp#L1360
That is the error I want to capture and treat differently the most.

But where to add any retries? My initial feeling is somewhere in CMusicInfoScanner rather than in CScraper, and that would mean the scraper indicating that 503 was the reason for failure. At the moment it is just a bool CMusicInfoScraper::Succeeded. Have GetArtistDetails and GetAlbumDetails set a property when failure happens that gives httpresponse?

Any thoughts Spiff?

BTW I like the comment by CScraper::RunNoThrow "don't use in new code; errors should be handled appropriately"
https://github.com/xbmc/xbmc/blob/master...r.cpp#L316
Reply
#52
yes, it needs to be in MusicInfoScanner. probably somewhere in :Tonguerocess().
Reply
#53
Many balls in the air....

Looking at #3 of the OP, the questions from http://forum.kodi.tv/showthread.php?tid=...pid2590360, and chatting with @night199uk on Slack I think I have an idea about a way forwards.

Reminder:
(2017-02-07, 17:45)ronie Wrote: 3) if the 'prefer online info' setting is enabled, and 'show song and album artists' is enabled:
this causes the same artist being listed twice in your library if the artistname in your tags does not 100% match the artistname the scraper returns.

for instance "The B-52's" vs. "The B-52s":
3.1) i have all songs of an album tagged with artist "The B-52's"
3.2) we start the album scanner and it returns the mbid for this artist
3.3) we pass this mbid to the artist scraper and it returns info for "The B-52s" and kodi adds it to the db.
3.4) kodi now scans all songs for 'additional' artists. it finds "The B-52's" and checks if it's already in the db... nope
3.5) we pass "The B-52's" to the artist scraper and it returns info for whatever closest match it can find and kodi adds this artist to the db

ref: https://github.com/xbmc/xbmc/blob/99c25f...#L843-L883

The current merge of album artist(s) when "prefer online info" is enabled is flawed. The artist data derived from tags is only partialy overwritten, and with storing/using scraped MBIDs it gets even more messy. The original idea was for everything in the library to have MBIDs from tags, and then use scraping with "prefer online info" to fetch the lastest info from MB including if the artist credits had been changed. To do that correctly the merge process needs extending to cover song artists too.

But I don't think we can sensibly do that for items where we lookup by name only and thus scrape the MBID, with possible inaccuracies and partial results due to 503 timeouts etc.. Sure we can fetch info like dates, bio, styles etc. and artwork, but overwriting the artist credits is too big an impact.

A "my tagging is a mess, build me a library" facility will need some kind of acoustic fingerprint approach, lookup on names alone isn't it. Meanwhile people can use Picard, and if that can't tag their music accurately with MBID then Kodi isn't going to be able to identify the MBIDs by scraping.

So the solution, is either a rework of "prefer online info" into 2 settings:
a) one to overwrite artist or album data (but not the album artist credits) derived from tags with that scraped (from online or NFO);

b)the other to indicate that music tagged with MBID can have the artist credits for both albums and songs updated, including the artist names, updated based on the MBID tags, ignoring what other tag values e.g. artist, albumartist etc. may say.

EDIT: or when "prefer online info" is enabled AND we have MBIDs from tags, then it causes the artists credits to be updated.

May try to add this, along with 503 retries, to PR12120
Reply
#54
(2017-02-14, 03:53)ronie Wrote:
(2017-02-13, 16:01)DaveBlake Wrote: We can also stop fetching the track lists for an album, users just don't care, all they want is the songs they have, although I doubt that will help much.

correct, i haven't implemented it, as the recent forum discussion on this subject seems to indicate users prefer to see a list of the actual tracks they have in their library instead of a scraped tracklist.

@ronie after chatting with @night199uk I have a slightly different take on the reasons for scraping album tracks. I now believe that we should still fetch these if the album has a MBID derived from tags (not name lookup).

They are not for display on the album info dialog, users just want to see the tracks they actually have, so not for storing either (the albuminfosong table can go).

The purpose is for music files that are fully tagged with mbids, combined with "Prefer online info" enabled, that scraping can update even the song titles, and album and song artist credits with those in the Musicbrainz database. It means that users could update their collection with the latest crowd wisdom from MB without having to go though the stage of refreshing the tags in Picard.

It was not intended to unscramble poorly tagged music using name lookup, that soon gets messy. And there were flaws in the implementation which meant that things could get overwritten when they shouldn't. I stopped some of that in v16, but now I know what it was trying to do.

So could tracks still be implemented in the Python scraper after all please. Smile
Also could they have strMusicbrainzTrackID fetched and set (so the merge process can check it is merging the correct song)
Reply
#55
@spiff are you about?

I think you agreed on the sense of varying scarper behaviour when the response from Muscbrainz was "Server Unavailable". Not being able to get data because of server issues is not the same thing as there not being any data for the album or artist. When we get a 503 some of the things we may want to do are:

a) retry a bit later
b) if album falis with 503 then skip trying the artist (inaccurately), but if album genuinely is not found then do try to scrape the artist.
c) Let the user know when refresh from info dialog has a server issue opposed to album or artist not found and prompting for entry of names.

I have this working for the xml scraper. But how would you suggest the Python scraper API is changed to allow a similar thing with that?
Reply
#56
there are two ways to go about it.

1) do nothing and leave it to the python code (probably suboptimal)
or
2) return an error code in a similar way, attached as a property to an item to stay within current api.
Reply
#57
Thanks @spiff, I am looking into getting the Python scraper to return an error code, getting far more into Python than I intended. I'm sure I will need rescuing at some point @ronie

Another general design question about album scraping:

From tags we store musicbrainz release ids. Different album releases of the same release group can have different record labels, extra tracks, release dates, artwork etc., and (in theory at least) users can have different releases of the same release group in their music library as different albums (same title and artist but different mbid).

But TADB, and Fanart.tv use Musicbrainz release group id for their lookups. Given album title and artist name(s), or musicbrainz release id, the xml scraper goes off and scrapes from MB first getting release group id that is then used to lookup data elsewhere.

Storing scraped release ids can save repeated lookup on title and artist at MB if we don't have the release id from tags. PR:12120 implements that. However, to say get extra art or rescrape from TADB, it needs release group id. So do we start holding that in the db as well? We could even capture it from tags (if music has been tagged with Picard then it will be there waiting to be scanned). It would mean that we could scrape MB less (and so drop traffic to their server).

At the moment the reasons to scrape from MB are:
  • to fetch ids we don't have
  • to fetch record label and release date
  • optionally to fetch rating (not sure how loved/used MB is as a source of ratings compared to TADB)
  • to fetch track listings, that are pretty useless compared to having the songs you have on the album info dialog
  • with "Prefer Online Info" enabled and fully mbid tagged music, to update with the latest edits of artist names and track titles (if these have changed in some way). Using this to unscamble the artists and songs of badly tagged music does not work, and was not the original intension.

So I propose we store both the release id and release group id for albums, and only bother MB to get them if not already there, and make then available to skins/addons/JSON so any additional looksups elsewhere don't need to hit MB.

Make sense, or has looking at scraping warped my brain?
Reply
#58
make complete sense to me. one more item in the db doesn't kill us and it if limits network traffic all parties are happier.
Reply
#59
Sometimes I think I am going backwards with this....

Testing has finally made me realise that Kodi is not internally throttling the requests in makes to Musicbrainz after all, so album scraping in particular often hits the IP address specific rate limit of 1 per sec and gets server 503 errors.

Hitting the limiter is very frustrating, it took me 15 attempts to scrape 850 albums succesfully.

Sure there is a sleep(2000) in CMusicInfoScanner that initially appears to throttle the requests to MB, and the comments in the code suggest that it is enough, but actually the album scraper makes additional requests by another route without any wait. The scraper XML can define "chains" of requests that are executed without returning to that level, hence any limiting needs to be applied in CScraper so that every request to MB, and only MB, can be throttled.

Does putting specific code for the Musicbrainz url into CScraper upset anyone? The Python scarpers can be responsible for their own throttling, but the xml scraper needs something in core code. Or are we just going to say, tough live with hitting the limiter until the xml scraper dies?

Not liking having to parse the url looking for Musicbrainz, so could a tag be added to the XML so albumuniversal.xml can indicate in the regexp that the url requires a limiter. But maybe I am approaching this all wrong?

Thoughts on this welcome Smile
Reply
#60
PR to fix the Musicbrainz throttling for the old xml scapers here https://github.com/xbmc/xbmc/pull/12402
Went for the simple approach.
Reply

Logout Mark Read Team Forum Stats Members Help
The logic and future of Music scrapers?0