Music Album Info - Correctly Merging Gathered Data
#1
Music 
Starting a new music thread to discuss album information, in particular how externally scraped (or NFO) data is merged with the data already in the music library as scanned from tags in music files.

Current implementation does not do what people expect or want, see first page of this thread Music Album Info - Should displaying it delete tag data.
There is also BUG - Album information search endless loop about the longstanding horrible dialog loop you can get stuck in when the album isn't found especially if you don't have a keyboard. There is a PR to fix this bug, but it is not complete yet, I hope Evilhamster will be able to get back to it.

I feel we need to improve the way that scraping this additional data is initiated. On the one hand most users would just want it to happen automatically, and it doesn't. On the other control of what scarper used and when it scrapes is awkward for those that do want to specify and control it. This issue is covered by Follow up to PR8069 - Adding music to the library so I don't want to include it here.

As a reminder the basic functionality is to have a scraper gather information about albums from external sources, additional to that data gathered from the tags on the song files, store it in the library and then display it. The additional information can also be gathered locally from NFO files instead of online, giving the user control. There is also a setting regarding overriding the tag data with scraped data that should determine data priority.

That is all well and good, but there are design issues around merging data that need consideration.

The album scraper (or NFO) gathers some data that may also have been gleaned from the tags.
a) If we are going to override the tag sourced data with this then it needs to propagate to all the data tables e.g. album_genre not just the genre string of album.
b) Equally if we do not want to override, just fill in the blanks, then data must not be changed by mistake e.g. artists getting deleted and replaced or added.

Do we agreeHuh Then I will attempt to change CAlbum::MergeScrapedAlbum to do that because the current implementation has bugs.

Now the fiddly details, and less obvious merge decisions.

1) Say we have an album but no Musicbrainz IDs for albums or artists from the tagged music files. Scraping finds the album and returns the MB IDs too. Do we store the MB IDs even if we have said do not override tags? They are useful to uniquely identify tracks, albums and artists, and it would be adding info since they were missing from the tags. So I think we should, or is someone going to object to having MB IDs appear in their libraryHuh

2) What if scraping misidentifies the album or, if offered a choice, we select the wrong one? This may not happen much with popular music, but with classical it is quite common as either the scraper only finds one match, or the list of possibles only shows composer name, album title and year and not conductor and orchestra making it hard to spot the right one if it is there (showing album cover would make choice simpler but that is a scraper issue). Too late you see the info dialog showing the wrong data Sad

Not so bad, I guess, if b) is working, but really could do with an "undo" or "revert to tag data" button.

Thoughts please!
Reply
#2
Well, a problem with MBID. Very often there are many MBID for "release" and then there is separate MBID for "release group". scrapper logic seems to want to use the release id, not release group. If you query by album title, there is no way to select a release id (assuming more than 1).

A second question is this note in the wiki on album.nfo files:

Note: Where indicated, tags can accept an optional "clear" boolean (true/false) attribute which clears corresponding scraped data and all values of similar tags up to that point.

I have never had to use this "clear" attribute. But AFAIK, if there is album.nfo that is used instead of the assigned scraper.

I only use nfo or scrapers to add data to embedded tags, never change it. I also use album.nfo to avoid scraper failure, even if I don't add anything to the tags.

scott s.
.
Reply
#3
(2015-09-29, 13:18)DaveBlake Wrote: The album scraper (or NFO) gathers some data that may also have been gleaned from the tags.
a) If we are going to override the tag sourced data with this then it needs to propagate to all the data tables e.g. album_genre not just the genre string of album.
b) Equally if we do not want to override, just fill in the blanks, then data must not be changed by mistake e.g. artists getting deleted and replaced or added.

I think so yes, personally I would always want to fill in the blanks.

Quote:1) Say we have an album but no Musicbrainz IDs for albums or artists from the tagged music files. Scraping finds the album and returns the MB IDs too. Do we store the MB IDs even if we have said do not override tags? They are useful to uniquely identify tracks, albums and artists, and it would be adding info since they were missing from the tags. So I think we should, or is someone going to object to having MB IDs appear in their libraryHuh

Yes, MBID's should be the key lookup ID to help with any metadata requested in the future.

Quote:2) What if scraping misidentifies the album or, if offered a choice, we select the wrong one? This may not happen much with popular music, but with classical it is quite common as either the scraper only finds one match, or the list of possibles only shows composer name, album title and year and not conductor and orchestra making it hard to spot the right one if it is there (showing album cover would make choice simpler but that is a scraper issue). Too late you see the info dialog showing the wrong data Sad

Info >> Search album details. So the user fixes this themselves. That's what I currently do to correct missing metadata.
Reply
#4
(2015-09-29, 13:18)DaveBlake Wrote: Current implementation does not do what people expect or want, see first page of this thread Music Album Info - Should displaying it delete tag data.
There is also BUG - Album information search endless loop about the longstanding horrible dialog loop you can get stuck in when the album isn't found especially if you don't have a keyboard. There is a PR to fix this bug, but it is not complete yet, I hope Evilhamster will be able to get back to it.

It can be merged as is, but I'm considering if I should also add a counter so that you only get the change album information when searching once.

(2015-09-29, 13:18)DaveBlake Wrote: As a reminder the basic functionality is to have a scraper gather information about albums from external sources, additional to that data gathered from the tags on the song files, store it in the library and then display it. The additional information can also be gathered locally from NFO files instead of online, giving the user control. There is also a setting regarding overriding the tag data with scraped data that should determine data priority.

That is all well and good, but there are design issues around merging data that need consideration.

The album scraper (or NFO) gathers some data that may also have been gleaned from the tags.
a) If we are going to override the tag sourced data with this then it needs to propagate to all the data tables e.g. album_genre not just the genre string of album.
b) Equally if we do not want to override, just fill in the blanks, then data must not be changed by mistake e.g. artists getting deleted and replaced or added.

a) Totally agree
b) This makes sense to me, there might be some corner cases we need to take care of but as a generic approach it makes sense. If this is done, how the rescan process is done would needs to be changed since currently it scans based on the old information and not the information in the tags (to my knowledge), so we would need to add some code to revert the information in the information to the tagged information before scanning (and maybe add a gui option to do the same).

(2015-09-29, 13:18)DaveBlake Wrote: Now the fiddly details, and less obvious merge decisions.

1) Say we have an album but no Musicbrainz IDs for albums or artists from the tagged music files. Scraping finds the album and returns the MB IDs too. Do we store the MB IDs even if we have said do not override tags? They are useful to uniquely identify tracks, albums and artists, and it would be adding info since they were missing from the tags. So I think we should, or is someone going to object to having MB IDs appear in their libraryHuh

2) What if scraping misidentifies the album or, if offered a choice, we select the wrong one? This may not happen much with popular music, but with classical it is quite common as either the scraper only finds one match, or the list of possibles only shows composer name, album title and year and not conductor and orchestra making it hard to spot the right one if it is there (showing album cover would make choice simpler but that is a scraper issue). Too late you see the info dialog showing the wrong data Sad

Not so bad, I guess, if b) is working, but really could do with an "undo" or "revert to tag data" button.

Thoughts please!

1) I think it makes sense to save the musicbrainzid if a musicbrainzid scraper was used.
2) Make sure it has correct musicbrainzids and this should be a non-issue Smile But adding undo or revert to tag data would make sense.

As fixes for the current solution these steps sounds good and improves the situation.
CUtil::AlbumRelevance

One more thing that might be good to do is to rework the code in CMusicInfoScanner:Big GrinownloadAlbumInfo, in it we set a relevance value based for hits to see if they are accepted or not. The problem is that fstrcmp is used to compare strings for albumname and artistname (information from kodi is a joined string), this has several problems in cases like:
* Artist that use a name like "lastname, firstname" and the other source uses "firstname lastname"
* Artist that use different locales for the names on the different sides (for example Ленинград or Leningrad, classical example would be Александр Порфирьевич Бородин or Alexander Porfyrevich Borodin)
* Artist where a different alias was used in the artist field.
* Different information in the artist field (for classical this could be a different inclusion of orchestra and composer in the artist name or not).

I would guess that this causes some problems for classical music as well.

It would be interesting to look into Heimdall and see how much work would be required to get something like that working for the music side of things.
Reply

Logout Mark Read Team Forum Stats Members Help
Music Album Info - Correctly Merging Gathered Data0