Python based music scrapers design for v19
#1
Starting this thread to capture some of the design discussion that has been happening on team Slack relating to 15665 (PR) the new Python based msuc scrapers - it is possibly easier to follow here, and gives a wider audience a chance to contribute. However it is a design thread, please report new scraper testing results here 342025 (thread) and be prepared to have your posts split off for clarity if necessary.

There is also an historic thread about music scraping here 306218 (thread) but so much discussed then has been superceded.
Reply
#2
Artist Scraper
Data Fetching Strategy

The overall approach taken with artists scraper is:
If don't have mbid
        try to find artist by name at Musicbrainz to get ID
        failing that fallback to discogs (getting discog id?)
Get all details of artist from all possible sources - Musicbrainz, Discogs, TADB, fanart.tv and allmusic.
Merge the results, least accurate first.
Finally apply user preferences for the source of each data value.

@ronie a few questions on that strategy
  1. What is used to lookup artist details on the TADB, fanart.tv and allmusic sites - MBID, name, some other info from Musicbrainz, Discogs id (of some other info from Discogs)?
  2. How do we ensure the results from all the different sites are the same artist, name alone is not enough?
  3. How do the preferred source sites for each get applied after merging the results, I just don't see it.

Optional Sites
We also discussed making it optional to not only prefer certain source sites, but also pick the sites accessed. Have look everywhere as default, but let those with more specialist collections, or maybe those that just want say art quickly, have the option to adjust where (other than Musicbrainz falling back to  Discogs) data is retrieved from.

I can see the sense in requesting all data from all remote sites - how is the user meant to know which sites has data for that artist. However I was concerned that it meant extra network traffic, extra load on source sites and slower scraping of unwanted data. @ronie pointed out that none of that was true, the XML scraper also makes requets to all sites, the only slowness is caused by the Musicbranz throttling (I request per second), so fetching all the data we can while we are requesting makes sense. 

However I think the user does need a way to avoid certain data and/or from certain sites. For example e.g. the artist style, mood and genre data can be an utter mess, you may not want it at all or only from one site that has better quality values (in your subjective view) , or the discographies returned for classical composers can be pretty useless too.

Skipping requests to unwnated sites could also reduce network traffic and load on those sites, maybe even make scraping a little quicker across thousands of artists.

Merging Scraped Artist Data with what already have
Of course a totally different approach would be to let the scraper return everything everytime, and  place control over how the new scraped data is merged with the previously scraped data into core or some Kodi settings (rather than addon settings).  Would be good thing for consistency, or just confusing for users to have the settings for that somewhere separate from the scraper?

Current approach with merging artist is, with exception of mbid and name that as identifiers are treated differently, the new values replace the old values. If empty then value gets cleared, sometimes users will use refresh with fetching data for some fields turned off to clean out garbage they accidentally scraped before - why did I get genres, yuck! But to achieve that you need a scraper option not to fetch some data values - perhaps put "none" in the preferrd source list?
Reply
#3
@Karellen asked
Quote:With the new python scrapers, is it possible to separate scraping artwork and metadata?
I know that the facility to refresh just possible art without picking up any other remote data is something users have asked for. @ronie it would be good if somewhere between scraper and core we could achieve that. Just to clarify "Prefer online info" and "Fetch additional info on update" settings have nothing to do with art.

The scraper fetches a list of possible art of a variety of types from remote sites. In the past only certain types of art have been filtered from the request results and returned to core (to be added to db). I think that we should start keeping the full list of what is available. This is not the same as setting the actual art. Yes core uses the stored list of possible art to fill in any gaps in thumbs/fanart that have not already been filled by local art, it also finds local art as part of tag scanning (not scraping).  So if fanarttv starts returning artist "mugshots" they just get included in the list of possible art, then later if the user has a skin that show mugshots Kodi can pick up the first art with aspect="mugshot" to fill that gap and set that as art (that gets fetched and cached). So art types should not be a scraper option, nor should the scraper limit what art type are saved in the list os possible art.

Obviously we could have an option or options in the scraper that limit it to just fetching art. Only asking the sites that return art could make scraping faster perhaps? But the way scraped artist data is merged in core would also need to know to only set possible art urls, and not wipe the rest of the existing metadata. The process that fills the gaps - setting the art to the first value of that type in the list if the artist does not already have that type of art - needs to run.

At the same time users would also like to be able to pick up local art that they have added since the artist was first scanned into the library. Fetching local art is a job for core, but has a logical tie in with fetching art generally. It also has a cross-over with the job performed by addons like Artist Beef - @rmrector I think you may have contribution to make to this thread.

So achieving this facility needs a meet in the middle approach.
Reply
#4
I also had the misfortune of my internet cutting out in the middle of a find artist. I tried the cancel button it was seductively offering me, and of course Kodi just hung.  Stuck somewhere in the Python? Somewhere in the scraper process? I don't know, but it would be nice to solve that kind of thing too
Reply
#5
(2019-03-15, 20:49)DaveBlake Wrote: Artist Scraper
Data Fetching Strategy


@ronie a few questions on that strategy
  1. What is used to lookup artist details on the TADB, fanart.tv and allmusic sites - MBID, name, some other info from Musicbrainz, Discogs id (of some other info from Discogs)?
  2. How do we ensure the results from all the different sites are the same artist, name alone is not enough?
  3. How do the preferred source sites for each get applied after merging the results, I just don't see it.

1) currently:
  • TADB - musicbrainz id
    fanart.tv - musicbrainz id
    allmusic - artistname
    discogs - artistname

2) as discussed on the PR, i'll change allmusic/discogs to use the urls provided musicbrainz for 100% accurate results.

3) at the end of compile_results() we call user_prefs()
this will overwrite certain parts of the compiled results with the data from your preferred metadata site.
https://gitlab.com/ronie/metadata.integr...y#L270-296


(2019-03-15, 20:49)DaveBlake Wrote: Optional Sites
However I think the user does need a way to avoid certain data and/or from certain sites. For example e.g. the artist style, mood and genre data can be an utter mess, you may not want it at all or only from one site that has better quality values (in your subjective view) , or the discographies returned for classical composers can be pretty useless too.

well, we fetch at least 20 different items (genre, biography, style, born, fanart, etc...) from up to 5 different sites.
so that would need at least 20 settings to configure your preferred site for each of those items.

then another 20 settings to configure a fallback (like the universal scraper has) in case the preferred site does not return anything.
and i'm sure some would prefer a 2nd or 3rth fallback as well.

for me this falls in the category 'hey i don't like that third pixel on the left, please add a setting so i can disable it.
of course, i'm willing to add a setting where it makes sense, but at the same time try to keep it simple keep it stupid.


(2019-03-15, 20:49)DaveBlake Wrote: Merging Scraped Artist Data with what already have
Of course a totally different approach would be to let the scraper return everything everytime, and  place control over how the new scraped data is merged with the previously scraped data into core or some Kodi settings (rather than addon settings).  Would be good thing for consistency, or just confusing for users to have the settings for that somewhere separate from the scraper?

Current approach with merging artist is, with exception of mbid and name that as identifiers are treated differently, the new values replace the old values. If empty then value gets cleared, sometimes users will use refresh with fetching data for some fields turned off to clean out garbage they accidentally scraped before - why did I get genres, yuck! But to achieve that you need a scraper option not to fetch some data values - perhaps put "none" in the preferrd source list? 
 
if adding a 'None' option would make sense (it doesn't to me) i can add it.
as for the rest, we already have a system in place for utter fine-tuning, it's called .nfo files ;-)
Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ, Help and Search the forum before posting.
Reply
#6
(2019-03-15, 21:19)DaveBlake Wrote: I know that the facility to refresh just possible art without picking up any other remote data is something users have asked for. @ronie it would be good if somewhere between scraper and core we could achieve that.
hmm.. i guess it can be accomplished without the need to make any changes to the scraper.
add a 'scrape artwork' option somewhere in kodi, and have kodi ignore all data the scraper returns except for artwork?

(2019-03-15, 21:19)DaveBlake Wrote: So if fanarttv starts returning artist "mugshots" they just get included in the list of possible art, then later if the user has a skin that show mugshots Kodi can pick up the first art with aspect="mugshot" to fill that gap and set that as art (that gets fetched and cached). So art types should not be a scraper option, nor should the scraper limit what art type are saved in the list os possible art.
if a metadata site starts returning artist "mugshots" they can not be 'magically' scraped by the scraper. the code to scrape 'mugshots' needs to be added to the scraper.

(2019-03-15, 21:19)DaveBlake Wrote: Obviously we could have an option or options in the scraper that limit it to just fetching art. Only asking the sites that return art could make scraping faster perhaps?
all sites we scrape, except musicbrainz, also return artwork. but in case we don't have a musicbrainz id, we need to query them as well.
Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ, Help and Search the forum before posting.
Reply
#7
(2019-03-15, 21:39)DaveBlake Wrote: I also had the misfortune of my internet cutting out in the middle of a find artist. I tried the cancel button it was seductively offering me, and of course Kodi just hung.  Stuck somewhere in the Python? Somewhere in the scraper process? I don't know, but it would be nice to solve that kind of thing too
 Yes, the scraper is likely waiting for a response from one if the metadata sites, i can add some code to make it time-out a bit faster.
(by default, the time for a connection to time out is defined by the OS)
Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ, Help and Search the forum before posting.
Reply
#8
So for art in nfo files, is <thumb aspect="some_arttype"> the "official" way to specify art?  Is "thumb" or "thumbnail" as an arttype in scraping dead (I guess only used for the auto-generated video framgrab so N/A on music)?

scott s.
.
maintainer of skin  Aeon MQ5 mods for post-Gotham Kodi releases:
Krypton
Leia
Reply
#9
(2019-03-16, 02:47)scott967 Wrote: So for art in nfo files, is <thumb aspect="some_arttype"> the "official" way to specify art?
yup, that's is what kodi uses when you export your library to .nfo files.

(2019-03-16, 02:47)scott967 Wrote: Is "thumb" or "thumbnail" as an arttype in scraping dead (I guess only used for the auto-generated video framgrab so N/A on music)?.
dunno for videos, but in music aspect="thumb" is used for both artist and album thumbnails.
Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ, Help and Search the forum before posting.
Reply
#10
(2019-03-15, 21:39)DaveBlake Wrote: I also had the misfortune of my internet cutting out in the middle of a find artist. I tried the cancel button it was seductively offering me, and of course Kodi just hung. Stuck somewhere in the Python? Somewhere in the scraper process? I don't know, but it would be nice to solve that kind of thing too
(2019-03-16, 02:19)ronie Wrote:  Yes, the scraper is likely waiting for a response from one if the metadata sites, i can add some code to make it time-out a bit faster.
(by default, the time for a connection to time out is defined by the OS) 
That isn't the issue - my unstable internet connection is providing a good opportunity to test it!

I can see in the logs the server timeout
2019-03-16 11:46:20.165 T:53860 WARNING: CCurlFile::FillBuffer - Reconnect, (re)try 1
2019-03-16 11:46:50.164 T:53860 ERROR: CCurlFile::FillBuffer - Failed: Timeout was reached(28)
2019-03-16 11:46:50.164 T:53860 WARNING: CCurlFile::FillBuffer - Reconnect, (re)try 2
2019-03-16 11:47:20.162 T:53860 ERROR: CCurlFile::FillBuffer - Failed: Timeout was reached(28)

then it just sits with the "Searching Artist" dialog displayed (I waited over an hour while I was connectionless). Click cancel and it hangs saying "Seraching Artist: Cancelling..." in the dialog title. In debug I can see it is stuck in CPluginDirectory::WaitOnScriptResult waiting for directory fetch to complete, end, or be cancelled.

The addon script isn't cancelling, and not sure how to get that to do so when the progress dialog put up by the scraper core code is cancelled. This will be a problem for all Python scrapers. Any ideas @ronie?
Reply
#11
As you all know, music tagging is a deeply personal thing, especially when it comes to things like genre. Even @DaveBlake whom I perceive to be a person who values logic over emotion commented in another thread something to the effect that musicbrainz genre choices would drive him crazy.

That said, the music and artist scraper settings offer a wide variety of sources-to-tags matching. This makes for great freedom and the possibility for indiviualization with the given automated processing but also requires a lot of research and in-depth-knowledge of the offered services and their quirks for optimal results.

Therefore I would like to suggest a kind of preview feature already available in settings, that could work like this:

a) User submits an album of his choice and would be able to see immediate results - and differences - when cycling through the various service providers
b) Alternatively, the scraper shows the differences in settings for one or more predefined, well-known albums of a certain genre.

Would I would like to see achieved is that end-users immediately see that same service tag e.g. genre as Rock-Pop and other services offer much more fine grained results. The same would be true for other tags, the detail of biography, list of album releases and so on.

An example preview would be the way to go, if at all programatically possible.
Reply
#12
Wonderful, thanks for doing these!

Scrapers will need to manage their own cache to avoid repeated requests to the same resource hitting the web service unnecessarily. Searches in particular will be repeated often enough, but we should be nice and cache all requests to web services. 

I would also request an option to enable/disable web services separately, to give users some control over the network requests. Maybe don't disable the initial search from MusicBrainz, but note that in the settings as a text line.
 
(2019-03-16, 02:47)scott967 Wrote: Is "thumb" or "thumbnail" as an arttype in scraping dead (I guess only used for the auto-generated video framgrab so N/A on music)?

For the video library, yes, these are thumbnails/frames/framegrabs/stills for movies, episodes, musicvideos, and loose video files, but they don't have to be auto-generated.
Reply
#13
For reference I thought I would tabulate what artwork the various remote sites provide, and the aspect/type that these new scrapers set in the results

Art TypeMusicbrainz (CoverArchive)DiscogsFanarttvTADBallmusic
Artist
fanartartistbackgroundstrArtistFanart
thumbimagesartistthumbstrArtistThumbartist-image
bannermusicbannerstrArtistBanner
clearartstrClearart
clearlogomusiclogo hdmusiclogostrArtistLogo
landscapestrWideThumb
Album
thumbfrontimagesalbumcoverstrAlbumThumbmedia-gallery-image
backbackstrAlbumThumbBack
discartmediumcdartstrAlbumCDart
spinespinestrAlbumSpine
3dcasestrAlbum3DCase
3dflatstrAlbum3DFlat
3dfacestrAlbum3DFace
-musiclabel

Perhaps @ronie you can check I have that right. It there anywhere design documents for an addon can go, it could help later with general maintainence or support if you aren't about.

It shows interesing things like: if only fetching artist art then no need to make a request to allmusic; if not feching art then no need to request anything from fanarttv. But also why not also save record label logo (as aspect=recordlogo) from fanart.tv?
Reply
#14
@DaveBlake 
I think there's a mistake in the table, strAlbumSpine belongs to TADB, not to allmusic
If I have helped you or increased your knowledge, click the 'thumbs up' button to give thanks :) (People with less than 20 posts won't see the "thumbs up" button)
Reply
#15
Yep strAlbumSpine is TADB, also we have the Record label logo strLabelLogo if needed, although its not exposed on the API yet.

Also please don't use strAlbumThumbHQ as that's only for special users at the moment. 4K 2160p covers are nice but not sustainable in terms of bandwidth for everyone. strAlbumThumb is 700x700 which is enough for most users and speeds up scraping a lot compared to fanart.tv
Image Image
Reply
 
Thread Rating:
  • 0 Vote(s) - 0 Average



Logout Mark Read Team Forum Stats Members Help
Python based music scrapers design for v1900