The logic and future of Music scrapers?

  Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
DaveBlake Offline
Team-Kodi Member
Posts: 2,209
Joined: Jun 2015
Reputation: 61
Location: South West England
Post: #16
Thanks @spiff

So currently both full NFO and online scaper produce similar xml, that gets loaded in CAlbum or CArtist (using a load method), and then merged into the existing db record depending on settings. But instead the Python based scrapers will popuate CArtist or CAlbum directly, I assume not switch to xml and convert back.

Is it OK if what we can scrape from NFO (have xml tags we interpret) and what we can scrape from online sources diverges?

Another question: where can I look to see what data the online sources could return. Do they have some kind of documented API?
find quote
ironic_monkey Offline
Posting Freak
Posts: 1,423
Joined: Nov 2013
Reputation: 66
Post: #17
who needs documentation when there is code Wink

https://github.com/xbmc/xbmc/blob/master...m.cpp#L354
https://github.com/xbmc/xbmc/blob/master...st.cpp#L55

the python scrapers kinda does what you say, except, we do a silly hop through xml for certain things to keep compatibility.
see the various instances of https://github.com/xbmc/xbmc/blob/master...r.cpp#L702

the idea is to replace the property usage by phate89's work from https://github.com/xbmc/xbmc/pull/11418

for music this will also most likely mean you want to kill off some of the helper containers, CAlbum and CArtist and use CFileItem and its musictag directly.
find quote
ironic_monkey Offline
Posting Freak
Posts: 1,423
Joined: Nov 2013
Reputation: 66
Post: #18
as for the divergence, there was no such thing previously as it was the same xml format. it needs to be considered more carefully with the python based scrapers, but ideally there should be no divergence.
find quote
ironic_monkey Offline
Posting Freak
Posts: 1,423
Joined: Nov 2013
Reputation: 66
Post: #19
@ronie admittedly a very left-handed test, but i got no errors looking up a couple of albums /artists i had available in the office. how to reproduce (i can engineer tags, no problem..)
(This post was last modified: 2017-02-13 17:17 by ironic_monkey.)
find quote
DaveBlake Offline
Team-Kodi Member
Posts: 2,209
Joined: Jun 2015
Reputation: 61
Location: South West England
Post: #20
@spiff that is kodi music code, I know about that. What I don't know is can for example Musicbrainz return artist genre, or instruments, or shoe size? What data do these online sources actually have, rather than what have we historically fetched and stored.

Quote:for music this will also most likely mean you want to kill off some of the helper containers, CAlbum and CArtist and use CFileItem and its musictag directly.
That sounds like a move in the wrong direction to me. CAlbum and CArtist have a structure that relates to the music library and have meaning, while CFileItem and its musictag is an ugly bucket.

Quote: but ideally there should be no divergence.
I see it differently. The scraper should fetch all the info of interest that is availble from the source, some sources providing different data to others. But we should be able to use NFO (or export/import to xml format) to load any info that we want to store about artists and albums whether online sources provide that kind of data or not.

For example artist "gender" - male, female, group, is something I like to use to select music. It could easily be added to the db and used in smart playlists etc., and as a xml tag that can be loaded from NFO. Do any of the online souces give artist gender? Musicbrainz has it on their site so they might, I have no idea about the others. But even if none of the online sources could return that I would still consider adding it.

I don't think we should limit NFO to just what online sources support. Equally I would like to know what data we could be fetching from those sources, but aren't.

That make more sense?
find quote
jjd-uk Offline
Team-Kodi Member
Posts: 6,191
Joined: Oct 2011
Reputation: 152
Post: #21
Was discussing online services with Dave and turned up these API docs that seem to answer at least some of his question.

MusicBrainz https://wiki.musicbrainz.org/Development...n_2/Search

AudioDB http://www.theaudiodb.com/forum/viewtopic.php?t=7

Fanart.tv http://docs.fanarttv.apiary.io/#reference/music

Allmusic http://developer.rovicorp.com/docs/?wiki...2.1/search (this uses the Allmusic DB from what I gather)
find quote
jjd-uk Offline
Team-Kodi Member
Posts: 6,191
Joined: Oct 2011
Reputation: 152
Post: #22
Also while reading this https://wiki.musicbrainz.org/XML_Web_Ser...e_Limiting I saw

Quote:For "python-musicbrainz/0.7.3": we allow through (on average) 50 requests per second, and decline the rest (though recently this has not been hit).

Don't if this might be any use in developments https://github.com/alastair/python-musicbrainzngs
find quote
DaveBlake Offline
Team-Kodi Member
Posts: 2,209
Joined: Jun 2015
Reputation: 61
Location: South West England
Post: #23
Not sure when that Musicbrainz quote dates from, but there were major issues last year, and again around 17.0 release time, even 1 request per sec was getting timeouts (as well as TADB being off line completely).

But thanks for the references Jeff.
find quote
ronie Offline
Team-Kodi Member
Posts: 13,164
Joined: Jan 2009
Reputation: 391
Post: #24
@DaveBlake thanx for the answers on my tagging questions. i'll update the scraper accordingly.


(2017-02-13 16:01)DaveBlake Wrote:  What data can the Musicbrainz API return? Do we get more than just mbid?
I think part of the original design was that re-scraping could update info that had changed e.g. someone in the community had added or edited data. So if we are getting anything else from MB then we still need to scrape even when we have a mbid.
the info we get from musicbrainz:

for artists:
- born / died / formed / disbanded / years active
- discography

for albums:
- genre
- year
- release date
- rating / votes
- release_type / type



(2017-02-13 16:01)DaveBlake Wrote:  We can also stop fetching the track lists for an album, users just don't care, all they want is the songs they have, although I doubt that will help much.

correct, i haven't implemented it, as the recent forum discussion on this subject seems to indicate users prefer to see a list of the actual tracks they have in their library instead of a scraped tracklist.

Quote:3) bugs
i came across another one and added it to the first post... perhaps it's a not bug, i don't know.

for the issues i've reported so far, i have identified to code parts and added them to the first post as well.

i would really like to get bug #2 fixed (not scraping artist info, if album lookup fails).
i've patched the code locally and didn't experience any crash-boom-bangs if we simply proceed with an artist lookup in that particular case.

(2017-02-13 16:01)DaveBlake Wrote:  I'd like to check that out too with my less scraper friendly music. And think about what happens when we have mbid for artists but not the albums etc. Something is niggling at the back of my mind over a reason why #2 is like it is.

great, i'll prepare a PR and let you run your music collection through it.

(2017-02-13 16:01)DaveBlake Wrote:  BTW I'm up for working on any of the code on the music side of things. Not a clue with the Python though.

+1 and much appreciated ;-)
you can let me handle the python side of things.


(2017-02-13 16:01)DaveBlake Wrote:  A new thing I would like to see on the scraper side is return of disambiguation data.

When manual scraping of a single item (from refresh on info dialog) without mbids is inconclusive we get a list of possibles but no disambiguation data to help spot which is the right one. Can we get that back from Musicbrainz? Olympia mentioned that the scraper used to do it, but stopped for some reason.

yes musicbrainz supports that for searches by artistname.
our current implementation is that the scrapers needs to return 'artist genre' / 'artist born' for searches,
but sadly musicbrainz does not provide either of those when searching for artists.
so yeah, disambiguation data would be the way to go.

for albumname searches, musicbrainz returns artist disambiguation data, primary/secondary release type and tags (mostly genre related tags)

Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ and Search the forum before posting.
For troubleshooting and bug reporting please make sure you read this first.
(This post was last modified: 2017-02-14 05:24 by ronie.)
find quote
ronie Offline
Team-Kodi Member
Posts: 13,164
Joined: Jan 2009
Reputation: 391
Post: #25
(2017-02-13 16:17)DaveBlake Wrote:  Another question: where can I look to see what data the online sources could return. Do they have some kind of documented API?

see the links from jjd-uk, but if you have specific questions, let me know

(2017-02-13 19:17)jjd-uk Wrote:  Was discussing online services with Dave and turned up these API docs that seem to answer at least some of his question.

MusicBrainz https://wiki.musicbrainz.org/Development...n_2/Search

AudioDB http://www.theaudiodb.com/forum/viewtopic.php?t=7

Fanart.tv http://docs.fanarttv.apiary.io/#reference/music

Allmusic http://developer.rovicorp.com/docs/?wiki...2.1/search (this uses the Allmusic DB from what I gather)

sadly, we can't use the rovicorp api, it's a paid one.
instead we scrape the allmusic pages.

btw, we are not allowed to scrape the artist biography and album reviews from allmusic, as they are copyrighted (they have contacted us about it in the past).
last.fm and discogs.com also previously let us know we are not allowed to scrape them.

in case discogs.com is deemed useful, their data is public domain and they provide data dumps. so setting up our own mirror is a possibility.

Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ and Search the forum before posting.
For troubleshooting and bug reporting please make sure you read this first.
find quote
ronie Offline
Team-Kodi Member
Posts: 13,164
Joined: Jan 2009
Reputation: 391
Post: #26
(2017-02-13 19:24)jjd-uk Wrote:  Also while reading this https://wiki.musicbrainz.org/XML_Web_Ser...e_Limiting I saw

Quote:For "python-musicbrainz/0.7.3": we allow through (on average) 50 requests per second, and decline the rest (though recently this has not been hit).

Don't if this might be any use in developments https://github.com/alastair/python-musicbrainzngs

we don't use that particular user-agent.
the scraper addon sets a, what they call, 'meaningful User-Agent'.
they don't apply throttling based on user-agent in this case.

we'll still hit the 1 call per second based on ip address though.

(2017-02-13 20:01)DaveBlake Wrote:  Not sure when that Musicbrainz quote dates from, but there were major issues last year, and again around 17.0 release time, even 1 request per sec was getting timeouts (as well as TADB being off line completely).

if musicbrainz is under heavy load, they throttle regardless of what time-out value you use i think:
Quote:We allow through 300 requests each second (on average), and decline (http 503) the rest.


while testing things i found out there is a very popular skin helper addon in our repo that is also making a lot of calls to musicbrainz
when you are scraping your music collection.
if you have this addon installed (it's a dependency of many skins) you're very likely to end up with many items that failed to scrape (due to throttling),
as the addon is basically doubling up the number of calls to musicbrainz.

Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ and Search the forum before posting.
For troubleshooting and bug reporting please make sure you read this first.
find quote
ronie Offline
Team-Kodi Member
Posts: 13,164
Joined: Jan 2009
Reputation: 391
Post: #27
(2017-02-13 17:14)ironic_monkey Wrote:  @ronie admittedly a very left-handed test, but i got no errors looking up a couple of albums /artists i had available in the office. how to reproduce (i can engineer tags, no problem..)

i wish i could tell you, i haven't been able to find a pattern. it's completely random.

i'm tested it by scraping the same album a couple of times and each time the results would differ (deleting the musicdb and restarting kodi inbetween).

first time, the artist scraper (getdetails call) leads to a invalid handle warning:
http://paste.ubuntu.com/23992706/

second time, the album scraper (getdetails call) leads to a invalid handle warning:
http://paste.ubuntu.com/23992707/

third time, no warnings.
http://paste.ubuntu.com/23992715/


linux / current master / fresh install / 'fetch additional information during updates' enabled, rest of the settings is default.

Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ and Search the forum before posting.
For troubleshooting and bug reporting please make sure you read this first.
find quote
ironic_monkey Offline
Posting Freak
Posts: 1,423
Joined: Nov 2013
Reputation: 66
Post: #28
now that was likely the important bit - the background fetches. i did manual...
find quote
ronie Offline
Team-Kodi Member
Posts: 13,164
Joined: Jan 2009
Reputation: 391
Post: #29
(2017-02-13 15:35)ironic_monkey Wrote:  2) 'url' nfo files. these hold an url to some backend. it was sort-of the manual mbrainz before mbrainz support was added. this is used to shortcut the two first steps in ronie's overview, that is, no search is performed, and thus there is no 'first entry' in a list to guess as match. code wise, this means that we enter the code at the 'grab info stage', skipping the 'search' stage.

oh, this one is also leading to a crash (similar to the one we had before):
http://paste.ubuntu.com/23998050/

needs some love here i think:
https://github.com/xbmc/xbmc/blob/master...r.cpp#L430

Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ and Search the forum before posting.
For troubleshooting and bug reporting please make sure you read this first.
find quote
DaveBlake Offline
Team-Kodi Member
Posts: 2,209
Joined: Jun 2015
Reputation: 61
Location: South West England
Post: #30
Lots of points to keep up with!

Info from Musicbrainz
Obviously if we are getting more than just mbid from there we can't drop the Musicbrainz lookup for those albums and artists where we alreday have the mbid from music file tags. But if the scraper settings are such that all we want from Musicbrainz is an ID then it would be sensible to optimise that out and go straight to the other sources using that ID as you suggested @ronie

Fetching artist gender and sortname would be useful for some v18 improvements I have in progress, along with disambiguation comment.

As you comment they don't return genre, but to be honest genre is so subjective and also needs to be granular in different levels depending on your collection. For example if you have only a handful of heavy metal in your lib "heavy metal" is enough, but if 1000 artists or albums then you probably what to break that into "Alt metal", "Funk metal", "Doom metal" etc. Also artist genre is not fully integrated into the db, just added as a text string, not linked to the genre table, and not used in filtering for playlist rules.

Nor sure what to do with the "tags" they return for albumname searches, they aren't always genres more a hashtag comment. I am looking at a facility for users to give artists and albums custom properties, these would fit there.

Quote:while testing things i found out there is a very popular skin helper addon in our repo that is also making a lot of calls to musicbrainz
when you are scraping your music collection.
if you have this addon installed (it's a dependency of many skins) you're very likely to end up with many items that failed to scrape (due to throttling), as the addon is basically doubling up the number of calls to musicbrainz.
Oh dear! What is it fetching? Does it do it even when we have mbid from tags? Could we look at optimising that too?

Bugs/Questions form 1st Post
Quote:#1.
If the 'prefer online info' setting is disabled, we pass the artistname to the artist scraper.
If the setting is enabled, we pass the artist mbid to the scraper.
Why don't we always pass the mbid (if available) regardless of this setting?
Not sure why you pointed at the code that you did, and think you may have misread? The code is about what we do with the data we have scraped, (correctly) managing what data derrived from tags gets overwritten.
When calling the scraper here https://github.com/xbmc/xbmc/blob/99c25f....cpp#L1097
and https://github.com/xbmc/xbmc/blob/99c25f....cpp#L1319
we use the mbid from tags when we have it.

Can #1 go?

#2. Will check out your PR.

#3. There are other variations on the point you make here. Scraping can make a mess, even delete artists but leave thie names in the song artist desc. I have been looking at reworking this in relation to also storing the mbids that we scrape. They need to be flagged as from online not embeded tags, because we also need a mechanism to replace them if inaccurate (assumption is always that embeded mbids tags are correct). I will pick this up.
find quote