discogs.com music scraper - development help and bug reports wanted!
#16
Hi,

Just found the Discogs Scraper and run it on my Library.
I tried to add ANV (artist name variations) so it would get more hits when looking for artists. This seems to work quite well.
Here is the code I've modified:
Code:
<GetArtistSearchResults dest="8">
    <RegExp input="$$5" output="&lt;results&gt;\1&lt;/results&gt;" dest="8">
        <!-- artist name variation -->
        <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;url&gt;http://www.discogs.com\1&lt;/url&gt;&lt;/entity&gt;" dest="5+">
            <expression repeat="yes" clear="yes">&lt;a class=&quot;rollover_link&quot; href=&quot;(/artist[^&quot;]*anv=[^&quot;]*)&quot;&gt;(.+)&lt;/a&gt;</expression>
        </RegExp>
        <!-- exact match -->
        <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;url&gt;http://www.discogs.com\1&lt;/url&gt;&lt;/entity&gt;" dest="5+">
            <expression repeat="yes" clear="no">&lt;a class=&quot;rollover_link&quot; href=&quot;(/artist[^&quot;]*)&quot;&gt;&lt;span style=&quot;font-size:11pt;&quot;&gt;&lt;em&gt;([^&lt;]*)&lt;</expression>
        </RegExp>
        <expression noclean="1"/>
    </RegExp>
</GetArtistSearchResults>

Then I found that the scraper doesn't find artists having "The " prepended. i.e "The Art of Noise" is not found since discogs expects it to be "Art of Noise, The"
I tried to add this, but couldn't get it to work. Maybe somebody can help.

Code:
<CreateArtistSearchUrl dest="3">
    <RegExp input="$$2" output="http://www.discogs.com/search?type=artists&amp;q=&quot;\1&quot;&amp;btn=Search" dest="3">
          <RegExp input="$$2" output="\1,%20The" dest="2">
            <RegExp input="$$1" output="\1" dest="2">
                <expression noclean="1"/>
            </RegExp>
            <expression noclean="1" clear="no" repeat="no" trim="1">[Tt]he[ ](.+)</expression>
        </RegExp>
        <expression noclean="1"/>
    </RegExp>
</CreateArtistSearchUrl>

The problem seems to be the blank after "The". I tried:
"[Tt]he (.+)"
"[Tt]he\s(.+)"
"[Tt]he[ ](.+)"
but none of them matched.

When I change it to:
"[Tt]he(.+)" it works but of course \1 has a prepending blank and the resulting string is:" Art of Noise, The".

Any ideas?

Bernd
Reply
#17
you are passed an url encoded version of the name, i.e. that whitespace is a + (or is it %20 can't recall).

updates as diff's on trac please
Reply
#18
%20 worked!
Thanks for the hint.

Now I need to find out how to create a diff so I can post it on trac.

Bernd

PS: I would be nice if the debug log contained the inputs and outputs to the scraper. This would make debugging easier.
Reply
#19
Ticket #6316 added and attached the patch.

Bernd
Reply
#20
Hi there,

There is an artist on discogs, and there is a bug in the name of artist. The last character is small, but it has to written with capitalized just like in all the releases.

http://www.discogs.com/artist/APh

can you/we fix this bug?

reply to [email protected] pleasepleaseplese
Reply
#21
Im sorry i posted a bug that which is already fixed but I didnt see that I was using an old version.
Reply
#22
I'm very interested in this scraper since my music collection consists mainly of vinyl electrnoc stuff which is quite well catagorized in discogs and nowhere else.
Unfortunatly I can't find the discogs.xml anymore in current svn checkouts (using 28276 atm).
Does this scraper still exsist? is there enough interest to maintain it?

Thanks in advance!

[edit] just found this http://www.discogs.com/help/forums/topic/198561 would this appear to be the issue? any chance for improvement?
Reply
#23
no, discogs started blocking us so we removed it.
Reply

Logout Mark Read Team Forum Stats Members Help
discogs.com music scraper - development help and bug reports wanted!0