• 1(current)
  • 2
  • 3
  • 4
  • 5
  • 37
[WIP] AniDB.net Anime Video Scraper
#1
I can't, for the life of me, figure out what's wrong with my scraper, though my initial guess probably has to do with my not properly using the "<url>" tag. With that in mind, I don't suppose I might be able to ask for a smidgen of help. I've been working on it with GVim and ScraperXML Editor, and the following comes up:
  • I've tested manually downloaded versions of the search results AND the individual result's info page. In both cases, ScraperXML Editor gives me a thumbs up; the results are correctly parsed
  • I can't test in ScraperXML from scratch (e.g. from a search query, that is), AFAIK. I have version 1.5
  • The old Scraper.exe program from way back when doesn't work with gzipped HTML. Is there an option to test it with downloaded html files?
Any help would be substantially appreciated. My script (modified from the animenfo.com script):

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper name="AniDB.net" date="2009-11-15" content="tvshows" framework="1.0" thumb="anidb.jpg" language="">
  <NfoUrl dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression></expression>
    </RegExp>
  </NfoUrl>
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl?show=animelist&amp;adb.search=\1&amp;do.search=search&lt;/url&gt;" dest="3">
      <expression noclean="1"></expression>
    </RegExp>
  </CreateSearchUrl>
  <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes" noclean="1">&lt;a href="(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))"&gt;([^&lt;]*)&lt;/a&gt;</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetSearchResults>
  <GetDetails dest="3">
    <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
      <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
        <expression repeat="yes">&lt;th class="field"&gt;Main Title&lt;/th&gt;.....&lt;td class="value"&gt;(.[^\n]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
        <expression noclean="1" trim="1">&lt;th class="field"&gt;Year&lt;/th&gt;.[^&gt;]*&gt;([^&lt;]*)|$</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;thumb&gt;&lt;url spoof=&quot;http://anidb.net&quot;&gt;http://animenfo.com/\1&lt;/url&gt;&lt;/thumb&gt;" dest="8+">
        <expression>&lt;div class="image".[^"]*"(http.[^"]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
        <expression>animevotes&amp;amp;aid=[0-9]*"&gt;(.[^&lt;]*)</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetDetails>
</scraper>
Reply
#2
as far as i know the only available attributes for the url tag are

1: spoof="foo" (the referer page)
2. post="yes" (tells the http api to use POST method

and only used if calling a custom function (in GetDetails, GetSettings, or custom functions
3. function="CustomFunctionName"

I could be wrong on this, considering i don't fully understand the http in XBMC but i don't think there is a gzip="yes" option for url (i think xbmc automatically detects gzipped sites and decompresses them
ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

Image
Reply
#3
there is indeed a gzip="yes" parameter to enable gzipped content. this is useful in the cases where it's not set explicity in the http headers. in the latter case, it's handled automagically by curl.
Reply
#4
Is there any other clear mistake in my code, then? I can't figure out what's wrong, and AFAIK there's no testing programs out there (the only two I know about have the above-listed problems that prevent me from performing a complete search test)

Also, is there a way to specify "ignored" portions of my search query? At least for my specific use, I want it to ignore anything in the folder name inside parentheses.
Reply
#5
spiff Wrote:there is indeed a gzip="yes" parameter to enable gzipped content. this is useful in the cases where it's not set explicity in the http headers. in the latter case, it's handled automagically by curl.

good to know, i guess that's something i need to code for
ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

Image
Reply
#6
I didn't know my version of the editor was so far behind. (caught the 3.5 link in your sig)
Reply
#7
Okay, now it almost works! This gets the thumbnail, year, title, and rating.

Herein lies the two questions I have to any other scraper writers:

1. How do I add episode titles?
2. How do I add "excluded" terms for the search query? I place format info (codec, audio tracks) in parenthesis in the folder name, so at least for the version of the scraper I keep personally, I'd like to remove all of that from the text that goes to the search query.
2.a. I tried adding the expression:
Code:
<CreateSearchUrl dest="3">
    <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=animelist&amp;adb.search=\1&lt;/url&gt;" dest="3">
        <expression>(.[^)(]+)</expression>
    </RegExp>
</CreateSearchUrl>
2.b. Thanks to ackana on Freenode's Regex, I have an couple of nice regex codes, NEITHER of which work (and BOTH of which work in Scraper XML Editor):
([^)(]+)\([a-zA-Z0-9]+\)+
([^)(]+(?=(?:\([a-zA-Z0-9]+\))+))

Edit note : Okay, scrap.exe works (as far as getting a useful search URL, obviously the g-zip's a no-go) with:
([^)(]+)
What am I doing wrong?! Is there ANY way to figure out WTF XBMC is doing with this expression?! >_<

Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1" date="2009-11-15" name="AniDB.net" content="tvshows" thumb="anidb.jpg" language="en">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="\1" dest="3">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=animelist&amp;adb.search=\1&lt;/url&gt;" dest="3">
            <expression>([^)(]+)</expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <!--     Multiple Results  -->
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" noclean="1">&lt;a href="(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))"&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
            
            <!--     Only one Result  -->
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;\2&lt;/url&gt;&lt;/entity&gt;" dest="5+">
                <expression repeat="no" noclean="1">&lt;th class=&quot;field&quot;&gt;Main Title&lt;/th&gt;.....&lt;td class=&quot;value&quot;&gt;(.[^\n]*)....&lt;a class=&quot;shortlink&quot; href=&quot;(http.[^&quot;]*)</expression>
            </RegExp>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
                <expression repeat="yes">&lt;th class="field"&gt;Main Title&lt;/th&gt;.....&lt;td class="value"&gt;(.[^\n]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
                <expression trim="1" noclean="1">&lt;th class="field"&gt;Year&lt;/th&gt;.[^&gt;]*&gt;([^&lt;]*)|$</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="8+">
                <expression>&lt;div class="image".[^"]*"(http.[^"]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
                <expression>animevotes&amp;amp;aid=[0-9]*"&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
                <expression>class=&quot;desc&quot;&gt;(.*)&lt;/div&gt;</expression>
            </RegExp>

            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
<!-- Created with ScraperXml Editor -->
</scraper>

EDIT : Solved how to deal with a single search result that defaults to the info page. Just add a regex to "getsearchresults" that looks specifically for the info page. Not sure what to do if the info page doesn't have a link to itself like it does on AniDB, tho.
Reply
#8
$#pages+1 holds the url to the page that is scraped for this very reason.

cleaning filenames has nothing to do with the scraper. see <cleanstrings> or something like that in advancedsettings.xml.

no idea why those expressions doesn't work and way too january 1. in my head atm Wink
Reply
#9
hi,
i'm trying to improve that anidb scraper a bit and was wondering if the wiki regex info are still correct or not.
i'm on linux using the 9.11 xbmc and it looks like regex lazyness do work although it was stated not working in the wiki.
Can i assume it is now working and plateform independant or should i stick to painfull lazyness free regex ?
And do you hints on the current regex version, and limitations, running in xbmc, in order to clear things up for my quick shot at this scrapper ?

thx
Reply
#10
MukiDA, props for your effort. i been dreaming of the link between aniDB and xbmc for a while.

out of ~43 (on my NAS) theTVdb has found all but 1,
i begrudgingly added it a few days back, but this makes adding new stuff a MASSIVE effort
& now i have another (missing from theTVdb) Sad

once you're ready to beta can i give it a shot ?
(im afraid that's as helpful as i can be in this situation Tongue )

Image
rPi 2&3 | android phones | fireHD8 | linux | win10 + NFS NAS w/ mySQL + props to...
libreElecyatse, titan, AELflexGet, context.manageTags (a zosky original)
Reply
#11
Thanks for the help, Spiff, that helped out substantially.

zonsky, feel free to beta test the scraper as-is (no episodes yet, hence why I said pre-pre-alpha, but at the moment the thumb, title, and plot summary work just fine Wink ) I would, of course, suggest testing it on only a few folders at a time, however (e.g. don't just set the anime root folder to it Wink )

Here's the latest version. I'm working on episode support, and once again, I have no idea what I'm doing wrong Wink (seems to be a running theme)

Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1" date="2009-11-15" name="AniDB.net" content="tvshows" thumb="anidb.jpg" language="en">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="\1" dest="3">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=animelist&amp;adb.search=\1&lt;/url&gt;" dest="3">
            <expression></expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
            <!--     Multiple Results  -->
            <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" noclean="1">&lt;a href="(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))"&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
            
            <!--     Only one Result  -->
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;\2&lt;/url&gt;&lt;/entity&gt;" dest="5+">
                <expression repeat="no" noclean="1">&lt;th class=&quot;field&quot;&gt;Main Title&lt;/th&gt;.....&lt;td class=&quot;value&quot;&gt;(.[^\n]*)....&lt;a class=&quot;shortlink&quot; href=&quot;(http.[^&quot;]*)</expression>
            </RegExp>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
                <expression repeat="yes">&lt;th class="field"&gt;Main Title&lt;/th&gt;.....&lt;td class="value"&gt;(.[^\n]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
                <expression trim="1" noclean="1">&lt;th class="field"&gt;Year&lt;/th&gt;.[^&gt;]*&gt;([^&lt;]*)|$</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="8+">
                <expression>&lt;div class="image".[^"]*"(http.[^"]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
                <expression>animevotes&amp;amp;aid=[0-9]*"&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
                <expression>class=&quot;desc&quot;&gt;(.*)&lt;/div&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;episode&gt;\1&lt;/episode&gt;" dest="8+">
                <expression repeat="no">&lt;td class=&quot;epno lastep&quot;&gt;([0-9]+)&lt;/td&gt;</expression>
            </RegExp>

            <expression noclean="1"></expression>
        </RegExp>
        <RegExp input="$$10" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;" dest="3+">
            <RegExp input="$$1" output="&lt;episode&gt;&lt;title&gt;\2&lt;/title&gt;&lt;epnum&gt;\1&lt;/epnum&gt;&lt;/episode&gt;" dest="10">
                <expression repeat="yes">&lt;td class=&quot;id eid&quot;&gt;&lt;a href.[^&gt;]*&gt;([0-9]+).*?label.[^&gt;]*&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
</scraper>

From what I can gather (I am SO adding this to the scraper Wiki once I understand it) from glancing at the tvdb scraper source, the episode format is part of the "GetDetails" (are these "sections" arbitrary?) section. The format seems to be as follows:
Code:
<episodeguide>
    <episode>
        <title>Title of this ep</title>
        <enum>XX</enum>
    </episode>
</episodeguide>

And it seems to be after <details>. At the moment, I think I'm doing this right, but once again, I can only check my regex with ScraperXML. Is there any way to see XBMC's output on a scrape run? Is there a scraper log flag I need to toggle?
Reply
#12
debug logging logs the entire process. and that is not the format, it is

Code:
<episodeguide>
  <episode>
    <title>.</title>
    <url>..</url>
    <season>...</season>
     <epnum>...</epnum>
     <id>..</id>
     <airdate>...</airdate>
</episode>
</episodeguide>
where season, epnum, url is mandatory and title a big plus.
Reply
#13
Thanks again, Spiff. I inadvertently caught a bug I'd almost ignored. Plot details was grabbing WAY too much junk after the actual plot details; the regex was wrong. So now, as far as I can tell, the info I got for the episode guide is just fine, but it keeps telling me I have "0" episodes in my test show. Just to make sure I wasn't do anything wrong, I renamed one of the shows "3x3 Eyes S01E01.mkv" (I know nobody judges here, but for the sake of clarity I own 3x3 Eyes on DVD and can produce a photo of me holding the disc on request Wink )

Here's my episode guide from the debug log (I added whitespace myself; anyone know an automated way to do this in vim?). It starts immediately before </details>. Does the <url> in it have to be a full episode info file?

Code:
<episodeguide>
    <episode>
        <url>http://anidb.net/</url>
        <season>1</season>
        <title>Transmigration</title>
        <epnum>1</epnum>
    </episode>
    <episode>
        <url>http://anidb.net/</url>
        <season>1</season>
        <title>Yakumo
        </title>
        <epnum>2</epnum>
    </episode>
    <episode>
        <url>http://anidb.net/</url>
        <season>1</season>
        <title>Sacrifice</title>
        <epnum>3</epnum>
    </episode>
    <episode>
        <url>http://anidb.net/</url>
        <season>1</season>
        <title>Straying</title>
        <epnum>4</epnum>
    </episode>
</episodeguide>

Here's the full scraped info so far on 3x3 eyes (straight from the log, unedited):
Code:
<details><title>3x3 Eyes</title><title>3x3 Eyes</title><year>25.07.1991 till 19.03.1992</year><thumb>http://img7.anidb.net/pics/anime/22311.jpg</thumb><rating>6.85</rating><plot>Pai is the last of the Sanjiyan -- a magical race of 3-eyed creatures, and she comes in search of Tokyo high-school student Yakumo with news of his father's death and hopes of becoming human. After a fatal accident, Pai is forced to absorb Yakumo's soul to keep him from dying, making him an undead creature bound to her. Their journey to make Pai human becomes complicated with dark forces seeking to stop them, especially when Pai's crueler nature emerges...
                                                    </plot><episode>4</episode><episodeguide><episode><url>http://anidb.net/</url><season>1</season><title>Transmigration</title><epnum>1</epnum></episode><episode><url>http://anidb.net/</url><season>1</season><title>Yakumo</title><epnum>2</epnum></episode><episode><url>http://anidb.net/</url><season>1</season><title>Sacrifice</title><epnum>3</epnum></episode><episode><url>http://anidb.net/</url><season>1</season><title>Straying</title><epnum>4</epnum></episode></episodeguide></details>


Of course, here's the newest version (should I just attach instead?)

Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1" date="2009-11-15" name="AniDB.net" content="tvshows" thumb="anidb.jpg" language="en">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="\1" dest="3">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=animelist&amp;adb.search=\1&lt;/url&gt;" dest="3">
            <expression></expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
            <!--     Multiple Results  -->
            <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" noclean="1">&lt;a href="(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))"&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
            
            <!--     Only one Result  -->
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;\2&lt;/url&gt;&lt;/entity&gt;" dest="5+">
                <expression repeat="no" noclean="1">&lt;th class=&quot;field&quot;&gt;Main Title&lt;/th&gt;.....&lt;td class=&quot;value&quot;&gt;(.[^\n]*)....&lt;a class=&quot;shortlink&quot; href=&quot;(http.[^&quot;]*)</expression>
            </RegExp>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
                <expression repeat="yes">&lt;th class="field"&gt;Main Title&lt;/th&gt;.....&lt;td class="value"&gt;(.[^\n]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
                <expression trim="1" noclean="1">&lt;th class="field"&gt;Year&lt;/th&gt;.[^&gt;]*&gt;([^&lt;]*)|$</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="8+">
                <expression>&lt;div class="image".[^"]*"(http.[^"]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
                <expression>animevotes&amp;amp;aid=[0-9]*"&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
                <expression>class=&quot;desc&quot;&gt;(.[^&lt;]+)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;episode&gt;\1&lt;/episode&gt;" dest="8+">
                <expression repeat="no">&lt;td class=&quot;epno lastep&quot;&gt;([0-9]+)&lt;/td&gt;</expression>
            </RegExp>

            <expression noclean="1"></expression>
        </RegExp>
        <RegExp input="$$10" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;" dest="3+">
            <RegExp input="$$1" output="&lt;episode&gt;&lt;url&gt;http://anidb.net/&lt;/url&gt;&lt;season&gt;1&lt;/season&gt;&lt;title&gt;\2&lt;/title&gt;&lt;epnum&gt;\1&lt;/epnum&gt;&lt;/episode&gt;" dest="10">
                <expression repeat="yes">&lt;td class=&quot;id eid&quot;&gt;&lt;a href.[^&gt;]*&gt;([0-9]+).*?label.[^&gt;]*&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
</scraper>
Reply
#14
@MukiDA

been trying to make a scrapper for anidb too based on your first draft.

I think you can safely use regex lazyness as it's widely used in all the other scrapers although it is said to be "not working" in the documentation. So for example your plot expression could look like this :
Code:
<expression trim="1">class=&quot;desc&quot;&gt;\s*(.*?)\s*&lt;/div</expression>
it makes sure no whitespace is left around. Doesn't make much difference with your code but it's just an example, lazyness is very useful when parsing is complicated.

Anyways i made a few other modifications but there's one thing that bothers me right now which is the year entry in GetDetails, for some reason it will always be shown as 65535 (0xFFFF) no matter what i put in the <year></year> field. Any idea what's going on ?

And i'll try adding some fanart from thetvdb as i've seen there's quite a lot there for animes.
Reply
#15
You can scrape from more than one site? o_0

... oh wait, I guess the spec does have quite a bit of room for that. Wink

Okay, here's another shot at finding the episode data. For some odd reason, GetEpisodeDetails never shows up in the log. Of course, I don't have a clear understand of exactly how that function works. Hopefully someone can point out my folly. Once again, thanks to everyone here for the help, especially spiff, who seems to be the ultimate one-man-orchestra on scraping.

Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1" date="2009-11-15" name="AniDB.net" content="tvshows" thumb="anidb.jpg" language="en">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="\1" dest="3">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=animelist&amp;adb.search=\1&lt;/url&gt;" dest="3">
            <expression></expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
            <!--     Multiple Results  -->
            <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" noclean="1">&lt;a href="(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))"&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
            
            <!--     Only one Result  -->
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;\2&lt;/url&gt;&lt;/entity&gt;" dest="5+">
                <expression repeat="no" noclean="1">&lt;th class=&quot;field&quot;&gt;Main Title&lt;/th&gt;.....&lt;td class=&quot;value&quot;&gt;(.[^\n]*)....&lt;a class=&quot;shortlink&quot; href=&quot;(http.[^&quot;]*)</expression>
            </RegExp>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$8" output="&lt;details&gt;\1" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
                <expression repeat="yes">&lt;th class="field"&gt;Main Title&lt;/th&gt;.....&lt;td class="value"&gt;(.[^\n]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
                <expression trim="1" noclean="1">&lt;th class="field"&gt;Year&lt;/th&gt;.[^&gt;]*&gt;([^&lt;]*)|$</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="8+">
                <expression>&lt;div class="image".[^"]*"(http.[^"]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
                <expression>animevotes&amp;amp;aid=[0-9]*"&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
                <expression>class=&quot;desc&quot;&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;episode&gt;\1&lt;/episode&gt;" dest="8+">
                <expression repeat="no">&lt;td class=&quot;epno lastep&quot;&gt;([0-9]+)&lt;/td&gt;</expression>
            </RegExp>

            <expression noclean="1"></expression>
        </RegExp>
        <RegExp input="$$10" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;&lt;/details&gt;" dest="3+">
            <RegExp input="$$1" output="&lt;episode&gt;&lt;url gzip=&quot;yes&quot;&gt;http://animedb.net/animedb.pl\?show=ep\&amp;\1&lt;/url&gt;&lt;season&gt;1&lt;/season&gt;&lt;title&gt;\3&lt;/title&gt;&lt;epnum&gt;\2&lt;/epnum&gt;&lt;/episode&gt;" dest="10">
                <expression repeat="yes">&lt;td class=&quot;id eid&quot;&gt;&lt;a href=&quot;animedb.pl\?show=ep\&amp;amp\;(.[^&quot;]*)&quot;&gt;([0-9]+).*?label.[^&gt;]*&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
    <GetEpisodeDetails dest="3">
        <RegExp input="$$7" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; standalone=&quot;yes&quot;?&gt;&lt;details&gt;&lt;title&gt;\1&lt;/title&gt;&lt;season&gt;1&lt;/season&gt;&lt;title&gt;\3&lt;/title&gt;&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="\1" dest="7">
                <expression noclean="1">Main Title&lt;/th&gt;.[^&lt;]*&lt;td class=&quot;value&quot;&gt;(.[^\(]*)\(</expression>
            </RegExp>
        </RegExp>
    </GetEpisodeDetails>
</scraper>
Reply
  • 1(current)
  • 2
  • 3
  • 4
  • 5
  • 37

Logout Mark Read Team Forum Stats Members Help
[WIP] AniDB.net Anime Video Scraper3