• 1
  • 2(current)
  • 3
  • 4
  • 5
  • 37
[WIP] AniDB.net Anime Video Scraper
#16
Very cool. im created a temp source & added excel saga (Official Title)
aka Heppoko Jikken Animation Excel Saga (Main Title)

main and official title dont differ often but here is another ex >
> official = Koukaku Kidoutai S.A.C. 2nd GIG
> main = Ghost in the Shell: Stand Alone Complex 2nd GIG
could you please grab that main or make it an option Smile

also beyond fanArt from thetvDB
please consider the wideBanner too

more impotently -we need eps Laugh
my files are enumerated (for the tvdb which does have this one).

hope this will help ... (BTW. i have no tvshow.nfo in this dir)

Code:
02:48:30 T:2997828464 M:1976143872   ERROR: CVideoInfoScanner::OnProcessSeriesFolder: Asked to lookup episode /mnt/newHD/anime2/Excel.Saga/Excel_Saga_s01e01.mkv online, [color=Red]but we have no episode guide. Check your tvshow.nfo and make sure the <episodeguide> tag is in place.[/color]
rPi 2&3 | android phones | fireHD8 | linux | win10 + NFS NAS w/ mySQL + props to...
libreElecyatse, titan, AELflexGet, context.manageTags (a zosky original)
Reply
#17
@zosky
afaik the episode guide does not work with that version yet, it is missplaced in the data.

i managed to correct that and fecth the episodes list and data but for some reason i can't get it displayed anywhere in my xbmc.

on top of that when i switch to library mode the Anime folder i use for my test is empty, although the files are present and that anime folder has the correct info, so i assume the episodes were not recognized ?

That's probably a side effect of the info "Episodes: 0 (0 watched - 0 unwatched)" i get no matter what i do..

Could someone please explain to me how the scraper engine is supposed to handle episodes list ? or point me to an up to date documentation on the subject.
I have mimicked tvrage and tvdb scrapers but won't manage to get anything to work here with anidb, although <details>, <episodes> and GetEpisodeDetails <details> are populated.

thx

- edit -
i'm leaving the original post so anyone can see the problem i had.
reading zosky's post i figured out what was wrong in my setup, the mkv files i have are not "enumerated" by xbmc, although they all contain the show title and an episode number, but xbmc was not able to identify them clearly because of other (group) info in the filename.
Once i've renamed the files they got enumerated after refreshing the anime folder data and i now have all the files and their respective data present..

Now i'll be able to finish my first draft, i still have to understand how caching works because i'm fetching unnecessary pages.
Reply
#18
@spiff (and others)

i'm struggling with a bug in the scrapper, need some quick help.

i managed to understand how you fetch additional pages and external pages so i added some fanart scrapping but i have the following problem :

in order to get the fanart on thetvdb i need to perform a search on their engine, i don't have any id to pass to the site from anidb. I use a function as follows :
Code:
<url gzip="yes" function="GetFanart">http://www.thetvdb.com/index.php?seriesname=Michiko to Hatchin&fieldlocation=1&language=7&genre=Animation&year=&order=fanartcount+desc&searching=Search&tab=advancedsearch</url>

using the anime name i get from the current anidb page being parsed. I don't know if the anime name used in the selection dialog is still somewhere but it probably makes no difference..
and so xbmc returns the following error :
Code:
ERROR: InternalGetDetails: Unable to parse web site [http://www.thetvdb.com/index.php?seriesname=Michiko to Hatchin&fieldlocation=1&language=7&genre=Animation&year=&order=fanartcount+desc&searching=Search&tab=advancedsearch]

it's probably an urlencoding error caused by the presence of spaces in the anime name.

Is there a way i can replace the spaces (with "+") or do some urlencoding on the pattern match i'll use for the <url> ?


Then could you point me to the current xml format for thumb and fanart. And there are some posters and banners on that website, can we use them for something in xbmc skins ?

thx
Reply
#19
@eldon

The site you're trying to parse isn't g-zipped (I just tried wget-ing it and it cats just fine), so you can remove the gzip="yes" part. In addition, you might need to escape the question mark (e.g. \? instead of ?) and possibly the ampersands (\&).

@zosky

I don't think my per-episode code works AT ALL yet. GetEpisodeDetails doesn't return ANYTHING (it doesn't even get called) in the XBMC debug log. I'm doing something horribly wrong there and I don't yet know what.

The TVDB.com scraping is a pretty brilliant idea, and I've love to use it as a realtime fallback. (e.g. scrape it if AniDB is missing info, and maybe even add its poster thumbs)

As a side note, and I think the lack of this is due more to a lack of momentum than anything else, but Anime could really use its own section, or some way to add custom sections to the Library. If it required adding a few more image text "blocks" for existing skins, so be it. Right now the closest thing we can do is sort by genre, and it feels unusably cumbersome if you have a frag-ton of anime and someone in your abode that's not into it is looking for a random TV show to watch; it puts Ghost in the Shell next to both Firefly 'n Dexter's Lab (as it would be categorized under "Animation", and "Science Fiction"). I guess it's not a remotely easy fix (as its Library section would need to be capable of dealing with both Movies and TV shows simultaneously) but it'd be nice.
Reply
#20
okay i managed to get fanart working so i'll post my first draft which is quite complete and just misses some cast & crew info.

Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1" date="2009-11-15" name="AniDB.net" content="tvshows" thumb="anidb.png" language="en">
    <GetSettings dest="3">
        <RegExp input="$$5" output="&lt;settings&gt;\1&lt;/settings&gt;" dest="3">
            <RegExp input="$$1" output="&lt;setting label=&quot;Enable fanart from thetvdb.org&quot; type=&quot;bool&quot; id=&quot;fanart&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5">
                <expression/>
            </RegExp>
        </RegExp>
    </GetSettings>
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl?type.web=1&amp;type.unknown=1&amp;type.tvspecial=1&amp;type.tvseries=1&amp;type.ova=1&amp;type.other=1&amp;type.musicvideo=1&amp;type.movie=1&amp;show=animelist&amp;orderby.name=0.1&amp;noalias=1&amp;do.update=update&amp;adb.search=\1&lt;/url&gt;" dest="3"> -->
            <expression>([^\)\(]+)</expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <!--     Multiple Results  -->
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3 - \4&lt;/title&gt;&lt;year&gt;\5&lt;/year&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/\1&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" noclean="1">&lt;a href=&quot;(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))&quot;&gt;([^&lt;]*)&lt;/a&gt;.*?&lt;td class=&quot;type[^&gt;]+&gt;([^&lt;]+)&lt;/td&gt;.*?airdate.*?([0-9]{4})?&lt;/td&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>            
            <!--     Only one Result  -->
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1 - \3&lt;/title&gt;&lt;year&gt;\4&lt;/year&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/\2&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="no" noclean="1">Main Title&lt;/th&gt;.*?&gt;([^\r\n\t]+).*?href=&quot;http://anidb.net/([^&quot;]*).*?Type&lt;/th&gt;[^&gt;]+&gt;([^,&lt;]*).*?Year&lt;/th&gt;.*?([0-9]{4})?(?: till|&lt;/)</expression>
            </RegExp>
            <expression clear="yes" noclean="1"/>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
                <expression trim="1">Main Title&lt;/th&gt;.*?&gt;([^\r\n\(]+)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
                <expression>Year&lt;/th&gt;.*?([0-9]{4})(?: till|&lt;/)</expression>
            </RegExp>
            <!--<div class="image"> \n <img src="http://img7.anidb.net/pics/anime/13614.jpg" alt="Michiko to Hatchin" />-->
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="8+">
                <expression>&lt;div class=&quot;image&quot;.*?(http[^&quot;]*)</expression>
            </RegExp>
            <!--<a href="animedb.pl?show=animevotes&amp;aid=5779">7.74</a>-->
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
                <expression>animevotes&amp;amp;aid=[0-9]*&quot;&gt;([^&lt;]*)</expression>
            </RegExp>
            <!-- <a href="animedb.pl?show=lexicon&amp;vtype=cat&amp;relid=4" title="search for other anime with this category">Action</a>,-->
            <RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
                <expression repeat="yes">animedb.pl\?show=lexicon&amp;amp;vtype=cat&amp;amp;relid=[0-9]+[^&gt;]*?&gt;([^&lt;]+)&lt;/a</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;studio&gt;\1&lt;/studio&gt;" dest="8+">
                <expression>Animation Work[^&gt;]*&gt;([^&lt;]+)&lt;/a</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;premiered&gt;\1&lt;/premiered&gt;" dest="8+">
                <expression>Year&lt;/th&gt;.*?([0-9]{4})(?: till|&lt;/)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
                <expression trim="1">class=&quot;desc&quot;&gt;\s*(.*?)\s*&lt;/div</expression>
            </RegExp>
            <!--<table id="characterlist" class="characterlist"> .. </table>-->
            <RegExp input="$$6" output="&lt;actor&gt;&lt;thumb&gt;&lt;/thumb&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\1&lt;/role&gt;&lt;/actor&gt;" dest="8+">
                <RegExp input="$$1" output="\1" dest="6">
                    <expression noclean="1">&lt;table id=&quot;characterlist&quot; class=&quot;characterlist&quot;&gt;(.*?)&lt;/table&gt;</expression>
                </RegExp>    
                <expression repeat="yes">animedb\.pl\?show=character&amp;amp;charid=[0-9]+&quot;&gt;([^&lt;]+)&lt;/a.*?animedb\.pl\?show=creator&amp;amp;creatorid=[0-9]+&quot;&gt;([^&lt;]+)&lt;/a</expression>
            </RegExp>
            <RegExp input="$$3" output="&lt;url function=&quot;GetFanart&quot;&gt;http://www.thetvdb.com/index.php?seriesname=$$3&amp;fieldlocation=1&amp;language=7&amp;genre=Animation&amp;year=&amp;order=fanartcount+desc&amp;searching=Search&amp;tab=advancedsearch&lt;/url&gt;" dest="8+">
                <RegExp input="$$1" output="\1" dest="7">
                    <expression trim="1">Main Title&lt;/th&gt;.*?&gt;([^\r\n\(]+)</expression>
                </RegExp>
                <RegExp input="$$7" output="\1+" dest="3">
                    <expression repeat="yes" trim="1">([^\s]+)</expression>
                </RegExp>
                <RegExp input="$$3" output="\1" dest="3">
                    <expression>([^\s]+)\+</expression>
                </RegExp>
                <expression noclean="1"/>
            </RegExp>            
            <!-- <input type="hidden" name="aid" value="5779" /> OR use cache ? -->
            <RegExp input="$$1" output="&lt;episodeguide&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=anime&amp;aid=\1&lt;/url&gt;&lt;/episodeguide&gt;" dest="8+">
                <expression>&lt;input type=&quot;hidden&quot; name=&quot;aid&quot; value=&quot;([0-9]+)&quot; /&gt;</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetDetails>
    <GetFanart dest="5">
        <RegExp input="$$1" output="&lt;details&gt;&lt;url gzip=&quot;yes&quot; function=&quot;GetFanartData&quot;&gt;http://www.thetvdb.com/index.php?tab=series&amp;id=\1&amp;lid=\2&lt;/url&gt;&lt;/details&gt;" dest="5">
            <expression>&lt;a href=&quot;/index\.php\?tab=series&amp;amp;id=([0-9]+)&amp;amp;lid=([0-9]+)&quot;.*?[1-9]+&lt;/td&gt;&lt;/tr&gt;</expression>
        </RegExp>
    </GetFanart>
    <GetFanartData dest="5">
        <RegExp input="$$8" output="&lt;details&gt;&lt;fanart&gt;\1&lt;/fanart&gt;&lt;/details&gt;" dest="5">
            <RegExp input="$$1" output="&lt;thumb preview=&quot;http://www.thetvdb.com\1&quot;&gt;http://www.thetvdb.com/\2&lt;/thumb&gt;" dest="8">
                <expression repeat="yes">&lt;img src=&quot;(/banners/_cache/fanart/original/[^&quot;]+)&quot;.*?&lt;a href=&quot;(banners/fanart/original/[^&quot;]+)&quot;</expression>
            </RegExp>    
            <expression noclean="1"/>
        </RegExp>
    </GetFanartData>
    <GetEpisodeList dest="3">

        <RegExp input="$$8" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;" dest="3">
            <RegExp input="$$1" output="&lt;episode&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl?show=ep&amp;eid=\1&lt;/url&gt;&lt;season&gt;1&lt;/season&gt;&lt;title&gt;\3&lt;/title&gt;&lt;epnum&gt;\2&lt;/epnum&gt;&lt;/episode&gt;" dest="8+">
                <expression repeat="yes">id=&quot;eid_([0-9]+)&quot;.*?eid=[0-9]+&quot;&gt;([0-9]+)&lt;/a.*?label[^&gt;]*&gt;([^&lt;]+)</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetEpisodeList>
    <GetEpisodeDetails dest="3">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="5">
                <expression>Main Title&lt;/th&gt;.*?&gt;([^\r\n\(]+)</expression>
            </RegExp>                        
            <RegExp input="$$1" output="&lt;plot&gt;&lt;/plot&gt;" dest="5+">
                <expression/>
            </RegExp>
            <!--class="rating ep mid">7.74 <span-->    
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="5+">
                <expression>class=&quot;rating[^&gt;]*&gt;([0-9\.]+)</expression>
            </RegExp>    
            <!--
                <th class="field">Air/Release Date</th>
                <td class="value">16.10.2008</td>
            -->
            <RegExp input="$$1" output="&lt;aired&gt;\1&lt;/aired&gt;" dest="5+">
                <expression>Air/Release.*?&gt;([0-9\.]+)&lt;/td</expression>
            </RegExp>                    
            <expression noclean="1"/>
        </RegExp>        
    </GetEpisodeDetails>
</scraper>

the scrapper settings dialog doesn't show up with the fanart option (on/off), i don't know why, something's missing somewhere.

cast & crew info is fetched from the anime main page so no thumbnail for them at the moment, and no crew at all is parsed yet so only cast is there.

I'll add the posters and banners from thetvdb and see what happens..

i've left some of the raw html, to be parsed, above most of the expressions so you can see how i parse it and modify it if you find something unnecessary. I had to remove big chunks because it was exceeding the max post size :\

the GetFanart function call code is quite dirty and only there to substitute spaces with "+" to urlencode the search text, it's quite ugly, let me know if there's a more natural way to do that task.

I also took a few minutes making a scrapper icon for anidb :
Image

let me know if it works with your anime library, there are probably quite a few bugs here and there as i only tested that on two animes (shigurui and Michiko to hatchin and cowboy bebop for a complete cast page).

Let me know if it works with your anime library.
Reply
#21
just realised tvdb has a xml api so i'm making changes to use it...

Doing that i noticed there are additional covers by seasons. i'm only using single season animes for my testing and it looks like thumbs with a season attribute, whatever it is set to, will not be listed in the info "get thumb" dialog, should thumb season attributes be used or avoided, and how do you set the anime season property ?

i'm wondering if the principle of seasons really applies on animes anyways. From what i know most animes have only one season and when it's a returning serie either the title changes of the episode number grows, (ex: ghost in the shell sac, sac 2nd gig or naruto ep 1-220).
Any anime expert can comment on that to let me know if i should dig deeper into that ?

As side note, more of a feature request, i was wondering if there was any use for thumbs of the characters ?
They are usually available on anidb and they would look quite nice on the cast info dialog.

Finally, i should point out it would be nice if xbmc was a bit more relaxed regarding episode filenames, if it has the title in it and number possibly corresponding to an episode number (1 2 3 or 01 02 03..) it should be found. You can't always easily rename filenames, for example when downloading torrents you usually can't rename the files as you'd like, the strict use of s01e01 seems a bit harsh.
They could be simply substituted for the actors thumbs, maybe according to a skin parameter like "Use characters thumbs".
Reply
#22
OK. As has already been mentioned in this very old thread, AniDB asks that you do not scrape their pages.

Again, to be clear, please do not scrape AniDB pages.

I appreciate that the xbmc framework provides convenient methods for doing so. I also understand (given an entire forum dedicated to scraper development) that is is a very much accepted method for gathering data. Just the same, please do not scrape AniDB data.

AniDB does offer two public APIs for retrieving data, the complexity of which varies depending on the richness of the data you wish you retrieve.

For your purposes here, the HTTP XML API probably provides what you want, including complete episode lists. To help find the correct aid for a given anime title, a complete dump of all anime titles, across languages is also available. This is updated daily.

If you're anxious to provide something especially in-depth, such as the MediaPortal plugin linked in the previous thread, the UDP API has been around for years.

Yes, I know that neither of these solutions is as convenient as using the scraping framework. However, they do both offer solutions that will not suddenly break one day, when AniDB makes internal changes. And there's a lot to be said for that.

Thank you.

- Ommina [AniDB]
Reply
#23
hurrah for you adding http xml api! the udp api is evil and why i abandoned your site many moons ago (plus the fact that nobody responded to my http/xml api inqueries). with that in place all i can promise is that a scraper not using your framework will never hit svn.
Reply
#24
well okay i guess that settles it then..

i can certainly use the xml http api but i don't know about client side caching. Scrappers caching seems to only last the time of a single update process.
On the wiki page, it looks like the studio and cast are missing from the api description, but that's not so bad, although xbmc can use cast names to cross reference xbmc library entries for a specific person which is quite nice.

Maybe someone can elaborate on another "time to live" based cache for the scrappers ?
But i fear that's only something that could be done through scripts or plugins.

It would at least most certainly be required to dowload the anime database in order to perform searches for aids.. or we could go through google of course as a second choice method.

Although cache purge doesn't seem to be handled by the scrapper itself, at least from what i saw in ScraperUrl.cpp, maybe the purge could honor some "time to live" in order not to remove the files right after the scrapping has been done.

We could do something like <url cache="file.xml" ttl="24">http://..</url>, ttl being set in hours, any devs around ?


@spiff
make sure you read about the ban rules for the http xml api, it would definitely require some more advanced caching mechanism on xbmc's side.
And if you want me to do the xml scrapper i'm fine with it i'm already half way anyways.
Reply
#25
yeah, i saw that. i will try to cough up a better system for a more persistent scraper cache. let me worry about that bit, you just make sure you use cache files in your scraper wherever appropriate.
Reply
#26
http://forum.xbmc.org/member.php?find=la...er&t=50055 delivered on my part
Reply
#27
delivered what, sorry don't quite understand what the link points to..

Anyways, i've got a bit of a problem, how can i get the current title being searched (passed by the search dialog from the filename or manual input) when i'm in GetSearchResults, or in any other part of the scrapper for that matter ?


i'm also having some trouble parsing the large anidb anime titles xml file, i'm looking for a way to iterate through <anime></anime> blocks to look for the search pattern but i can't find a way to do that.

the anidb animes titles data is as follows :

Code:
    <anime aid="18">
        <title type="short" xml:lang="x-unk">ICT</title>
        <title type="official" xml:lang="en">Irresponsible Captain Tylor</title>
        <title type="main" xml:lang="x-jat">Musekinin Kanchou Tylor</title>
        <title type="official" xml:lang="fr">Tylor - The irresponsible captain</title>
        <title type="syn" xml:lang="ru">Безответственный капитан Тайлор</title>
        <title type="official" xml:lang="ja">無責任艦長タイラー</title>
    </anime>
    <anime aid="19">
        <title type="official" xml:lang="ja">りぜるまいん</title>
        <title type="syn" xml:lang="th">ริเซ็ลไมน์</title>
        <title type="short" xml:lang="x-unk">rizel</title>
        <title type="syn" xml:lang="x-unk">Rizelmain</title>
        <title type="main" xml:lang="x-jat">Rizelmine</title>
        <title type="syn" xml:lang="x-unk">Rizerumain</title>
    </anime>
    <anime aid="20">
        <title type="main" xml:lang="x-jat">Tokyo Underground</title>
        <title type="official" xml:lang="fr">Tokyo Underground</title>
        <title type="short" xml:lang="x-unk">TU</title>
        <title type="official" xml:lang="ja">東京アンダーグラウンド</title>
    </anime>

i've got a working regex but it takes for ages to run on that large file, i'd like to improve that by running a simpler regex through each single block.

What should be the correct <RegExp><expression> structure to perform such a task ?

thx
Reply
#28
i have completed the scrapper but am still not able to use anidb titles database, as explained above.

i'm currently using google to avoid scrapping any anidb.net pages but if someone can help me build a correct <RegExp><expression> structure, it would make the whole scrapper anidb compliant.

Or let me know if it can't be done and i'll post the scrapper in its current state.

thx
Reply
#29
i don't understand what you are trying to extract from those blocks.
Reply
#30
i'm simply trying to find the anime id based on the title xbmc found on the filename or directory name.

the match should be based on thoses "main" titles lines :
Code:
<title type="main" xml:lang="x-jat">Musekinin Kanchou Tylor</title>

when you make a regex trying to scan the whole database file (2.2MB) it takes approx 30 sec to 1 min to find a match, probably a lot more if the entry is located farther as i didn't use "repeat" for my test.

so i'd like to break the database into <anime></anime> blocks and try to a title match for each of them. I do know how to assign a matching block to a buffer and then use that buffer for further parsing but i haven't found a way to repeat that on all the blocks.

But frankly i must say that even with such method i'm not really sure it'll be much faster and if not that would make the use of anidb titles database quite useless, or maybe only optional for those with serious muscle under their xbmc hood.

Google anidb title search works fine but anidb indexed page titles don't always have the actual anime title in them, only "Anidb.net".
Reply
  • 1
  • 2(current)
  • 3
  • 4
  • 5
  • 37

Logout Mark Read Team Forum Stats Members Help
[WIP] AniDB.net Anime Video Scraper3