• 1
  • 2
  • 3(current)
  • 4
  • 5
  • 37
[WIP] AniDB.net Anime Video Scraper
#31
you're telling me this xml api doesn't even have a search?

the uhrm, for lack of another non-offensive term, choices of the anidb guys (slight offense only ommina) truely amazes me.
Reply
#32
well as far as i could tell the http api only provides access to limited per anime data, provided you already know the anidb ID.

As for the search they provide a daily copy of the anime titles in their database (around 2.2MB in the current state), and leave the client side deal with it.

I understand by your answer that google will be the best choice for this scrapper to identify the anime. Using the database could be left as an option and/or further identification could be done by cross referencing the id found on google and the title present in the database matching that id.

i'll try to see if parsing the database is fast enough when you have the anime id and then maybe use it to double check google's result.

Besides that the scrapper looks fine, anidb api data is used and missing media or episode info are fetched from thetvdb api.

I still have to see how i can access the animes separately from the other tv shows in xbmc, i guess there are some shortcuts you can use.

I'll post the srapper over the weekend, thx again for your answer.
Reply
#33
while running through 2.2mb shouldn't really take a minute with a tight expression, it still won't be nice. atleast the scraper will operate nice for those it can find with google, else there is always the url nfo's.
Reply
#34
you can't be serious... regexp to parse xml... and all that using python?! *facepalm* every time i see people abusing regexp for everything and their dog i die a little bit innerly...

there are various very fast xml frameworks for py. so why oh why don't you use one of them?!

cheap example:


Code:
from lxml import etree

searchterm = "naruto"
result     = []
tree       = etree.parse(file('animetitles.xml'))

for anime in tree.getroot().getchildren():
    for titles in anime.getchildren():
        if titles.text.lower().find(searchterm) >= 0:
            result.append(anime.get("aid"))
            break

print result

that's not even an attempt at being efficient or fast, but takes only a few ms for the 2mb file.

also the point is to grab the xml file and process it completely locally (and then reget it every odd blue moon). which has speed advantages and doesn't cause additional load on anidb's server. aside of that you yourself can decide on how you want to search.

considering anidb is run completely free of cost, ads or money donations and is not backed by a multimio corporation it should be quite obvious that we have restricted facilities. so don't expect us to provide supreme additional services which the majority of our users will never ever use.
Reply
#35
uhm. i don't want/intended to pick a fight. but who gave you the idea python is involved? it is not. we use a custom scraper framework based on regexps+xml.
i agree that using regexp to parse xml is stupid in general, but i made the choice since it allows handling html/xml/whatever text format in one unified framework, and .py's were just too slow for the job on the xbox.

from my point of view you would save bandwith by offering a search. i have to do a helluvalot of searches to download 2mb from your site. this is obviously not valid if cpu is your concern.

of course i do not expect anything from a free service based on volunteery work. i just find the choice of offering data, yet making no search available, hence forcing every client to reimplement the wheel, peculiar. to be a bit cheeky, i would like to point out that both tmdb and thetvdb falls into exactly the same category (free, voluntary) and both survive just fine having a search in their api Tongue

in either case, your services are inadequate/unfit for use by our system, which is unfortunate for those of our users that care (i am not included). they will have to live with the not-always-functioning google search.
Reply
#36
well no matter how the client side handles things, anidb provides it with 95% of the data required to get something out of it and i guess that's good enough.

If google fails, the xml titles database can be used as a failsafe, moreover you can tweak google search by using "manual" inputs in xbmc.

If it can be of any help i'm not sure some of the data present in the current http api is really necessary, such as full tags and categories descriptions when the anime cast is missing.

Anyways i'll make sure the scrapper can handle both google and the local database.

Anidb is a truly great website and i can only encourage the devs and admins keep up the good work.
Reply
#37
How is this going? Any more progress yet or has the lack of features with the API put a stop to this one? TVDB seems unable to scrape my Anime (episodes) so I'd definitely be interested in this.
Reply
#38
hi sorry for the late update, i couldn't find the time to make more tests with the anime database so i'm still only using google at the moment.

If i can't find time to update it with additional anidb database processing, i'll post it in its current state this weekend. It's working without the db but relying only on google can't be considered to be safe on the long term.
Reply
#39
Have you found any way of separating Anime from regular TV Shows?

I tried Anime in my library using thetvdb scraper, it wasnt too bad but I will be sure to give your scraper a try. It did get a bit annoying though, having them so mixed, so I just switched to using library for TV and Movies and file view for Anime.
Reply
#40
mashles Wrote:Have you found any way of separating Anime from regular TV Shows?

I tried Anime in my library using thetvdb scraper, it wasnt too bad but I will be sure to give your scraper a try. It did get a bit annoying though, having them so mixed, so I just switched to using library for TV and Movies and file view for Anime.

I was also looking for a way to do this as I too find it annoying when the two are mixed. Currently I use the "genre" filter in the Aeon skin (you can get to it by hitting the Down directional key on your remote while TV Shows is highlighted) to show just tv-shows tagged with "Anime" or "Animation". I think a "proper" solution would maybe require some change of code at xbmc's level in order to create a new category for anime (i.e. Movies, TV Shows, Anime) which would have its own associated scrapers? (Or I may just be talking out of my arse).

Edit: Just noticed your reply eldon, looking forward to it, I can understand that the google search method is not ideal but frankly TVDB is failing so severely at recognising any episodes of my Anime that I'm willing to take the risk.
Reply
#41
Searching anidb.net from Google seems to always return exactly one match (and sometimes the wrong match, like for "Initial D" or "Honey and Clover II"). Google seems to be much better at searching Anime News Network though (I think maybe because Anime News Network itself use Google for its search?). Maybe that would be a better choice?
Reply
#42
Hi,

as i won't have much time perfecting it, i'm posting my current revision of the anidb scrapper.

Features :
- uses anidb and thetvdb http api, no scrapping of web pages.
- can use google to search for an anime instead of anidb database file.
- can fetch posters and fan art from thetvdb.com
- can fetch episode data from thetvdb.com

Option :
- Use Google Search (default yes) : overrides the animetitles.xml database search.
- Fetch Fan Art from thetvdb.com (default yes)
- Fetch episode info from thetvdb.com (default no)

Notes:
- due to anidb http api restrictions it is recommended to use google search if you don't have a very recent (01/2010) svn build (cache with time to live aka "<scraper .. cachePersistence="hh:mm".."), otherwise you could get banned from the anidb api.
- some data is passed from a scrapper section to another using the actual download urls. It's not very clean but i couldn't find another way to do it and some data (title, id) is required for parsing url results and accessing cached files.
- i managed to build a fancy regex to browse animetitles.xml database quicly, so it should be very fast now and will work quite well. Once the cache time to live is present in the stable release the search should be defaulted to the database file instead of google.
- the scrapper uses an api ident i created on anidb, if you want to use one from the xbmc team you should replace "client=xbmcscrap" with your id from various urls.

Todo : although anidb is in english, thetvdb episode info has various languages so you could choose to override the default english language to fetch episodes data in your desired language.


here's the clean version :
http://pastebin.com/f6d30dc0c

here's a version with additional parsed data examples, not really a commented version but you can see what's done with what :
http://pastebin.com/f566f3847

(on pastebin, don't copy/paste, use the "download" link above the code displayed)

let me know if it works for you or if you have spotted some bugs, i may have a few moments to correct them but i can't make any promises.

The anidb scrapper icon i made should still be present in a previous post in this thread.
Reply
#43
Looks like eldon really have no time to update his scraper so i started to make some small tweaks by myself. There is result, it's bit more tweaked than i expected at start Wink
First i must appologize for wall of text i produced Big Grin and my english, hope it's somehow understandable.

Features:
Because this scraper is based on eldon's version, you should read his post above first, almost all informations apply there too and i will describe (almost) only changed behaviour.

1/ Searching for anime
- Google search - almost same like in eldon's version and i don't intent to tweak it more because i don't use it
- Anidb.xml search - my personal favorite, because i use AniDB utilities to rename and add anime to MyList i have almost 100% hit percentage (using main anime name as directory), it only miss when i mess directory naming.
-- commented out original download of anidb.xml because it never worked for me (downloaded only ~0.5MB of 2.3MB file - don't know if it's some curl setting or anidb.net limitation), so i only use cached file which i download manualy from anidb.net (included current version at the end of post). Personally i set file timestamp of anidb.xml 2 month forward so it doesn't expire so soon in cache and when it finally expires, it warns me by XBMC scraper error message that i should update.
-- added some name filtering because <cleanstrings> from advancedsettings.xml doesn't work correctly

2/ Processing anime details
- added overtaking of temporary rating in case permanent isn't present (in case series isn't finished)
- added filtering out some generic (Asia, Japan, Game, Novel) or annoing (Sudden Girlfriend Appearance, Boing) anime genres
- added filtering of plot summary (unfortunatelly there is no pattern how plot summaries are written on anidb.net, so it will never be 100% :-S)
-- removing '*' from start
-- removing empty lines (no need to have them in cramped space for plot summary in XBMC) - sometimes doens't work for some reason (TODO)
-- removing duplicate white characters (spaces, tabs, etc)
-- removing http links to another anidb.net articles (e.g. http://anidb.net/ch4051 [Chika] --> Chika)
-- removing source of summary informations (like [Source: ANN] or (taken from Animenfo))
- added filtering out (possible to turn it off in setting) only single studio. I'm using skin (Alaska) which shows studio logos as part of media flags, so i need only one studio in database to make it working. Additionally this feature prefer studios for which i have logo (posted my anime studios logos collection here)

3/ How Fanarts&Poster are looked up (generally thetvdb.com lookup, it's used for episode extra details later too)
- first thetvdb.com is searched for following names (in particular order):
-- anime main name
-- anime x-jat synonym name
-- anime english official name
-- anime english synonym name
- if no match is found scraper tries find anime prequel, primary by "Prequel" link, secondary (if no prequel link exist) by "Parent Story" link (there is possibility to change this link type in setting)
- if prequel is found scraper returns to first step of this cycle (searching thetvdb.com for prequel names)
- it ends when match is found on thetvdb.com or there is no other prequel (at the series "root" anime)

Why this complicated recursive algoritm? It's because different approach of anidb.net and thetvdb.com to anime series, generally animes are not grouped to series, if there is sequel for some anime it has different name (like Hayate no Gotoku! and Hayate no Gotoku!! Wink). Anidb.net follows this style and has for each anime unique entry, on the other hand thetvdb.com follows more western style of tv series with seasons and so they add sequels as new seasons to first anime in row. Result is scraper basically needs to find that first anime in row because it contains fanarts, thumbs and banners for all seasons. Indeed you will get same pictures for all sequels/seasons but it is how thetvdb.com works, you can choose right one from XBMC because you will get list of all of them.
Off course sometimes this process fails horribly and i have some ideas how to improve it (means rewrite it completely), but right now results are satisfying and i have no much time to spare.... Smile

4/ Processing anime episodes details (from anidb.net)
- added filling runtime from anime details - anidb.net has no runtime specific for every episode, but most episodes of one anime has same runtime
- added filling director from anime details - right now it fills in first director what scraper find, filling episode specific director is in TODO
- rest is same like in eldon's original scraper

5/ Processing anime episodes extra details (from thetvdb.com)
- because anidb.net doesn't contains plot summary for every episode eldon added thetvdb.com episode lookup to his scraper to fill plot summary and episode thumb. I only tweaked search for specific episode because it sometimes missfired on multi-season animes.
-- what is new there is way how anime is searched on thetvdb.com, it uses exactly same recursive algorithm as in section 3/ -> advantage is that scraper will end up searching same thetvdb.com entry as for fanarts, disadvantage is that it can be pretty slow, because whole recursive lookup is done again for each episode Wink, luckily all results from fanart lookup are cached so everything will be done "locally".
-- unfortunatelly unlike fanarts lookup where scraper doesn't need to care about season (because pictures aren't divided by seasons) for episodes scraper must find right season. Basically scraper counts +1 for each prequel lookup, e.g if anime is found on thetvdb.com, then it's considered season 1, if anime prequel is found, it's considered season 2, if prequel of prequel is found, it's considered season 3 ..... Sometimes it works fine, sometimes not (mostly in cases when relations graph is too complicated or some OVA/OAVs are mixed in), live with it.
-- as stated above OVA/OAVs are real PITA because they aren't treated consistently on thetvdb.com, sometimes they are entered as regular season, sometimes as specials. So there is setting where extra details can be turned off for OVA/OAVs.
-- added preset for season and episode offset for episodes extra details lookup (see setting). This is workaround for issues above (mismatched seasons and OVA/OAVs mapped to specials). When season preset is used scraper will use this "hardcoded" value instead of computed one. Additionally for specials there is possibility to set episode offset in case more than one OVA/OAVs are listed in specials.
-- unfortunately i'm still not able to learn regular expressions and scraper engine how to foretell Sad, so it requires some checking for extra detail matches and/or manual work with presets setting. If you won't be bothered by some manual work simply turn extra details off in setting, you will lost episode plot summary (if filled on thetvdb.com at all) but at least for me it's not big deal because i check it only rarely. Thumbs will be generated from your video files by XBMC. (if enabled).


Settings:
- Use Google Search ... disabled by default, allow use Google as search engine instead of seaching in anidb.xml
- Enable anidb.net prequel lookup ... enabled by default, self-explanatory, see section 3/
- Alternative anidb.net prequel link type ... "Parent Story" as default, allow select alternative prequel link type from "Parent Story", "Alternative Setting" and "Side Story"
- Enable only single Animation studio return ... enabled by default, self-explanatory, see section 2/
- Enable thetvdb.org fanart/posters ... enabled by default, self-explanatory
- Enable thetvdb.org banners ... disabled by default, banners are wide posters and i added this setting because i don't use them in XBMC
- Enable thetvdb.org extra episode details ... enabled by default, self-explanatory, see section 5/
- Enable thetvdb.org extra episode details for OVA/OAVs ... disabled by default, tells scraper if it should lookup extra details on thetvdb.com for OVA/OAVs episodes, see section 5/
- Enable presets for thetvdb.org extra episode details ... disabled by default, enables following two settings
- Preset season number ... 1 as default, see description in section 5/
- Preset episode number offset ... +0 as default, see description in section 5/, enabled only when Preset season number is set to 0 (specials on thetvdb.com)


TODO:
- cleanup ... i started by modifying eldon's original scraper so i'm still using passing function parameters over actual download urls, which isn't clean. I should use clearbuffers="no" instead.
- some tweaks in plot summary cleaning
- some tweaks to multiple directors handling
- add possibility to lookup thetvdb.com episode by name, current solution based on episode number lookup doesn't work correctly with single (long running) anime divided to multiple seasons on thetvdb.com (e.g. One Piece)


Scraper (use Download link in upper right corner):
http://pastebin.com/749rzV3R

Current anidb.xml (Created: Tue Apr 6 02:00:18 2010 (5773 anime, 31702 titles)):
http://www.megaupload.com/?d=0TSMZMX1


Let me know if you find some bugs.
Reply
#44
Great update bambi73, I have tested it a little bit and it works amazing for me with the tests that I have done using the google matching. I tried to get the Anidb.xml search working but it keeps deleting my anidb.xml file whenever I go to search.

I am using a Linux install and from what I can tell it needs to go into the .xbmc\temp\scrapers directory in the home folder. That is where the uncommented one downloaded it (only got .5MB also) when I tested that out. I am not sure if its a Linux thing or one of my settings but it seems to clear out the temp scraper folder on every search. Setting the timestamp on the anidb.xml 2 months ahead didn't seem to help.

I also wanted to ask what you are using in you advancedsettings to match to your renamed files. Right now I am renaming my files (using AniDBs AOM) with s01e in front of the episode number because <regexp>[/\._ \-]([0-9]+)</regexp> was giving me wrong matches or to many of them.

Thank you for the updated scraper and any help.
Reply
#45
GrEn Wrote:Great update bambi73, I have tested it a little bit and it works amazing for me with the tests that I have done using the google matching. I tried to get the Anidb.xml search working but it keeps deleting my anidb.xml file whenever I go to search.
Thanks Smile

GrEn Wrote:I am using a Linux install and from what I can tell it needs to go into the .xbmc\temp\scrapers directory in the home folder. That is where the uncommented one downloaded it (only got .5MB also) when I tested that out. I am not sure if its a Linux thing or one of my settings but it seems to clear out the temp scraper folder on every search. Setting the timestamp on the anidb.xml 2 months ahead didn't seem to help.
What version/revision are you using? It's important because cachePersistence parameter was added in relatively recent builds (01/2010 by info in eldons post above), if you have older build your scraper cache dir will be deleted before each scrape attempt (don't try to enable anidb.xml download in this case because you will easily get banned from anidb).

GrEn Wrote:I also wanted to ask what you are using in you advancedsettings to match to your renamed files. Right now I am renaming my files (using AniDBs AOM) with s01e in front of the episode number because <regexp>[/\._ \-]([0-9]+)</regexp> was giving me wrong matches or to many of them.
I'm using naming pattern %ann E%enr - %epn in WebAOE (result is for example Angel Beats E01 - Departure.mkv) and <regexp>(?i)[/\\].*? E(\d{2,3})(never happened)?([^/\\]*)</regexp> regexp in XBMC which forces season 1 for every match.
Reply
  • 1
  • 2
  • 3(current)
  • 4
  • 5
  • 37

Logout Mark Read Team Forum Stats Members Help
[WIP] AniDB.net Anime Video Scraper3