scraper in development

I've started working on a scraper. Even though its a wiki its got a fairly static structure that can be scraped most of the time. I've got the basics down(extracting the title/year/narrator and the episode titles/plots(if they exists). I'm running into an issue though of being unsure how the documentaries should be organized, since they are somewhere between tvshows and movies.

They are like tvshows in the *some* of them come in multi-part series. They are like movies in that a good number are only single part though. What i'm not sure of is how they need to be organized to be scanned into xbmc with the single parters being recognized as single part documentaries(movies esentially), and the multi-parts being recognized as a show with episodes. Do i need to seperate them into different source directories with different scrapers? a "movie" scraper for the single part documentaries, and a "tvshow" scraper for the multi part? would be essentially the same scraper so seems a bad hack to do it that way.

The next problem relates to advancedsettings.xml. When creating the regexp for <tvshowmatching> it seems it needs to detect the season and the episode from each name. The main problem though is that documentaries dont have a season, for simplicity sake my scraper currently outputs all episodes as part of season 1. Documentaries are usually Name.XXofYY.EpisodeTitle.quality.ripgroup.avi. How can i recognize 2of3 or 5of9 as being Season 1 episode 2, or season 1 episode 5 based off those file names? Its escaping me beacause i need to capture a 1 for the season but there is not reliably a "1" present in the names.

Basically, it seems documentaries dont fit in very well as movies or as tv shows, any pointers to getting this done would be well apreciated. Or should i be submitting a trac ticket for a third type of video file, movies, tvshows, and the new documentaries? I'm not particularly excited to submit a trac ticket for this though because i imagine it could be a few months before anything solid happens(if ever) in respect to a new type of video file.

Another thing i forgot to mention regarding filenames. Documentaries cataloged at docuwiki use the following naming convention for extras:

Name.Extras.1of2.title.avi (or sometimes Name.Extras.1.title.avi)
Name.Extras.2of2.title.avi (or sometimes Name.Extras.2.title.avi)

This also creates complications in matching if trying to use 1of3 as season 1 ep 1, because the extras get seen as being part of the season 1 as well. I'm starting to wonder, is there anyway i can reference episodes in the scraper by title instead of by episode? Every documentary i've looked at on docuwiki follows that same naming format(filename is available on docuwiki), always having the title of the episode. In most cases docuwiki doesn't even show episode numbers with the episode descriptions, just the title of the episode and then the plot, followed by the next episode.

For an example page, look at A History of Britain or Battleplan. Unfortunatly they arn't exactly the same as battleplan uses episode numbers in the episode names, and a history of britain(along with most of the site) does not.


sorry for the late reply this somehow went past my attention.

to support this properly we need to define some semantics and a new content type for documentaries. from what i can see we need added logic for
1) support 'movie' and 'tvshow' documentaries. this should be quite easy to add, it would just be some flags in the returned xml from GetSearchResults.
2) some additional rules for extras-naming. i guess this should be done in a general way to support movies and tvshows as well.
3) possibly support for scraping episodes by filenames - this one we want to avoid if possible (but certainly doable if we deem it necessary).

but in any case, we will have to return to this after atlantis. nag me then Smile
I'm very interested in this and had thought of making my own. Any update on this?
2008... guess this never happened?
I specifically registered to try and add a bit of weight to the request for a docuwiki scraper as my collection grows and grows but it's a pain in the behind to catalogue them...
Thread Rating:
  • 0 Vote(s) - 0 Average

Logout Mark Read Team Forum Stats Members Help scraper in development00