2009-12-20, 07:10
I can't, for the life of me, figure out what's wrong with my scraper, though my initial guess probably has to do with my not properly using the "<url>" tag. With that in mind, I don't suppose I might be able to ask for a smidgen of help. I've been working on it with GVim and ScraperXML Editor, and the following comes up:
- I've tested manually downloaded versions of the search results AND the individual result's info page. In both cases, ScraperXML Editor gives me a thumbs up; the results are correctly parsed
- I can't test in ScraperXML from scratch (e.g. from a search query, that is), AFAIK. I have version 1.5
- The old Scraper.exe program from way back when doesn't work with gzipped HTML. Is there an option to test it with downloaded html files?
Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper name="AniDB.net" date="2009-11-15" content="tvshows" framework="1.0" thumb="anidb.jpg" language="">
<NfoUrl dest="3">
<RegExp input="$$1" output="\1" dest="3">
<expression></expression>
</RegExp>
</NfoUrl>
<CreateSearchUrl dest="3">
<RegExp input="$$1" output="<url gzip="yes">http://anidb.net/perl-bin/animedb.pl?show=animelist&adb.search=\1&do.search=search</url>" dest="3">
<expression noclean="1"></expression>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="8">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
<RegExp input="$$1" output="<entity><title>\3</title><url gzip="yes">http://anidb.net/\1</url></entity>" dest="5">
<expression repeat="yes" noclean="1"><a href="(animedb.pl\?show=anime&amp;aid=([0-9]*))">([^<]*)</a></expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSearchResults>
<GetDetails dest="3">
<RegExp input="$$8" output="<details>\1</details>" dest="3">
<RegExp input="$$1" output="<title>\1</title>" dest="8">
<expression repeat="yes"><th class="field">Main Title</th>.....<td class="value">(.[^\n]*)</expression>
</RegExp>
<RegExp input="$$1" output="<year>\1</year>" dest="8+">
<expression noclean="1" trim="1"><th class="field">Year</th>.[^>]*>([^<]*)|$</expression>
</RegExp>
<RegExp input="$$1" output="<thumb><url spoof="http://anidb.net">http://animenfo.com/\1</url></thumb>" dest="8+">
<expression><div class="image".[^"]*"(http.[^"]*)</expression>
</RegExp>
<RegExp input="$$1" output="<rating>\1</rating>" dest="8+">
<expression>animevotes&amp;aid=[0-9]*">(.[^<]*)</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetDetails>
</scraper>