2010-02-06, 23:50
spiff Wrote:no reason to escape the /'es.
Yeah, I noticed I did that. I use rubular (rubular.com - great tool btw) to test out my regexes, and it requires escaping, forgot to remove some I guess. Anyhoo, removing them did no difference. I got it working though, to some degree.
I ended up doing this:
Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1,1" date="2010-02-04" name="Lovefilm.se" content="movies" thumb="lovefilm.png" language="sv">
<CreateSearchUrl dest="4">
<RegExp input="$$1" output="<url>http://www.google.com/search?q=intitle:\1+site:lovefilm.se/film&num=100</url>" dest="4">
<expression></expression>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="6">
<RegExp input="$$5" output="<results>\1</results>" dest="6">
<RegExp input="$$1" output="<entity><title>\2</title><url>\1</url></entity>" dest="5">
<expression repeat="yes" clear="yes">a href="(http://www.lovefilm.se/film/[^\.]*.do)".[^>]*>(.[^H]*) Hyr</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSearchResults>
<GetDetails dest="8">
<RegExp input="$$7" output="<details>\1</details>" dest="8">
<RegExp input="$$1" output="<title>\1</title><year>\2</year>" dest="7">
<expression trim="1"><h1>.[^>]*>(.[^<]*)</span>.[^0-9]*([0-9]+)\)</expression>
</RegExp>
<RegExp input="$$1" output="<rating>\1</rating><votes>\2</votes>" dest="7+">
<expression trim="1">\(([0-9],[0-9])\) \(([0-9]+) r.ster\)</p></expression>
</RegExp>
<RegExp input="$$1" output="<originaltitle>\1</originaltitle>" dest="7+">
<expression trim="1">Originaltitel:</div>.[^<]*<div class="mainInfoRowRight">.[^<]*<strong>(.[^<]*)</strong></expression>
</RegExp>
<RegExp input="$$1" output="<director>\1</director>" dest="7+">
<expression trim="1">REGISSÖR</li>.[^<]*<ul>.[^<]*<li>.[^>]*>(.[^<]*)</a></li></expression>
</RegExp>
<RegExp input="$$1" output="<plot>\1</plot>" dest="7+">
<expression trim="1"><div id="description">(.[^<]*)</div></expression>
</RegExp>
<RegExp input="$$1" output="<genre>\1</genre>" dest="7+">
<expression cs="true" trim="1"><li class="header">GENRE</li>.[^<]*<a href="/category/.[^>]*>(.[^<]*)</a></li></expression>
</RegExp>
<RegExp input="$$1" output="<runtime>\1</runtime>" dest="7+">
<expression trim="1"><span>.[^ ]*DVD.[^S]*Speltid:.[^<]*<strong>(.[^\.]*)\.</strong></expression>
</RegExp>
<RegExp input="$$1" output="<thumb>\1</thumb>" dest="7+">
<expression trim="1"><img src="(http://static.lovefilm.se/img/cover/movie/huge/.[^"]*)"</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetDetails>
<!--Created with ScraperXml Editor, Author: filigran-->
</scraper>
In the page code there's four tabs before the plot text starts, and those end up as squares. I thought trimming would take care of that, but I guess it only removes spaces? Or am I using it wrong?
Another issue is that it doesn't find all the results that I do when searching manually with google, for some reason (using the same url). But I'm done trying to fix this. I'll just use the filmdelta.se scraper for now, it works good enough for me.
If you know why it's doing that, and how to fix it, I'd be glad to here what's causing it though. For future reference.
Thanks for your help!