Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed?
#11
spiff Wrote:no reason to escape the /'es.

Yeah, I noticed I did that. I use rubular (rubular.com - great tool btw) to test out my regexes, and it requires escaping, forgot to remove some I guess. Anyhoo, removing them did no difference. I got it working though, to some degree.
I ended up doing this:
Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1,1" date="2010-02-04" name="Lovefilm.se" content="movies" thumb="lovefilm.png" language="sv">
    <CreateSearchUrl dest="4">
        <RegExp input="$$1" output="&lt;url&gt;http://www.google.com/search?q=intitle:\1+site:lovefilm.se/film&amp;num=100&lt;/url&gt;" dest="4">
            <expression></expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="6">
        <RegExp input="$$5" output="&lt;results&gt;\1&lt;/results&gt;" dest="6">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;url&gt;\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" clear="yes">a href="(http://www.lovefilm.se/film/[^\.]*.do)".[^&gt;]*&gt;(.[^H]*) Hyr</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="8">
        <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="8">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;year&gt;\2&lt;/year&gt;" dest="7">
                <expression trim="1">&lt;h1&gt;.[^&gt;]*&gt;(.[^&lt;]*)&lt;/span&gt;.[^0-9]*([0-9]+)\)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="7+">
                <expression trim="1">\(([0-9],[0-9])\) \(([0-9]+) r.ster\)&lt;/p&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" dest="7+">
                <expression trim="1">Originaltitel:&lt;/div&gt;.[^&lt;]*&lt;div class="mainInfoRowRight"&gt;.[^&lt;]*&lt;strong&gt;(.[^&lt;]*)&lt;/strong&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;director&gt;\1&lt;/director&gt;" dest="7+">
                <expression trim="1">REGISSÖR&lt;/li&gt;.[^&lt;]*&lt;ul&gt;.[^&lt;]*&lt;li&gt;.[^&gt;]*&gt;(.[^&lt;]*)&lt;/a&gt;&lt;/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="7+">
                <expression trim="1">&lt;div id="description"&gt;(.[^&lt;]*)&lt;/div&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="7+">
                <expression cs="true" trim="1">&lt;li class="header"&gt;GENRE&lt;/li&gt;.[^&lt;]*&lt;a href="/category/.[^&gt;]*&gt;(.[^&lt;]*)&lt;/a&gt;&lt;/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="7+">
                <expression trim="1">&lt;span&gt;.[^ ]*DVD.[^S]*Speltid:.[^&lt;]*&lt;strong&gt;(.[^\.]*)\.&lt;/strong&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="7+">
                <expression trim="1">&lt;img src="(http://static.lovefilm.se/img/cover/movie/huge/.[^"]*)"</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
<!--Created with ScraperXml Editor, Author: filigran-->
</scraper>
and finally got it working, to some degree. I think it was the .*? matching it didn't like. I replaced that and it started working. It finds results, and gets the info I want, but the plot is messed up:
Image
In the page code there's four tabs before the plot text starts, and those end up as squares. I thought trimming would take care of that, but I guess it only removes spaces? Or am I using it wrong?

Another issue is that it doesn't find all the results that I do when searching manually with google, for some reason (using the same url). But I'm done trying to fix this. I'll just use the filmdelta.se scraper for now, it works good enough for me.

If you know why it's doing that, and how to fix it, I'd be glad to here what's causing it though. For future reference. Smile

Thanks for your help!
Reply


Messages In This Thread
[No subject] - by spiff - 2010-01-20, 09:16
[No subject] - by The_Ghost16 - 2010-01-20, 11:06
[No subject] - by filigran - 2010-01-20, 12:43
[No subject] - by filigran - 2010-02-02, 23:06
[No subject] - by mkortstiege - 2010-02-03, 01:15
[No subject] - by filigran - 2010-02-03, 15:21
[No subject] - by spiff - 2010-02-03, 15:30
[No subject] - by filigran - 2010-02-05, 02:12
[No subject] - by spiff - 2010-02-05, 11:18
[No subject] - by filigran - 2010-02-06, 23:50
[No subject] - by jojje - 2010-04-20, 22:28
Logout Mark Read Team Forum Stats Members Help
Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed?1