2012-02-04, 15:37
Hi,
I have general questions about how scraper functions work. It would be nice if someone could help me
I tried to analyse the cinefacts scraper to understand scraper functions. Here is the interesting part of it:
So the 3rd RegExp inside NfoUrl would output something like that
So the GetByIMDBId-function is called which itself calls the GetCinefactsDetailsLink-function.
My first question is: How is the output of a function inserted into the "parent-output".
Example (GetByIMDBId calls GetCinefactsDetailsLink ):
I could imagine that the url-tag in GetByIMDBId will be replaced with the output of GetCinefactsDetailsLink.
Then the output would be something like
So how would the ouput look like? "<details><details><url>..."?
Or must the <url function="">-Tag reside in another element (in this <details>) which will be replaced with the output of the function-call?
In this case NfoUrl would output something like "<url></url><id></id><url></url><id></id><url></url><id></id>".
Best Regards,
MKay
I have general questions about how scraper functions work. It would be nice if someone could help me
I tried to analyse the cinefacts scraper to understand scraper functions. Here is the interesting part of it:
Code:
<NfoUrl dest="3">
<RegExp input="$$1" output="<url>http://www.cinefacts.de/kino/\1/\2/filmdetails.html</url><id>\1</id>" dest="3">
<expression clear="yes" noclean="1">(cinefacts.de/kino/)([0-9]*)/(.[^\/]*)/filmdetails.html</expression>
</RegExp>
<RegExp input="$$1" output="<details><url cache="tt\2" function="GetByIMDBId">http://www.imdb.com/title/tt\2/externalreviews</url><id>tt\2</id></details>" dest="3+">
<expression>(imdb.com/title/tt)([0-9]*)</expression>
</RegExp>
<RegExp input="$$1" output="<details><url cache="tt\2" function="GetByIMDBId">http://www.imdb.com/title/tt\2/externalreviews</url><id>tt\2</id></details>" dest="3+">
<expression>(imdb.com/)Title\?([0-9]+)</expression>
</RegExp>
</NfoUrl>
<GetByIMDBId dest="3">
<RegExp input="$$1" output="<details><url function="GetCinefactsDetailsLink">http://www.cinefacts.de/kino/\1</url><id>$$2</id></details>" dest="3+">
<expression noclean="1"><a href="http://www.cinefacts.de/kino/([^"]*)"</expression>
</RegExp>
</GetByIMDBId>
<GetCinefactsDetailsLink dest="3">
<RegExp input="$$1" output="<url>http://www.cinefacts.de\1</url><id>$$2</id>" dest="3+">
<expression><a href="([^"]*)">Filmdetails</a></expression>
</RegExp>
</GetCinefactsDetailsLink>
Code:
<details><url cache="tt\2" function="GetByIMDBId">http://www.imdb.com/title/tt\2/externalreviews</url><id>tt\2</id></details>
My first question is: How is the output of a function inserted into the "parent-output".
Example (GetByIMDBId calls GetCinefactsDetailsLink ):
I could imagine that the url-tag in GetByIMDBId will be replaced with the output of GetCinefactsDetailsLink.
Then the output would be something like
Quote:<details><url>http://www.cinefacts.de\1</url><id>$$2</id><id>$$2</id></details>But what happens with the output of GetByIMDBId? The 3rd RegExp in NfoUrl outputs a <details>-Tag and GetByIMDBId outputs a <details>-Tag, too.
So how would the ouput look like? "<details><details><url>..."?
Or must the <url function="">-Tag reside in another element (in this <details>) which will be replaced with the output of the function-call?
In this case NfoUrl would output something like "<url></url><id></id><url></url><id></id><url></url><id></id>".
Best Regards,
MKay