HOW-TO write Media Info Scrapers - Scraper creation for dummies
#1
Information 
This is the first "chapter" (the only one so far) of the "course" to learn scraper creation that I'm writing as I learn how to make them, my intention is to incorporate it to the wiki when it is finished, please give me your opinion about it:

Scrape creation for dummies

Chapter one

First, some very important reference information, not to read it right now but keep the URLs on hand...

Introduction to scraper creation: http://wiki.xbmc.org/?title=How_To_Write...o_Scrapers
Reference to scraper structure: http://wiki.xbmc.org/?title=Scraper.xml
Tool to test scrapers: http://wiki.xbmc.org/?title=Scrap (Download NOW both files, scrap.exe & libcurl.dll)
Some info about regular expressions: http://wiki.xbmc.org/?title=Regular_Expr...9_Tutorial
More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex

I. How a scraper works

In a nutshell:

1) If there is movie.nfo, use it (section NfoUrl) and then go to step 5
2) Otherwise, with the file's name generate a search URL (section CreateSearchUrl) ang get the results pagechrome://informenter/skin/marker.png
3) With the results generate a listing (section GetSearchResults) that has for each "candidate" movie a user-friendly denomination and one (or more) associate URLs
4) Show the listing to the user for him to choose and select the associate URL(s)
5) Get the URL's content and extract from it (section GetDetails) the apropriate data for the movie to store in videodb

Each one of that four sections is made as a RegExp entry that has this structure:

Code:
<RegExp input=INPUT output=OUTPUT dest=DEST>
         <expression>EXPRESSION</expression>
      </RegExp>

INPUT is usually the content of a buffer (in a moment we see what that is)
OUTPUT is a string that is build up by the RegExp
DEST is the name of the buffer where OUTPUT will be stored
EXPRESSION is a regular expression that somehow manipulates INPUT to extract from it information as "fields". If EXPRESSION is empty, automatically a field "1" is created which contains INPUT

Here a "buffer" is just a memory section that is used for communication between each section and the rest of XBMC. There are ten buffers named 1 to 9. To express the *content* of a buffer you use "$$n", where n is the number of the buffer.

The fields get extracted from the input by EXPRESSION just by selecting patterns with "(" and ")" and get named as numbers sequentially; the first one is \1, the second \2 up to a maximum of 9.

A very easy example:

Code:
<RegExp input="$$1" output="\1" dest="3">
         <expression></expression>
      </RegExp>

- As input the content of buffer 1 is used
- The output will be stored in buffer 3
- As expression is empty, all the input ($$1) will be stored on field \1
- As output is simply \1, al its content will be used for output, that is, $$1

So, the end result will be that the content of buffer 1 will be stored on buffer 3

If you do not know anything about regular expressions, this is the moment to make a quick study of the principles of them from the references above.

Another example, this time we use a string as input and use a very simple regular expression to select part of it

Code:
<RegExp input="Movie: The Dark Knight" output="The title is \1" dest="3">
         <expression>Movie: (.*)</expression>
      </RegExp>

There, when we apply the expression to the input, the selected pattern (.*) becomes field 1, in this case it gets assigned "The Dark Knight". The output will so be "The title is The Dark Knight" and will be stored in buffer 3.

II. The most important sections in a scraper

Now, let's have a look into the 3 "important" sections: CreateSearchUrl, GetSearchResults and GetDetails. first there is some basic information about them we need to know.

CreateSearchUrl must generate (into buffer 3) the URL that will be used to get the listing of possible movies. To do that, you need the name of file selected to be scraped and that is stored by XBMC in buffer 1.

GetSearchResults must generate (in buffer 8) the listing of movies (in user-ready form) and their associate URLs. The result of downloading the content of the URL generated by CreateSearchResult is stored by XBMC in buffer 5. The listing must have this structure:
Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<results>
   <entity>
      <title></title>
      <url></url>
   </entity>
   <entity>
      <title></title>
      <url></url>
   </entity>
</results>

Each <entity> must have a <title> (the text that will be show to the user) and at least one <url>, although there can be up to 9. You can generate as many <entity> as you need, they will become a listing show to the user to choose.

Once the user has selected a movie, the associated URL(s) will be downloaded.

Last, GetDetails must generate (in buffer 3) the listing of detailed information about the movie in the correct format, using for that the content of the URL(s) selected from GetSearchResults. The first one will be in $$1, the second in $$2 and so on.

The structure that the listing must have is this:
Code:
<details>
    <title></title>
    <year></year>
    <director></director>
    <top250></top250>
    <mpaa></mpaa>
    <tagline></tagline>
    <runtime></runtime>
    <thumb></thumb>
    <credits></credits>
    <rating></rating>
    <votes></votes>
    <genre></genre>
    <actor>
        <name></name>
        <role></role>
    </actor>
    <outline></outline>
    <plot></plot>
</details>

Notes:
- Some fields can be missing (?)
- <thumb> contains the URL of the image to be downladed later
- <genre>, <credits>, <director> and <actor> can be repeated as many times as needed

Some important details to remember:

1) When you need to use some special characters into the regular expression, do not forget to "scape" them:
\ -> \\
( -> \(
. -> \.
etc.

2) Since the scraper itself is a XML file, the characters with meaning in XML cannot be used directly and so must be uses its aliases:
& -> &amp;
< -> &lt;
> -> &gt;
" -> &quot;
' -> &apos;

3) If you use non-ASCII characters in your XML to be used in the output (umlauts, ñ, etc), they must be coded as iso-8859-1

III. Our first working scraper

Now, with all that information, let's create our first scraper. Just create a "dummy.xml" file with this content, study it a little, it should be fairly easy to understand with what we already know:

Code:
<scraper name="dummy" content="movies" thumb="imdb.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="\1" dest="3">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
          <RegExp input=$$1 output="&lt;url&gt;http://www.nada.com&lt;/url&gt;" dest="3">
             <expression></expression>
          </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp input="$$1" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;&lt;entity&gt;&lt;title&gt;Dummy&lt;/title&gt;&lt;url&gt;http://www.nada.com&lt;/url&gt;&lt;/entity&gt;" dest="8">
            <expression></expression>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$1" output="&lt;details&gt;&lt;title&gt;The Dummy Movie&lt;/title&gt;&lt;year&gt;2008&lt;/year&gt;&lt;director&gt;Dummy Dumb&lt;/director&gt;&lt;top250&gt;&lt;/top250&gt;&lt;mpaa&gt;&lt;/mpaa&gt;&lt;tagline&gt;Some dumb dummies&lt;/tagline&gt;&lt;runtime&gt;&lt;/runtime&gt;&lt;thumb&gt;&lt;/thumb&gt;&lt;credits&gt;Dummy Dumb&lt;/credits&gt;&lt;rating&gt;&lt;/rating&gt;&lt;votes&gt;&lt;/votes&gt;&lt;genre&gt;&lt;/genre&gt;&lt;actor&gt;&lt;name&gt;Dummy Dumb&lt;/name&gt;&lt;role&gt;The dumb dummy&lt;/role&gt;&lt;/actor&gt;&lt;outline&gt;&lt;/outline&gt;&lt;plot&gt;Some dummies doing dumb things&lt;/plot&gt;&lt;/details&gt;" dest="3">
            <expression></expression>
        </RegExp>
    </GetDetails>
</scraper>

A really stupid scraper with no meaningful use whatsoever: be it any movie feeded, it will always generate the same (fake) data, also it will download information from http://www.nada.com and not use it at all, but nevertheless we have our first working scraper, congratulations!

To test it in windows, put in any directory the files scrap.exe and libcurl.dll that are referenced at the beginning of this lesson and the dummy.xml file and then execute for example:

Code:
scrap dummy.xml "Hello, world"

It should execute without errors and show you each step and its output.

You can also try it in a "real" XBMC, just copy dummy.xml to XBMC\system\scrapers\video, restart XBMC, choose any directory from your sources that contains a video file not incorporated into the library, "set content" of the directory to use "dummy" as scraper and finally select "movie info" over the video file. All our fake data will be incorporated into the video database.
Reply
#2
Thumbs Up 
Great stuff! Can you please add this to the XBMC Online Manual (wiki) as well?

Cool
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#3
- you have 20 buffers to play with
- yes you can skip contents of tags (or the whole tags)
- you can have any encoding as long as you tag the scraper using it (remember, the scraper is a xml file so it obeys the <?xml tag), or the returned xml.
Reply
#4
Gamester17 Wrote:Great stuff! Can you please add this to the XBMC Online Manual (wiki) as well?

Yes, that is idea, I post it here to have commentaries (like those from spiff, very good information) and correct bugs etc. but I plan to add it to the wiki soon.

spiff Wrote:- you have 20 buffers to play with

Ok, so you use their content using $$1 ... $$20, I suppose.

spiff Wrote:- yes you can skip contents of tags (or the whole tags)

Just like I thought (and I tested with the second scraper). I will modify the dummy.xml to simplify it prior to posting in the wiki.

spiff Wrote:- you can have any encoding as long as you tag the scraper using it (remember, the scraper is a xml file so it obeys the <?xml tag), or the returned xml.

Oh, didn't know that you can change the encoding! I was having problems with the culturalia V2 scraper that is the center of the second chapter... I should investigate this a little more (also, something not related to this, I am having encoding problems with the nfo files created by XBMC-DB tool, it is worth exploring).

I have the second chapter almost ready, in it I rewrite step by step the culturalia scraper; the site is really easy to manage but at the same time allow some interesting treatments, I am learning a lot. It is not ready yet for two reasons:

- I do not know who did the original culturalia scraper and I would like to include some attribution (also I hope he/she doesn't mind me using his/her work)

- I am not entirely satisfied with the regular expressions used in some fields, they work on all examples I tried but are far from "elegant", and since this will go to the wiki for everyone to see...

My culturalia scraper, as the original, cannot obtain thumbnails since culturalia's web server block them, I could probably hack the connections (that may be for chapter four) but instead I plan to implement scraping them from IMDB (that was my original intention when I decided to learn the craft of scraper making). That will be, if I am capable enough, the content of chapter three.
Reply
#5
Here is first version of chapter 2. It is not ready yet, I have some doubts (how does clear="yes" work when dest is something like "8+"?). Also, although the scraper works OK with scrap.exe, when trying it for real in xbmc it behaved differently, it did not scrape correctly the "genres" field... somehow scrap.exe and xbmc do not make the same interpretation of the scraper.

Chapter two

Now that we know how to create a skeleton scraper, let's re-create a real one. I've chosen one fairly simple, the one used to scrape the spanish site culturalia.es (in fact the URL is www.culturalianet.com). First of all, we must know how works the site we intend to write the scraper for.
Huh?? who did the original culturalia scraper? I would like to include a "thank you" here

Open http://www.culturalianet.com. To perform a search, write "la noche es nuestra" (spanish title for "we own the night") in the "buscar:" box in the top of the page. When you press the "Buscar" ("Search") button, the URL opened is:

Code:
http://www.culturalianet.com/bus/resu.php?texto=la+noche+es+nuestra&donde=1

so, very easy, our search URL will be "http://www.culturalianet.com/bus/resu.php?texto=" + (text to search) + "&donde=1"

For example:
Code:
<RegExp input="$$1" output="http://www.culturalianet.com/bus/resu.php?texto=\1&donde=1" dest="3">
     <expression></expression)
  </RegExp>

So far, so good; in field 1 goes the input (the name of the movie, already stripped by XBMC of the file extension and some common words like "divx", "ac3" and so on), and to generate the output we just write \1 at the point we need.

Now we must understand how the results page is formatted; for that, the function "View selection source" of firefox is very useful. Just select the end of the header of the listing and some of the first entries and "view selection source", this is what I get:

Code:
Se han encontrado 249 artículos. Se muestran del 1 al 25. <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=1">26</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=2">51</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=3">76</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=4">101</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=5">126</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=6">151</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=7">176</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=8">201</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=9">226</a> </td></tr><tr><td><b><a href="../art/ver.php?art=29405" target="_top">Noche es nuestra, La.</a></b></td></tr>
<tr><td colspan="2"><i>We Own the Night</i>. De James Gray (2007)</td></tr><tr><td><b><a href="../art/ver.php?art=23798" target="_top">10 + 2: La noche mágica.</a></b></td></tr>
<tr><td colspan="2"><i>10   2: La noche mágica</i>. De Miquel Pujol Lozano (2000)</td>

See? we simply need to select for each entry, the title and maybe some information and then the URL, and repeat that for all the entries in the listing. Fortunately, XBMC offers us some resources to help that we haven't seen yet: the "expression" part of RegExp can have some attributes, in this case, to repeat the appliying of <expression> to the input as many times as there are data for ir, we simply add 'repeat=yes"' as an attribute:

Code:
<expression repeat="yes">

and now let's go for the expression. We will extract the culturalianet's ID of the article about the movie, the spanish title, the original title, the name of the director and the year of the movie.

The ID we get from:
Code:
<a href="../art/ver.php?art=29405" target="_top">
is just a string of numbers, to select it as a field we surround it with parentheses:
Code:
<a href='../art/ver.php\?art=([0-9]*)' target='_top'>

after that, there is the spanish title, ending in a dot and followed by </a>, so we select as our second field a string of any lenght (must have at least one character) that does not contain "<":
Code:
(.[^<]*)\.</a>

Then there is some formatting and, surrounded by <i> and </i>, the original title (again a string of one or more characters). we jump over the formatting with [^<i>]* and select our third field:
Code:
[^<i>]*(.[^<]*)

Then there is </i> and the literal "De " followed by the director's name up until the year of the movie that appears surrounded by parentheses:
Code:
<\i>\. De (.[^\(]*)

and our fourth and last field is the movie year, ending (but not including) the character ")":
Code:
\(([0-9]*)\)

all put together and exchanging "&lt;" for "<" etc, this is our <expression>:
Code:
<expression repeat="yes">&lt;a href='../art/ver.php\?art=([0-9]*)' target='_top'&gt;(.[^&lt;]*)\.&lt;/a&gt;[^&lt;i&gt;]*(.[^&lt;]*)&lt;\i&gt;\. De (.[^\(]*)\(([0-9]*)\)</expression>

there, the fields will be:
\1 ID of the movies's article in culturalianet.com
\2 Spanish title
\3 Original title
\4 Director's name
\5 Movie's year of first exhibition

Each of our <entity> will have a <name> in the form:
'Noche es nuestra, la' (We own the night) de James Gray (2007)
or, with our actual fields:
'\2' (\3) de \4 (\5)

Also there will be a <url> generated by:
Code:
http://www.culturalianet.com/art/ver.php?art=\1

Like we did with our dummy scraper, we add all the necessary headings and this is the result:

Code:
<GetSearchResults dest="8">
    <RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
        <RegExp input="$$1" output="&lt;entity%gt;&lt;title&gt;'\2' (\3) de \4 (\5)&lt;/title&gt;&lt;url&gt;http://www.culturalianet.com/art/ver.php?art=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
            <expression repeat="yes">&lt;a href='../art/ver.php\?art=([0-9]*)' target='_top'&gt;(.[^&lt;]*)\.&lt;/a&gt;[^&lt;i&gt;]*(.[^&lt;]*)&lt;\i&gt;\. De (.[^\(]*)\(([0-9]*)\)</expression>
        </RegExp>
        <expression noclean="1" />
    </RegExp>
</GetSearchResults>

There are a few things there we have not seen yet. For starters, see that there are two anidated regexp; they get evaluated from the inner ones to the outer ones. Also, there is an attribute for <expression> we haven't seen yet, 'noclean="1"'; by default, XBMC will strip the expression of all HTML formatting, but here we do not want that, so we add that to indicated that we do not want XBMC to clean our input before using it.

also, and this is a XML standard, you can shorten empty XML clauses like
<expression></expression>
by writing instead:
<expression/>

So, how does XBMC execute this? it goes to the inner regexp and using input="$$1" (the content of our search url), applies to it expression and generates our fields:
Code:
<expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>[^<i>]*(.[^<]*)<\i>\. De (.[^\(]*)\(([0-9]*)\)</expression>
In the previous line, for clarity, I'm using < instead of &lt;

that generates this output to buffer 5:
Code:
<entity><title>'\2' (\3) de \4 (\5)</title><url>http://www.culturalianet.com/art/ver.php?art=\1</url></entity>

repeats it as long there is a <expression> match in input, generating as many <entity>, and all of the goes to $$5

Then, the outer regexp gets executed, it uses as input $$5 that has just been generated; it does not modify anithin (empty <expression> means all input goes to \1) but remember to use the noclean clause to maintain the necessary formatting. Simply takes all the <entity>s generated and insert them in the correct xml structure:
Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>

All output goes to buffer 8.
Reply
#6
Now, XBMC will show the user the list of movies and one will be selected. The associated URL, the article page of the movie, will be downloades and fed to buffer 1, and that we need to parse to extract the information we need.

Go now to a movie article, like http://www.culturalianet.com/art/ver.php?art=29405 and select and look at the underlying HTML code. Very much like we did when parsing the search results page, we must detect the patterns in the page that allows us to select the correct fields and then use them to build our <details> XML structure. Some parts are fairly straightforward, like title, duration, plot or year; this expression extracts the spanish title, the original title and the year into fields 1, 3 and 2 respectively:
Code:
'titulo2'>(.[^\<]*)\. \(([0-9]*)\)</font></u><br><br><i>(.[^<]*)</i>

and the output for that (we add an <original title> clause that is not needed nor used by XBMC right now, maybe in future versions will get used):
Code:
<title>\1 (\3)</title><originaltitle>\3</originaltitle><year>\2</year>

Some data is much more difficult to obtain; the actors, writers and directors could be one or more, and in the page, the structure is different werther culturalia has a page of the specific artist or not (there is just the name or the name becomes a link). The "actors" block is surrounded by "Actores:" and "Productor:", we simply extract that block into, for example, $$7
Code:
<RegExp input="$$1" output="\1" dest="7">
    <expression noclean="1">Actores:([^:]*)Productor:</expression>
</RegExp>

Then we parse the $$7 buffer, in it the name of each actor will be anything between > and < that is at least one character long:
Code:
<expression noclean="1" repeat="yes">&gt;(.[^&lt;&gt;]*)&lt;</expression>

This will be our actors' output:
Code:
<actor><name>\1</name><role></role></actor>

the full regexp for the actors (remember that it evals from inner to outer regexp):
Code:
<RegExp input="$$7" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;&lt;/role&gt;&lt;/actor&gt;" dest="8">
    <RegExp input="$$1" output="\1" dest="7">
        <expression noclean="1">Actores:([^:]*)Productor:</expression>
    </RegExp>
    <expression noclean="1" repeat="yes">&gt;(.[^&lt;&gt;]*)&lt;</expression>
</RegExp>


So, to get the end result, we simply put one after another all the regexp that generate our <details> listing. When regexps are one after another (not nested) they simply execute in order.

Something we haven't used before, when we want the output to append to an existing buffer, not overwriting it, we simply write, for example, dest="7+"

We generate all the items into the "8" buffer and use the "7" buffer as temporary in each regexp.

For the final version of the scraper, we use some different attribtes for <expression>, with this meaning:

repeat="yes" -> will repeat the expression as long as there are matches

noclean="1" -> will NOT strip html tags and special charactes from field 1. Field can be 1 ... 9. By default, all fields are "cleaned"

trim="1" -> trim white spaces of field 1. Field can be 1 ... 9

clear="yes" -> if there is no match for the expression, dest will be cleared. By default, dest will keep it previous value
Huh?NOTE: what happens when using clear="yes" if dest is "8+"Huh?

So, without further ado, this is the whole scraper. The extraction of the different fileds is similar to the "actors" field we saw. This is just one of many possible ways of getting the info and probably not the best one, it is slightly different to the original culturalia.xml scraper. One additional comment: to avoid trouble with some special characters ("é" in "Género", for example) that can get different encodings depending on your text editor and can be difficult to type, I'm using a dot instead, since a non-scaped dot means "any character" when used in regular expressions.


Code:
<scraper name="Culturalia.es V2" content="movies" thumb="culturalia.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="http://www.culturalianet.com/art/ver.php?art=\1" dest="3">
            <expression noclean="1">art/ver\.php\?art=([0-9]*)</expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="http://www.culturalianet.com/bus/resu.php?texto=\1&amp;donde=1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 (\3) de \4 (\5)&lt;/title&gt;&lt;url&gt;http://www.culturalianet.com/art/ver.php?art=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&lt;a href=&apos;../art/ver.php\?art=([0-9]*)&apos; target=&apos;_top&apos;&gt;(.[^&lt;]*)\.&lt;/a&gt;.[^\(]*&lt;i&gt;(.[^&lt;]*)&lt;/i&gt;\. De (.[^\(]*) \(([0-9]*)\)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <!-- Titles, year !-->
            <RegExp input="$$1" output="&lt;title&gt;\1 (\3)&lt;/title&gt;&lt;originaltitle&gt;\3&lt;/originaltitle&gt;&lt;year&gt;\2&lt;/year&gt;" dest="8">
                <expression trim="1" noclean="1">&apos;titulo2&apos;&gt;(.[^\&lt;]*)\. \(([0-9]*)\)&lt;/font&gt;&lt;/u&gt;&lt;br&gt;&lt;br&gt;&lt;i&gt;(.[^&lt;]*)&lt;/i&gt;</expression>
            </RegExp>
            <!-- Director's names !-->
            <RegExp input="$$7" output="&lt;director&gt;\1&lt;/director&gt;" dest="8+">
                <RegExp input="$$1" output="\1" dest="7">
                    <expression noclean="1">Director:([^:]*)Actores:</expression>
                </RegExp>
                <expression noclean="1" repeat="yes">&gt;(.[^&lt;&gt;]*)&lt;</expression>
            </RegExp>
            <!-- Runtime !-->
            <RegExp input="$$7" output="&lt;runtime&gt;\1 minutos&lt;/runtime&gt;" dest="8+">
                <RegExp input="$$1" output="\1&lt;" dest="7">
                    <expression noclean="1" clear="yes">Duraci.n:(.*)minutos</expression>
                </RegExp>
                <expression noclean="1" trim="1">&gt;(.[^&lt;&gt;]*)&lt;</expression>
            </RegExp>
            <!-- Thumbnail !-->
            <RegExp input="$$1" output="&lt;thumb&gt;http://www.culturalianet.com/imatges/articulos/\1-1.jpg&lt;/thumb&gt;" dest="8+">
                <expression>imatges/articulos/([0-9]*)-</expression>
            </RegExp>
            <!-- Credits !-->
            <RegExp input="$$7" output="&lt;credits&gt;\1&lt;/credits&gt;" dest="8+">
                <RegExp input="$$1" output="\1" dest="7">
                    <expression noclean="1">Gui.n:([^:]*)Fotograf.a:</expression>
                </RegExp>
                <expression noclean="1" repeat="yes">&gt;(.[^&lt;&gt;]*)&lt;</expression>
            </RegExp>
            <!-- Genres !-->
            <RegExp input="$$7" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
                <RegExp input="$$9" output="\1" dest="7">
                    <RegExp input="$$1" output="\1/" dest="9">
                        <expression>G.nero:([^:]*)Nacionalidad:</expression>
                    </RegExp>
                    <expression>&gt;(.[^&lt;&gt;]*)&lt;</expression>
                </RegExp>
                <expression repeat="yes" trim="1">(.[^/]*)/</expression>
            </RegExp>
            <!-- Actors !-->
            <RegExp input="$$7" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;&lt;/role&gt;&lt;/actor&gt;" dest="8+">
                <RegExp input="$$1" output="\1" dest="7">
                    <expression noclean="1">Actores:([^:]*)Productor:</expression>
                </RegExp>
                <expression noclean="1" repeat="yes">&gt;(.[^&lt;&gt;]*)&lt;</expression>
            </RegExp>
            <!-- Plot !-->
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
                <expression>Sinopsis:&lt;/b&gt;&lt;br&gt;([^=]*)&lt;br&gt;&lt;br&gt;</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetDetails>
</scraper>
Reply
#7
The first two chapters (with some corrections) are now in the wiki, unfortunately, introducing the code with <xml> to format it correctly, makes the articles very wide and that affects not only the code but the text also, and makes everything very unpleasant to read... does anybody know how to mend that?
Reply
#8
This is the link to the article: HOW-TO Write Media Info Scrapers (the complete dummies guide)
Reply
#9
Question 
Thank you very much pko66 (and spiff of course) but...
pko66 Wrote:This is the link to the article: HOW-TO Write Media Info Scrapers (the complete dummies guide)
I wonder if these three articles can not be merged into only one article (or at least only two articles?)?
http://wiki.xbmc.org/?title=HOW-TO_Write...s_guide%29
http://wiki.xbmc.org/?title=HOW-TO_Write...duction%29
http://wiki.xbmc.org/?title=Scraper.xml

Can at least the first two listed above be merged into one single article called "HOW-TO write Media Info Scrapers"?

Huh
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#10
Gamester17 Wrote:Thank you very much pko66 (and spiff of course) but... I wonder if these three articles can not be merged into only one article (or at least only two articles?)?
http://wiki.xbmc.org/?title=HOW-TO_Write...s_guide%29
http://wiki.xbmc.org/?title=HOW-TO_Write...duction%29
http://wiki.xbmc.org/?title=Scraper.xml

Can at least the first two listed above be merged into one single article called "HOW-TO write Media Info Scrapers"?

Huh

The last one (scrapers) I think is better to have it in a separate article, since is more a reference than a how to; I intend to add some more info to it once I learn enough.

The other how to, the one I've renamed as "introduction" can very well be merged (in fact, already is a little, since was one of my main sources of information). My idea is first to finish the "course" (probably two more chapters) and then modify it incorporating all the info from "introduction" that is missing and then simply erase it.

BTW, is there something that can be done to the horizontal scroll bar? it makes the article pretty unreadable Sad
Reply
#11
pko66 Wrote:BTW, is there something that can be done to the horizontal scroll bar? it makes the article pretty unreadable Sad
Send sho a PM, he is our resident wiki guru Big Grin
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#12
I've been a few days on vacation and I've been unable to do anything in the guide... But now I'm back and I hope this weekend I will have the time and make some progress into it. Now that culturalia scraper is finished (didn't post it to be included in xbmc yet... will do soon), I'm planning to expand it to search into IMDB, and I hope I can make the implementation open enough so it can be included in other scrapers (like moviemeter.nl, that has been requested recently)

Then I plan to do a scraper for "generic" movies (to scrape for example home movies) and an addon for any scraper to simulate file mode in library mode (sacrifying the genre tags); both would work in current xbmc, but a more "elegant" solution would need some (I hope easy) modifications in xbmc code, so an after-atlantis thing.

Maybe I'm being a little too optimistic with my time and knowledge capabilities :-D but I hope not, I think I can do that in a 2 - 3 week timeframe.
Reply
#13
My Search string needs to be modified before aplied, if i have spaces in the buffer how would i modify the string to change all spaces to a '+' ?
Reply
#14
is this a search string fed from the application? if so; which. they should be url encoded before passed to the scraper.

in any case, just run a regular expression to replace..
Reply
#15
Okay i've figured it out how to use a regex to replace the spaces now... for the record when i use scrap using scrap.exe the url returns 'Blah+blah+blah' when i use xbmc the log reports that its scrapng for 'Blah%20blah%20blah'

Just one small observation.. even the guide for dummies on writing scrapers is a bit high brow... one has to read through it about 30 times while trying it to figure it out completely... for instance one thing that's not FULLY addressed is exactley what all the 'special characters' that need to be escaped are that much i still haven't figured out yet, but i guess its because i still don't completely understand XML...

The only reason i'm getting it right now is because i wrote a little VB program to output my regular expressions to xml-encoded strings.
Reply

Logout Mark Read Team Forum Stats Members Help
HOW-TO write Media Info Scrapers - Scraper creation for dummies1
This forum uses Lukasz Tkacz MyBB addons.