Login at Kodi Home

pko66 · 2008-08-27, 13:41

This is the first "chapter" (the only one so far) of the "course" to learn scraper creation that I'm writing as I learn how to make them, my intention is to incorporate it to the wiki when it is finished, please give me your opinion about it:

Scrape creation for dummies

Chapter one

First, some very important reference information, not to read it right now but keep the URLs on hand...

Introduction to scraper creation: http://wiki.xbmc.org/?title=How_To_Write...o_Scrapers
Reference to scraper structure: http://wiki.xbmc.org/?title=Scraper.xml
Tool to test scrapers: http://wiki.xbmc.org/?title=Scrap (Download NOW both files, scrap.exe & libcurl.dll)
Some info about regular expressions: http://wiki.xbmc.org/?title=Regular_Expr...9_Tutorial
More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex

I. How a scraper works

In a nutshell:

1) If there is movie.nfo, use it (section NfoUrl) and then go to step 5
2) Otherwise, with the file's name generate a search URL (section CreateSearchUrl) ang get the results pagechrome://informenter/skin/marker.png
3) With the results generate a listing (section GetSearchResults) that has for each "candidate" movie a user-friendly denomination and one (or more) associate URLs
4) Show the listing to the user for him to choose and select the associate URL(s)
5) Get the URL's content and extract from it (section GetDetails) the apropriate data for the movie to store in videodb

Each one of that four sections is made as a RegExp entry that has this structure:

Code:
<RegExp input=INPUT output=OUTPUT dest=DEST>

         <expression>EXPRESSION</expression>

      </RegExp>

INPUT is usually the content of a buffer (in a moment we see what that is)
OUTPUT is a string that is build up by the RegExp
DEST is the name of the buffer where OUTPUT will be stored
EXPRESSION is a regular expression that somehow manipulates INPUT to extract from it information as "fields". If EXPRESSION is empty, automatically a field "1" is created which contains INPUT

Here a "buffer" is just a memory section that is used for communication between each section and the rest of XBMC. There are ten buffers named 1 to 9. To express the *content* of a buffer you use "$$n", where n is the number of the buffer.

The fields get extracted from the input by EXPRESSION just by selecting patterns with "(" and ")" and get named as numbers sequentially; the first one is \1, the second \2 up to a maximum of 9.

A very easy example:

Code:
<RegExp input="$$1" output="\1" dest="3">

         <expression></expression>

      </RegExp>

- As input the content of buffer 1 is used
- The output will be stored in buffer 3
- As expression is empty, all the input ($$1) will be stored on field \1
- As output is simply \1, al its content will be used for output, that is, $$1

So, the end result will be that the content of buffer 1 will be stored on buffer 3

If you do not know anything about regular expressions, this is the moment to make a quick study of the principles of them from the references above.

Another example, this time we use a string as input and use a very simple regular expression to select part of it

Code:
<RegExp input="Movie: The Dark Knight" output="The title is \1" dest="3">

         <expression>Movie: (.*)</expression>

      </RegExp>

There, when we apply the expression to the input, the selected pattern (.*) becomes field 1, in this case it gets assigned "The Dark Knight". The output will so be "The title is The Dark Knight" and will be stored in buffer 3.

II. The most important sections in a scraper

Now, let's have a look into the 3 "important" sections: CreateSearchUrl, GetSearchResults and GetDetails. first there is some basic information about them we need to know.

CreateSearchUrl must generate (into buffer 3) the URL that will be used to get the listing of possible movies. To do that, you need the name of file selected to be scraped and that is stored by XBMC in buffer 1.

GetSearchResults must generate (in buffer 8) the listing of movies (in user-ready form) and their associate URLs. The result of downloading the content of the URL generated by CreateSearchResult is stored by XBMC in buffer 5. The listing must have this structure:

Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>

<results>

   <entity>

      <title></title>

      <url></url>

   </entity>

   <entity>

      <title></title>

      <url></url>

   </entity>

</results>

Each <entity> must have a <title> (the text that will be show to the user) and at least one <url>, although there can be up to 9. You can generate as many <entity> as you need, they will become a listing show to the user to choose.

Once the user has selected a movie, the associated URL(s) will be downloaded.

Last, GetDetails must generate (in buffer 3) the listing of detailed information about the movie in the correct format, using for that the content of the URL(s) selected from GetSearchResults. The first one will be in $$1, the second in $$2 and so on.

The structure that the listing must have is this:

Code:
<details>

    <title></title>

    <year></year>

    <director></director>

    <top250></top250>

    <mpaa></mpaa>

    <tagline></tagline>

    <runtime></runtime>

    <thumb></thumb>

    <credits></credits>

    <rating></rating>

    <votes></votes>

    <genre></genre>

    <actor>

        <name></name>

        <role></role>

    </actor>

    <outline></outline>

    <plot></plot>

</details>

Notes:
- Some fields can be missing (?)
- <thumb> contains the URL of the image to be downladed later
- <genre>, <credits>, <director> and <actor> can be repeated as many times as needed

Some important details to remember:

1) When you need to use some special characters into the regular expression, do not forget to "scape" them:
\ -> \\
( -> \(
. -> \.
etc.

2) Since the scraper itself is a XML file, the characters with meaning in XML cannot be used directly and so must be uses its aliases:
& -> &
< -> <
> -> >
" -> "
' -> '

3) If you use non-ASCII characters in your XML to be used in the output (umlauts, ñ, etc), they must be coded as iso-8859-1

III. Our first working scraper

Now, with all that information, let's create our first scraper. Just create a "dummy.xml" file with this content, study it a little, it should be fairly easy to understand with what we already know:

Code:
<scraper name="dummy" content="movies" thumb="imdb.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <NfoUrl dest="3">

        <RegExp input="$$1" output="\1" dest="3">

            <expression></expression>

        </RegExp>

    </NfoUrl>

    <CreateSearchUrl dest="3">

          <RegExp input=$$1 output="&lt;url&gt;http://www.nada.com&lt;/url&gt;" dest="3">

             <expression></expression>

          </RegExp>

    </CreateSearchUrl>

    <GetSearchResults dest="8">

        <RegExp input="$$1" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;&lt;entity&gt;&lt;title&gt;Dummy&lt;/title&gt;&lt;url&gt;http://www.nada.com&lt;/url&gt;&lt;/entity&gt;" dest="8">

            <expression></expression>

        </RegExp>

    </GetSearchResults>

    <GetDetails dest="3">

        <RegExp input="$$1" output="&lt;details&gt;&lt;title&gt;The Dummy Movie&lt;/title&gt;&lt;year&gt;2008&lt;/year&gt;&lt;director&gt;Dummy Dumb&lt;/director&gt;&lt;top250&gt;&lt;/top250&gt;&lt;mpaa&gt;&lt;/mpaa&gt;&lt;tagline&gt;Some dumb dummies&lt;/tagline&gt;&lt;runtime&gt;&lt;/runtime&gt;&lt;thumb&gt;&lt;/thumb&gt;&lt;credits&gt;Dummy Dumb&lt;/credits&gt;&lt;rating&gt;&lt;/rating&gt;&lt;votes&gt;&lt;/votes&gt;&lt;genre&gt;&lt;/genre&gt;&lt;actor&gt;&lt;name&gt;Dummy Dumb&lt;/name&gt;&lt;role&gt;The dumb dummy&lt;/role&gt;&lt;/actor&gt;&lt;outline&gt;&lt;/outline&gt;&lt;plot&gt;Some dummies doing dumb things&lt;/plot&gt;&lt;/details&gt;" dest="3">

            <expression></expression>

        </RegExp>

    </GetDetails>

</scraper>

A really stupid scraper with no meaningful use whatsoever: be it any movie feeded, it will always generate the same (fake) data, also it will download information from http://www.nada.com and not use it at all, but nevertheless we have our first working scraper, congratulations!

To test it in windows, put in any directory the files scrap.exe and libcurl.dll that are referenced at the beginning of this lesson and the dummy.xml file and then execute for example:

Code:
scrap dummy.xml "Hello, world"

It should execute without errors and show you each step and its output.

You can also try it in a "real" XBMC, just copy dummy.xml to XBMC\system\scrapers\video, restart XBMC, choose any directory from your sources that contains a video file not incorporated into the library, "set content" of the directory to use "dummy" as scraper and finally select "movie info" over the video file. All our fake data will be incorporated into the video database.