Login at Kodi Home

fekker · (This post was last modified: 2009-09-23, 07:00 by fekker.)

giving scraper creation a go

site: getvideoartwork.com

the only way i can see to parse the data is to pick the url to grab by the first char in the title, while removing "the " from it (if it's there).

(?:the )?(.)(.*)

\1 is the first char
\2 is the rest of the name

how do i form a search url based off \1

ie.. if \1 is [0-9] results to grab are one page, if [aA] it's another
[0-9] http://getvideoartwork.com/index.php?act..._itemId=38
[aA] http://getvideoartwork.com/index.php?act..._itemId=39
[fF] http://getvideoartwork.com/index.php?act...temId=2174

it's not really a search, just pulling the page that lists all the movies starting with a number (or char)

another question, directly related, if i want it to pull more then 1 url how does that work
ie.
[dD] has 2 pages of data
http://getvideoartwork.com/index.php?act...&g2_page=1
http://getvideoartwork.com/index.php?act...&g2_page=2

here's sorta the pseudo logic

Code:
getvideoartwork.com

Strip out "the"

match first char and search specifc page based on first char

(?:the )?(.)(.*)

\1 is first char (not the and not space)

\1\2 is the title

dynamic list (parse this page for each letter)

http://getvideoartwork.com/index.php?action=gallery&g2_itemId=27

regex repeat

fixed list

[0-9] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=38

[aA] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=39

[bB] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=40

[cC] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=80

[dD] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=82

[eE] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=84

[fF] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=2174

[gG] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=88

[hH] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=663

[iI] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=92

[jJ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=125

[kK] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=127

[lL] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=129

[mM] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=131

[nN] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=133

[oO] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=147

[pP] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=149

[qQ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=151

[rR] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=153

[sS] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=155

[tT] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=157

[uU] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=159

[vV] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=161

[wW] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=163

[xX] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=165

[yY] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=167

[zZ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=169

from that page get the info in 

get all <tr> .*? </tr>

<td>(.*?)</td>

\1 is the blocks of data to process

(regex repeat)

then process those

ID: get id from g2_itemId: <a href="index.php?action=gallery&amp;g2_itemId=3570">

<a href="index.php.action=gallery&amp;g2_itemId=(\d{4,7})">

\1 is the id

TITLE:

<p class="giTitle">([^<]*)</p>

\1 is the title on the page

Title: giTitle <p> (paragraph) tag

<p class="giTitle">.*?</p> 

or

<p class="giTitle">[^<]*</p> 

regex repeat 

form the image link as 

http://getvideoartwork.com/index.php?action=gallery&g2_itemId=/1&g2_imageViewsIndex=1

i.e. http://getvideoartwork.com/index.php?action=gallery&g2_itemId=3989&g2_imageViewsIndex=1

Done with inital searching, that's the results list

=======================

(page for reference) http://getvideoartwork.com/index.php?action=gallery&g2_itemId=3989&g2_imageViewsIndex=1

take those results

and get the link (includes some serial number thing)

<div id="gsImageView" class="gbBlock">

<img src="gallery/main.php?g2_view=core.DownloadItem&amp;g2_itemId=3989&amp;g2_serialNumber=1" alt="Dark Knight v2.jpg" height="1500" width="1000">

</div>

<img src="(gallery/main.php.g2_view=core.DownloadItem&amp;g2_itemId=\d{4,7}&amp;g2_serialNumber=\d{1,9})"

append http://getvideoartwork.com/ to img src for url of image

url = http://getvideoartwork.com/\1

**spiff** · 2009-09-23, 10:43

uhrr, why not use the search on the site?

fekker · (This post was last modified: 2009-09-23, 19:22 by fekker.)

spiff Wrote:uhrr, why not use the search on the site?

I couldn't figure out how to send just a url with the correct options, i'm not sure if it's the session id's on the site that are causing the issue or just me being dumb about how to do it.

i ran a packet capture on a search form submit and tried to pass the same values, but i must be doing it wrong as it just takes me to the forums on the site

as for the stripping out "the " part, the results are sent back as "Dark Knight, The" for some images and "The Dark Knight" for others, so it's just for more results.

**spiff** · 2009-09-24, 10:00

well the only way to do this would be to grab the letter, then do some lut with the page url's. still, i'd spend my money on figuring out the form.

fekker · 2009-09-24, 18:44

spiff Wrote:well the only way to do this would be to grab the letter, then do some lut with the page url's. still, i'd spend my money on figuring out the form.

good point, i'll dig into how that works further.

thanks

-fekker