scraper question - how to setup url based on first char that's not "the"
#1
giving scraper creation a go

site: getvideoartwork.com

the only way i can see to parse the data is to pick the url to grab by the first char in the title, while removing "the " from it (if it's there).

(?:the )?(.)(.*)

\1 is the first char
\2 is the rest of the name

how do i form a search url based off \1

ie.. if \1 is [0-9] results to grab are one page, if [aA] it's another
[0-9] http://getvideoartwork.com/index.php?act..._itemId=38
[aA] http://getvideoartwork.com/index.php?act..._itemId=39
[fF] http://getvideoartwork.com/index.php?act...temId=2174

it's not really a search, just pulling the page that lists all the movies starting with a number (or char)

another question, directly related, if i want it to pull more then 1 url how does that work
ie.
[dD] has 2 pages of data
http://getvideoartwork.com/index.php?act...&g2_page=1
http://getvideoartwork.com/index.php?act...&g2_page=2

here's sorta the pseudo logic
Code:
getvideoartwork.com

Strip out "the"
match first char and search specifc page based on first char
(?:the )?(.)(.*)
\1 is first char (not the and not space)
\1\2 is the title


dynamic list (parse this page for each letter)
http://getvideoartwork.com/index.php?action=gallery&g2_itemId=27
regex repeat


fixed list
[0-9] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=38
[aA] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=39
[bB] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=40
[cC] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=80
[dD] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=82
[eE] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=84
[fF] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=2174
[gG] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=88
[hH] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=663
[iI] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=92
[jJ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=125
[kK] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=127
[lL] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=129
[mM] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=131
[nN] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=133
[oO] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=147
[pP] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=149
[qQ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=151
[rR] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=153
[sS] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=155
[tT] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=157
[uU] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=159
[vV] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=161
[wW] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=163
[xX] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=165
[yY] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=167
[zZ] http://getvideoartwork.com/index.php?action=gallery&g2_itemId=169


from that page get the info in

get all <tr> .*? </tr>
<td>(.*?)</td>
\1 is the blocks of data to process
(regex repeat)

then process those
ID: get id from g2_itemId: <a href="index.php?action=gallery&amp;g2_itemId=3570">
<a href="index.php.action=gallery&amp;g2_itemId=(\d{4,7})">
\1 is the id

TITLE:
<p class="giTitle">([^<]*)</p>
\1 is the title on the page

Title: giTitle <p> (paragraph) tag
<p class="giTitle">.*?</p>
or
<p class="giTitle">[^<]*</p>
regex repeat


form the image link as
http://getvideoartwork.com/index.php?action=gallery&g2_itemId=/1&g2_imageViewsIndex=1

i.e. http://getvideoartwork.com/index.php?action=gallery&g2_itemId=3989&g2_imageViewsIndex=1

Done with inital searching, that's the results list
=======================

(page for reference) http://getvideoartwork.com/index.php?action=gallery&g2_itemId=3989&g2_imageViewsIndex=1

take those results
and get the link (includes some serial number thing)
<div id="gsImageView" class="gbBlock">
<img src="gallery/main.php?g2_view=core.DownloadItem&amp;g2_itemId=3989&amp;g2_serialNumber=1" alt="Dark Knight v2.jpg" height="1500" width="1000">
</div>

<img src="(gallery/main.php.g2_view=core.DownloadItem&amp;g2_itemId=\d{4,7}&amp;g2_serialNumber=\d{1,9})"

append http://getvideoartwork.com/ to img src for url of image
url = http://getvideoartwork.com/\1
Reply
#2
uhrr, why not use the search on the site?
Reply
#3
spiff Wrote:uhrr, why not use the search on the site?

I couldn't figure out how to send just a url with the correct options, i'm not sure if it's the session id's on the site that are causing the issue or just me being dumb about how to do it.

i ran a packet capture on a search form submit and tried to pass the same values, but i must be doing it wrong as it just takes me to the forums on the site

as for the stripping out "the " part, the results are sent back as "Dark Knight, The" for some images and "The Dark Knight" for others, so it's just for more results.
Reply
#4
well the only way to do this would be to grab the letter, then do some lut with the page url's. still, i'd spend my money on figuring out the form.
Reply
#5
spiff Wrote:well the only way to do this would be to grab the letter, then do some lut with the page url's. still, i'd spend my money on figuring out the form.

good point, i'll dig into how that works further.

thanks

-fekker
Reply

Logout Mark Read Team Forum Stats Members Help
scraper question - how to setup url based on first char that's not "the"0