Hi, i'm working on a scraper, continuing the work of 'esd'.
so far quite a few things work, but i've still got a few little problems.

- The plot outline is loading what should be the plot, doenst have a plot outline section really.

- with some movies it does load the plot outline (plot) but then it doesnt load the actors, even though they are available on the site, and are loaded fine with other movies.

i'll be pretty much happy when i get those to work, i've also havent got MPAA rating to work, but i dont really care that much about that one atm.

any help would be greatly appreciated.

here is the current XML. Scraper
ok, to bad no-one has offered any help so far.

i didnt work on this for a few days, but today i continued it again.
there seems to be a problem somewhere.

i got the plot to appear in the plot section (as it should ofcourse).

I'm mainly testing this scraper with 2 movies "the classic" and "peppermint candy" since those 2 movies have all their info fille in on the site.

but eventhough they both have all the info, with peppermint candy the plot gets downloaded, but the actors dont.
and with The classic, i get the actors, but no plot.

i went through the script quite a few times but cant seem to find what causes this.

any help, ideas, suggestions etc would be great, even if it's just a guess.
you can check my previous post for the script, thanks.

ok, again a small update.

right now everything except plot works.
would be great if someone else could fix this since i'm pretty much lost.
here's what i've got right now, but i'm not really sure if i'll continue it since no-one has offered any help, and i'm stuck myself.

hopefully it can be useful to people:
i'd suggest

1) putting the plot grabbing regexp inside outermost one - stuffing things into buffer 5 after it has been transfered into the final buffer (3) is your main issue.

2) this version instead
<RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="5+">
<expression trim="1">Synopsis&lt;/td&gt;&lt;/table&gt;&lt;div[^&gt;]*&gt;&lt;table[^&gt;]*&gt;&lt;td[^&gt;]*&gt;&lt;img[^&gt;]*&gt;(.*)&lt;/td&gt;&lt;/table&gt;&lt;/div&gt;&lt;p&gt;</expression>

note that i grab the synopsis and not the introduction.

oh, and please do not pm me asking for help - i read the forums quite frequently and will respond when i have the time / if i feel like it. i get a lot of these and they bump my grump factor
thanks spiff, but i dont understand point 1.

i've put your version of the plot regexp, and it did the trick with 1 or 2 movies.
the rest still dont get the plot.

if it's not to much to ask, could you perhaps make the required changes to the scraper version posted above ?



notice that i moved the plot inside the
<RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
thanks Spiff, but i'm sorry to say that the problem isnt fixed Sad
still only a few (3/4) movies get their plot, the rest doesnt, evethough with quite a few the info is available on the website.

like with the movie "April snow"
i never said i fixed your expressions - i fixed what you had in there.

anyways, the problem is obvious. some pages has synopsis, some have synopsis and introduction and some have only introduction.

solution: add another expression after the synopsis one that grabs the introduction block. if the page has a synopsis the first <plot> tag will be used. if it doesnt, we'll use the introduction text.
Great!, it works Laugh

thanks for the help Spiff, couldn't have finished it without you of course Wink
cool - now tidy it up (for starters you should include the year in the returned search results as the imdb scraper does). when you think its ready submit a patch at sf. cheers
just submitted it, eventhough the scraper isnt a 100% functional.
it's still good enough to be used, but other can/may improve it if they feel the need.

sorry i didnt tidy it up like you said Spiff, but my lack of 'code vision' Rolleyes is kinda keeping me from doing that.
but like i said, some-one else might do it.


Logout Mark Read Team Forum Stats Members Help scraper0
This forum uses Lukasz Tkacz MyBB addons.