Filmweb scraper
#16
wiesz co nie wiem dlaczego ale nigdy nie zadziałał mi regexp na link tekstowy, a nie mam czasu na testy

filmweb tylko maskuje numer id
twój link
http://frantic.filmweb.pl/

to ten sam co ten z id
http://www.filmweb.pl/Film?id=1107

lub ten
http://www.filmweb.pl/Film,id=1107

sorki ale nie planuję zaktualizować scrapera, mam nadzieję że pliki nfo zaczną zawierać id

smuto
Reply
#17
i hope i can use my native language in this topic
Reply
#18
help!!

quite simple. output xml in the format

<actor>
<thumb>...</thumb>
<name>something</name>
<role>somethingelse</role>
</actor>

but, i don't have thumb url in cast. So i try with "url function"

first without luck, but i like this idea (mayby this should work in libary by "Set Actor Thumb"

<actor>
<thumb><url function="ActorLink">...</url></thumb>
<name>something</name>
<role>somethingelse</role>
</actor>

second also without luck
<actor>
<name>something</name>
<role>somethingelse</role>
</actor>
<url function="ActorLink">somethinglink</url>

function="ActorLink"
<actor>
<name>something</name>
<thumb>...</thumb>
</actor>

don't know by mayby i need same numerator

actor$1 -> function="ActorLink$1"
actor$2 -> function="ActorLink$2"

my WIP
filmweb.xml

smuto
Reply
#19
dont add the actors at that point.

1) make sure all function you call dont clear buffers
2) make sure not to destroy the buffer which holds the id when it enters getdetails (# of htmls +1)
4) grab the url and chain once per actor
5) use the id to grab the role from the filmography list.

that should do, no?
Reply
#20
i made a lite ver.of scraper for tests
filmweb_only_actor_test.xml

from scrap.exe for "Goodbye Bafana"

details.xml

ActorLink.xml

why in ActorLink.xml i have only one (last) entry , scrap visit all url's from details
Reply
#21
Well looks like you dont repeat the thumb expression anyway.
Reply
#22
scrap will only show you the last outputted xml.

in xbmc the actors will be pushed to a list for each returned xml
Reply
#23
thx a lot

filmweb.xml with actor's thumb - 100% working

but it becomes extremly slow - sometimes to collect url of thumbs, scraper visits more then 20 pages

so if someone want use it - just grab it from here

one more question

i edit TheTVDB.com scraper to match at first polish strings
tvdb-pl.xml
try to set encoding to ISO-8859-2 in scraper, but without success

A gui charset in langinfo.xml
<charsets>
<gui unicodefont="false">CP1250</gui>
<subtitle>CP1250</subtitle>
</charsets>

polish xbmc language strings are in "utf-8"
polish subtitle are mostly in CP1250

when i change gui charset to
<gui >ISO-8859-2</gui>
tvdb-pl scraper works perfect

What gui charset is for?
smuto
Reply
#24
if returned xml is not utf8, it will be assumed to be gui charset and is converted from that to utf8.
if this is the best behaviour? not sure

as for the scraper being slow - not much we can do about that as long as the site is organized as it is...
Reply
#25
update - add scraper settings

one week for tests before we add this to SVN

filmweb.xml- with settings



have problem with encodings labels in scraper file
Image
is the way to add "Automatically grab actor thumbs" set to scraper settings window?

smuto
Reply
#26
Just add a setting to the xml label="Auto Grab Actor Thumbs" id="autograb" type="bool" default="false"

duplicate your ActorLink have one input with conditional="autograb" (with thumb)

and the copy of ActorLink conditional="!autograb" but do not output <thumb></thumb>

EDIT: line 104, pos 239 change &nbsp to &amp;nbsp;
Reply
#27
i dont think it fits as a scraper setting. you see, if you do it in the scraper it means you won't return the urls at all. the global setting is whether or not to actually grab the thumbs, not whether or not to grab the urls. small but important difference here - if you disable it at scraper level it means you cannot grab them manually either... hence dual settings makes sense to me
Reply
#28
i try to update the <nfourl>
can someone help me

for now i only use link with id
http://www.filmweb.pl/Film?id=999999

i try to add link with movie title to <nfourl>
http://movie.title.filmweb.pl/

this is my wip
PHP Code:
    <NfoUrl dest="3">
        <
RegExp input="$$1" output="http://www.filmweb.pl/Film?id=\1"  dest="3">
            <
expression noclean="1">Film.id=([0-9]*)</expression>
        </
RegExp>
                <
RegExp input="$$1" output="http://\1.filmweb.pl"  dest="3+">
            <
expression noclean="1">http://([^\/]+).filmweb.pl</expression>
        
</RegExp>
    </
NfoUrl

but movie title regexp work for both url
how can i force scraper to use id, if it's present
Reply
#29
easiest solution (i dont have time to analyze the regexp's).

output xml, i.e. <url>theurl</url>

first url block will take priority
Reply
#30
thx a lot - it's working

add as a patch to SVN
Reply

Logout Mark Read Team Forum Stats Members Help
Filmweb scraper2