help need for row problem
#1
Sad 
question is "how am i write regex for spaces into the html code?"

i try to write scraber for "www.beyazperde.mynet.com"

"create search" and "get searh result" part ok.
in "get details" section for director name there is 3 rows
Code:
<!-- YONETMEN -->
                      <br><span class="itembaslik">Yönetmen : </span>
<a href="/kisi/27214" class=turunculine_11_px>[b]Ben Verbong[/b]</a>
if i use
Code:
<a href="/kisi/([0-9]*)" class=turunculine_11_px>(.[^<]*)</a>
it turns several match (couse there is several smiliar lines in html)
[img=http://pic1.resimupload.com/r5/thumb_123574808.JPGImage

i try to begin with
Code:
<\!\-\- YONETMEN \-\->
it' match with "<!-- YONETMEN -->"
but when i add other words
Code:
<\!\-\- YONETMEN \-\-> <br><span class=..........

it' match nothing.Couse code with <b> is in lower row in html
The "<br>" is in lower row i think i have use something like "&nbsp;"

Image

Here is my XML would you help me on this puzzle?

Code:
<?xml version="1.0" encoding="iso-8859-9"?>
<scraper framework="10" date="2010-04-16" name="beyazperde" content="movies" thumb="logo_beyazperde.JPG" language="tr">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="&lt;url&gt;http://beyazperde.mynet.com/hizliarama.asp?keyword=\1&lt;/url&gt;" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl SearchStringEncoding="iso-8859-9" dest="3">
        <RegExp input="$$1" output="http://beyazperde.mynet.com/hizliarama.asp?keyword=\1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;url&gt;http://beyazperde.mynet.com/film/\1/arama/\2&lt;/url&gt;&lt;id&gt;\2&lt;/id&gt;&lt;/entity&gt; \n" dest="5">
                <expression repeat="yes" encode="1,2,3,4">&lt;a href=&quot;http://beyazperde.mynet.com/film/([0-9]*)/arama/(.[^&lt;]*)&quot; class=&quot;turuncucizgisiz_11_px&quot;&gt;&lt;b&gt;(.[^&lt;]*)&lt;/b&gt; \(([0-9]*)\)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <!--Title-->
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="5">
                <expression trim="1" noclean="1">h1 class=&quot;baslik_filmadi31&quot;&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <!--Year Film-->
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="5+">
                <expression>class=turunculine_11_px&gt;([0-9]*)&lt;/a&gt;, </expression>
            </RegExp>
            <!--Director-->
            <RegExp input="$$1" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
                <expression>&lt;a href=&quot;/kisi/([0-9]*)&quot; class=turunculine_11_px&gt;(.[^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <!--Runtime Film-->
            <RegExp input="$$1" output="&lt;runtime&gt;\1\2\3&lt;/runtime&gt;" dest="5+">
                <expression>&lt;a class=item href=&quot;\/arama.asp\?kat=vizyon&amp;keyword=([0-9]*).([0-9]*).([0-9]*)</expression>
            </RegExp>
            <!--Genre Film-->
            <RegExp input="$$6" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="5+">
                <RegExp input="$$1" output="\2" dest="6">
                    <expression> &lt;a href=&quot;\/arama.asp\?kat=tur&amp;keyword=([0-9]*)&quot; class=turunculine_11_px&gt;(.[^&lt;]*)</expression>
                </RegExp>
                <RegExp input="$$1" output="\2" dest="6">
                    <expression>href=&quot;\/arama.asp\?kat=alttur&amp;keyword=([0-9]*)&quot; class=turunculine_11_px&gt;(.[^&lt;]*)</expression>
                </RegExp>
                <expression repeat="yes"/>
            </RegExp>
            <!--Thumbnail-->
            <RegExp input="$$1" output="&lt;thumb&gt;http://beyazperde.mynet.com/images/film/\1-\2.jpg&lt;/thumb&gt;" dest="5+">
                <expression noclean="1">src=&quot;\/images\/film\/([0-9]*)-(.[^&lt;]*)([0-9]*).jpg&quot; width=&quot;150&quot; height=&quot;200&quot;&gt;</expression>
            </RegExp>
            <!--Actors-->
            <RegExp input="$$7" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;/actor&gt;" dest="5+">
                <expression repeat="yes">&lt;a href=&quot;\/kisi/([0-9]*)&quot; class=turunculine_11_px style=&quot;line-height:15px;&quot;&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetDetails>
</scraper>
Reply
#2
it's a bit hard to grasp what you're aiming at, in particular without a link to an example html.

a general tip would be to split your expression in two parts.
first grab the relevant part of the html into a temporary buffer. then loop over that to grab the actors or whatever YONETMEN's are Wink
Reply
#3
Smile 
spiff Wrote:it's a bit hard to grasp what you're aiming at, in particular without a link to an example html.

a general tip would be to split your expression in two parts.
first grab the relevant part of the html into a temporary buffer. then loop over that to grab the actors or whatever YONETMEN's are Wink

here is part of html:

Code:
<!-- GÖSTERİM TARİHİ -->
                                            <span class="itembaslik"><br>
                      Gösterim Tarihi :</span>
                      <a class=item href="/arama.asp?kat=vizyon&keyword=22.01.2010">22&nbsp;Ocak&nbsp;2010</a>
                        
                      <!-- YONETMEN -->
                      <br><span class="itembaslik">Yönetmen : </span>
<a href="/kisi/3916" class=turunculine_11_px>Uğur Yücel</a>

all pages prepared like that:
1. row title "<!-- GÖSTERİM TARİHİ -->" . "<!-- YONETMEN -->" (release date, Director)
2. row ....just code
3.row "<a class=...........=22.01.2010">22&nbsp;Ocak&nbsp;2010</a> (this data i need)
but i have to use 1. row for filter.
i'm not good at this job.try to read, understand and make something.Could you help me prepare regex for me and lot of turkish people
Reply
#4
first; are there multiple such blocks or just one?

if it's just one this should do (note i write in xml syntax, not editor syntax - you need to unescape the html/xml chars);

Code:
<RegExp input="$$1" output="\1" dest="4">
  <expression noclean="1">YONETMEN --&gt;(.*?)&lt;/a&gt;</expression>
</RegEp>
<RegExp input="$$4" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
  <expression>&lt;a[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt;</expression>
</RegExp>

to grab release year (we don't support full dates) we do much the same;
Code:
<RegExp input="$$1" output="\1" dest="4">
  <expression noclean="1">GÖSTERİM TARİHİ --&gt;(.*?)&lt;/a&gt;</expression>
</RegEp>
<RegExp input="$$4" output="&lt;year&gt;\1&lt;/year&gt;" dest="5+">
  <expression>&lt;a[^&gt;]*&gt;.*?([0-9]+)/a&gt;</expression>
</RegExp>
Reply
#5
thanks for quick reply i!ll try it now.
Reply
#6
Wink 
thanks for quick reply and explanation.
it'works...also i learned split regExp and (.*?) comand to select all text until known code... thanks for all.

- how can use 2 movie name for search first english than turkish ?
[solved] - how can i erase empty space front of grabbing info (plot)? (just click trim )


i love this game (xbmc)
Reply

Logout Mark Read Team Forum Stats Members Help
help need for row problem0