Login at Kodi Home

muttley:bd · (This post was last modified: 2011-03-06, 15:31 by muttley:bd.)

i would use, deprecated but very fast, google search ajax api, for search films in mymovies.it, with this similar url:
https://ajax.googleapis.com/ajax/service...o%20Aliens

The server answer is a json structure with many fields and i have a problem with url returned:

"unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id\u003d744"
"url":"http://www.mymovies.it/dizionario/recensione.asp%3Fid%3D744"

In the first field (unescapedUrl) i think it's utf-8 encoded string, and in "url" field is ASCII character encoding.

Well, when i insert in xml result structure:

Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><results><entity><title>\2 \3</title><url>[b]URL or UNESACPEDURL[/b]</url><id>\1</id></entity></results>

Xbmc can't open the url in the "encoded state"...i'll try with "fixchars" and with "SearchStringEncoding="UTF-8"" with no result

Another question, how can make the union of two search results?
I have tried append to buffer two url tags in CreateSearchUrl, but only one is resolved.

I hope my poor english is enough clear Laugh

thanks!

KoTiX · (This post was last modified: 2011-03-08, 00:21 by KoTiX.)

I'd like to help you here Tongue

If the results are the same (same title, id and url) xbmc will automatically get only the one result.
The url problem is not really an encoding problem but just a matter of what part of the url you consider:
From this url:
http://www.mymovies.it/dizionario/recens...3Fid%3D744
In your regula expression you should take only this part:
http://www.mymovies.it/dizionario/recensione.asp%3Fid%3
replace the "D" with a "=" and add at the end the id "744"

The same for the unescaped url
"unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id\u003d744"
consider only "unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id
and add the rest "=744"

I hope this helped you
Ciao Wink

muttley:bd · 2011-03-08, 11:29

Ciao kotix,

i don't understand very well your answer:

Quote:If the results are the same (same title, id and url) xbmc will automatically get only the one result.

Do you talk about search url union?

Quote:The url problem is not really an encoding problem but just a matter of what part of the url you consider:
From this url:
http://www.mymovies.it/dizionario/re...asp%3Fid%3D744
In your regula expression you should take only this part:
http://www.mymovies.it/dizionario/recensione.asp%3Fid%3
replace the "D" with a "=" and add at the end the id "744"

The same for the unescaped url
"unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id\u003d744"
consider only "unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id
and add the rest "=744"

if i understand i have to "convert" the url:

http://www.mymovies.it/dizionario/re...asp%3Fid%3D744 => http://www.mymovies.it/dizionario/re...asp%3Fid%3=744

but %3D is = (ASCII in hex)?
Have to I substitute whole %3D with = ?

the same is for \u003d but in UTF-8 econding.

I know that i can make a "manual" substitute for characters =, ? and &...but the problem appear again in title and film description with strange characters (',",etc...).

thanks!!

Ciao!

KoTiX · (This post was last modified: 2011-03-09, 00:50 by KoTiX.)

Sorry but i don't know wht kind of encoding is that but this is the way i would do, considering just the unescaped url results :

Uploaded with ImageShack.us

Code:
        <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 \3&lt;/title&gt;&lt;url&gt;http://www.mymovies.it/dizionario/recensione.asp?id\1&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;&lt;/entity&gt;" dest="7">

            <expression repeat="yes">http://www.mymovies.it/dizionario/recensione.asp?[^\d]*003d([^\&quot;]*)[^\|]*\| MYmovies&quot;,&quot;titleNoFormatting&quot;:&quot;([^\|]*) \| MYmovies&quot;,&quot;content&quot;:&quot;([^\&quot;]*)&quot;}</expression>

        </RegExp>

Capture 1 is the Id
Capture 2 is the title
Capture 3 is the description but the code it is a little bit dirty there, it's never the same

I hope i understood what you need Smile

Ciaooo.

muttley:bd · 2011-03-09, 10:42

mhhh...see your results in description/content/capture 3...

Whole google search result is encoded, and you can see:

Code:
"content":"Un film di Rob Letterman, Conrad Vernon con Reese Witherspoon, Seth Rogen, Hugh   Laurie, Will Arnett. Ultimatum al cinema bidimensionale e contatto con \u003cb\u003e...\u003c/b\u003e"

or

Code:
"titleNoFormatting":"Highlander - L\u0026#39;ultimo immortale (1986) | MYmovies"

but this is only some example...if there are "strange" character in title or in content will be replaced with utf-8 encoding.

And i can't make a regex replace function for all character encoding! Wink

i think xbmc must have some setting for set character encoding...as in a head of xml i can put:

Code:
<?xml version="1.0" encoding="UTF-8"?>

or

Code:
<?xml version="1.0" encoding="ISO-8859-1"?>

booohhhhh

Ciao!!

KoTiX · 2011-03-09, 18:25

K i understand now what you mean, and i know that in the expression you can specify
<expression encode="1">whatever<\expression>
but i can't remember exactly what it does, maybe that is what you need to clean the things up.
Ciauz.

muttley:bd · 2011-03-09, 18:38

Quote:<expression encode="1">whatever<\expression>
but i can't remember exactly what it does, maybe that is what you need to clean the things up.

exactly the contrary Laugh

encode (transform) simple url character from:

Code:
www.domain.it/path/query?id=what i search

to

Code:
www.domain.it/path/query?id=what%20i%20search

...burp, if there isn't solution for "search result union" and "utf-8 string decoding", the scraper will be only more slow...and not much Big Grin

anyway...Thank you kotix!