"Decode" UTF-8 search result (google ajax api) and make union o two search url
#1
i would use, deprecated but very fast, google search ajax api, for search films in mymovies.it, with this similar url:
https://ajax.googleapis.com/ajax/service...o%20Aliens

The server answer is a json structure with many fields and i have a problem with url returned:
  1. "unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id\u003d744"
  2. "url":"http://www.mymovies.it/dizionario/recensione.asp%3Fid%3D744"

In the first field (unescapedUrl) i think it's utf-8 encoded string, and in "url" field is ASCII character encoding.

Well, when i insert in xml result structure:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><results><entity><title>\2 \3</title><url>[b]URL or UNESACPEDURL[/b]</url><id>\1</id></entity></results>

Xbmc can't open the url in the "encoded state"...i'll try with "fixchars" and with "SearchStringEncoding="UTF-8"" with no result Oo

Another question, how can make the union of two search results?
I have tried append to buffer two url tags in CreateSearchUrl, but only one is resolved.

I hope my poor english is enough clear Laugh

thanks!
Reply
#2
I'd like to help you here Tongue
If the results are the same (same title, id and url) xbmc will automatically get only the one result.
The url problem is not really an encoding problem but just a matter of what part of the url you consider:
From this url:
http://www.mymovies.it/dizionario/recens...3Fid%3D744
In your regula expression you should take only this part:
http://www.mymovies.it/dizionario/recensione.asp%3Fid%3
replace the "D" with a "=" and add at the end the id "744"

The same for the unescaped url
"unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id\u003d744"
consider only "unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id
and add the rest "=744"

I hope this helped you
Ciao Wink
XBMC Italian translator, Movieplayer.it scrapers developer and the old "The Orbs" skin creator.
Reply
#3
Ciao kotix,

i don't understand very well your answer:

Quote:If the results are the same (same title, id and url) xbmc will automatically get only the one result.

Do you talk about search url union?

Quote:The url problem is not really an encoding problem but just a matter of what part of the url you consider:
From this url:
http://www.mymovies.it/dizionario/re...asp%3Fid%3D744
In your regula expression you should take only this part:
http://www.mymovies.it/dizionario/recensione.asp%3Fid%3
replace the "D" with a "=" and add at the end the id "744"

The same for the unescaped url
"unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id\u003d744"
consider only "unescapedUrl":"http://www.mymovies.it/dizionario/recensione.asp?id
and add the rest "=744"

if i understand i have to "convert" the url:

http://www.mymovies.it/dizionario/re...asp%3Fid%3D744 => http://www.mymovies.it/dizionario/re...asp%3Fid%3=744

but %3D is = (ASCII in hex)?
Have to I substitute whole %3D with = ?

the same is for \u003d but in UTF-8 econding.

I know that i can make a "manual" substitute for characters =, ? and &...but the problem appear again in title and film description with strange characters (',",etc...).

thanks!!

Ciao!
Reply
#4
Sorry but i don't know wht kind of encoding is that but this is the way i would do, considering just the unescaped url results :

Image

Uploaded with ImageShack.us

Code:
        <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 \3&lt;/title&gt;&lt;url&gt;http://www.mymovies.it/dizionario/recensione.asp?id\1&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;&lt;/entity&gt;" dest="7">
            <expression repeat="yes">http://www.mymovies.it/dizionario/recensione.asp?[^\d]*003d([^\&quot;]*)[^\|]*\| MYmovies&quot;,&quot;titleNoFormatting&quot;:&quot;([^\|]*) \| MYmovies&quot;,&quot;content&quot;:&quot;([^\&quot;]*)&quot;}</expression>
        </RegExp>


Capture 1 is the Id
Capture 2 is the title
Capture 3 is the description but the code it is a little bit dirty there, it's never the same

I hope i understood what you need Smile
Ciaooo.
XBMC Italian translator, Movieplayer.it scrapers developer and the old "The Orbs" skin creator.
Reply
#5
mhhh...see your results in description/content/capture 3...

Whole google search result is encoded, and you can see:
Code:
"content":"Un film di Rob Letterman, Conrad Vernon con Reese Witherspoon, Seth Rogen, Hugh   Laurie, Will Arnett. Ultimatum al cinema bidimensionale e contatto con \u003cb\u003e...\u003c/b\u003e"
or
Code:
"titleNoFormatting":"Highlander - L\u0026#39;ultimo immortale (1986) | MYmovies"

but this is only some example...if there are "strange" character in title or in content will be replaced with utf-8 encoding.

And i can't make a regex replace function for all character encoding! Wink

i think xbmc must have some setting for set character encoding...as in a head of xml i can put:
Code:
<?xml version="1.0" encoding="UTF-8"?>
or
Code:
<?xml version="1.0" encoding="ISO-8859-1"?>

booohhhhh Laugh

Ciao!!
Reply
#6
K i understand now what you mean, and i know that in the expression you can specify
<expression encode="1">whatever<\expression>
but i can't remember exactly what it does, maybe that is what you need to clean the things up.
Ciauz.
XBMC Italian translator, Movieplayer.it scrapers developer and the old "The Orbs" skin creator.
Reply
#7
Quote:<expression encode="1">whatever<\expression>
but i can't remember exactly what it does, maybe that is what you need to clean the things up.

exactly the contrary Laugh

encode (transform) simple url character from:

Code:
www.domain.it/path/query?id=what i search

to

Code:
www.domain.it/path/query?id=what%20i%20search

...burp, if there isn't solution for "search result union" and "utf-8 string decoding", the scraper will be only more slow...and not much Big Grin

anyway...Thank you kotix!
Reply

Logout Mark Read Team Forum Stats Members Help
"Decode" UTF-8 search result (google ajax api) and make union o two search url0