Login at Kodi Home

guranbanan · 2014-06-14, 04:23

Hi!

It's past 4 AM in Sweden, where I live, so I really need to go to sleep - but - before I do, I'll post a quick question. Smile

I'm in the middle of making my own scraper in XBMC and a new (one of many) problem just occurred to me:

The genre tags sometimes consists of non-ascii characters such as å, ä & ö. Unfortunately they are html escaped. So, for example, the word 'räksmörgås' becomes 'räksmörgås'.

Is it possible to do a simple string replace of some sort inside the xml code?

Thanks, and good night! Sleepy

guranbanan · 2014-06-14, 12:28

Just woke up and had an idea of how one might approach this. Haven't got the time to test right now but I do have the time to just post it and see if someone might have a better idea...

I guess you could nest multiple RegExes and make something like this:

Code:
<RegExp input="$$20" output="\1ä" dest="19">

     <RegExp input="$$1" output="\1å" dest="20">

          <expression repeat="yes">([a-zA-Z]+)&amp;aring;</expression>

     </RegExp>

     <expression repeat="yes">([a-zA-Z]+)&amp;auml;</expression>

</RegExp>

etc...

But maybe there exists a better approach...?

**mkortstiege** · 2014-06-14, 12:51

You might want to use something like "<expression repeat="yes" fixchars="1">..." on your expressions.
That should do the trick. Please let me know if it works for you.

guranbanan · 2014-06-14, 13:03

I appears to be exactly what I want! For some reason though, it interprets "Ä" as "Ã,," :S but I guess it's some encoding error somewhere.. the whole site is encoded in latin-1 and I have also set my scraper xml encoding to that... but maybe XBMC prefers utf-8?

**mkortstiege** · (This post was last modified: 2014-06-14, 13:19 by mkortstiege.)

I guess you want ISO-8859-1 ?

EDIT: nvm, misread it. XBMC always prefers UTF-8.

guranbanan · (This post was last modified: 2014-06-14, 13:30 by guranbanan.)

Yes, and I have the RegExp output in the <GetSearchResults> tag set to encoding ISO-8859-1 and it works for all characters, except the HTML entity characters, so it shouldn't really be an encoding issue should it?

Code:
<RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">

BTW. Off topic, but I just tried a (.*?) pattern and it seemed to work... i guess laziness IS supported by the regex engine after all (hooray!)..

[EDIT] Oh, sorry, missed your latest reply. I actually tested changing the main encoding to UTF-8 but it didn't help...

guranbanan · 2014-06-14, 14:37

I managed to solve it!

Specifying 'SearchStringEncoding="iso-8859-1" in the <CreateSearchString> tag did the trick. From what I've read it works together with the fixchars option.

Thanks!