Unescaping HTML entities
#1
Hi!

It's past 4 AM in Sweden, where I live, so I really need to go to sleep - but - before I do, I'll post a quick question. Smile

I'm in the middle of making my own scraper in XBMC and a new (one of many) problem just occurred to me:

The genre tags sometimes consists of non-ascii characters such as å, ä & ö. Unfortunately they are html escaped. So, for example, the word 'räksmörgås' becomes 'räksmörgås'.

Is it possible to do a simple string replace of some sort inside the xml code?

Thanks, and good night! Sleepy
Reply
#2
Just woke up and had an idea of how one might approach this. Haven't got the time to test right now but I do have the time to just post it and see if someone might have a better idea...

I guess you could nest multiple RegExes and make something like this:

Code:
<RegExp input="$$20" output="\1ä" dest="19">
     <RegExp input="$$1" output="\1å" dest="20">
          <expression repeat="yes">([a-zA-Z]+)&amp;aring;</expression>
     </RegExp>
     <expression repeat="yes">([a-zA-Z]+)&amp;auml;</expression>
</RegExp>
etc...

But maybe there exists a better approach...?
Reply
#3
You might want to use something like "<expression repeat="yes" fixchars="1">..." on your expressions.
That should do the trick. Please let me know if it works for you.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not PM or e-mail Team-Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#4
I appears to be exactly what I want! For some reason though, it interprets "&Auml;" as "Ã,," :S but I guess it's some encoding error somewhere.. the whole site is encoded in latin-1 and I have also set my scraper xml encoding to that... but maybe XBMC prefers utf-8?
Reply
#5
I guess you want ISO-8859-1 ?

EDIT: nvm, misread it. XBMC always prefers UTF-8.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not PM or e-mail Team-Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#6
Yes, and I have the RegExp output in the <GetSearchResults> tag set to encoding ISO-8859-1 and it works for all characters, except the HTML entity characters, so it shouldn't really be an encoding issue should it?

Code:
<RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">

BTW. Off topic, but I just tried a (.*?) pattern and it seemed to work... i guess laziness IS supported by the regex engine after all (hooray!)..

[EDIT] Oh, sorry, missed your latest reply. I actually tested changing the main encoding to UTF-8 but it didn't help...
Reply
#7
I managed to solve it!

Specifying 'SearchStringEncoding="iso-8859-1" in the <CreateSearchString> tag did the trick. From what I've read it works together with the fixchars option.

Thanks!
Reply

Logout Mark Read Team Forum Stats Members Help
Unescaping HTML entities0