iPhone XML parsing Norwegian characters æ ø å - ios

I've had this problem for a long time but I've been implementing this ugly hack on the backend to get around it.
Now I've decided to act as a real developer and deal with it.
My problem is that when parsing an XML feed with any of the Norwegian characters æ, ø or å in the title node, all the letters appearing before these special characters are ommitted.
So if the word is "Bålhuset" it only displays "ålhuset" - the funny thing is that æ,ø and å characters AFTER the initial problem character is included.
So if I put for example "ÅBålhuset", I will get "Bålhuset". So it seems it's only the first occurence of one of these special characters that will cause a problem.
Any help would be immensely appreciated!
-Chris

Try while you creating XML use CDATA tags like
<title><![CDATA[Transport "Bålhuset"Classic World's]]></title>
Also here is a list of HTML Tags and more cases XML with those characters is invalid, unless they are contained within a CDATA. Also try this Question hope with help you
Otherwise you need to use their special character code. If you want to represent ö you need to type ö please review like.
And Final XML with those characters is invalid, unless they are contained within a CDATA.
You can Validate you XML while creating and easily fix the bug.

What did it for me was getting the data in JSON and using the native JSON methods; no dropped characters and other sporadic behaviour.
So what that means to me is that there is an issue with NSXMLParser that makes it choke on international characters (the first occurence of which mind you) even though everything is in order with encoding etc.

Related

CGI::unescape can't handle unescaping "wymiana+teflon%F3w"?

I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML

Is it necessary to use — and – in XHTML or HTML5?

It seems that it is best to use the & escape, instead of simply typing the ampersand (&).
However, should we be using X/HTML character entity references for dashes and other common typographical characters when writing blog posts on CMSs like WordPress or hard-coding websites by hand?
For example:
– is an en dash (–)
— is an em dash (—)
What is the risk if we do not?
Why is the hyphen (-) never written as - but simply typed directly from the keyboard in HTML? (Assuming that it the hyphen, and not a minus sign.)
The W3C released an official response about when to use and when not to use character escapes which you can find here. As they are also the group that is in charge of the HTML specification, I think it's best to follow their advice.
From the section "When to Use Escapes"
Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.
< (<)
> (>)
& (&)
They also mention using characters that might not be supported in the current encoding.
From the section "When Not to Use Escapes"
It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.
http://www.w3.org/International/questions/qa-escapes
Those entities are there to help you, the author, with characters not usually typable on your average keyboard. (The em dash is an example —, as well as © and ).
You only need to escape those characters that have meaning in (X)HTML < > and &.

Grails UrlEncoding non latin characters like åäö

I have some link resources with none latin characters like åäö
These are usually user uploaded files
The problem is that i am not successfull in encoding them
using filename.encodeAsURL seems to not encode it the right way
For example the character ö is turned into o%CC%88
Testing to type the same thing in firefox and copy the contents gives %C3%B6
What are the difference between these encodings and what should i use to get the correct encoding??
Both encodings are correct. You are actually seeing the encoding of two different strings.
The key here is noticing the o at the beginning of the string:
o%CC%88 is the letter o followed by Unicode Character Combining Diaeresis, which combines with the previous character when rendered.
%C3%B6 is Unicode Character Latin Small O With Diaeresis.
What you are seeing is that in the first case, the string entered is something like these two characters: o ¨, which are actually rendered as ö.
In the second case, it's the actual character ö.
My guess is you are seeing the difference between two different inputs.
Update based on below discussion: If you are dynamically processing Unicode characters, and you do not have control over the input methods, you can try to normalize the Unicode, using java.text.Normalizer (Java 1.6 or newer).
Normalizing attempts to ensure that all characters are consistently represented, so that accented characters are always represented by a combined character or always by the character+combining mark.
Rough example:
String.metaClass.normalizeUnicode = {
return java.text.Normalizer.normalize(delegate, java.text.Normalizer.Form.NFC)
}
input = input.normalizeUnicode()
There are four forms of normalization. I picked the one that seems to be best for your case based on the description of how they work, but you may prefer to try the other ones and see what works most consistently.
All that being said, if you are try to representing Unicode characters in a URL, and they are not being loaded and processed by the code directly, it's probably best to avoid using non-latin characters altogether. Not only does this have the benefit of consistently, but also significantly shorter and more legible URLs. boo.pdf is a lot easier to read than bo%CC%88o.pdf.

’ is getting converted as "\u0092" by nokogiri in ruby on rails

I have html page which has following line with some html entities like "’".
#Here I am not pasting whole html page content. just putting issue line only
html_file = "<html>....<body><p>they’re originally intended to describe the spread of of viral diseases, but they&#146;re nice analogies for how web/SN apps grow.<p> ...</body></html>"
doc = Nokogiri::HTML(html)
body = doc.xpath('//body')
body_content = body[0].inner_html
puts body_content
Result:
These terms come from the fields of medicine and biology they\u0092re originally intended to describe the spread of of viral diseases, but they\u0092re nice analogies for how web/SN apps grow.
I want to leave these entities as it is instead of changing it to unicode.
Any thing, Am I missing?
Thanks
they’re
is wrong and should be avoided. If you want to use a close-single-quote there, to reproduce the typographical practice of rendering apostrophes as a slanted quote, then the correct character is U+2019 RIGHT SINGLE QUOTATION MARK, which can be written as ’ or ’. Or, if you're using UTF-8, just included verbatim as ’.
’ should refer to character U+0092, a little-used and pointless control character that typically renders as blank or a missing-glyph box. And indeed in XML, it does.
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range € to Ÿ are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
The problem is that Nokogiri doesn't know about this quirk, and takes character reference 146 at its word, ending up with the character 146 (\u0092) that you don't really want. I think Nokogiri is using libxml2 to parse HTML, so ultimately the proper fix would be to libxml2's htmlParseCharRef function, to substitute characters 128–159.
In the meantime you could perhaps try ‘fixing up’ character references manually with crude string substitution like ’->’ before parsing. It's a bit wrong, but at least in HTML the only other place you can have the markup sequence ’ without it being a character reference would be in a comment, so hopefully it wouldn't matter if you changed the content there accidentally too.
Have you tried changing
&#146;
into
’
i think the parser parses the ampersand first then concats it with "#146" and then parse them both. it's just an opinion though..I want this to be just a comment IDK how..lol
Well I got the idea from focos in his answer post here, and the unicode from here.

Why do you need to encode URLs?

Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1

Resources