I'm currently struggling with percent escaping special characters on iOS, for instance "é" when contained in a query parameter value.
I'm using AFNetworking, but the issue isn't specific to it.
The "é" character should be percent escaped to "%E9", yet the result is "%C3%A9". The reason is because "é" is represented as those 2 bytes in UTF8.
The actual percent escaping method is the well known one and I'm passing UTF8 as string encoding. The string itself is #"é".
static NSString * AFPercentEscapedQueryStringPairMemberFromStringWithEncoding(NSString *string, NSStringEncoding encoding)
{
static NSString * const kAFCharactersToBeEscaped = #":/?&=;+!##$()~";
static NSString * const kAFCharactersToLeaveUnescaped = #"[].";
return (__bridge_transfer NSString *)CFURLCreateStringByAddingPercentEscapes(kCFAllocatorDefault, (__bridge CFStringRef)string, (__bridge CFStringRef)kAFCharactersToLeaveUnescaped, (__bridge CFStringRef)kAFCharactersToBeEscaped, CFStringConvertNSStringEncodingToEncoding(encoding));
}
I had hoped passing in UTF16 string encoding would solve it, but it doesn't. The result is "%FF%FE%E9%00" in this case, it contains "%E9" but I must be missing something obvious.
Somehow I can't get my head around it.
Any pointers would be awesome.
RFC 3986 explains that, unless the characters you're encoding fall into the unreserved US-ASCII range, the convention is to convert the character to (in this case, A UTF8-encoded) byte value, and and use that value as the percent encoding base.
The behavior you're seeing is correct.
The disparity between the encoded values given for UTF-8 vs. UTF-16 is due to a couple of factors.
Encoding Differences
First, there's the difference in the way that the respective encodings are actually defined. UTF-16 will always use two bytes to represent its character, and essentially concatenates the higher order byte with the lower order byte to define the code. (The ordering of these bytes will depend on whether the code is encoded as Little Endian or Big Endian.) UTF-8, on the other hand, uses a dynamic number of bytes, depending on where in the Unicode code page the character exists. The way UTF-8 relates how many bytes it's going to use is by the bits that are set in the first byte itself.
So if we look at C3 A9, that translates into the following bits:
1100 0011 1010 1001
Looking at RFC 2279, we see that the beginning set of '1's with an terminating '0' denotes how many bytes will be used--in this case, 2. Stripping off the initial 110 metadata, we're left with 00011 from the first byte: that represents the leftmost bits of the actual value.
For the next byte (1010 1001), again from the RFC we see that, for every subsequent byte, 10 will be "prefix" metadata for the actual value. Stripping that off, we're left with 101001.
Concatenating the actual value bits, we end up with 00011 101001, which is 233 in base-10, or E9 in base-16.
Encoding Identification
The other thing to consider specifically from the UTF-16 value (%FF%FE%E9%00) is from the original RFC, which mentions that there's no explicit definition of the encoding used, in the encoded value itself. So in this case, iOS is "cheating", giving you an indication of what encoding is used. FF FE is a well-known byte-ordering mark used in UTF-16 encoded files, to denote that UTF-16 is the encoding used. As for E9 00, as mentioned, UTF-16 always uses two bytes. In this case, since all of its data can be represented in 1 byte, the other is simply null.
Related
I'm having issues working with iOS Swift 2.0 to perform an XOR on a [UInt8] and convert the XORd result to a String. I'm having to interface with a crude server that wants to do simple XOR encryption with a predefined array of UInt8 values and return that result as a String.
Using iOS Swift 2.0 Playground, create the following array:
let xorResult : [UInt8] = [24, 48, 160, 212] // XORd result
let result = NSString(bytes: xorResult, length: xorResult.count, encoding: NSUTF8StringEncoding)
The result is always nil. If you remove the 160 and 212 values from the array, NSString is not nil. If I switch to NSUTF16StringEncoding then I do not receive nil, however, the server does not support UTF16. I have tried converting the values to a hex string, then converting the hex string to NSData, then try to convert that to NSUTF8StringEncoding but still nil until I remove the 160 and 212. I know this algorithm works in Java, however in Java we're using a combination of char and StringBuilder and everything is happy. Is there a way around this in iOS Swift?
To store an arbitrary chunk of binary data as as a string, you need
a string encoding which maps each single byte (0 ... 255) to some
character. UTF-8 does not have this property, as for example 160
is the start of a multi-byte UTF-8 sequence and not valid on its own.
The simplest encoding with this property is the ISO Latin 1 aka
ISO 8859-1, which is the
ISO/IEC 8859-1
encoding when supplemented with the C0 and C1 control codes.
It maps the Unicode code points U+0000 .. U+00FF
to the bytes 0x00 .. 0xFF (compare 8859-1.TXT).
This encoding is available for
(NS)String as NSISOLatin1StringEncoding.
Please note: The result of converting an arbitrary binary chunk to
a (NS)String with NSISOLatin1StringEncoding will contain embedded
NUL and control characters. Some functions behave unexpectedly
when used with such a string. For example, NSLog() terminates the
output at the first embedded NUL character. This conversion
is meant to solve OP's concrete problem (creating a QR-code which
is recognized by a 3rd party application). It is not meant as
a universal mechanism to convert arbitrary data to a string which may
be printed or presented in any way to the user.
If we percent encode the char "€", we will have %E2%82%AC as result. Ok!
My problem:
a = %61
I already know it.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
If yes, will browsers and servers understand the result as the char "a"?
If we percent encode the char "€", we will have %E2%82%AC as result.
€ is Unicode codepoint U+20AC EURO SIGN. The byte sequence 0xE2 0x82 0xAC is how U+20AC is encoded in UTF-8. %E2%82%AC is the URL encoding of those bytes.
a = %61
I already know it.
For ASCII character a, aka Unicode codepoint U+0061 LATIN SMALL LETTER A, that is correct. It is encoded as byte 0x61 in UTF-8 (and most other charsets), and thus can be encoded as %61 in URLs.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
Yes. Any character can be encoded using percent encoding in a URL. Simply encode the character in the appropriate charset, and then percent-encode the resulting bytes. However, most ASCII non-reserved characters do not require such encoding, just use them as-is.
If yes, will browsers and servers understand the result as the char "a"?
In URLs and URL-like content encodings (like application/x-www-webform-urlencoded), yes.
I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string.
For example, "ABCģķī" would result in "ABCģķī"
Now 2 basic things:
Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.
So - How do I get a value of a char, that is in 1-255 range?
I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.
Is there any other method for returning char value? Any help appreciated
I suggest you avoid codepages like the plague.
There are two approaches for Unicode that I'd consider: WideString, and UTF-8.
Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.
UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.
#DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.
In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABCģķī" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABCģķī" or "ABCģķī" instead. Most other charsets do no support those particular Unicode characters.
In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;ķī" or "ABCģķī".
So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:
if you need HTML 4:
A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
If you need HTML 5:
A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.
In case I understood the OP correctly, I'll just leave this here.
function Entitties(const S: WideString): string;
var
I: Integer;
begin
Result := '';
for I := 1 to Length(S) do
begin
if Word(S[I]) > Word(High(AnsiChar)) then
Result := Result + '#' + IntToStr(Word(S[I])) + ';'
else
Result := Result + S[I];
end;
end;
I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.
I am interested in knowing why '%20' is used as a space in URLs, particularly why %20 was used and why we even need it in the first place.
It's called percent encoding. Some characters can't be in a URI (for example #, as it denotes the URL fragment), so they are represented with characters that can be (# becomes %23)
Here's an excerpt from that same article:
When a character from the reserved set (a "reserved character") has
special meaning (a "reserved purpose") in a certain context, and a URI
scheme says that it is necessary to use that character for some other
purpose, then the character must be percent-encoded.
Percent-encoding a reserved character involves converting the
character to its corresponding byte value in ASCII and then
representing that value as a pair of hexadecimal digits. The digits,
preceded by a percent sign ("%") which is used as an escape character,
are then used in the URI in place of the reserved character. (For a
non-ASCII character, it is typically converted to its byte sequence in
UTF-8, and then each byte value is represented as above.)
The space character's character code is 32:
> ' '.charCodeAt(0)
32
Which is 20 in base-16:
> ' '.charCodeAt(0).toString(16)
"20"
Tack a percent sign in front of it and you get %20.
Because URLs have strict syntactic rules, like / being a special path separator character, spaces not being allowed in a URL and all characters having to be a certain subset of ASCII. To embed arbitrary characters in URLs regardless of these restrictions, bytes can be percent encoded. The byte x20 represents a space in the ASCII encoding (and most other encodings), hence %20 is the URL-encoded version of it.
It uses percent encoding. You can see the Percent Encoding part of the RFC for Uniform Resource Identifier (URI): Generic Syntax
A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component. A percent-encoded octet is encoded as a character
triplet, consisting of the percent character "%" followed by the two
hexadecimal digits representing that octet's numeric value. For
example, "%20" is the percent-encoding for the binary octet
"00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
character (SP).