Using ASCII control codes in componentsSeparatedByString - ios

I have a need to separate a string based on an ASCII control character (in particular, a US — 0x1F).
How can I achieve this with NSString componentsSeparatedByString when it expects a unicode string and I'm providing ASCII (UTF-8)?

Just use the literals:
[#"XXX\x1fYYY" componentsSeparatedByString:#"\x1f"]
or, better:
[#"XXX\x1fYYY" componentsSeparatedByCharactersInSet:[NSCharacterSet controlCharacterSet]]
The ASCII control characters 00 - 1f are mapped to the same unicode code points.

Related

iOS Localization: Unicode character escape sequences, which have the form '\uxxxx' does not work

We have key-value pair in Localization.string file.
"spanish-key" = "Espa\u00f1ol";
When we fetch and assign to label then app displays it as "Espau00f1ol".
Doesn't work.
self.label1.text= NSLocalizedString(#"spanish-key", nil);
It works- shows in required format.
self.label1.text= #"Espa\u00f1ol";
What could be the problem here when we use
NSLocalizedString(#"spanish-key", nil)?
If we set \U instead of \u, then it works.
"spanish-key" = "Espa\U00f1ol";
When to use "\Uxxxx" and "\uxxxx"?
NSString literals and strings-files use different escaping rules.
NSString literals use the same escape sequences as "normal" C-strings, in particular
the "universal character names" defined in the C99 standard:
\unnnn - the character whose four-digit short identifier is nnnn
\Unnnnnnnn - the character whose eight-digit short identifier is nnnnnnnn
Example:
NSString *string = #"Espa\u00F1ol - \U0001F600"; // Español - 😀
Strings-files, on the other hand, use \Unnnn to denote a UTF-16 character,
and "UTF-16 surrogate pairs" for characters > U+FFFF:
"spanish-key" = "Espa\U00f1ol - \Ud83d\Ude00";
(This is the escaping used in "old style property lists", which you can see when printing
the description of an `NSDictionary.)
This (hopefully) answers your question
When to use "\Uxxxx" and "\uxxxx"?
But: As also noted by #gnasher729 in his answer, there is no need to use Unicode
escape sequences at all. You can simply insert the Unicode characters itself,
both in NSString literals and in strings-files:
NSString *string = #"Español - 😀";
"spanish-key" = "Español - 😀";
Just write the string in proper Unicode in Localization.string.
"spanish-key" = "Español";

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Percent escaping special characters like é on iOS

I'm currently struggling with percent escaping special characters on iOS, for instance "é" when contained in a query parameter value.
I'm using AFNetworking, but the issue isn't specific to it.
The "é" character should be percent escaped to "%E9", yet the result is "%C3%A9". The reason is because "é" is represented as those 2 bytes in UTF8.
The actual percent escaping method is the well known one and I'm passing UTF8 as string encoding. The string itself is #"é".
static NSString * AFPercentEscapedQueryStringPairMemberFromStringWithEncoding(NSString *string, NSStringEncoding encoding)
{
static NSString * const kAFCharactersToBeEscaped = #":/?&=;+!##$()~";
static NSString * const kAFCharactersToLeaveUnescaped = #"[].";
return (__bridge_transfer NSString *)CFURLCreateStringByAddingPercentEscapes(kCFAllocatorDefault, (__bridge CFStringRef)string, (__bridge CFStringRef)kAFCharactersToLeaveUnescaped, (__bridge CFStringRef)kAFCharactersToBeEscaped, CFStringConvertNSStringEncodingToEncoding(encoding));
}
I had hoped passing in UTF16 string encoding would solve it, but it doesn't. The result is "%FF%FE%E9%00" in this case, it contains "%E9" but I must be missing something obvious.
Somehow I can't get my head around it.
Any pointers would be awesome.
RFC 3986 explains that, unless the characters you're encoding fall into the unreserved US-ASCII range, the convention is to convert the character to (in this case, A UTF8-encoded) byte value, and and use that value as the percent encoding base.
The behavior you're seeing is correct.
The disparity between the encoded values given for UTF-8 vs. UTF-16 is due to a couple of factors.
Encoding Differences
First, there's the difference in the way that the respective encodings are actually defined. UTF-16 will always use two bytes to represent its character, and essentially concatenates the higher order byte with the lower order byte to define the code. (The ordering of these bytes will depend on whether the code is encoded as Little Endian or Big Endian.) UTF-8, on the other hand, uses a dynamic number of bytes, depending on where in the Unicode code page the character exists. The way UTF-8 relates how many bytes it's going to use is by the bits that are set in the first byte itself.
So if we look at C3 A9, that translates into the following bits:
1100 0011 1010 1001
Looking at RFC 2279, we see that the beginning set of '1's with an terminating '0' denotes how many bytes will be used--in this case, 2. Stripping off the initial 110 metadata, we're left with 00011 from the first byte: that represents the leftmost bits of the actual value.
For the next byte (1010 1001), again from the RFC we see that, for every subsequent byte, 10 will be "prefix" metadata for the actual value. Stripping that off, we're left with 101001.
Concatenating the actual value bits, we end up with 00011 101001, which is 233 in base-10, or E9 in base-16.
Encoding Identification
The other thing to consider specifically from the UTF-16 value (%FF%FE%E9%00) is from the original RFC, which mentions that there's no explicit definition of the encoding used, in the encoded value itself. So in this case, iOS is "cheating", giving you an indication of what encoding is used. FF FE is a well-known byte-ordering mark used in UTF-16 encoded files, to denote that UTF-16 is the encoding used. As for E9 00, as mentioned, UTF-16 always uses two bytes. In this case, since all of its data can be represented in 1 byte, the other is simply null.

Writing a unicode character with NSString

I'm using the symbol font Symbolicons instead of images in a new project. However, it seems that any code over 4 characters can't be set using NSString.
Example:
self.saveDealButton.titleLabel.font = [UIFont fontWithName:#"SS Symbolicons" size:31.0f];
[self.saveDealButton setTitle:#"\u1F4E5" forState:UIControlStateNormal];
Will not work, however:
self.shareButton.titleLabel.font = [UIFont fontWithName:#"SS Symbolicons" size:31.0f];
[self.shareButton setTitle:#"\uF601" forState:UIControlStateNormal];
Works fine. How can I get NSString to recognize the extra bit?
For those characters in the Supplementary Multilingual Plane, as in your example, use the uppercase U in the escape string and followed by eight hex code. So it should be written as \U0001F4E5.
In iOS unicode characters belong to a 16bit representation \u , with n between 0000 and ffff in hexadecimal notation.
In your example \uF601 represents one character and you could add another character by adding another sequence \uF601\uF602 etc.
For me it seems that you misunderstood the escape syntax?

Why do string constants use wide characters even when formed entirely from 8 bit characters?

I just posted a question about Unicode character constants, where $HIGHCHARUNICODE appeared to be the reason.
Now with the default $HIGHCHARUNICODE OFF (Delphi XE2), why is this:
const
AllLowByteValues =#$00#$01#$02#$03#$04#$05#$06#$07#$08#$09#$0a#$0b#$0c#$0d#$0e#$0f;
AllHighByteValues=#$D0#$D1#$D2#$D3#$D4#$D5#$D6#$D7#$D8#$D9#$Da#$Db#$Dc#$Dd#$De#$Df;
==> Sizeof(AllLowByteValues[1]) = 2
==> Sizeof(AllHighByteValues[1]) = 2
If "All hexadecimal #$xx 2-digit literals are parsed as AnsiChar" for #$80 ... #$FF, then why is AllHighByteValues a unicode String and not an ANSIString?
That's because string constants are PChar and so made up of UTF-16 elements.
From the documentation:
String constants are assignment-compatible with the PChar and PWideChar types, which represent pointers to null-terminated arrays of Char and WideChar values.
You are not taking that into account that String and Character literals are context-sensitive in D2009+. If a literal is used in an Ansi context, it will be stored as Ansi. If a literal is used in a Unicode context, it will be stored as Unicode. HIGHCHARUNICODE only applies to 3-digit numeric Character literals between #128-#255 and 2-digit hex Character literals between #$80-#$FF. Those particular values are ambiquious between Ansi and Unicode, so HIGHCHARUNICODE is used to address the ambiquity. HIGHCHARUNICODE does not apply to other types of literals, including String literals. If you pass a String or Character literal to SizeOf(), there is no Ansi/Unicode context in the source code for the compiler to use, so it is going to use a Unicode context except in the specific case where HIGHCHARUNICODE applies, in which case an Ansi context is used if HICHCHARUNICODE is OFF. That is what you are seeing happen.

Resources