I'm using the symbol font Symbolicons instead of images in a new project. However, it seems that any code over 4 characters can't be set using NSString.
Example:
self.saveDealButton.titleLabel.font = [UIFont fontWithName:#"SS Symbolicons" size:31.0f];
[self.saveDealButton setTitle:#"\u1F4E5" forState:UIControlStateNormal];
Will not work, however:
self.shareButton.titleLabel.font = [UIFont fontWithName:#"SS Symbolicons" size:31.0f];
[self.shareButton setTitle:#"\uF601" forState:UIControlStateNormal];
Works fine. How can I get NSString to recognize the extra bit?
For those characters in the Supplementary Multilingual Plane, as in your example, use the uppercase U in the escape string and followed by eight hex code. So it should be written as \U0001F4E5.
In iOS unicode characters belong to a 16bit representation \u , with n between 0000 and ffff in hexadecimal notation.
In your example \uF601 represents one character and you could add another character by adding another sequence \uF601\uF602 etc.
For me it seems that you misunderstood the escape syntax?
Related
There are some Unicode arrangements that I want to use in my app. I am having trouble properly escaping them for use.
For instance this Unicode sequence: 🅰
If I escape it using an online tool i get: \ud83c\udd70
But of course this is an invalid sequence per the compiler:
var str = NSString.stringWithUTF8String("\ud83c\udd70")
Also if I do this:
var str = NSString.stringWithUTF8String("\ud83c")
I get an error "Invalid Unicode Scalar"
I'm trying to use these Unicode "fonts":
http://www.panix.com/~eli/unicode/convert.cgi?text=abcdefghijklmnopqrstuvwxyz
If I view the source of this website I see sequences like this:
𝕒
Struggling to wrap my head around what is the "proper" way to work with/escape unicode.
And simply need a to figure out a way to get them working on iOS.
Any thoughts?
\ud83c\udd70 is a UTF-16 surrogate pair which encodes the unicode character 🅰 (U+1F170). Swift string literals do not use UTF-16, so that escape sequence doesn't make sense. However, since 1F170 has five digits you can't use a \uXXXX escape sequence (which only accepts four hexadecimal digits). Instead, use a \UXXXXXXXX sequence (note the capital U), which accepts eight:
var str = "\U0001F170" // returns "🅰"
You can also just paste the character itself into your string:
var str = "🅰" // returns "🅰"
Swift is an early Beta, is is broken in many ways. This issue is a Swift bug.
let ringAboveA: String = "\u0041\u030A" is Å and is accepted
let negativeSquaredA: String = "\uD83D\uDD70" is 🅰 and produces an error
Both are decomposed UTF16 characters that are accepted by Objective-C. The difference is that the composed character 🅰 is in plane 1.
Note: to get the UTF32 code point either use the OSX Character Viewer or a code snippet:
NSLog(#"utf32: %#", [#"🅰" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);
utf32: <0001f170>
To get the Character Viewer in the Apple Menu go to the "System Preferences", "Keyboard", "Keyboard" tab and select the checkbox: "Show Keyboard & Character Viewers in menu bar". The "Character View" item will be in the menu bar just to the left of the Date.
After entering the character right (control) click on the character in favorites to copy the search results.
Copied information:
🅰
NEGATIVE SQUARED LATIN CAPITAL LETTER A
Unicode: U+1F170 (U+D83C U+DD70), UTF-8: F0 9F 85 B0
Better yet: Add unicode in the list on the left and select it.
We have key-value pair in Localization.string file.
"spanish-key" = "Espa\u00f1ol";
When we fetch and assign to label then app displays it as "Espau00f1ol".
Doesn't work.
self.label1.text= NSLocalizedString(#"spanish-key", nil);
It works- shows in required format.
self.label1.text= #"Espa\u00f1ol";
What could be the problem here when we use
NSLocalizedString(#"spanish-key", nil)?
If we set \U instead of \u, then it works.
"spanish-key" = "Espa\U00f1ol";
When to use "\Uxxxx" and "\uxxxx"?
NSString literals and strings-files use different escaping rules.
NSString literals use the same escape sequences as "normal" C-strings, in particular
the "universal character names" defined in the C99 standard:
\unnnn - the character whose four-digit short identifier is nnnn
\Unnnnnnnn - the character whose eight-digit short identifier is nnnnnnnn
Example:
NSString *string = #"Espa\u00F1ol - \U0001F600"; // Español - 😀
Strings-files, on the other hand, use \Unnnn to denote a UTF-16 character,
and "UTF-16 surrogate pairs" for characters > U+FFFF:
"spanish-key" = "Espa\U00f1ol - \Ud83d\Ude00";
(This is the escaping used in "old style property lists", which you can see when printing
the description of an `NSDictionary.)
This (hopefully) answers your question
When to use "\Uxxxx" and "\uxxxx"?
But: As also noted by #gnasher729 in his answer, there is no need to use Unicode
escape sequences at all. You can simply insert the Unicode characters itself,
both in NSString literals and in strings-files:
NSString *string = #"Español - 😀";
"spanish-key" = "Español - 😀";
Just write the string in proper Unicode in Localization.string.
"spanish-key" = "Español";
I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.
I have a need to separate a string based on an ASCII control character (in particular, a US — 0x1F).
How can I achieve this with NSString componentsSeparatedByString when it expects a unicode string and I'm providing ASCII (UTF-8)?
Just use the literals:
[#"XXX\x1fYYY" componentsSeparatedByString:#"\x1f"]
or, better:
[#"XXX\x1fYYY" componentsSeparatedByCharactersInSet:[NSCharacterSet controlCharacterSet]]
The ASCII control characters 00 - 1f are mapped to the same unicode code points.
I have several small place marks such as 'א,א' 'א,ב'. If we use the comma as the center point, i need at most 2 characters before the comma, and up to the next space after the comma.
I have (.-,.-)%s but its not doing what I need. Any idea?
Also as you can see there not latin letters so using %l will not work.
There are couple of issues here. First, a minor issue: .-, will match as little as possible before the coma, that is zero characters. You should anchor the beginning of the matched string.
The more complicated issue is that you use Hebrew letters. The problem is that Lua has no concept of multi-byte characters.
If you use a 8-bit encoding such as Windows-1255, or ISO-8859-8, then you probably can simply match against a character class [ת-א]. If you have properly set Hebrew locale, %l should work fine for you.
If you use UTF-8 or any other encoding that uses multi-byte characters, then you must construct a regex that has all the Hebrew alphabet escaped as a sequence of octets. The aleph is U+05D0x, which in UTF-8 will be represented as 0xD7 0x90. The tav is U+05EA, which will be encoded as 0xD7 0xAA.
In Lua you can escape any 8-bit character with a backslash + decimal code. All the hebrew characters encoded in UTF-8 have the first byte the same -- 0xD7, that is "\215". The second character can be anything from "\144" to "\170". Thus, the regex that will match a single Hebrew letter is: "\215[\144-\170]". Put that in your original regex, where you had single dots that match any character.
Of course, the above reasoning must be modified for encodings different than UTF-8. Right-to-left writing direction in Hebrew is another thing to keep in mind.