Escape Unicode Characters for iOS - ios

There are some Unicode arrangements that I want to use in my app. I am having trouble properly escaping them for use.
For instance this Unicode sequence: ๐Ÿ…ฐ
If I escape it using an online tool i get: \ud83c\udd70
But of course this is an invalid sequence per the compiler:
var str = NSString.stringWithUTF8String("\ud83c\udd70")
Also if I do this:
var str = NSString.stringWithUTF8String("\ud83c")
I get an error "Invalid Unicode Scalar"
I'm trying to use these Unicode "fonts":
http://www.panix.com/~eli/unicode/convert.cgi?text=abcdefghijklmnopqrstuvwxyz
If I view the source of this website I see sequences like this:
&#x1D552
Struggling to wrap my head around what is the "proper" way to work with/escape unicode.
And simply need a to figure out a way to get them working on iOS.
Any thoughts?

\ud83c\udd70 is a UTF-16 surrogate pair which encodes the unicode character ๐Ÿ…ฐ (U+1F170). Swift string literals do not use UTF-16, so that escape sequence doesn't make sense. However, since 1F170 has five digits you can't use a \uXXXX escape sequence (which only accepts four hexadecimal digits). Instead, use a \UXXXXXXXX sequence (note the capital U), which accepts eight:
var str = "\U0001F170" // returns "๐Ÿ…ฐ"
You can also just paste the character itself into your string:
var str = "๐Ÿ…ฐ" // returns "๐Ÿ…ฐ"

Swift is an early Beta, is is broken in many ways. This issue is a Swift bug.
let ringAboveA: String = "\u0041\u030A" is ร… and is accepted
let negativeSquaredA: String = "\uD83D\uDD70" is ๐Ÿ…ฐ and produces an error
Both are decomposed UTF16 characters that are accepted by Objective-C. The difference is that the composed character ๐Ÿ…ฐ is in plane 1.
Note: to get the UTF32 code point either use the OSX Character Viewer or a code snippet:
NSLog(#"utf32: %#", [#"๐Ÿ…ฐ" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);
utf32: <0001f170>
To get the Character Viewer in the Apple Menu go to the "System Preferences", "Keyboard", "Keyboard" tab and select the checkbox: "Show Keyboard & Character Viewers in menu bar". The "Character View" item will be in the menu bar just to the left of the Date.
After entering the character right (control) click on the character in favorites to copy the search results.
Copied information:
๐Ÿ…ฐ
NEGATIVE SQUARED LATIN CAPITAL LETTER A
Unicode: U+1F170 (U+D83C U+DD70), UTF-8: F0 9F 85 B0
Better yet: Add unicode in the list on the left and select it.

Related

Validating RGB String using regex in Swift

I've been trying to figure out the best way to validate a user entry which is a string with comma separated RGB values. It should only allow strings with no whitespaces and in formats such as these (1,12,123; 225,225,2; 32,42,241...).
I've never used Regex before, but i'm guessing it would be the best solution? I've been playing around on RegexPal and have gotten this string working:
(#([\da-f]{3}){1,2}(\d{1,3}%?,\s?){3}(1|0?\.\d+)\)|\d{1,3}%?(,\s?\d{1,3}%?){2})
However, not having much luck using it in Swift. I get the error "Invalid escape sequence in literal".
Would appreciate any help with using that regex in Swift, or if there's a better regex string/solution to validating the entry. Thanks!
You can use hashtag before the first double quote and after the last double quote in Swift to avoid having to manually add a backslash before any special character. Regarding the regex you are using it would allow the user to enter values above the 255 limit.
The regex below adapted from this post would limit the values from 0-255 and would allow the user enter 1 or more rgb values followed by ";" or "; "
#"^\((((([1]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])),){2}(([1]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))(;|; )?){1,}\)$"#
extension StringProtocol {
var isValidRGB: Bool { range(of: #"^\((((([1]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])),){2}(([1]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))(;|; )?){1,}\)$"#,
options: .regularExpression) != nil }
}
"(200,55,1)".isValidRGB // true
"(10,99,255; 0,0,10)".isValidRGB // true
"(2,2,2;)".isValidRGB // true
"(2,2,2;2)".isValidRGB // false
"(2,2,2;2,2)".isValidRGB // false
"(2,2,254;0,0,0)".isValidRGB // true
"(2,2,256;0,0,0)".isValidRGB // false
Add the Swift code where you define the RegEx to your question.
The other poster likely has identified the problem. (#manzarhaq, you should really post your reply as an answer so the OP can accept it.)
The backslash is a special character in Swift strings. It tells the compiler that the character next is a special character. If you want a literal backslash, you need 2 backslashes in a row. So your regEx string might look like this:
let regExStrin = "(#([\\da-f]{3}){1,2}(\\d{1,3}%?,\\s?){3}(1|0?\\.\\d+)\\)|\\d{1,3}%?(,\\s?\\d{1,3}%?){2})"
Note that using backslashes this way is common to most languages that derive, even loosely, from C. Swift does have some C in its ancestry.
In many C-like languages, \n is a newline character, \t is a tab character, \f is a form-feed, \" is a quotation mark, and \\ is a literal backslash.
(I don't think the \f form feed character is defined in Swift. That harks back to the days of ASCII driven serial printers.)

How to remove ANSI codes from a string?

I am working on string manipulation using LUA and having trouble with the following problem.
Using this as an example of the original data I am given -
"[0;1;36m(Web): You say, "Text here."[0;37m"
I want to keep the string intact except for removing the ANSI codes.
I have been pointed toward using gsub with the LUA pattern matching but I cannot seem to get the pattern correct. I am also unsure how to reference exactly the escape character sent.
text:gsub("[\27\[([\d\;]+)m]", "")
or
text:gsub("%x%[[%d+;+]m", "")
If successful, all I want to be left with, using the above example, would be:
(Web): You say, "Text here."
Your string example is missing the escape character, ASCII 27.
Here's one way:
s = '\x1b[0;1;36m(Web): You say, "Text here."\x1b[0;37m'
s = s:gsub('\x1b%[%d+;%d+;%d+;%d+;%d+m','')
:gsub('\x1b%[%d+;%d+;%d+;%d+m','')
:gsub('\x1b%[%d+;%d+;%d+m','')
:gsub('\x1b%[%d+;%d+m','')
:gsub('\x1b%[%d+m','')
print(s)

Convert Unicode escape sequence into its corresponding character

I'm receiving a string from the server and it has the special characters in code. Here's the example:
"El usuario o las contrase\UOOOOfffda no son v\UOOOOfffdlidos"
The first one should be an "รฑ" and the second one "รก"
I know it's not complicated but I can't find the answer. How can I get the string with the special characters correctly formatted?
Unicode U+FFFD (in your string, displayed as UTF-32 \U0000fffd) is "๏ฟฝ", the replacement character. It is often substituted in strings when a system encounters unrecognized characters.
This character really shouldn't appear in string data since its purpose is to indicate an error in displaying or interpreting the string. Since your server is sending you that character for both รฑ and รก, there is no way to retrieve the correct character.
How are you "receiving" this string? It could be that you are accessing the server incorrectly so it isn't sending you an unmodified string.
Unicode for those characters should look like this:
#"accented-a is \u00f1, and tilda-n is \u00e1"
But it's not clear what you're getting from the server makes any sense. The objective-c literal must have a lowercase leading "u" followed only by valid hex digits (0-9 and a-f). I don't see a transformation that changes the literals you have to the ones you expect.
Once the characters are formatted properly, the built-in classes will just work, for example, assigning the string to a label's text property will show the user a nice glyph.

Where should my brackets be in relation to the text for Arabic languages?

Our application automatically modifies the layout of Arabic text when it is followed by a bracket and I was wondering whether this was the correct behaviour or not?
The application shows items in the following format:
[ID of structure](version)
So version 1.5 of the English structure "stackoverflow" would be displayed as:
stackoverflow(1.5)
Note: the brackets need to be displayed. There is no space between the ID and the first bracket. The brackets simply encompass the version. The brackets could have been any character but it's far too late to switch to a different character now!
This works fine for left to right languages, but for Arabic languages the structures appear in the form:
ุณุชุงูƒูˆููŠุฑูู„ูˆูˆ(1.0)
I am not an Arabic speaker and I need to know if this is actually correct. Is the Arabic format the equivalent of the English format or has something gone horribly wrong?
The text in Arabic should be shown like:
ุณุชุงูƒูˆููŠุฑูู„ูˆูˆ(1.0) โ€
I added the html entity of RLM / Right-to-left Mark โ€ in order to fix the text. You should do so if your application doesn't support Bidi native-ly. You can add the RLM by these ways:
HTML Entity (decimal) โ€
HTML Entity (hex) โ€
HTML Entity (named) โ€
How to type in Microsoft Windows Alt +200F
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)
UTF-8 (binary) 11100010:10000000:10001111
UTF-16 (hex) 0x200F (200f)
UTF-16 (decimal) 8,207
UTF-32 (hex) 0x0000200F (200f)
UTF-32 (decimal) 8,207
C/C++/Java source code "\u200F"
Python source code u"\u200F"
(note: StackOverflow right transliteration is ุณุชุงูƒ-ุฃูˆูุฑูู„ูˆ)

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'ฤ‘' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'ฤ‘' but '์“‰' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'ฤ‘'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify ฤ‘, you can specify it in the following ways (untested):
#"ฤ‘"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for ฤ‘, where U+0111 is the code point for the character ฤ‘.
The last example shows how you would specify the UTF-8 encoding of ฤ‘ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Resources