In Objective-c...
If I have a character like "∆" how can I get the unicode value and then determine if it is in a certain range of values.
For example if I want to know if a certain character is in the unicode range of U+1F300 to U+1F6FF
NSString uses UTF-16 to store codepoints internally, so those in the range you're looking for (U+1F300 to U+1F6FF) will be stored as a surrogate pair (four bytes). Despite its name, characterAtIndex: (and unichar) doesn't know about codepoints and will give you the two bytes that it sees at the index you give it (the 55357 you're seeing is the lead surrogate of the codepoint in UTF-16).
To examine the raw codepoints, you'll want to convert the string/characters into UTF-32 (which encodes them directly). To do this, you have a few options:
Get all UTF-16 bytes that make up the codepoint, and use either this algorithm or CFStringGetLongCharacterForSurrogatePair to convert the surrogate pairs to UTF-32.
Use either dataUsingEncoding: or getBytes:maxLength:usedLength:encoding:options:range:remainingRange: to convert the NSString to UTF-32, and interpret the raw bytes as a uint32_t.
Use a library like ICU.
Related
I am reading the document on index to Delphi string, as below:
http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)
One statement said:
You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.
If I understand correctly, S[i] is index to the i-th byte of the string. If S is a UnicodeString, then S[1] is the first byte, S[2] is the 2nd byte of the first character, S[3] is the first byte of the second character, etc. If that is the case, then how do I index the character instead of the byte inside a string? I need to index characters, not bytes.
In Delphi, S[i] is a char aka widechar. But this is not an Unicode "character", it is an UTF-16 encoded value in 16 bits (2 bytes). In previous century, i.e. until 1996, Unicode was 16-bit, but it is not the case any more! Please read carrefully the Unicode FAQ.
You may need several widechar to have a whole Unicode codepoint = more or less what we usually call "character". And even this may be wrong, if diacritics are used.
UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.
Originally, Unicode was designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be
represented with private-use characters.)
Over time, and especially
after the addition of over 14,500 composite characters for
compatibility with legacy sets, it became clear that 16-bits were not
sufficient for the user community. Out of this arose UTF-16.
see UTF-16 FAQ
For proper decoding of Unicode codepoints in Delphi, see Detecting and Retrieving codepoints and surrogates from a Delphi String (link by #LURD in comments)
I have in string a code point like U+1F4DB and I would like to convert it to a unicode char (📛). How to do ?
You can't convert that to a single char. That code point is outside of the range that can be represented in UTF-16 as a single 16 bit element. Instead it is represented by a surrogate pair, two char elements. In Delphi it would be expressed as the string
#$D83D#$DCDB
as can be discerned from this page: http://www.fileformat.info/info/unicode/char/1f4db/index.htm
In practise I think it would be simpler to paste the character into your source code, inside a string literal, and let the source code be stored encoded as UTF-8. The IDE will prompt to do so automatically. That is represent it as
'📛'
In comments you make it clear that you wish to parse arbitrary text of the form U+xxxx. Extract the numeric value and convert it to an integer. Then pass it through TCharHelper.ConvertFromUtf32.
I read something aboue string there:
http://www.lua.org/pil/2.4.html
Lua is eight-bit clean and so strings may contain characters with any numeric value, including embedded zeros.
What is that eight-bit clean means?
Why it can contain characters with any numeric value ? (different with basic c strings)
There are two common ways to store strings:
Characters and Terminator
Length and Characters
When you use #1, you need to "sacrifice" one character to serve as the terminator; when you use #2, you do not have such limitation.
C uses the first method of storing strings. It uses character zero to serve as the terminator; the other 255 characters can be used to represent characters of the string.
Lua uses the second method of storing strings. All 256 possible character values, including zeros, can be used in Lua strings. For example, you can construct a three-character string from characters 'A', 0, 'B', and Lua will treat it as a three character string. You can construct the same string in C, but its string-processing libraries will treat it as a single-character string: strlen would return 1, puts will write character A and stop, and so on.
The Lua string type is a counted sequence of bytes. A byte can hold any value between 0 and 255.
The string type is used for character strings. You are right, few character set encodings allow any byte value or sequence of byte values. Code page 437 is one that does; It maps 256 characters to 256 values, one byte per character. Windows-1252 does not; It maps 251 characters to 251 values, one byte per character. UTF-8 maps 1,112,064 characters to sequences of one to four bytes, where some values of bytes are not used and some sequences of values are not used.
The Lua string library does have functions that treats bytes as characters. Their behavior is influenced by the implementation's libraries, which typically uses the C runtime along with its locale features.
There are specialized libraries for Lua to explicitly handle various character set encodings.
I is represented as 21321 when printed as an Integer.
The data is coming from a device into a Delphi DLL and being passed to me to write out. However, it does not sit well with Delphi's Ansi string conversions.
I just need to know possible character encodings this may be, so I can begin to identify how to convert it properly.
The number 21321 is 5349 in hexadecimal, and interpreted a 8-bit values, 53 and 49 are the ASCII codes for the Latin letters “S” and “I.” So my guess is that the data is actually “SI” in ASCII or some compatible encoding.
It is difficult to imagine any encoding where “I” would be 5349 hexadecimal, so this is about something else than just an unknown encoding.
I have to translate some memory contents into a string, using ASCII encoding. For example:
0x6A636162
But I am not sure how to break that up, to be translated into ASCII. I think it has something to do with how many bits are in a char/digit, but I am not sure how to go about doing so (and of course, I would like to know more of the reasoning behind it, not just "how to do it").
ASCII uses 7 bits to encode a character (http://en.wikipedia.org/wiki/ASCII). However, it's common to encode characters using 8 bits instead (note that technically this isn't ASCII). Thus, you'd need to split your data into 8-bit chunks and match that to an ASCII table.
If you're using a specific programming language, it may have a way to translate an ASCII code to a character. For instance, Ruby has the .chr method, Python has the chr() built-in function, and in C you can printf("%c", number).
Note that each nibble (4 bits) can be represented as one hexadecimal digit, so for the sample string you show, each 8-bit "chunk" would be:
0x6A
0x63
0x61
0X62
the string reads "jcab" :)