How can I convert a 4-byte string into an unicode emoji? - delphi

A webservice i use in my Delphi 10.3 returns a string to me consisting of these four bytes: F0 9F 99 82 . I expect a slightly smiling emoji. This site shows this byte sequence as the UTF-8 representation of that emoji. So I guess i have a UTF-8 representation in my string, but its an actual unicode string? How do i convert my string into the actual unicode representation, to show it, for example, in a TMemo?

The character 🙂 has the Unicode code point U+1F642. Displaying text is defined thru an encoding: how a set of bytes has to be interpreted:
in UTF-8 one character can consist of 8, 16, 24 or 32 bits (1 to 4 Bytes); this one is $F0 $9F $99 $82.
in UTF-16 one character can consist of 16 or 32 bits (2 or 4 bytes = 1 or 2 Words); this one is $D83D $DE42 (using surrogates).
in UTF-32 one character always consists of 32 bits (4 bytes = 1 Cardinal or DWord) and always equals to the code point, that is $1F642.
In Delphi, you can use:
TEncoding.UTF8.GetString() for UTF-8
(or TEncoding.Unicode.GetString() if you'd have UTF-16LE
and TEncoding.BigEndianUnicode.GetString() if you'd have UTF-16BE).
Keep in mind that 🙂 is just a character like each letter, symbol and whitespace of this text: it can be marked thru selection (i.e. Ctrl+A) and copied to the clipboard (i.e. Ctrl+C). No special care is needed.

Related

Concatenate two Base64 encoded strings

I want to decode two Base64 encoded strings and combine them to make one 128 bit string. I am able to decode the Base64 encoded strings. Can some one guide me on how to combine these two decoded strings?
This is the code I used for decoding the two encoded strings.
NSData *decodedData_contentKey = [[NSData alloc] initWithBase64EncodedString:str_content options:0];
NSString *decodedString_contentKey = [[NSString alloc] initWithData:decodedData_contentKey encoding:NSUTF8StringEncoding];
NSLog(#"%#", decodedString_contentKey);
Thanks.
Base 64 is a statically sized encoding of octets/bytes into characters/text: 6 bits of a byte are represented as a printable ASCII character. Hence the name: 2^6 = 64, it uses a alphabet of 64 characters to encode the binary data (+ plus a delimiter character: '=' that does not contain encoded bits).
UTF-8 - used in your sample code - on the other hand is a character-encoding. It is used to encode characters in octets. So character encoding works the other way around. What you are actually doing is to decode the characters back from the bytes. UTF-8 does not use 128 bit values, nor does is it statically sized; multiple bytes may be used to represent one character. It will likely fail when it comes across an octet or octets that do not combine into a valid character encoding.
There is no such thing as base 128 encoding. Please think of what you are trying to accomplish and ask a new question that we can decode, if you get stuck.
GUESSED ANSWER:
Base 64 encoding will output 64 bits (8 bytes) of ASCII text for each 6 bytes. Therefore, if you want 128 bit (16 bytes) of encoding output, you simply have to input 12 bytes. As the base 64 encoding restarts at each 4 character boundary however (4 * 8 = 32 bits of encoding, each 8 bit character represents 6 bits, 4 * 6 = 24 bits of data, 24 bits is 3 bytes -> each 4 character string holds precisely 3 bytes of input), you can simply concatenate the two base 64 strings without decoding.

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'Ä‘' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'Ä‘'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify Ä‘, you can specify it in the following ways (untested):
#"Ä‘"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for Ä‘, where U+0111 is the code point for the character Ä‘.
The last example shows how you would specify the UTF-8 encoding of Ä‘ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Reading a text file as bytes (byte by byte) using delphi 2010

I would like to read a UTF-8 text file byte by byte and get the ascii value representation of each byte in the file. Can this be done? If so, what is the best method?
My goal is to then replace 2 byte combinations that i find with one byte (these are set conditions that I have prepared)
for example, If I find a 197 followed by a 158 (decimal representations), i will replace it with a single byte 17
I don't want to use the standard delphi IO operations
AssignFile
ReSet
ReWrite(OutFile);
ReadLn
WriteLn
CloseFile
Is there a better method? Can this be done using TStream (Reader & Writer)?
Here is an example test I am using. I know there is a character (350) (two bytes) starting in column 84. When viewed in a hex editor, the character consists of 197 + 158 - so i am trying to find the 198 using my delphi code and can't seem to find it
FS1:= TFileStream.Create(ParamStr1, fmOpenRead);
try
FS1.Seek(0, soBeginning);
FS1.Position:= FS1.Position + 84;
FS1.Read(B, SizeOf(B));
if ord(B) = 197 then showMessage('True') else ShowMessage('False');
finally
FS1.Free;
end;
You can use TFileStream to read all data from file to, for isntance, array of bytes, and later check for utf8 sequence.
Also please note that utf8 sequence can contain more than 2 bytes.
And, in Delphi there is a function Utf8ToUnicode, which will convert utf8 data to usable unicode string.
My understanding is that you want to convert a text file from UTF-8 to ASCII. That's quite simple:
StringList.LoadFromFile(UTF8FileName, TEncoding.UTF8);
StringList.SaveToFile(ASCIIFileName, TEncoding.ASCII);
The runtime library comes with all sorts of functionality to convert between different text encodings. Surely you don't want to attempt to replicate this functionality yourself?
I trust you realise that this conversion is liable to lose data. Characters with ordinal greater than 127 cannot be represented in ASCII. In fact every code point that requires more than 1 octet in UTF-8 cannot be represented in ASCII.
You asked the same question 5 hours later in another topic, the answer od which better addresses your specific question:
Replacing a unicode character in UTF-8 file using delphi 2010

Percent escaping special characters like é on iOS

I'm currently struggling with percent escaping special characters on iOS, for instance "é" when contained in a query parameter value.
I'm using AFNetworking, but the issue isn't specific to it.
The "é" character should be percent escaped to "%E9", yet the result is "%C3%A9". The reason is because "é" is represented as those 2 bytes in UTF8.
The actual percent escaping method is the well known one and I'm passing UTF8 as string encoding. The string itself is #"é".
static NSString * AFPercentEscapedQueryStringPairMemberFromStringWithEncoding(NSString *string, NSStringEncoding encoding)
{
static NSString * const kAFCharactersToBeEscaped = #":/?&=;+!##$()~";
static NSString * const kAFCharactersToLeaveUnescaped = #"[].";
return (__bridge_transfer NSString *)CFURLCreateStringByAddingPercentEscapes(kCFAllocatorDefault, (__bridge CFStringRef)string, (__bridge CFStringRef)kAFCharactersToLeaveUnescaped, (__bridge CFStringRef)kAFCharactersToBeEscaped, CFStringConvertNSStringEncodingToEncoding(encoding));
}
I had hoped passing in UTF16 string encoding would solve it, but it doesn't. The result is "%FF%FE%E9%00" in this case, it contains "%E9" but I must be missing something obvious.
Somehow I can't get my head around it.
Any pointers would be awesome.
RFC 3986 explains that, unless the characters you're encoding fall into the unreserved US-ASCII range, the convention is to convert the character to (in this case, A UTF8-encoded) byte value, and and use that value as the percent encoding base.
The behavior you're seeing is correct.
The disparity between the encoded values given for UTF-8 vs. UTF-16 is due to a couple of factors.
Encoding Differences
First, there's the difference in the way that the respective encodings are actually defined. UTF-16 will always use two bytes to represent its character, and essentially concatenates the higher order byte with the lower order byte to define the code. (The ordering of these bytes will depend on whether the code is encoded as Little Endian or Big Endian.) UTF-8, on the other hand, uses a dynamic number of bytes, depending on where in the Unicode code page the character exists. The way UTF-8 relates how many bytes it's going to use is by the bits that are set in the first byte itself.
So if we look at C3 A9, that translates into the following bits:
1100 0011 1010 1001
Looking at RFC 2279, we see that the beginning set of '1's with an terminating '0' denotes how many bytes will be used--in this case, 2. Stripping off the initial 110 metadata, we're left with 00011 from the first byte: that represents the leftmost bits of the actual value.
For the next byte (1010 1001), again from the RFC we see that, for every subsequent byte, 10 will be "prefix" metadata for the actual value. Stripping that off, we're left with 101001.
Concatenating the actual value bits, we end up with 00011 101001, which is 233 in base-10, or E9 in base-16.
Encoding Identification
The other thing to consider specifically from the UTF-16 value (%FF%FE%E9%00) is from the original RFC, which mentions that there's no explicit definition of the encoding used, in the encoded value itself. So in this case, iOS is "cheating", giving you an indication of what encoding is used. FF FE is a well-known byte-ordering mark used in UTF-16 encoded files, to denote that UTF-16 is the encoding used. As for E9 00, as mentioned, UTF-16 always uses two bytes. In this case, since all of its data can be represented in 1 byte, the other is simply null.

UTF-16BE to UTF-16LE, and back

I have a Blackberry project that I'm working on and I need to convert byte arrays of strings encoded using UTF-16LE (little endian) to a byte array of string in the UTF-16BE (big endian) encoding, and vis. versa. A server I'm connecting to is sending the BlackBerry device byte arrays of strings in the UTF-16LE encoding however the device doesn't natively support UTF-16LE. When I try to decode the byte arrays back into strings, the strings are illegible. The device does, however, support UTF-16BE. I also need to reverse this process, i.e. convert a byte array of a string with UTF-16BE encoding into the what the server is expecting (UTF-16LE). Thanks.
I cannot do this on the device:
String test = "test";
byte[] testBytes = test.getBytes("UTF-16LE");// throws UnsupportedEncodingException
I can do this:
String test = "test";
byte[] testBytes = test.getBytes("UTF-16BE");//works
UTF-16 uses two bytes per codeunit, with some Unicode codepoints encoded using one codeunit and other codepoints using two codeunits (called a surrogate pair).
To convert between UTF-16LE and UTF-16BE, simply loop through the bytes swapping the order of each 2-byte pair of each codeunit. The order of surrogate codeunits does not change between LE and BE. IOW, simply swap bytes 0 and 1 with each other, swap bytes 2 and 3 with each other, and so on.

Resources