Index character instead of byte in the Delphi string - delphi

I am reading the document on index to Delphi string, as below:
http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)
One statement said:
You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.
If I understand correctly, S[i] is index to the i-th byte of the string. If S is a UnicodeString, then S[1] is the first byte, S[2] is the 2nd byte of the first character, S[3] is the first byte of the second character, etc. If that is the case, then how do I index the character instead of the byte inside a string? I need to index characters, not bytes.

In Delphi, S[i] is a char aka widechar. But this is not an Unicode "character", it is an UTF-16 encoded value in 16 bits (2 bytes). In previous century, i.e. until 1996, Unicode was 16-bit, but it is not the case any more! Please read carrefully the Unicode FAQ.
You may need several widechar to have a whole Unicode codepoint = more or less what we usually call "character". And even this may be wrong, if diacritics are used.
UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.
Originally, Unicode was designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be
represented with private-use characters.)
Over time, and especially
after the addition of over 14,500 composite characters for
compatibility with legacy sets, it became clear that 16-bits were not
sufficient for the user community. Out of this arose UTF-16.
see UTF-16 FAQ
For proper decoding of Unicode codepoints in Delphi, see Detecting and Retrieving codepoints and surrogates from a Delphi String (link by #LURD in comments)

Related

Delphi Unicode String Length in Bytes

I'm working on porting some Delphi 7 code to XE4, so, unicode is the subject here.
I have a method where a string gets written to a TMemoryStream, so according to this embarcadero article, I should multiply the length of the string (in characters) times the size of the Char type to get the length in bytes that is needed for the length (in bytes) parameter to WriteBuffer.
so before:
rawHtml : string; //AnsiString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml);
after:
rawHtml : string; //UnicodeString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));
My understanding of Delphi's UnicodeString type is that it's UTF-16 internally. But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes. Another of embarcadero's articles seems to confirm that my suspicions, "In fact, it isn’t even always true that one Char is equal to two bytes!"
So...that leaves me wondering whether Length(rawHtml)* SizeOf(Char) is really going to be robust enough to be consistently accurate, or whether there's a better way to determine the size of the string that will be more accurate?
Delphi's UnicodeString is encoded with UTF-16. UTF-16 is a variable length encoding, just like UTF-8. In other words, a single Unicode code point may require multiple character elements to encode it. As a point of interest, the only fixed length Unicode encoding is UTF-32. The UTF-16 encoding uses 16 bit character elements, hence the name.
In a Unicode Delphi, Char is an alias for WideChar which is a UTF-16 character element. And string is an alias for UnicodeString, which is an array of WideChar elements. The Length() function returns the number of elements in the array.
So, SizeOf(Char) is always 2 for UnicodeString. Some Unicode code points are encoded with multiple character elements, or Chars. But Length() returns the number of characters elements and not the number of code points. The character elements all have the same size. So
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));
is correct.
My understanding of Delphi's UnicodeString type is that it's UTF-16
internally.
You are correct about UTF-16 encoding of Delphi's UnicodeString. This means what one 16-bit character is wide enough to represent all code points from the Basic Multilingual Plane as exactly one Char element of string array.
But my general understanding of Unicode is that not all
unicode characters can be represented even in 2 bytes, that some
corner case foreign characters will take 4 bytes.
However, you've got a little misconception here. Length function does not perform any deep inspection of characters and simply returns number of 16-bit WideChar elements, without taking into account any surrogates within your string. This means what if you assign a single character from any of Supplementary Planes to the UnicodeString, Length will return 2.
program Egyptian;
{$APPTYPE CONSOLE}
var
S: UnicodeString;
begin
S := #$1304E; // single char
Writeln(Length(S));
Readln;
end.
Conclusion: byte size of string data is always fixed and equals Length(S) * SizeOf(Char), no matter if S contains any variable-length characters.
Others have explained how UnicodeString is encoded and how to calculate its byte length. I just want to mention that the RTL already has such a function - SysUtils.ByteLength():
memorystream1.WriteBuffer(PChar(rawHtml)^, ByteLength(rawHtml));
What you are doing is correct (with the sizeof(Char)).
What you refer to is that not one character refers to one code point (due to surrogate pairs for example). But the USC2 encoded (NOT UTF-16) characters in the string take up exactly the amount of bytes with Length( Str ) * sizeof( Char ).
Note that the Unicode encoding used in Delphi is the same as all Windows API call expect in the ....W variants.

Why Lua's string can contain characters with any numeric value?

I read something aboue string there:
http://www.lua.org/pil/2.4.html
Lua is eight-bit clean and so strings may contain characters with any numeric value, including embedded zeros.
What is that eight-bit clean means?
Why it can contain characters with any numeric value ? (different with basic c strings)
There are two common ways to store strings:
Characters and Terminator
Length and Characters
When you use #1, you need to "sacrifice" one character to serve as the terminator; when you use #2, you do not have such limitation.
C uses the first method of storing strings. It uses character zero to serve as the terminator; the other 255 characters can be used to represent characters of the string.
Lua uses the second method of storing strings. All 256 possible character values, including zeros, can be used in Lua strings. For example, you can construct a three-character string from characters 'A', 0, 'B', and Lua will treat it as a three character string. You can construct the same string in C, but its string-processing libraries will treat it as a single-character string: strlen would return 1, puts will write character A and stop, and so on.
The Lua string type is a counted sequence of bytes. A byte can hold any value between 0 and 255.
The string type is used for character strings. You are right, few character set encodings allow any byte value or sequence of byte values. Code page 437 is one that does; It maps 256 characters to 256 values, one byte per character. Windows-1252 does not; It maps 251 characters to 251 values, one byte per character. UTF-8 maps 1,112,064 characters to sequences of one to four bytes, where some values of bytes are not used and some sequences of values are not used.
The Lua string library does have functions that treats bytes as characters. Their behavior is influenced by the implementation's libraries, which typically uses the C runtime along with its locale features.
There are specialized libraries for Lua to explicitly handle various character set encodings.

What Character Encoding is this? [Example Character to Int value provided]

I is represented as 21321 when printed as an Integer.
The data is coming from a device into a Delphi DLL and being passed to me to write out. However, it does not sit well with Delphi's Ansi string conversions.
I just need to know possible character encodings this may be, so I can begin to identify how to convert it properly.
The number 21321 is 5349 in hexadecimal, and interpreted a 8-bit values, 53 and 49 are the ASCII codes for the Latin letters “S” and “I.” So my guess is that the data is actually “SI” in ASCII or some compatible encoding.
It is difficult to imagine any encoding where “I” would be 5349 hexadecimal, so this is about something else than just an unknown encoding.

Translating memory contents into a string via ASCII encoding

I have to translate some memory contents into a string, using ASCII encoding. For example:
0x6A636162
But I am not sure how to break that up, to be translated into ASCII. I think it has something to do with how many bits are in a char/digit, but I am not sure how to go about doing so (and of course, I would like to know more of the reasoning behind it, not just "how to do it").
ASCII uses 7 bits to encode a character (http://en.wikipedia.org/wiki/ASCII). However, it's common to encode characters using 8 bits instead (note that technically this isn't ASCII). Thus, you'd need to split your data into 8-bit chunks and match that to an ASCII table.
If you're using a specific programming language, it may have a way to translate an ASCII code to a character. For instance, Ruby has the .chr method, Python has the chr() built-in function, and in C you can printf("%c", number).
Note that each nibble (4 bits) can be represented as one hexadecimal digit, so for the sample string you show, each 8-bit "chunk" would be:
0x6A
0x63
0x61
0X62
the string reads "jcab" :)

How can I convert unicode characters to ascii codes in delphi 7?

Yes we're talking about ASCII codes. My appologies I'm not the Delphi dev here.
For Delphi 7, I'd get the free Unicode Library by Mike Lischke who is the author of Virtual Treeview.
The libary includes a lot of conversion functions to go to and from Unicode, so you can use the ones that make most sense in your application.
Or you can upgrade to Delphi 2009 which has built-in encoding routines, and its own library of conversion functions.
Let's get a few things straight. Character set (charset) and character encodings are two related but different concepts. A character set is an abstract list of characters with some sort of integer character code associated. Then there are character encodings, which is basically an algorithm that describes how the characters are represented in bytes.
ASCII acts as both the character set and encoding. It uses 7 bits to express 128 characters (94 printable). Unicode on the other hand is a character set, expressing 1,114,112 code points. There are several encodings to represent Unicode strings but most notable ones are UTF-8, UTF-16, UTF-16LE, and UTF-32. In other words, a single Unicode character can be represented in different ways depending on the encodings.
How can I convert unicode characters to ascii codes in delphi 7?
I think the question could be interpreted in two ways.
I have a Unicode string in some encoding that only includes ASCII printable characters. How can I convert the string into a byte array of ASCII encoding?
I have a Unicode string in some encoding that also includes non-ASCII printable characters such as Chinese characters. How can I encode the string into a ASCII encoding without losing information, and later decode it back to the original Unicode string?
If you mean the first, you can load the Unicode string into WideString like Osman is saying and do
var
original: WideString;
s: AnsiString;
begin
s := AnsiString(original);
If you mean the second, you would need a generic encoding algorithm like Base64 encoding. You can use DCPBase64.pas included in David Barton's DCPcrypt v2 Beta 3.
It depends what your definition of conversion is. If you want to map the 127 lowest characters to the Unicode equivalent, you can use an explicit cast. But this creates garbage if the string contains higher characters.
If you want mappings like ë -> e and û -> u, you can write your own code. But be aware that there are always characters that can't be converted.
"ASCII" is the name of a specific mapping of characters to numbers, but some people say "ASCII code" when they don't really mean ASCII at all; they just want the numeric value of a character, whatever mapping is in effect at the time. Does that description apply to you?
If so, then you can use the Ord standard function to get the Unicode code-point value of whatever Unicode character you have.
var
wc: WideChar;
ws: WideString;
x: Word;
x := Ord(wc);
x := Ord(ws[1]);
If you really meant ASCII, though, then you'll have to be more specific about what sort of conversion you have in mind.
As an example, the letter A is represented in unicode as U+0041 and in ansi as just 41. So converting that would be pretty simple, but you must find out how the unicode character is encoded. The most common are UTF-16 and UTF-8. UTF 16, is basically two bytes per character, but even that is an oversimplification, as a character may have more bytes. UTF-8 sounds as if it means 1 byte per character but can be 2 or 3. To further complicate matters, UTF-16 can be little endian or big endian. (U+0041 or U+4100).
Where your question makes no sense is if you wanted to for example convert the arabic letter ain U+0639 to ansi on an English locale. You can't.
See related questions on converting from Unicode to ASCII:
How to convert UTF-8 to US-Ascii in Java
How to convert a Unicode character to its ASCII equivalent
How do I convert a file’s format from Unicode to ASCII using Python?
In general, character set of hundreds thousands entries cannot be converted to character set of 127 entries without some loss of information or encoding scheme.
You can use the function in http://swissdelphicenter.ch/en/showcode.php?id=1692
It converts Unicode string to Ansi string using specified code page. If you want convert using default system codepage (defined in regional options as non-unicode codepage) you can do it simply like following:
var
ws: widestring;
s: string;
begin
s:=string(ws)

Resources