Quoting the Delphi XE8 help:
For single-byte and multibyte strings, Length returns the number of bytes used by the string. Example for UTF-8:
Writeln(Length(Utf8String('1¢'))); // displays 3
For Unicode (WideString) strings, Length returns the number of bytes divided by two.
This arises important questions:
Why the difference in handling is there at all?
Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?
Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.
Length returns the number of elements when considering the string as an array.
For strings with 8 bit element types (ANSI, UTF-8) then Length gives you the number of bytes since the number of bytes is the same as the number of elements.
For strings with 16 bit elements (UTF-16) then Length is half the number of bytes because each element is 2 bytes wide.
Your string '1¢' has two code points, but the second code point requires two bytes to encode it in UTF-8. Hence Length(Utf8String('1¢')) evaluates to three.
You mention SizeOf in the question title. Passing a string variable to SizeOf will always return the size of a pointer, since a string variable is, under the hood, just a pointer.
To your specific questions:
Why the difference in handling is there at all?
There is only a difference if you think of Length as relating to bytes. But that's the wrong way to think about it Length always returns an element count, and when viewed that way, there behaviour is uniform across all string types, and indeed across all array types.
Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?
It does always return the element count. It just so happens that when the element size is a single byte, then the element count and the byte count happen to be the same. In fact the documentation that you refer to also contains the following just above the excerpt that you provided: Returns the number of characters in a string or of elements in an array. That is the key text. The excerpt that you included is meant as an illustration of the implications of this italicised text.
Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.
UTF-16 character elements are always 16 bits wide. However, some Unicode code points require two character elements to encode. These pairs of character elements are known as surrogate pairs.
You are hoping, I think, that Length will return the number of code points in a string. But it doesn't. It returns the number of character elements. And for variable length encodings, the number of code points is not necessarily the same as the number of character elements. If your string was encoded as UTF-32 then the number of code points would be the same as the number of character elements since UTF-32 is a constant sized encoding.
A quick way to count the code points is to scan through the string checking for surrogate pairs. When you encounter a surrogate pair, count one code point. Otherwise, when you encounter a character element that is not part of a surrogate pair, count one code point. In pseudo-code:
N := 0;
for C in S do
if C.IsSurrogate then
inc(N)
else
inc(N, 2);
CodePointCount := N div 2;
Another point to make is that the code point count is not the same as the visible character count. Some code points are combining characters and are combined with their neighbouring code points to form a single visible character or glyph.
Finally, if all you are hoping to do is find the byte size of the string payload, use this expression:
Length(S) * SizeOf(S[1])
This expression works for all types of string.
Be very careful about the function System.SysUtils.ByteLength. On the face of it this seems to be just what you want. However, that function returns the byte length of a UTF-16 encoded string. So if you pass it an AnsiString, say, then the value returned by ByteLength is twice the number of bytes of the AnsiString.
Related
I am reading the document on index to Delphi string, as below:
http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)
One statement said:
You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.
If I understand correctly, S[i] is index to the i-th byte of the string. If S is a UnicodeString, then S[1] is the first byte, S[2] is the 2nd byte of the first character, S[3] is the first byte of the second character, etc. If that is the case, then how do I index the character instead of the byte inside a string? I need to index characters, not bytes.
In Delphi, S[i] is a char aka widechar. But this is not an Unicode "character", it is an UTF-16 encoded value in 16 bits (2 bytes). In previous century, i.e. until 1996, Unicode was 16-bit, but it is not the case any more! Please read carrefully the Unicode FAQ.
You may need several widechar to have a whole Unicode codepoint = more or less what we usually call "character". And even this may be wrong, if diacritics are used.
UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.
Originally, Unicode was designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be
represented with private-use characters.)
Over time, and especially
after the addition of over 14,500 composite characters for
compatibility with legacy sets, it became clear that 16-bits were not
sufficient for the user community. Out of this arose UTF-16.
see UTF-16 FAQ
For proper decoding of Unicode codepoints in Delphi, see Detecting and Retrieving codepoints and surrogates from a Delphi String (link by #LURD in comments)
I have a Delphi 7 application where I deal with ANSI strings and I need to count their number of characters (as opposed to the number of bytes). I always know the Charset (and thus the code page) associated with the string.
So, knowing the Charset (code page), I'm currently using MultiByteToWideChar to get the number of characters. It's useful when the Charset is one of the Chinese, Korean, or Japanese charsets where most of the characters are 2 bytes in length and simply using the Length function won't give me what I want.
However, it still counts composite characters as two characters, and I need them counted as one. Now, some composite characters have precomposed versions in Unicode, those would be counted correctly as one character since the MB_PRECOMPOSED is used by default. But many characters simply don't exist as precomposed, for example characters in Hebrew, Arabic, Thai, etc, and those are counted as two.
So the question really is: How to count composite characters as single characters? I don't mind converting the ANSI strings to Wide strings to count the number of characters, I'm already doing it with MultiByteToWideChar anyway.
You can count the Unicode code points like this:
function CodePointCount(P: PWideChar): Integer;
var
Count: Integer;
begin
Count := 0;
while Word(P^)<>0 do
begin
if (Word(P^)>=$D800) and (Word(P^)<=$DFFF) then
// part of surrogate pair
inc(Count)
else
inc(Count, 2);
inc(P);
end;
Result := Count div 2;
end;
This covers the issue that you did not mention. Namely that UTF-16 is a variable width encoding.
However, this will not tell you the number of glyphs represented by a UTF-16 string. That's because some code points represent combining characters. These combining characters combine with their neighbours to form a single equivalent character. So, multiple code-points, single glyph. More information can be found here: http://en.wikipedia.org/wiki/Unicode_equivalence
This is the harder issue. To solve it your code needs to fully understand the meaning of each Unicode code point. Is it a combining character? How does it combine? Really you need a dedicated Unicode library. For instance ICU.
The other suggestion I have for you is to give up using ANSI code pages. If you really care about internationalisation then you need to use Unicode.
I'm working on porting some Delphi 7 code to XE4, so, unicode is the subject here.
I have a method where a string gets written to a TMemoryStream, so according to this embarcadero article, I should multiply the length of the string (in characters) times the size of the Char type to get the length in bytes that is needed for the length (in bytes) parameter to WriteBuffer.
so before:
rawHtml : string; //AnsiString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml);
after:
rawHtml : string; //UnicodeString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));
My understanding of Delphi's UnicodeString type is that it's UTF-16 internally. But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes. Another of embarcadero's articles seems to confirm that my suspicions, "In fact, it isn’t even always true that one Char is equal to two bytes!"
So...that leaves me wondering whether Length(rawHtml)* SizeOf(Char) is really going to be robust enough to be consistently accurate, or whether there's a better way to determine the size of the string that will be more accurate?
Delphi's UnicodeString is encoded with UTF-16. UTF-16 is a variable length encoding, just like UTF-8. In other words, a single Unicode code point may require multiple character elements to encode it. As a point of interest, the only fixed length Unicode encoding is UTF-32. The UTF-16 encoding uses 16 bit character elements, hence the name.
In a Unicode Delphi, Char is an alias for WideChar which is a UTF-16 character element. And string is an alias for UnicodeString, which is an array of WideChar elements. The Length() function returns the number of elements in the array.
So, SizeOf(Char) is always 2 for UnicodeString. Some Unicode code points are encoded with multiple character elements, or Chars. But Length() returns the number of characters elements and not the number of code points. The character elements all have the same size. So
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));
is correct.
My understanding of Delphi's UnicodeString type is that it's UTF-16
internally.
You are correct about UTF-16 encoding of Delphi's UnicodeString. This means what one 16-bit character is wide enough to represent all code points from the Basic Multilingual Plane as exactly one Char element of string array.
But my general understanding of Unicode is that not all
unicode characters can be represented even in 2 bytes, that some
corner case foreign characters will take 4 bytes.
However, you've got a little misconception here. Length function does not perform any deep inspection of characters and simply returns number of 16-bit WideChar elements, without taking into account any surrogates within your string. This means what if you assign a single character from any of Supplementary Planes to the UnicodeString, Length will return 2.
program Egyptian;
{$APPTYPE CONSOLE}
var
S: UnicodeString;
begin
S := #$1304E; // single char
Writeln(Length(S));
Readln;
end.
Conclusion: byte size of string data is always fixed and equals Length(S) * SizeOf(Char), no matter if S contains any variable-length characters.
Others have explained how UnicodeString is encoded and how to calculate its byte length. I just want to mention that the RTL already has such a function - SysUtils.ByteLength():
memorystream1.WriteBuffer(PChar(rawHtml)^, ByteLength(rawHtml));
What you are doing is correct (with the sizeof(Char)).
What you refer to is that not one character refers to one code point (due to surrogate pairs for example). But the USC2 encoded (NOT UTF-16) characters in the string take up exactly the amount of bytes with Length( Str ) * sizeof( Char ).
Note that the Unicode encoding used in Delphi is the same as all Windows API call expect in the ....W variants.
I read something aboue string there:
http://www.lua.org/pil/2.4.html
Lua is eight-bit clean and so strings may contain characters with any numeric value, including embedded zeros.
What is that eight-bit clean means?
Why it can contain characters with any numeric value ? (different with basic c strings)
There are two common ways to store strings:
Characters and Terminator
Length and Characters
When you use #1, you need to "sacrifice" one character to serve as the terminator; when you use #2, you do not have such limitation.
C uses the first method of storing strings. It uses character zero to serve as the terminator; the other 255 characters can be used to represent characters of the string.
Lua uses the second method of storing strings. All 256 possible character values, including zeros, can be used in Lua strings. For example, you can construct a three-character string from characters 'A', 0, 'B', and Lua will treat it as a three character string. You can construct the same string in C, but its string-processing libraries will treat it as a single-character string: strlen would return 1, puts will write character A and stop, and so on.
The Lua string type is a counted sequence of bytes. A byte can hold any value between 0 and 255.
The string type is used for character strings. You are right, few character set encodings allow any byte value or sequence of byte values. Code page 437 is one that does; It maps 256 characters to 256 values, one byte per character. Windows-1252 does not; It maps 251 characters to 251 values, one byte per character. UTF-8 maps 1,112,064 characters to sequences of one to four bytes, where some values of bytes are not used and some sequences of values are not used.
The Lua string library does have functions that treats bytes as characters. Their behavior is influenced by the implementation's libraries, which typically uses the C runtime along with its locale features.
There are specialized libraries for Lua to explicitly handle various character set encodings.
How do you determine how many bits per character are required for a fixed length code in a string using huffman? i had an idea that you count the number of different characters in a string than you present that number in binary so that will be the fixed length but it doesn't work. For example in the string "letty lotto likes lots of lolly"...there are 10 different characters excluding the quotes(since 10 = 0101(4bits), i thought it meant all the characters can be represented using 4 bits), now the frequency of f is 1 and is encoded as 11111(5 bits)not 4.
Let's say you have a string with 50 "A"s, 35 "B"s and 15 "C"s.
With a fixed-length encoding, you could represent each character in that string using 2 bits. There are 100 total characters, so when using this method, the compressed string would be 200 bits long.
Alternatively, you could use a variable-length encoding scheme. If you allow the characters to have a variable number of bits, you could represent "A" with 1 bit ("0"), "B" with 2 bits ("10") and "C" with 2 bits ("11"). With this method, the compressed string is 150 bits long, because the most common pieces of information in the string take fewer bits to represent.
Huffman coding specifically refers to a method of building a variable-length encoding scheme, using the number of occurrences of each character to do so.
The fixed-length algorithm you're describing is entirely separate from Huffman coding. If your goal is to compress text using a fixed-length code, then your method of figuring out how many bits to represent each character with will work.