BytesOf() and wchar_t arrays - c++builder

I found that there is an api called System.UnicodeString.BytesOf to get byte arrays of the UnicodeString.
However, I do not know the benefit of using the function.
Instead, we can use wchar_t arrays like:
wchar_t szBuf[100];
wcscpy(szBuf, str.c_str());
What is the usefulness of the BytesOf function comparing to those using wchar_t array?

BytesOf() converts a string to a byte array. In the case of the overloaded version that takes a UnicodeString as input, it converts the UnicodeString data to the OS's default Ansi charset before then copying the resulting data to the array (IOW, BytesOf(UnicodeString) is just a wrapper for TEncoding::Default->GetBytes(UnicodeString)).

Related

Why is Uint8List compatible with list<int> in dart?

I am a dart newbie.
Something strange I noticed while learning dart is that Uint8List seems to be compatible with List<int>.
For example, the IOSink.read() method accepts data of type List<int> as an argument. But it also seems to accept data of type Uint8List as argument directly.
What kind of mechanism is this? It doesn't really convert every byte in the Uint8List to int, does it? That would be very wasteful in terms of efficiency and memory usage.
The Uint8List interface implements List<int>.
That means that it has an implementation of every member of List<int> with a signature that is compatible with List<int>.
It also means that Uint8List is a subtype of List<int> and a Uint8List instance can be used anywhere a List<int> instance is allowed or required.
Making Uint8List implement List<int> was easy, since a Uint8List is a list of (limited) integers, and because Dart only has one integer type, int, there is no problem distinguishing between a "byte" and an integer.
Any integer you read out of a Uint8List will be in the range 0..255.
Any integer you write into a Uint8List will be truncated to its first 8 bits before being stored. Storing the integer 257 into a Uint8List means actually storing the byte with value 1.
The read method will likely just use plain List methods for storing integers into the buffer. If that buffer happens to be a Uint8List, those integers are truncated and take up only a single byte. If not, it just stores integers (which happen to be in the range 0..255) into a List<int> as normal.

Passing constants to TIniFile.ReadString

Do I have to use L each time I pass a cosntant to ReadString?
s = MyIni->ReadString (L"ü", L"SomeEntry", "");
The Embarcadero example doesn't say so, but they also don't use non-ASCII characters in their example.
In C++Builder 2009 and later, the entire RTL is based on System::UnicodeString rather than System::AnsiString. Using the L prefix tells the compiler to create a wide string literal (based on wchar_t) instead of a narrow string literal (based on char).
While you don't HAVE to use the prefix L, you SHOULD use it, because it invokes less overhead at runtime. On Windows, constructing a UnicodeString from a wchar_t string is just a simple memory copy, whereas constructing it from a char string performs a data conversion (using the System::DefaultSystemCodePage variable as the codepage to use for the conversion). That conversion MAY be lossy for non-ASCII characters, depending on the encoding of the narrow string, which is subject to the charset that you save your source file in, as well as the charset used by the compiler when it parses the source file. So there is no guarantee that what you write in code in a narrow string literal is what you will actually get at runtime. Using a wide string literal avoids that ambiguity.
Note that UnicodeString is UTF-16 encoded on all platforms, but wchar_t is used for UTF-16 only on Windows, where wchar_t is a 16-bit data type. On other platforms, where wchar_t is usually a 32-bit data type used for UTF-32, char16_t is used instead. As such, if you need to write portable code, use the RTL's _D() macro instead of using the L prefix directly, eg:
s = MyIni->ReadString(_D("ü"), _D("SomeEntry"), _D(""));
_D() will map a string/character literal to the correct data type (wchar_t or char16_t, depending on the platform you are compiling for). So, when using string/character literals with the RTL, VCL, and FMX libraries, you should get in the habit of always using _D().

Delphi Unicode String Length in Bytes

I'm working on porting some Delphi 7 code to XE4, so, unicode is the subject here.
I have a method where a string gets written to a TMemoryStream, so according to this embarcadero article, I should multiply the length of the string (in characters) times the size of the Char type to get the length in bytes that is needed for the length (in bytes) parameter to WriteBuffer.
so before:
rawHtml : string; //AnsiString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml);
after:
rawHtml : string; //UnicodeString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));
My understanding of Delphi's UnicodeString type is that it's UTF-16 internally. But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes. Another of embarcadero's articles seems to confirm that my suspicions, "In fact, it isn’t even always true that one Char is equal to two bytes!"
So...that leaves me wondering whether Length(rawHtml)* SizeOf(Char) is really going to be robust enough to be consistently accurate, or whether there's a better way to determine the size of the string that will be more accurate?
Delphi's UnicodeString is encoded with UTF-16. UTF-16 is a variable length encoding, just like UTF-8. In other words, a single Unicode code point may require multiple character elements to encode it. As a point of interest, the only fixed length Unicode encoding is UTF-32. The UTF-16 encoding uses 16 bit character elements, hence the name.
In a Unicode Delphi, Char is an alias for WideChar which is a UTF-16 character element. And string is an alias for UnicodeString, which is an array of WideChar elements. The Length() function returns the number of elements in the array.
So, SizeOf(Char) is always 2 for UnicodeString. Some Unicode code points are encoded with multiple character elements, or Chars. But Length() returns the number of characters elements and not the number of code points. The character elements all have the same size. So
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));
is correct.
My understanding of Delphi's UnicodeString type is that it's UTF-16
internally.
You are correct about UTF-16 encoding of Delphi's UnicodeString. This means what one 16-bit character is wide enough to represent all code points from the Basic Multilingual Plane as exactly one Char element of string array.
But my general understanding of Unicode is that not all
unicode characters can be represented even in 2 bytes, that some
corner case foreign characters will take 4 bytes.
However, you've got a little misconception here. Length function does not perform any deep inspection of characters and simply returns number of 16-bit WideChar elements, without taking into account any surrogates within your string. This means what if you assign a single character from any of Supplementary Planes to the UnicodeString, Length will return 2.
program Egyptian;
{$APPTYPE CONSOLE}
var
S: UnicodeString;
begin
S := #$1304E; // single char
Writeln(Length(S));
Readln;
end.
Conclusion: byte size of string data is always fixed and equals Length(S) * SizeOf(Char), no matter if S contains any variable-length characters.
Others have explained how UnicodeString is encoded and how to calculate its byte length. I just want to mention that the RTL already has such a function - SysUtils.ByteLength():
memorystream1.WriteBuffer(PChar(rawHtml)^, ByteLength(rawHtml));
What you are doing is correct (with the sizeof(Char)).
What you refer to is that not one character refers to one code point (due to surrogate pairs for example). But the USC2 encoded (NOT UTF-16) characters in the string take up exactly the amount of bytes with Length( Str ) * sizeof( Char ).
Note that the Unicode encoding used in Delphi is the same as all Windows API call expect in the ....W variants.

Can BitConverter be used to reliably extract multi-byte values from an IL byte stream (as returned by MethodBody.GetILAsByteArray)?

I am working on some code that parses IL byte arrays as returned by MethodBody.GetILAsByteArray.
Lets say I want to read a metadata token or a 32-bit integer constant from such an IL byte stream. At first I thought using BitConverter.ToInt32(byteArray, offset) would make this easy. However I'm now worried that this won't work on big-endian machines.
As far as I know, IL always uses little-endian encoding for multi-byte values:
"All argument numbers are encoded least-significant-byte-at-smallest-address (a pattern commonly termed 'little-endian')." — The Common Language Infrastructure Annotated Standard, Partition III, ch. 1.2 (p. 482).
Since BitConverter's conversion methods honour the computer architecture's endianness (which can be discovered through BitConverter.IsLittleEndian), I conclude that BitConverter should not be used to extract multi-byte values from an IL byte stream, because this would give wrong results on big-endian machines.
Is this conclusion correct?
If yes: Is there any way to tell BitConverter which endianness to use for conversions, or is there any other class in the BCL that offers this functionality, or do I have to write my own conversion code?
If no: Where am I wrong? What is the proper way of extracting e.g. a Int32 operand value from an IL byte array?
You should always do this on a little endian array before passing it:
// Array is little. Are we on big?
if (!BitConverter.IsLittleEndian)
{
// Then flip it
Array.Reverse(array);
}
int val = BitConverter.ToInt32(...);
However as you mention an IL stream. The bytecode is this (AFAIK):
(OPCODE:(1|2):little) (VARIABLES:x:little)
So I would read a byte, check its opcode, then read the appropriate bytes and flip the array if necessary using the above code. Can I ask what you are doing?

Is there a quick and dirty way to Cast PansiChar to Pchar in Delphi 2009

I have a very large number of app to convert to Delphi 2009 and there are a number of external interfaces that return pAnsiChars. Does anyone have a quick and simple way to cast these back to pChars? There is a lot on string to pAnsiChar, but much I can find on the other way around.
Delphi 2009 has added a new string type called RawByteString. It is defined as:
type
RawByteString = type AnsiString($ffff);
If you need to save binary data coming in as PAnsiString, you can use this. You should be able to use the RawByteString the way you used AnsiString previously.
However, the recommended long term solution is still to convert your programs to Unicode.
There is no way to "cast" a PAnsiChar to a PChar. PChar is Unicode in Delphi 2009. Ansi data cannot be simply casted to Unicode, and vice versa. You have to perform an actual data conversion. If you have a PAnsiChar pointer to some data, and want to put the data into a Unicode string, then assign the PAnsiChar data to an AnsiString first, and then assign the AnsiString to the Unicode string as needed. Likewise, if you need to pass a Unicode string to a PAnsiChar, you have to assign the data to an AnsiString first. There are articles on Embarcadero's and TeamB's blog sites that take about Delphi 2009 migration issues.

Resources