Delphi Unicode String Type Stored Directly at its Address (or "Unicode ShortString")

Delphi Unicode String Type Stored Directly at its Address (or "Unicode ShortString") - delphi

I want a string type that is Unicode and that stores the string directly at the adress of the variable, as is the case of the (Ansi-only) ShortString type.
I mean, if I declare a S: ShortString and let S := 'My String', then, at #S, I will find the length of the string (as one byte, so the string cannot contain more than 255 characters) followed by the ANSI-encoded string itself.
What I would like is a Unicode variant of this. That is, I want a string type such that, at #S, I will find a unsigned 32-bit integer (or a single byte would be enough, actually) containing the length of the string in bytes (or in characters, which is half the number of bytes) followed by the Unicode representation of the string. I have tried WideString, UnicodeString, and RawByteString, but they all appear only to store an adress at #S, and the actual string somewhere else (I guess this has do do with reference counting and such). Update: The most important reason for this is probably that it would be very problematic if sizeof(string) were variable.
I suspect that there is no built-in type to use, and that I have to come up with my own way of storing text the way I want (which actually is fun). Am I right?
Update
I will, among other things, need to use these strings in packed records. I also need manually to read/write these strings to files/the heap. I could live with fixed-size strings, such as <= 128 characters, and I could redesign the problem so it will work with null-terminated strings. But PChar will not work, for sizeof(PChar) = 1 - it's merely an address.
The approach I eventually settled for was to use a static array of bytes. I will post my implementation as a solution later today.

You're right. There is no exact analogue to ShortString that holds Unicode characters. There are lots of things that come close, including WideString, UnicodeString, and arrays of WideChar, but if you're not willing to revisit the way you intend to use the data type (make byte-for-byte copies in memory and in files while still being using them in all the contexts a string could be allowed), then none of Delphi's built-in types will work for you.
WideString fails because you insist that the string's length must exist at the address of the string variable, but WideString is a reference type; the only thing at its address is another address. Its length happens to be at the address held by the variable, minus four. That's subject to change, though, because all operations on that type are supposed to go through the API.
UnicodeString fails for that same reason, as well as because it's a reference-counted type; making a byte-for-byte copy of one breaks the reference counting, so you'll get memory leaks, invalid-pointer-operation exceptions, or more subtle heap corruption.
An array of WideChar can be copied without problems, but it doesn't keep track of its effective length, and it also doesn't act like a string very often. You can assign string literals to it and it will act like you called StrLCopy, but you can't assign string variables to it.
You could define a record that has a field for the length and another field for a character array. That would resolve the length issue, but it would still have all the rest of the shortcomings of an undecorated array.
If I were you, I'd simply use a built-in string type. Then I'd write functions to help transfer it between files, blocks of memory, and native variables. It's not that hard; probably much easier than trying to get operator overloading to work just right with a custom record type. Consider how much code you will write to load and store your data versus how much code you're going to write that uses your data structure like an ordinary string. You're going to write the data-persistence code once, but for the rest of the project's lifetime, you're going to be using those strings, and you're going to want them to look and act just like real strings. So use real strings. "Suffer" the inconvenience of manually producing the on-disk format you want, and gain the advantage of being able to use all the existing string library functions.

PChar should work like this, right? AFAIK, it's an array of chars stored right where you put it. Zero terminated, not sure how that works with Unicode Chars.

You actually have this in some way with the new unicode strings.
s as a pointer points to s[1] and the 4 bytes on the left contains the length.
But why not simply use Length(s)?
And for direct reading of the length from memory:
procedure TForm9.Button1Click(Sender: TObject);
var
s: string;
begin
s := 'hlkk ljhk jhto';
{$POINTERMATH ON}
Assert(Length(s) = (PInteger(s)-1)^);
//if you don't want POINTERMATH, replace by PInteger(Cardinal(s)-SizeOf(Integer))^
showmessage(IntToStr(length(s)));
end;

There's no Unicode version of ShortString. If you want to store unicode data inline inside an object instead of as a reference type, you can allocate a buffer:
var
buffer = array[0..255] of WideChar;
This has two disadvantages. 1, the size is fixed, and 2, the compiler doesn't recognize it as a string type.
The main problem here is #1: The fixed size. If you're going to declare an array inside of a larger object or record, the compiler needs to know how large it is in order to calculate the size of the object or record itself. For ShortString this wasn't a big problem, since they could only go up to 256 bytes (1/4 of a K) total, which isn't all that much. But if you want to use long strings that are addressed by a 32-bit integer, that makes the max size 4 GB. You can't put that inside of an object!
This, not the reference counting, is why long strings are implemented as reference types, whose inline size is always a constant sizeof(pointer). Then the compiler can put the string data inside a dynamic array and resize it to fit the current needs.
Why do you need to put something like this into a packed array? If I were to guess, I'd say this probably has something to do with serialization. If so, you're better off using a TStream and a normal Unicode string, and writing an integer (size) to the stream, and then the contents of the string. That turns out to be a lot more flexible than trying to stuff everything into a packed array.

The solution I eventually settled for is this (real-world sample - the string is, of course, the third member called "Ident"):
TASStructMemHeader = packed record
TotalSize: cardinal;
MemType: TASStructMemType;
Ident: packed array[0..63] of WideChar;
DataSize: cardinal;
procedure SetIdent(const AIdent: string);
function ReadIdent: string;
end;
where
function TASStructMemHeader.ReadIdent: string;
begin
result := WideCharLenToString(PWideChar(#(Ident[0])), length(Ident));
end;
procedure TASStructMemHeader.SetIdent(const AIdent: string);
var
i: Integer;
begin
if length(AIdent) > 63 then
raise Exception.Create('Too long structure identifier.');
FillChar(Ident[0], length(Ident) * sizeof(WideChar), 0);
Move(AIdent[1], Ident[0], length(AIdent) * sizeof(WideChar));
end;
But then I realized that the compiler really can interpret array[0..63] of WideChar as a string, so I could simply write
var
MyStr: string;
Ident := 'This is a sample string.';
MyStr := Ident;
Hence, after all, the answer given by Mason Wheeler above is actually the answer.

Related

Displaying the result of mmioRead

After locating the data chunk using mmioDescend, then how i suppose to read and display the sample data into for example into a memo in delphi 7?
I have follow the step like open the file, locating the riff, locating the fmt, locating data chunk.
if (mmioDescend(HMMIO, #ckiData, #ckiRIFF, MMIO_FINDCHUNK) = MMSYSERR_NOERROR) then
SetLength(buf, ckiData.cksize);
mmioRead(HMMIO, PAnsiChar(buf), ckiData.cksize);
I use mmioRead too but i don't know how to display the data.Can anyone help give an example how to use the mmioRead and then display the result?

Well, I'd probably read into a buffer that was declared using a more appropriate type.
For example, suppose your data are 16 bit integers, Smallint in Delphi. Then declare a dynamic array of Smallint.
var
buf: array of Smallint;
Then allocate enough space for the data:
Assert(ckiData.cksize mod SizeOf(buf[0])=0);
SetLength(buf, ckiData.cksize div SizeOf(buf[0]));
And then read the buffer:
mmioRead(HMMIO, PAnsiChar(buf), ckiData.cksize);
Now you can access the elements as Smallint values.
If you have different element types, then you can adjust your array declaration. If you don't know until runtime what the element type is you may be better off with array of Byte and then using pointer arithmetic and casting to access the actual content.
I'd say that the design of the interface to mmioRead is a little weak. The buffer isn't really a string. It's probably best considered as a byte array. But perhaps because C does not have separate byte and character types, the function is declared as taking a pointer to char array. Really the Delphi translation would be better exposing a pointer to byte or even better in my view, a plain untyped Pointer type.
I assumed that you were struggling with interpreting the output of mmioRead since that was the code that you included in the question. But, according to now deleted comments, your question is a GUI question.
You want to add content to a memo. Do it like this:
Memo1.Clear;
for i := low(buf) to high(buf) do
Memo1.Items.Add(IntToStr(buf[i]));
If you want to convert to floating point then, still assuming 16 bit signed data, do this:
Memo1.Clear;
for i := low(buf) to high(buf) do
Memo1.Items.Add(FormatFloat('0.00000', buf[i]/32768.0));//show 5dp

Why use string[1] rather than string while using readbuffer

I am having a record like this
TEmf_SrectchDIBits = packed record
rEMF_STRETCHDI_BITS: TEMRStretchDIBits;
rBitmapInfo: TBitmapInfo;
ImageSource: string;
end;
---
---
RecordData: TEmf_SrectchDIBits;
If i am reading data into it by using TStream like this an exception is occuring
SetLength(RecordData.ImageSource, pRecordSize);
EMFStream.ReadBuffer(RecordData.ImageSource,pRecordSize)
But if i use below code, it was working normally
SetLength(RecordData.ImageSource, pRecordSize);
EMFStream.ReadBuffer(RecordData.ImageSource[1], pRecordSize);
So what is the difference between using String and String[1]

The difference is a detail related to the signature of the .ReadBuffer method.
The signature is:
procedure ReadBuffer(var Buffer; Count: Longint);
As you can see, the Buffer parameter does not have a type. In this case, you're saying that you want access to the underlying variable.
However, a string is two parts, a pointer (the content of the variable) and the string (the variable points to this).
So, if ReadBuffer were given just the string variable, it would have 4 bytes to store data into, the string variable, and that would not work out too well since the string variable is supposed to hold a pointer, not just any random binary data. If ReadBuffer wrote more than 4 bytes, it would overwrite something else in memory with new data, a potentially disastrous action to do.
By passing the [1] character to a var parameter, you're giving ReadBuffer access to the data that the string variable points to, which is what you want. You want to change the string content after all.
Also, make sure you've set up the length of the string variable to be big enough to hold whatever you're reading into it.
Also, final note, one that I cannot verify. In older Delphi versions, a string variable contained 1-byte characters. In newer, I think they're two, due to unicode, so that code might not work as expected in newer versions of Delphi. You probably would like to use a byte array or heap memory instead.

String types are implemented actually as pointers to something we could call a "string descriptor block". Basically, you have a level of indirection.
That block contains some string control data (reference count, length, and in later versions character set info as well) at negative offsets, and the string characters at positive ones. A string variable is a pointer to the decription block (and if you print SizeOf(stringvar) you get 4), when you work on strings the compiler knows where to find the string data and handle them. But when using an untyped parameter (var Buffer;), the compiler does not know that, it will simply access the memory at "Buffer", but with a string variable that's the pointer to the string block, not the actual string characters. Using string[1] you pass the location of the first character data.

TArray<Byte> VS TBytes VS PByteArray

Those 3 types are very similar...
TArray is the generic version of TBytes.
Both can be casted to PByteArray and used as buffer for calls to Windows API. (with the same restrictions as string to Pchar).
What I would like to know: Is this behavior "by design" or "By Implementation". Or more specifically, could it break in future release?
//Edit
As stated lower...
What I really want to know is: Is this as safe to typecast TBytes(or TArray) to PByteArray as it is to typecast String to PChar as far as forward compatibility is concerned. (Or maybe AnsiString to PAnsiChar is a better exemple ^_^)

Simply put, an array of bytes is an array of bytes, and as long as the definitions of a byte and an array don't change, this won't change either. You're safe to use it that way, as long as you make sure to respect the array bounds, since casting it out of Delphi's array types nullifies your bounds checking.
EDIT: I think I see what you're asking a bit better now.
No, you shouldn't cast a dynamic array reference to a C-style array pointer. You can get away with it with strings because the compiler helps you out a little.
What you can do, though, is cast a pointer to element 0 of the dynamic array to a C-style array pointer. That will work, and won't change.

Two of those types are similar (identical in fact). The third is not.
TArray is declared as "Array of Byte", as is TBytes. You missed a further very relevant type however, TByteArray (the type referenced by PByteArray).
Being a pointer to TByteArray, PByteArray is strictly speaking a pointer to a static byte array, not a dynamic array (which the other byte array types all are). It is typed in this way in order to allow reference to offsets from that base pointer using an integer index. And note that this indexing is limited to 2^15 elements (0..32767). For arbitrary byte offsets (> 32767) from some base pointer, a PByteArray is no good:
var
b: Byte;
ab: TArray<Byte>;
pba: PByteArray;
begin
SetLength(ab, 100000);
pba := #ab; // << No cast necessary - the compiler knows (magic!)
b := pba[62767]; // << COMPILE ERROR!
end;
i.e. casting an Array of Byte or a TArray to a PByteArray is potentially going to lead to problems where the array has > 32K elements (and the pointer is passed to some code which attempts to access all elements). Casting to an untyped pointer avoids this of course (as long as the "recipient" of the pointer then handles access to the memory reference by the pointer appropriately).
BUT, none of this is likely to change in the future, it is merely a consequence of the implementation details that have long since applied in this area. The introduction of a syntactically sugared generic type declaration is a kipper rouge.

Delphi; performance of passing const strings versus passing var strings

Quick one; am I right in thinking that passing a string to a method 'as a CONST' involves more overhead than passing a string as a 'VAR'? The compiler will get Delphi to make a copy of the string and then pass the copy, if the string parameter is declared as a CONST, right?
The reason for the question is a bit tedious; we have a legacy Delphi 5 utility whose days are truly numbered (the replacement is under development). It does a large amount of string processing, frequently passing 1-2Kb strings between various functions and procedures. Throughout the code, the 'correct' observation of using CONST or VAR to pass parameters (depending on the job in hand) has been adhered to. We're just looking for a few 'quick wins' that might shave a few microseconds off the execution time, to tide us over until the new version is ready. We thought of changing the memory manager from the default Delphi 5 one to FastMM, and we also wondered if it was worth altering the way the strings are passed around - because the code is working fine with the strings passed as const, we don't see a problem if we changed those declarations to var - the code within that method isn't going to change the string.
But would it really make any difference in real terms? (The program really just does a large amount of processing on these 1kb+ish strings; several hundred strings a minute at peak times). In the re-write these strings are being held in objects/class variables, so they're not really being copied/passed around in the same way at all, but in the legacy code it's very much 'old school' pascal.
Naturally we'll profile an overall run of the program to see what difference we've made but there's no point in actually trying this if we're categorically wrong about how the string-passing works in the first instance!

No, there shouldn't be any performance difference between using const or var in your case. In both cases a pointer to the string is passed as the parameter. If the parameter is const the compiler simply disallows any modifications to it. Note that this does not preclude modifications to the string if you get tricky:
procedure TForm1.Button1Click(Sender: TObject);
var
s: string;
begin
s := 'foo bar baz';
UniqueString(s);
SetConstCaption(s);
Caption := s;
end;
procedure TForm1.SetConstCaption(const AValue: string);
var
P: PChar;
begin
P := PChar(AValue);
P[3] := '?';
Caption := AValue;
end;
This will actually change the local string variable in the calling method, proof that only a pointer to it is passed.
But definitely use FastMM4, it should have a much bigger performance impact.

const for parameters in Delphi essentially means "I'm not going to mutate this, and I also don't care if this is passed by value or by reference - whichever is most efficient is fine by me". The bolded part is important, because it is actually observable. Consider this code:
type TFoo =
record
x: integer;
//dummy: array[1..10] of integer;
end;
procedure Foo(var x1: TFoo; const x2: TFoo);
begin
WriteLn(x1.x);
WriteLn(x2.x);
Inc(x1.x);
WriteLn;
WriteLn(x1.x);
WriteLn(x2.x);
end;
var
x: TFoo;
begin
Foo(x, x);
ReadLn;
end.
The trick here is that we pass the same variable both as var and as const, so that our function can mutate via one argument, and see if this affects the other. If you try it with code above, you'll see that incrementing x1.x inside Foo doesn't change x2.x, so x2 was passed by value. But try uncommenting the array declaration in TFoo, so that its size becomes larger, and running it again - and you'll see how x2.x now aliases x1.x, so we have pass-by-reference for x2 now!
To sum it up, const is always the most efficient way to pass parameter of any type, but you should not make any assumptions about whether you have a copy of the value that was passed by the caller, or a reference to some (potentially mutated by other code that you may call) location.

This is really a comment, but a long one so bear with me.
About 'so called' string passing by value
Delphi always passes string and ansistring (WideStrings and ShortStrings excluded) by reference, as a pointer.
So strings are never passed by value.
This can be easily tested by passing 100MB strings around.
As long as you don't change them inside the body of the called routine string passing takes O(1) time (and with a small constant at that)
However when passing a string without var or const clause, Delphi does three things.
Increase the reference count of the string.
put an implicit try-finally block around the procedure, so the reference count of the string parameter gets decreased again when the method exits.
When the string gets changed (and only then) Delphi makes a copy of the string, decreases the reference count of the passed string and uses the copy in the rest of the routine.
It fakes a pass by value in doing this.
About passing by reference (pointer)
When the string is passed as a const or var, Delphi also passes a reference (pointer), however:
The reference count of the string does not increase. (tiny, tiny speed increase)
No implicit try/finally is put around the routine, because it is not needed. This is part 1 why const/var string parameters execute faster.
When the string is changed inside the routine, no copy is make the actual string is changed. For const parameters the compiler prohibits string alternations. This is part 2 of why var/const string parameters work faster.
If however you need to create a local var to assign the string to; Delphi copies the string :-) and places an implicit try/finally block eliminating 99%+ of the speed gain of a const string parameter.
Hope this sheds some light on the issue.
Disclaimer: Most of this info comes from here, here and here

The compiler won't make a copy of the string when using const afaik. Using const saves you the overhead of incrementing/decrementing the refcounter for the string that you use.
You will get a bigger performanceboost by upgrading the memorymanager to FastMM, and, because you do a lot with strings, consider using the FastCode library.

Const is already the most efficient way of passing parameters to a function. It avoids creating a copy (default, by value) or even passing a pointer (var, by reference).
It is particularly true for strings and was indeed the way to go when computing power was limited and not to be wasted (hence the "old school" label).
IMO, const should have been the default convention, being up to the programmer to change it when really needed to by value or by var. That would have been more in line with the overall safety of Pascal (as in limiting the opportunity of shooting oneself in the foot).
My 2¢...

Delphi 2009 + Unicode + Char-size

I just got Delphi 2009 and have previously read some articles about modifications that might be necessary because of the switch to Unicode strings.
Mostly, it is mentioned that sizeof(char) is not guaranteed to be 1 anymore.
But why would this be interesting regarding string manipulation?
For example, if I use an AnsiString:='Test' and do the same with a String (which is unicode now), then I get Length() = 4 which is correct for both cases.
Without having tested it, I'm sure all other string manipulation functions behave the same way and decide internally if the argument is a unicode string or anything else.
Why would the actual size of a char be of interest for me if I do string manipulations?
(Of course if I use strings as strings and not to store any other data)
Thanks for any help!
Holger

With Unicode SizeOf(SomeChar) <> Length(SomeChar). Essentially the length of a string is less then the sum of the size of its chars. As long as you don't assume SizeOf(Char) = 1, or SizeOf(SomeString[x]) = 1 (since both are FALSE now) or try to interchange bytes with chars, then you shouldn't have any trouble. Any place you are doing something creative stuffing Bytes into Chars or Strings, then you will need to use AnsiString.
(SizeOf(SomeString) is still 4 no matter the length since it is essentially a pointer with some compiler magic.)

People often implicitly convert from characters to bytes in old Delphi code without really thinking about it. For example, when writing to a stream. When you write a string to a stream, you have to specify the number of bytes you write, but people often pass the character count instead. See this post from Chris Bensen for another example.
Another way people often make this implicit conversion and older code is by using a "string" to store binary data. In this case, they actually want bytes, but the data type expects characters. D2009 has a better type for this.

I didn't try Delphi 2009, but are using fpc which is also switching to unicode slowly. I'm 95% sure that everything below also holds for Delphi 2009
In fpc (when supporting unicode) it will be so that functions like 'length' take the codepage into consideration. Thus it will return the length of the string as a 'human' would see it. If there are - for example - two chinese characters, that both take two bytes of memory in unicode, length will return 2, since there are two characters in the string. But the string will take 4 bytes of memory. (+the memory for the reference count and the leading #0, but that aside)
What you can not do anymore is this:
var p : pchar;
begin
p := s[1];
for i := 0 to length(string)-1 do
begin
write(p);
inc(p);
end;
end;
Because this code will - in the two chinese-character example - write the wrong two characters. Namely the two bytes which are part of the first 'real' character.
In short: Length() doesn't return the amount of bytes allocated for the string anymore, but the amount of characters. (Before the switch to unicode, those two values were equal to eachother)

The actual size of a character shouldn't matter, unless you are doing the manipulation at the byte level.

(Of course if I use strings as strings and not to store any other data)
That's the key point, YOU don't use strings for other purposes, but some people do. They use strings just like arrays, so they (and that's including me) would need to check all such uses to make sure nothing is broken...

Lets not forget that there are times when this conversion is not really desired. Say for storing a GUID in a record for instance. The guid can only contain hexadecimal characters plus the - and brackets...making them take up twice the space can make quite an impact on existing code. Sure the simple solution is to change them to AnsiString, and deal with the compiler warnings if you do any string manipulation on them.

It can be an issue if you make Windows API calls. Or if you have legacy code that does inc or dec of str[0] to change its length.

Categories

HOME

docker

clang

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart