Why use string[1] rather than string while using readbuffer - delphi

I am having a record like this
TEmf_SrectchDIBits = packed record
rEMF_STRETCHDI_BITS: TEMRStretchDIBits;
rBitmapInfo: TBitmapInfo;
ImageSource: string;
end;
---
---
RecordData: TEmf_SrectchDIBits;
If i am reading data into it by using TStream like this an exception is occuring
SetLength(RecordData.ImageSource, pRecordSize);
EMFStream.ReadBuffer(RecordData.ImageSource,pRecordSize)
But if i use below code, it was working normally
SetLength(RecordData.ImageSource, pRecordSize);
EMFStream.ReadBuffer(RecordData.ImageSource[1], pRecordSize);
So what is the difference between using String and String[1]

The difference is a detail related to the signature of the .ReadBuffer method.
The signature is:
procedure ReadBuffer(var Buffer; Count: Longint);
As you can see, the Buffer parameter does not have a type. In this case, you're saying that you want access to the underlying variable.
However, a string is two parts, a pointer (the content of the variable) and the string (the variable points to this).
So, if ReadBuffer were given just the string variable, it would have 4 bytes to store data into, the string variable, and that would not work out too well since the string variable is supposed to hold a pointer, not just any random binary data. If ReadBuffer wrote more than 4 bytes, it would overwrite something else in memory with new data, a potentially disastrous action to do.
By passing the [1] character to a var parameter, you're giving ReadBuffer access to the data that the string variable points to, which is what you want. You want to change the string content after all.
Also, make sure you've set up the length of the string variable to be big enough to hold whatever you're reading into it.
Also, final note, one that I cannot verify. In older Delphi versions, a string variable contained 1-byte characters. In newer, I think they're two, due to unicode, so that code might not work as expected in newer versions of Delphi. You probably would like to use a byte array or heap memory instead.

String types are implemented actually as pointers to something we could call a "string descriptor block". Basically, you have a level of indirection.
That block contains some string control data (reference count, length, and in later versions character set info as well) at negative offsets, and the string characters at positive ones. A string variable is a pointer to the decription block (and if you print SizeOf(stringvar) you get 4), when you work on strings the compiler knows where to find the string data and handle them. But when using an untyped parameter (var Buffer;), the compiler does not know that, it will simply access the memory at "Buffer", but with a string variable that's the pointer to the string block, not the actual string characters. Using string[1] you pass the location of the first character data.

Related

Delphi Unicode String Type Stored Directly at its Address (or "Unicode ShortString")

I want a string type that is Unicode and that stores the string directly at the adress of the variable, as is the case of the (Ansi-only) ShortString type.
I mean, if I declare a S: ShortString and let S := 'My String', then, at #S, I will find the length of the string (as one byte, so the string cannot contain more than 255 characters) followed by the ANSI-encoded string itself.
What I would like is a Unicode variant of this. That is, I want a string type such that, at #S, I will find a unsigned 32-bit integer (or a single byte would be enough, actually) containing the length of the string in bytes (or in characters, which is half the number of bytes) followed by the Unicode representation of the string. I have tried WideString, UnicodeString, and RawByteString, but they all appear only to store an adress at #S, and the actual string somewhere else (I guess this has do do with reference counting and such). Update: The most important reason for this is probably that it would be very problematic if sizeof(string) were variable.
I suspect that there is no built-in type to use, and that I have to come up with my own way of storing text the way I want (which actually is fun). Am I right?
Update
I will, among other things, need to use these strings in packed records. I also need manually to read/write these strings to files/the heap. I could live with fixed-size strings, such as <= 128 characters, and I could redesign the problem so it will work with null-terminated strings. But PChar will not work, for sizeof(PChar) = 1 - it's merely an address.
The approach I eventually settled for was to use a static array of bytes. I will post my implementation as a solution later today.
You're right. There is no exact analogue to ShortString that holds Unicode characters. There are lots of things that come close, including WideString, UnicodeString, and arrays of WideChar, but if you're not willing to revisit the way you intend to use the data type (make byte-for-byte copies in memory and in files while still being using them in all the contexts a string could be allowed), then none of Delphi's built-in types will work for you.
WideString fails because you insist that the string's length must exist at the address of the string variable, but WideString is a reference type; the only thing at its address is another address. Its length happens to be at the address held by the variable, minus four. That's subject to change, though, because all operations on that type are supposed to go through the API.
UnicodeString fails for that same reason, as well as because it's a reference-counted type; making a byte-for-byte copy of one breaks the reference counting, so you'll get memory leaks, invalid-pointer-operation exceptions, or more subtle heap corruption.
An array of WideChar can be copied without problems, but it doesn't keep track of its effective length, and it also doesn't act like a string very often. You can assign string literals to it and it will act like you called StrLCopy, but you can't assign string variables to it.
You could define a record that has a field for the length and another field for a character array. That would resolve the length issue, but it would still have all the rest of the shortcomings of an undecorated array.
If I were you, I'd simply use a built-in string type. Then I'd write functions to help transfer it between files, blocks of memory, and native variables. It's not that hard; probably much easier than trying to get operator overloading to work just right with a custom record type. Consider how much code you will write to load and store your data versus how much code you're going to write that uses your data structure like an ordinary string. You're going to write the data-persistence code once, but for the rest of the project's lifetime, you're going to be using those strings, and you're going to want them to look and act just like real strings. So use real strings. "Suffer" the inconvenience of manually producing the on-disk format you want, and gain the advantage of being able to use all the existing string library functions.
PChar should work like this, right? AFAIK, it's an array of chars stored right where you put it. Zero terminated, not sure how that works with Unicode Chars.
You actually have this in some way with the new unicode strings.
s as a pointer points to s[1] and the 4 bytes on the left contains the length.
But why not simply use Length(s)?
And for direct reading of the length from memory:
procedure TForm9.Button1Click(Sender: TObject);
var
s: string;
begin
s := 'hlkk ljhk jhto';
{$POINTERMATH ON}
Assert(Length(s) = (PInteger(s)-1)^);
//if you don't want POINTERMATH, replace by PInteger(Cardinal(s)-SizeOf(Integer))^
showmessage(IntToStr(length(s)));
end;
There's no Unicode version of ShortString. If you want to store unicode data inline inside an object instead of as a reference type, you can allocate a buffer:
var
buffer = array[0..255] of WideChar;
This has two disadvantages. 1, the size is fixed, and 2, the compiler doesn't recognize it as a string type.
The main problem here is #1: The fixed size. If you're going to declare an array inside of a larger object or record, the compiler needs to know how large it is in order to calculate the size of the object or record itself. For ShortString this wasn't a big problem, since they could only go up to 256 bytes (1/4 of a K) total, which isn't all that much. But if you want to use long strings that are addressed by a 32-bit integer, that makes the max size 4 GB. You can't put that inside of an object!
This, not the reference counting, is why long strings are implemented as reference types, whose inline size is always a constant sizeof(pointer). Then the compiler can put the string data inside a dynamic array and resize it to fit the current needs.
Why do you need to put something like this into a packed array? If I were to guess, I'd say this probably has something to do with serialization. If so, you're better off using a TStream and a normal Unicode string, and writing an integer (size) to the stream, and then the contents of the string. That turns out to be a lot more flexible than trying to stuff everything into a packed array.
The solution I eventually settled for is this (real-world sample - the string is, of course, the third member called "Ident"):
TASStructMemHeader = packed record
TotalSize: cardinal;
MemType: TASStructMemType;
Ident: packed array[0..63] of WideChar;
DataSize: cardinal;
procedure SetIdent(const AIdent: string);
function ReadIdent: string;
end;
where
function TASStructMemHeader.ReadIdent: string;
begin
result := WideCharLenToString(PWideChar(#(Ident[0])), length(Ident));
end;
procedure TASStructMemHeader.SetIdent(const AIdent: string);
var
i: Integer;
begin
if length(AIdent) > 63 then
raise Exception.Create('Too long structure identifier.');
FillChar(Ident[0], length(Ident) * sizeof(WideChar), 0);
Move(AIdent[1], Ident[0], length(AIdent) * sizeof(WideChar));
end;
But then I realized that the compiler really can interpret array[0..63] of WideChar as a string, so I could simply write
var
MyStr: string;
Ident := 'This is a sample string.';
MyStr := Ident;
Hence, after all, the answer given by Mason Wheeler above is actually the answer.

TArray<Byte> VS TBytes VS PByteArray

Those 3 types are very similar...
TArray is the generic version of TBytes.
Both can be casted to PByteArray and used as buffer for calls to Windows API. (with the same restrictions as string to Pchar).
What I would like to know: Is this behavior "by design" or "By Implementation". Or more specifically, could it break in future release?
//Edit
As stated lower...
What I really want to know is: Is this as safe to typecast TBytes(or TArray) to PByteArray as it is to typecast String to PChar as far as forward compatibility is concerned. (Or maybe AnsiString to PAnsiChar is a better exemple ^_^)
Simply put, an array of bytes is an array of bytes, and as long as the definitions of a byte and an array don't change, this won't change either. You're safe to use it that way, as long as you make sure to respect the array bounds, since casting it out of Delphi's array types nullifies your bounds checking.
EDIT: I think I see what you're asking a bit better now.
No, you shouldn't cast a dynamic array reference to a C-style array pointer. You can get away with it with strings because the compiler helps you out a little.
What you can do, though, is cast a pointer to element 0 of the dynamic array to a C-style array pointer. That will work, and won't change.
Two of those types are similar (identical in fact). The third is not.
TArray is declared as "Array of Byte", as is TBytes. You missed a further very relevant type however, TByteArray (the type referenced by PByteArray).
Being a pointer to TByteArray, PByteArray is strictly speaking a pointer to a static byte array, not a dynamic array (which the other byte array types all are). It is typed in this way in order to allow reference to offsets from that base pointer using an integer index. And note that this indexing is limited to 2^15 elements (0..32767). For arbitrary byte offsets (> 32767) from some base pointer, a PByteArray is no good:
var
b: Byte;
ab: TArray<Byte>;
pba: PByteArray;
begin
SetLength(ab, 100000);
pba := #ab; // << No cast necessary - the compiler knows (magic!)
b := pba[62767]; // << COMPILE ERROR!
end;
i.e. casting an Array of Byte or a TArray to a PByteArray is potentially going to lead to problems where the array has > 32K elements (and the pointer is passed to some code which attempts to access all elements). Casting to an untyped pointer avoids this of course (as long as the "recipient" of the pointer then handles access to the memory reference by the pointer appropriately).
BUT, none of this is likely to change in the future, it is merely a consequence of the implementation details that have long since applied in this area. The introduction of a syntactically sugared generic type declaration is a kipper rouge.

Why does Delphi warn when assigning ShortString to string?

I'm converting some legacy code to Delphi 2010.
There are a fair number of old ShortStrings, like string[25]
Why does the assignment below:
type
S: String;
ShortS: String[25];
...
S := ShortS;
cause the compiler to generate this warning:
W1057 Implicit string cast from 'ShortString' to 'string'.
There's no data loss that is occurring here. In what circumstances would this warning be helpful information to me?
Thanks!
Tomw
It's because your code is implicitly converting a single-byte character string to a UnicodeString. It's warning you in case you might have overlooked it, since that can cause problems if you do it by mistake.
To make it go away, use an explicit conversion:
S := string(ShortS);
The ShortString type has not changed. It continues to be, in effect, an array of AnsiChar.
By assigning it to a string type, you are taking what is a group of AnsiChars (one byte) and putting it into a group of WideChars (two bytes). The compiler can do that just fine, and is smart enough not to lose data, but the warning is there to let you know that such a conversion has taken place.
The warning is very important because you may lose data. The conversion is done using the current Windows 8-bit character set, and some character sets do not define all values between 0 and 255, or are multi-byte character sets, and thus cannot convert all byte values.
The data loss can occur on a standard computer in a country with specific standard character sets, or on a computer in USA that has been set up for a different locale, because the user communicates a lot with people in other languages.
For instance, if the local code page is 932, the byte values 129 and 130 will both convert to the same value in the Unicode string.
In addition to this, the conversion involves a Windows API call, which is an expensive operation. If you do a lot of these, it can slow down your application.
It's safe ( as long as you're using the ShortString for its intended purpose: to hold a string of characters and not a collection of bytes, some of which may be 0 ), but may have performance implications if you do it a lot. As far as I know, Delphi has to allocate memory for the new unicode string, extract the characters from the ShortString into a null-terminated string (that's why it's important that it's a properly-formed string) and then call something like the Windows API MultiByteToWideChar() function. Not rocket science, but not a trivial operation either.
ShortStrings don't have a code page associated with them, AnsiStrings do (since D2009).
The conversion from ShortString to UnicodeString can only be done on the assumption that ShortStrings are encoded in the default ANSI encoding which is not a safe assumption.
I don't really know Delphi, but if I remember correctly, the Shortstrings are essentially a sequence of characters on the stack, whereas a regular string (AnsiString) is actually a reference to a location on the heap. This may have different implications.
Here's a good article on the different string types:
http://www.codexterity.com/delphistrings.htm
I think there might also be a difference in terms of encoding but I'm not 100% sure.

Are Delphi strings immutable?

As far as I know, strings are immutable in Delphi. I kind of understand that means if you do:
string1 := 'Hello';
string1 := string1 + " World";
first string is destroyed and you get a reference to a new string "Hello World".
But what happens if you have the same string in different places around your code?
I have a string hash assigned for identifying several variables, so for example a "change" is identified by a hash value of the properties of that change. That way it's easy for me to check to "changes" for equality.
Now, each hash is computed separately (not all the properties are taken into account so that to separate instances can be equal even if they differ on some values).
The question is, how does Delphi handles those strings? If I compute to separate hashes to the same 10 byte length string, what do I get? Two memory blocks of 10 bytes or two references to the same memory block?
Clarification: A change is composed by some properties read from the database and is generated by an individual thread. The TChange class has a GetHash method that computes a hash based on some of the values (but not all) resulting on a string. Now, other threads receive the Change and have to compare it to previously processed changes so that they don't process the same (logical) change. Hence the hash and, as they have separate instances, two different strings are computed. I'm trying to determine if it'd be a real improvement to change from string to something like a 128 bit hash or it'll be just wasting my time.
Edit: Version of Delphi is Delphi 7.0
Delphi strings are copy on write. If you modify a string (without using pointer tricks or similar techniques to fool the compiler), no other references to the same string will be affected.
Delphi strings are not interned. If you create the same string from two separate sections of code, they will not share the same backing store - the same data will be stored twice.
Delphi strings are not immutable (try: string1[2] := 'a') but they are reference-counted and copy-on-write.
The consequences for your hashes are not clear, you'll have to detail how they are stored etc.
But a hash should only depend on the contents of a string, not on how it is stored. That makes the whole question mute. Unless you can explain it better.
As others have said, Delphi strings are not generally immutable. Here are a few references on strings in Delphi.
http://blog.marcocantu.com/blog/delphi_super_duper_strings.html
http://conferences.codegear.com/he/article/32120
http://www.codexterity.com/delphistrings.htm
The Delphi version may be important to know. The good old Delphi BCL handles strings as copy-on-write, which basically means that a new instance is created when something in the string is changed. So yes, they are more or less immutable.

Delphi 2009 + Unicode + Char-size

I just got Delphi 2009 and have previously read some articles about modifications that might be necessary because of the switch to Unicode strings.
Mostly, it is mentioned that sizeof(char) is not guaranteed to be 1 anymore.
But why would this be interesting regarding string manipulation?
For example, if I use an AnsiString:='Test' and do the same with a String (which is unicode now), then I get Length() = 4 which is correct for both cases.
Without having tested it, I'm sure all other string manipulation functions behave the same way and decide internally if the argument is a unicode string or anything else.
Why would the actual size of a char be of interest for me if I do string manipulations?
(Of course if I use strings as strings and not to store any other data)
Thanks for any help!
Holger
With Unicode SizeOf(SomeChar) <> Length(SomeChar). Essentially the length of a string is less then the sum of the size of its chars. As long as you don't assume SizeOf(Char) = 1, or SizeOf(SomeString[x]) = 1 (since both are FALSE now) or try to interchange bytes with chars, then you shouldn't have any trouble. Any place you are doing something creative stuffing Bytes into Chars or Strings, then you will need to use AnsiString.
(SizeOf(SomeString) is still 4 no matter the length since it is essentially a pointer with some compiler magic.)
People often implicitly convert from characters to bytes in old Delphi code without really thinking about it. For example, when writing to a stream. When you write a string to a stream, you have to specify the number of bytes you write, but people often pass the character count instead. See this post from Chris Bensen for another example.
Another way people often make this implicit conversion and older code is by using a "string" to store binary data. In this case, they actually want bytes, but the data type expects characters. D2009 has a better type for this.
I didn't try Delphi 2009, but are using fpc which is also switching to unicode slowly. I'm 95% sure that everything below also holds for Delphi 2009
In fpc (when supporting unicode) it will be so that functions like 'length' take the codepage into consideration. Thus it will return the length of the string as a 'human' would see it. If there are - for example - two chinese characters, that both take two bytes of memory in unicode, length will return 2, since there are two characters in the string. But the string will take 4 bytes of memory. (+the memory for the reference count and the leading #0, but that aside)
What you can not do anymore is this:
var p : pchar;
begin
p := s[1];
for i := 0 to length(string)-1 do
begin
write(p);
inc(p);
end;
end;
Because this code will - in the two chinese-character example - write the wrong two characters. Namely the two bytes which are part of the first 'real' character.
In short: Length() doesn't return the amount of bytes allocated for the string anymore, but the amount of characters. (Before the switch to unicode, those two values were equal to eachother)
The actual size of a character shouldn't matter, unless you are doing the manipulation at the byte level.
(Of course if I use strings as strings and not to store any other data)
That's the key point, YOU don't use strings for other purposes, but some people do. They use strings just like arrays, so they (and that's including me) would need to check all such uses to make sure nothing is broken...
Lets not forget that there are times when this conversion is not really desired. Say for storing a GUID in a record for instance. The guid can only contain hexadecimal characters plus the - and brackets...making them take up twice the space can make quite an impact on existing code. Sure the simple solution is to change them to AnsiString, and deal with the compiler warnings if you do any string manipulation on them.
It can be an issue if you make Windows API calls. Or if you have legacy code that does inc or dec of str[0] to change its length.

Resources