Why does Delphi warn when assigning ShortString to string? - delphi

I'm converting some legacy code to Delphi 2010.
There are a fair number of old ShortStrings, like string[25]
Why does the assignment below:
type
S: String;
ShortS: String[25];
...
S := ShortS;
cause the compiler to generate this warning:
W1057 Implicit string cast from 'ShortString' to 'string'.
There's no data loss that is occurring here. In what circumstances would this warning be helpful information to me?
Thanks!
Tomw

It's because your code is implicitly converting a single-byte character string to a UnicodeString. It's warning you in case you might have overlooked it, since that can cause problems if you do it by mistake.
To make it go away, use an explicit conversion:
S := string(ShortS);

The ShortString type has not changed. It continues to be, in effect, an array of AnsiChar.
By assigning it to a string type, you are taking what is a group of AnsiChars (one byte) and putting it into a group of WideChars (two bytes). The compiler can do that just fine, and is smart enough not to lose data, but the warning is there to let you know that such a conversion has taken place.

The warning is very important because you may lose data. The conversion is done using the current Windows 8-bit character set, and some character sets do not define all values between 0 and 255, or are multi-byte character sets, and thus cannot convert all byte values.
The data loss can occur on a standard computer in a country with specific standard character sets, or on a computer in USA that has been set up for a different locale, because the user communicates a lot with people in other languages.
For instance, if the local code page is 932, the byte values 129 and 130 will both convert to the same value in the Unicode string.
In addition to this, the conversion involves a Windows API call, which is an expensive operation. If you do a lot of these, it can slow down your application.

It's safe ( as long as you're using the ShortString for its intended purpose: to hold a string of characters and not a collection of bytes, some of which may be 0 ), but may have performance implications if you do it a lot. As far as I know, Delphi has to allocate memory for the new unicode string, extract the characters from the ShortString into a null-terminated string (that's why it's important that it's a properly-formed string) and then call something like the Windows API MultiByteToWideChar() function. Not rocket science, but not a trivial operation either.

ShortStrings don't have a code page associated with them, AnsiStrings do (since D2009).
The conversion from ShortString to UnicodeString can only be done on the assumption that ShortStrings are encoded in the default ANSI encoding which is not a safe assumption.

I don't really know Delphi, but if I remember correctly, the Shortstrings are essentially a sequence of characters on the stack, whereas a regular string (AnsiString) is actually a reference to a location on the heap. This may have different implications.
Here's a good article on the different string types:
http://www.codexterity.com/delphistrings.htm
I think there might also be a difference in terms of encoding but I'm not 100% sure.

Related

OPerand mismatch converting from D6 to RS10

I took a break from porting code, and now I'm spending some more time on it again.
Problem is, I guess i'm still stuck backwards in my head (everything works fine on D6 :D).
Can anyone tell me why this simple code is not working?
if NewSig <> NewCompressionSignature then
E2015 Operator not applicable to this operand type
Here are the definitions of the above:
NewCompressionSignature: TCompressionSignature = 'DRM$IG01';
NewSig: array[0..SizeOf(NewCompressionSignature)-1] of Char;
I'm just guessing here because the type of TCompressionSignature is not given, but I can reproduce ERROR2015 if TCompressionSignature is declared as some kind of ShortString like
type
TCompressionSignature = String[8]
As you might know, Delphi is currently using Unicode as its standard internal string encoding. For backward compatibility reasons, the type ShortString and other short string types (like String[8]) were left unchanged. These strings have the same encoding like AnsiString and are composed of standard plain old 1-byte characters (AnsiChar).
NewSig on the other hand is composed of two-byte Unicode characters and can not be compared directly with an ShortString.
One solution of your problem would be to declare:
NewSig: array[0..SizeOf(NewCompressionSignature)-1] of AnsiChar;
Another solution would be be a cast to string:
if NewSig <> String(NewCompressionSignature) then ...
But I would prefer to change the array declaration if possible.
Please review the documentation short strings and about unicode - especially if you're doing io operations to ensure your input and output is read and written with the correct codepage.

Cannot get expected result for Spring4D cryptography examples

The Spring4D library has cryptography classes, however I cannot get them to work as expected. I'm probably using them incorrectly, however lack of any examples makes it difficult.
For example on the website https://quickhash.com/hash-sha256-online, I can hash the word "test" to generate the following hash:
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
Using the Spring4D library, the following code produces a different hash:
CreateSHA256.ComputeHash('test').ToString;
results in:
9EFEA1AEAC9EDA04A892885A65FDAE0E6D9BE8C9FC96DA76D31B929262E12B1D
Upper/lower case aside, it is a different hash altogether. I know must be doing something wrong, but again there's no examples of use so I'm stuck on how to do this.
Hashing algorithms operate on binary data, typically represented using byte arrays.
Unfortunately, both of the resources you have used offer the ability to hash text. In order to hash text, you first need to convert from text to binary. To do so requires a choice of encoding. And neither method makes it clear what that choice is.
When I use this Delphi code:
LowerCase(CreateSHA256.ComputeHash(TEncoding.UTF8.GetBytes('test')).ToString)
I get the same hash as appears in your question.
I urge you never to attempt to encrypt/hash text and instead regard these operations as operating on binary. Always use an explicit encoding and then encrypt/hash the array of bytes that the encoding produced.
I've picked the UTF-8 encoding here, because it is a full Unicode encoding, and tends to be efficient in terms of space. However, I don't think your online encoder uses UTF-8. In fact I've no idea what encoding it uses, it is unclear on the matter. This is of course the same old issue of text being different from binary.
In my opinion it is a design flaw of the Delphi library that you use that it allows you to hash text without an explicit choice of encoding. If this library must offer a function that hashes text, then it should require the caller to supply an extra TEncoding parameter.
There is no conversion going on internally so it hashes the UnicodeString which is at least 2 bytes per character.
If you want the same result as on the page you have to use UTF8Encode or directly pass as AnsiString.
However I tried some strings that contained different unicode characters and the page returned a different result. So I am not quite sure how they treat the strings there. I guess it's a codepage thing.
Edit: If you use this page http://www.xorbin.com/tools/sha256-hash-calculator it generates the same hash as TSHA256 with UTF8Encode.
Which type of string are you using? Do you use AnsiString or WideString (Unicode string). Delphi 2009 and Newer are using WideString by default.
Why is string type inportant? All hasging algorithm operates on raw bytes data so it is omportant if each character of your string is stored in one Byte of memory (AnsiString) or multiple Bytes of memory (WideString).

Delphi String / Array of Strings

I have an old programm which was programmed in Delphi 1 (or 2, I'm not sure) and I want to build a 64-bit version of it (I use the Delphi XE2). Now the problem is that in the source code there are on the one hand strings and on the other arrays of strings (I guess to limit the string length).
Now there are a lot of errors while compiling because of incompatible types.
Above all there are procedures which should handle both types.
Is there an easy way to solve this problem (without changing every variable)?
Short answer
Search and replace : string => : ansistring
make sure you use length(astring) and setLength(astring) instead of manipulating string[0].
Long answer
Delphi 1 has only one type of string.
The old-skool ShortString that has a maximum length of 255 chars and a declared maximum length.
It looks and feels like an array of char, but it has a leading length byte.
var
ShortString: string[100];
In Delphi 2 longstrings (aka AnsiString) were introduced, these replace the shortstring. They do not have a fixed length, but are allocated dynamically instead and automatically grow and shrink as needed.
They are automatically created and destroyed.
var
Longstring: string; //AnsiString, can have any length up to 2GB.
In Delphi 2009 Unicode was introduced.
This changes the longstring because now each char no langer takes up 1 byte, but takes 2 bytes(*).
Additionally you can specify a character set to an AnsiString, whereas the new Unicode longstring uses UTF-16.
What you need to do depends on your needs:
If you just want the old code to work as before and you don't care about supporting all the multilingual stuff Unicode supports, you will need to replace all your string keywords with AnsiString (for all strings that are longstrings).
If you have Delphi 1 code, you can rename the string to ShortString.
I would recommend that you refactor the code to always use longstrings (read: AnsiString) though.
Delphi will automatically translate the UnicodeStrings that all return values of functions (Unicode string) are translated into AnsiStrings and visa versa, however this may include loss of data if your users enter symbols in a editbox that your AnsiString cannot store.
Also all that translation takes a bit of time (I doubt you will notice this though).
In Delphi 1 up to Delphi 2007 this problem did not exist, because controls did not allow Unicode characters to be entered.
(*) gross oversimplification

Delphi XE - should I use String or AnsiString?

I finally upgraded to Delphi XE. I have a library of units where I use strings to store plain ANSI characters (chars between A and U). I am 101% sure that I will never ever use UNICODE characters in those places.
I want to convert all other libraries to Unicode, but for this specific library I think it will be better to stick with ANSI. The advantage is the memory requirement as in some cases I load very large TXT files (containing ONLY Ansi characters). The disadvantage might be that I have to do lots and lots of typecasts when I make those libraries to interact with normal (unicode) libraries.
There are some general guidelines to show when is good to convert to Unicode and when to stick with Ansi?
The problem with general guidelines is that something like this can be very specific to a person's situation. Your example here is one of those.
However, for people Googling and arriving here, some general guidelines are:
Yes, convert to Unicode. Don't try to keep an old app fully using AnsiStrings. The reason is that the whole VCL is Unicode, and you shouldn't try to mix the two, because you will convert every time you assign a Unicode string to an ANSI string, and that is a lossy conversion. Trying to keep the old way because it's less work (or some similar reason) will cause you pain; just embrace the new string type, convert, and go with it.
Instead of randomly mixing the two, explicitly perform any conversions you need to, once - for example, if you're loading data from an old version of your program you know it will be ANSI, so read it into a Unicode string there, and that's it. Ever after, it will be Unicode.
You should not need to change the type of your string variables - string pre-D2009 is ANSI, and in D2009 and alter is Unicode. Instead, follow compiler warnings and watch which string methods you use - some still take an AnsiString parameter and I find it all confusing. The compiler will tell you.
If you use strings to hold bytes (in other words, using them as an array of bytes because a character was a byte) switch to TBytes.
You may encounter specific problems for things like encryption (strings are no longer byte/characters, so 'character' for 'character' you may get different output); reading text files (use the stream classes and TEncoding); and, frankly, miscellaneous stuff. Search here on SO, most things have been asked before.
Commenters, please add more suggestions... I mostly use C++Builder, not Delphi, and there are probably quite a few specific things for Delphi I don't know about.
Now for your specific question: should you convert this library?
If:
The values between A and U are truly only ever in this range, and
These values represent characters (A really is A, not byte value 65 - if so, use TBytes), and
You load large text files and memory is a problem
then not converting to Unicode, and instead switching your strings to AnsiStrings, makes sense.
Be aware that:
There is an overhead every time you convert from ANSI to Unicode
You could use UTF8String, which is a specific type of AnsiString that will not be lossy when converted, and will still store most text (Roman characters) in a single byte
Changing all the instances of string to AnsiString could be a bit of work, and you will need to check all the methods called with them to see if too many implicit conversions are being performed (for performance), etc
You may need to change the outer layer of your library to use Unicode so that conversion code or ANSI/Unicode compiler warnings are not visible to users of your library
If you convert to Unicode, sets of characters (can't remember the syntax, maybe if 'S' in MySet?) won't work. From your description of characters A to U, I could guess you would like to use this syntax.
My recommendation? Personally, the only reason I would do this from the information you've given is the memory use, and possibly performance depending on what you're doing with this huge amount of A..Us. If that truly is significant, it's both the driver and the constraint, and you should convert to ANSI.
You should be able to wrap up the conversion at the interface between this unit and its clients. Use AnsiString internally and string everywhere else and you should be fine.
In general only use AnsiString if it is important that the Chars are single bytes, Otherwise the use of string ensures future compatibility with Unicode.
You need to check all libraries anyway because all Windows API functions in Delhpi XE replaced by their unicode-analogues, etc. If you will never use UNICODE you need to use Delphi 7.
Use AnsiString explicitly everywhere in this unit and then you'll get compiler warning errors (which you should never ignore) for String to AnsiString conversion errors if you happen to access the routines incorrectly.
Alternately, perhaps preferably depending on your situation, simply convert everything to UTF8.
Stick with Ansi strings ONLY if you do not have the time to convert the code properly. The use of Ansi strings is really only for backward compatibility - to my knowledge C# does not have an equiavalent to Ansi strings. Otherwise use the standard Unicode strings. If you have a look on my web-site I have a whole strings routines unit (about 5,000 LOC) that works with both Delphi 2007 (non-Uniocde) and XE (Unicode) with only "string" interfaces and contains almost all of the conversion issues you might face.

Delphi 2009 + Unicode + Char-size

I just got Delphi 2009 and have previously read some articles about modifications that might be necessary because of the switch to Unicode strings.
Mostly, it is mentioned that sizeof(char) is not guaranteed to be 1 anymore.
But why would this be interesting regarding string manipulation?
For example, if I use an AnsiString:='Test' and do the same with a String (which is unicode now), then I get Length() = 4 which is correct for both cases.
Without having tested it, I'm sure all other string manipulation functions behave the same way and decide internally if the argument is a unicode string or anything else.
Why would the actual size of a char be of interest for me if I do string manipulations?
(Of course if I use strings as strings and not to store any other data)
Thanks for any help!
Holger
With Unicode SizeOf(SomeChar) <> Length(SomeChar). Essentially the length of a string is less then the sum of the size of its chars. As long as you don't assume SizeOf(Char) = 1, or SizeOf(SomeString[x]) = 1 (since both are FALSE now) or try to interchange bytes with chars, then you shouldn't have any trouble. Any place you are doing something creative stuffing Bytes into Chars or Strings, then you will need to use AnsiString.
(SizeOf(SomeString) is still 4 no matter the length since it is essentially a pointer with some compiler magic.)
People often implicitly convert from characters to bytes in old Delphi code without really thinking about it. For example, when writing to a stream. When you write a string to a stream, you have to specify the number of bytes you write, but people often pass the character count instead. See this post from Chris Bensen for another example.
Another way people often make this implicit conversion and older code is by using a "string" to store binary data. In this case, they actually want bytes, but the data type expects characters. D2009 has a better type for this.
I didn't try Delphi 2009, but are using fpc which is also switching to unicode slowly. I'm 95% sure that everything below also holds for Delphi 2009
In fpc (when supporting unicode) it will be so that functions like 'length' take the codepage into consideration. Thus it will return the length of the string as a 'human' would see it. If there are - for example - two chinese characters, that both take two bytes of memory in unicode, length will return 2, since there are two characters in the string. But the string will take 4 bytes of memory. (+the memory for the reference count and the leading #0, but that aside)
What you can not do anymore is this:
var p : pchar;
begin
p := s[1];
for i := 0 to length(string)-1 do
begin
write(p);
inc(p);
end;
end;
Because this code will - in the two chinese-character example - write the wrong two characters. Namely the two bytes which are part of the first 'real' character.
In short: Length() doesn't return the amount of bytes allocated for the string anymore, but the amount of characters. (Before the switch to unicode, those two values were equal to eachother)
The actual size of a character shouldn't matter, unless you are doing the manipulation at the byte level.
(Of course if I use strings as strings and not to store any other data)
That's the key point, YOU don't use strings for other purposes, but some people do. They use strings just like arrays, so they (and that's including me) would need to check all such uses to make sure nothing is broken...
Lets not forget that there are times when this conversion is not really desired. Say for storing a GUID in a record for instance. The guid can only contain hexadecimal characters plus the - and brackets...making them take up twice the space can make quite an impact on existing code. Sure the simple solution is to change them to AnsiString, and deal with the compiler warnings if you do any string manipulation on them.
It can be an issue if you make Windows API calls. Or if you have legacy code that does inc or dec of str[0] to change its length.

Resources