Handling of UTF-16 string data type in Delphi XE (best practice)

Handling of UTF-16 string data type in Delphi XE (best practice) - delphi

So much information about Unicode but hard for for me to get a conclusion.
I'm working on an multi-language Delphi XE5 application and now I face this problem with this unicode characters. Honestly I don't want to understand the magic behind, I just want to see them work in my application.
Before it was simple. In general use String data type. Now I've read about WideString, UnicodeString, AnsiString and the fact that String in XE5 is compliant with UTF-16
I've tested with WideString and the lating characters like (șțăîâ) are working, but it's still not clear if WideString is the best one or not. Should I use UnicodeString or else?
So, If I should make a multi-language application that support all languages, in the end, what kind of data type should I use? Is it any possibility to maintain String type and get the same results like WideString?
Remark: I use inside my application FireDac components, but this should not matter.

In modern Delphi "string" is a shortcut to "UnicodeString" real data type. Use it unless have to forced to use other types.
WideString is but Delphi pseudonym for Microsoft OLE BSTR type and lacks reference counting. That disables copy-on-write optimizations and makes those string work slower than UnicodeString in general (the very data buffer is copied time and again instead of just passing new pointer to it). Unless you need exactly those features - better use usual strings. However for i18n both those types work enough.

Related

Cannot get expected result for Spring4D cryptography examples

The Spring4D library has cryptography classes, however I cannot get them to work as expected. I'm probably using them incorrectly, however lack of any examples makes it difficult.
For example on the website https://quickhash.com/hash-sha256-online, I can hash the word "test" to generate the following hash:
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
Using the Spring4D library, the following code produces a different hash:
CreateSHA256.ComputeHash('test').ToString;
results in:
9EFEA1AEAC9EDA04A892885A65FDAE0E6D9BE8C9FC96DA76D31B929262E12B1D
Upper/lower case aside, it is a different hash altogether. I know must be doing something wrong, but again there's no examples of use so I'm stuck on how to do this.

Hashing algorithms operate on binary data, typically represented using byte arrays.
Unfortunately, both of the resources you have used offer the ability to hash text. In order to hash text, you first need to convert from text to binary. To do so requires a choice of encoding. And neither method makes it clear what that choice is.
When I use this Delphi code:
LowerCase(CreateSHA256.ComputeHash(TEncoding.UTF8.GetBytes('test')).ToString)
I get the same hash as appears in your question.
I urge you never to attempt to encrypt/hash text and instead regard these operations as operating on binary. Always use an explicit encoding and then encrypt/hash the array of bytes that the encoding produced.
I've picked the UTF-8 encoding here, because it is a full Unicode encoding, and tends to be efficient in terms of space. However, I don't think your online encoder uses UTF-8. In fact I've no idea what encoding it uses, it is unclear on the matter. This is of course the same old issue of text being different from binary.
In my opinion it is a design flaw of the Delphi library that you use that it allows you to hash text without an explicit choice of encoding. If this library must offer a function that hashes text, then it should require the caller to supply an extra TEncoding parameter.

There is no conversion going on internally so it hashes the UnicodeString which is at least 2 bytes per character.
If you want the same result as on the page you have to use UTF8Encode or directly pass as AnsiString.
However I tried some strings that contained different unicode characters and the page returned a different result. So I am not quite sure how they treat the strings there. I guess it's a codepage thing.
Edit: If you use this page http://www.xorbin.com/tools/sha256-hash-calculator it generates the same hash as TSHA256 with UTF8Encode.

Which type of string are you using? Do you use AnsiString or WideString (Unicode string). Delphi 2009 and Newer are using WideString by default.
Why is string type inportant? All hasging algorithm operates on raw bytes data so it is omportant if each character of your string is stored in one Byte of memory (AnsiString) or multiple Bytes of memory (WideString).

How to use AnsiString to store binary data?

I have a simple question.
I want to use AnsiString as a container for binary data. I mostly load such data from TMemoryStream or TFileStream and I save it back from AnsiString after some processing. Works fine, haven't found a problem with that.
But from what I've seen using it like that sparcles debates to use Sysutils::TBytes instead. Why? Sysutils::TBytes has much fewer useful methods which I can use to manipulate data stored inside for example AnsiString. It is clearly half-finished container, compared to AnsiString.
Is the only problem I should care about conversion to regular string or is there something else why I should really use the less-than-adequate TBytes instead? I do not make conversions of AnsiString to other string types - that is what is quoted as a possible problem elsewhere.
An example of how I load data:
AnsiString data;
boost::scoped_ptr<TFileStream> fs(new TFileStream(FileName, fmOpenRead | fmShareDenyWrite));
data.SetLength(fs->Size);
fs->Read(data.c_str(), fs->Size);
An example how I save data:
// fs wants void * so I have to use data.data() instead of data.c_str() here
fs->Write(data.data(), data.Length());
So it should be safe to store binary data correct?

I want to use AnsiString as a container for binary data.
One word - DON'T! It will bite you someday. Use a more appropriate container, such as TBytes, TMemoryStream, std::vector<byte>, etc.
Works fine, haven't found a problem with that.
Consider yourself lucky. From C++Builder 2009 onwards, AnsiString is codepage-aware, and it WILL cause data conversions if you are not VERY careful when passing AnsiString around. Sooner or later, you are likely to slip up and it will risk corrupting your binary data.
But from what I've seen using it like that sparcles debates to use Sysutils::TBytes instead. Why?
Because it is an actual raw binary container meant specifically for raw bytes.
Sysutils::TBytes has much fewer useful methods which I can use to manipulate data stored inside for example AnsiString.
You should not be manipulating binary data as text to begin with. And since you are using things like Boost and STL, you should consider using their binary containers instead. They have more functions available.
That being said, XE7 does introduce some new functions for manipulating Delphi-style dynamic arrays (like TBytes) including inserts, deletes, and concatenations:
String-Like Operations Supported on Dynamic Arrays
It does not look like those new functions made it into C++Builder's DynamicArray class (which TBytes is a typedef of), though.
It is clearly half-finished container, compared to AnsiString.
AnsiString is a container of text characters. Period. Always has been, always will be. People ABUSE it by taking advantage of the fact that sizeof(char)==sizeof(byte). That worked up to a point, but it has become dangerous in recent years to continue abusing it.
Is the only problem I should care about conversion to regular string or is there something else why I should really use the less-than-adequate TBytes instead?
That, and the fact that Embarcadero has been phasing out AnsiString since 2009. 8bit strings are disabled in the mobile compilers, it is only a matter of time before the desktop compilers follow suit.
Why are you wanting to manipulate raw bytes as strings to begin with? Can you provide an example of something you can do with AnsiString that you cannot do with TBytes?
So it should be safe to store binary data correct?
In your specific example, yes (and yes, you can use c_str() instead of data() when calling fs->Write()).

Delphi XE - should I use String or AnsiString?

I finally upgraded to Delphi XE. I have a library of units where I use strings to store plain ANSI characters (chars between A and U). I am 101% sure that I will never ever use UNICODE characters in those places.
I want to convert all other libraries to Unicode, but for this specific library I think it will be better to stick with ANSI. The advantage is the memory requirement as in some cases I load very large TXT files (containing ONLY Ansi characters). The disadvantage might be that I have to do lots and lots of typecasts when I make those libraries to interact with normal (unicode) libraries.
There are some general guidelines to show when is good to convert to Unicode and when to stick with Ansi?

The problem with general guidelines is that something like this can be very specific to a person's situation. Your example here is one of those.
However, for people Googling and arriving here, some general guidelines are:
Yes, convert to Unicode. Don't try to keep an old app fully using AnsiStrings. The reason is that the whole VCL is Unicode, and you shouldn't try to mix the two, because you will convert every time you assign a Unicode string to an ANSI string, and that is a lossy conversion. Trying to keep the old way because it's less work (or some similar reason) will cause you pain; just embrace the new string type, convert, and go with it.
Instead of randomly mixing the two, explicitly perform any conversions you need to, once - for example, if you're loading data from an old version of your program you know it will be ANSI, so read it into a Unicode string there, and that's it. Ever after, it will be Unicode.
You should not need to change the type of your string variables - string pre-D2009 is ANSI, and in D2009 and alter is Unicode. Instead, follow compiler warnings and watch which string methods you use - some still take an AnsiString parameter and I find it all confusing. The compiler will tell you.
If you use strings to hold bytes (in other words, using them as an array of bytes because a character was a byte) switch to TBytes.
You may encounter specific problems for things like encryption (strings are no longer byte/characters, so 'character' for 'character' you may get different output); reading text files (use the stream classes and TEncoding); and, frankly, miscellaneous stuff. Search here on SO, most things have been asked before.
Commenters, please add more suggestions... I mostly use C++Builder, not Delphi, and there are probably quite a few specific things for Delphi I don't know about.
Now for your specific question: should you convert this library?
If:
The values between A and U are truly only ever in this range, and
These values represent characters (A really is A, not byte value 65 - if so, use TBytes), and
You load large text files and memory is a problem
then not converting to Unicode, and instead switching your strings to AnsiStrings, makes sense.
Be aware that:
There is an overhead every time you convert from ANSI to Unicode
You could use UTF8String, which is a specific type of AnsiString that will not be lossy when converted, and will still store most text (Roman characters) in a single byte
Changing all the instances of string to AnsiString could be a bit of work, and you will need to check all the methods called with them to see if too many implicit conversions are being performed (for performance), etc
You may need to change the outer layer of your library to use Unicode so that conversion code or ANSI/Unicode compiler warnings are not visible to users of your library
If you convert to Unicode, sets of characters (can't remember the syntax, maybe if 'S' in MySet?) won't work. From your description of characters A to U, I could guess you would like to use this syntax.
My recommendation? Personally, the only reason I would do this from the information you've given is the memory use, and possibly performance depending on what you're doing with this huge amount of A..Us. If that truly is significant, it's both the driver and the constraint, and you should convert to ANSI.

You should be able to wrap up the conversion at the interface between this unit and its clients. Use AnsiString internally and string everywhere else and you should be fine.

In general only use AnsiString if it is important that the Chars are single bytes, Otherwise the use of string ensures future compatibility with Unicode.

You need to check all libraries anyway because all Windows API functions in Delhpi XE replaced by their unicode-analogues, etc. If you will never use UNICODE you need to use Delphi 7.

Use AnsiString explicitly everywhere in this unit and then you'll get compiler warning errors (which you should never ignore) for String to AnsiString conversion errors if you happen to access the routines incorrectly.
Alternately, perhaps preferably depending on your situation, simply convert everything to UTF8.

Stick with Ansi strings ONLY if you do not have the time to convert the code properly. The use of Ansi strings is really only for backward compatibility - to my knowledge C# does not have an equiavalent to Ansi strings. Otherwise use the standard Unicode strings. If you have a look on my web-site I have a whole strings routines unit (about 5,000 LOC) that works with both Delphi 2007 (non-Uniocde) and XE (Unicode) with only "string" interfaces and contains almost all of the conversion issues you might face.

How do the new string types work in Delphi 2009/2010?

I have to convert a large legacy application to Delphi 2009 which uses strings, AnsiStrings, WideStrings and UTF8 data all over the place and I have a hard time to understand how the new string types work and how they should be used.
The application fully supported Unicode using TntUnicodeControls and there are 3rd party DLLs which require strings in specific encodings, mostly UTF8 and UTF16, making the conversion task not as trivial as one would suspect.
I especially have problems with the C DLL calls and choosing the right type.
I also get the impression that there are many implicit string conversions happening, because one of the DLL seems to always receive UTF-8 encoded strings, no matter how the Delphi string is encoded.
Can someone please provide a short overview about the new Delphi 2009 string types UnicodeString and RawByteString, perhaps some usage hints and possible pitfalls when converting a pre 2009 application?

See Delphi and Unicode, a white paper written by Marco Cantù and I guess
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), written by Joel.
One pitfall is that the default Win32 API call has been mapped to use the W (wide string) version instead of the A (ANSI) version, for example ShellExecuteA If your code is doing tricky pointer code assuming internal layout of AnsiString, it will break. A fallback is to substitute PChar with PAnsiChar, Char with AnsiChar, string with AnsiString, and append A at the end of Win32 API call for that portion of code. After the code actually compiles and runs normally, you could refactor your code to use string (UnicodeString).

Watch my CodeRage 4 talk on "Using Unicode and Other Encodings in your Programs" this friday, or wait until the replay of it is available online.
I'm going to cover some encodings and explain about the string format.
The slides will be available shortly (I'll try to get them online today) and contain a lot of references to stuff you should read on the internet (but I must admit I forgot the link to Joel on Unicode that eed3si9n posted).
Will edit this answer today with the uploads and the links.
Edit:
If you have a small sample where you can show that your C/C++ DLL receives the strings UTF8 encoded, but thought they should be encoded otherwise, please post it (mail me; almost anything at the pluimers dot com gets to me, especially if you use my first name before the at sign).
Session materials can be downloaded now, including the "Using Unicode and Other Encodings in your Programs" session.
These are links from that session:
Read these:
Marco Cantu, Whitepaper “Delphi and Unicode”
Marco Cantu, Presentation “Delphi and Unicode”
Nick Hodges, Whitepaper “Delphi in a Unicode World”
Relevant on-line help topics:
What's New in Delphi and C++Builder 2009
String Types: Base: ShortString, AnsiString, WideString, UnicodeString
String Types: Unicode (including internal memory layouts of the string types)
String Types: Enabling for Unicode
String Types: RawByteString (AnsiString with CodePage $ffff)
String Types: UTF8String (AnsiString with CodePage 65001)
String <-> PChar conversions: PChar fundamentals
String <-> PChar conversions: Returning a PChar Local Variable
String <-> PChar conversions: Passing a Local Variable as a PChar
Hope this gets you going. If not, mail me and I'll try to extend the answer here.

Note that it does not only hit real string code. It also hits code where PCHAR is used to trawl through buffers, or interface with APIs.
E.g. initialization code of headers that load the DLL dynamically (getprocedureaddress/loadlibray)

It seems almost all my problems come from the automatic conversion on assignments to UTF8String.
I already had old code using UTF8String just to help me think which type of string a variable should contain.
When starting to port my application, I replaced AnsiString with UTF8String for the same reason, but the code depended on UTF8String being just an alias to (classic) AnsiString
Now with the automatic conversion that assumption is no longer true, which created many problems.
Be careful if you use UTF8String when porting from pre-2009 Delphi code!

Another thing to watch out for when passing string between dlls built with different versions of Delphi or C++ Builder is that, starting with 2009, the StrRec part of AnsiStringBase gained two extra fields; codePage and elemSize. They are 2 bytes each (short ints), so the size of StrRec is now 12 bytes instead of 8. This can cause invalid pointer exception problems with memory allocation and destruction, even when the data part of the string seems to transfer ok.

Is there a quick and dirty way to Cast PansiChar to Pchar in Delphi 2009

I have a very large number of app to convert to Delphi 2009 and there are a number of external interfaces that return pAnsiChars. Does anyone have a quick and simple way to cast these back to pChars? There is a lot on string to pAnsiChar, but much I can find on the other way around.

Delphi 2009 has added a new string type called RawByteString. It is defined as:
type
RawByteString = type AnsiString($ffff);
If you need to save binary data coming in as PAnsiString, you can use this. You should be able to use the RawByteString the way you used AnsiString previously.
However, the recommended long term solution is still to convert your programs to Unicode.

There is no way to "cast" a PAnsiChar to a PChar. PChar is Unicode in Delphi 2009. Ansi data cannot be simply casted to Unicode, and vice versa. You have to perform an actual data conversion. If you have a PAnsiChar pointer to some data, and want to put the data into a Unicode string, then assign the PAnsiChar data to an AnsiString first, and then assign the AnsiString to the Unicode string as needed. Likewise, if you need to pass a Unicode string to a PAnsiChar, you have to assign the data to an AnsiString first. There are articles on Embarcadero's and TeamB's blog sites that take about Delphi 2009 migration issues.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart