What determines if a variable of type UnicodeString represents a Unicode string or an ANSI string? - delphi

I'm experienced with Delphi but new to Unicode.
The embedded Delphi XE2 documentation about UnicodeString (System.UnicodeString) says:
"Delphi utilizes several string types. UnicodeString can contain both Unicode and ANSI strings.
Support for this type includes the following features:
Strings as large as available memory.
Efficient use of memory through shared references.
Routines and operators that evaluate strings based on the current locale.
Despite its name, UnicodeString can represent both ANSI character set strings and Unicode strings. "
I don't understand what is meant by the word "can." ("It can contain both Unicode and ANSI." ... "Despite its name, UnicodeString can represent both ANSI character set strings and Unicode strings.")
My question: what determines if a variable of type UnicodeString represents a Unicode string or an ANSI string?

The documentation is outdated. UnicodeString in XE2 can only contain Unicode data.
In CB2009 and D2009, when UnicodeString was first introduced, there were cases, mostly in C++<->Delphi interactions, where the RTL allowed Ansi data to be stored in a UnicodeString and Unicode data to be stored in an AnsiString to help users migrate legacy Ansi code to Unicode. UnicodeString and AnsiString do have a unified internal structure, and the Delphi compiler had a {$STRINGCHECKS} directive that would detect any discrepancies and perform silent data conversions when needed. Although it did work, it also had subtle side effects if you were not careful with it.
By the time XE was released, Embarcadero figured users had had enough time to migrate, so the {$STRINGCHECKS} directive and supporting RTL functionality was removed. UnicodeString and AnsiString still have a unified internal structure, so it is technically possible to store Ansi data in a UnicodeString and Unicode in an AnsiString, but you would have to directly manipulate memory to do it manually, the compiler/RTL will not do it in "normal" code, and will not perform silent conversions anymore when discrepancies exist, so data corruption and/or crashes can occur if you are not careful.

Related

What is the difference between WideChar and AnsiChar?

I'm upgrading some ancient (from 2003) Delphi code to Delphi Architect XE and I'm running into a few problems. I am getting a number of errors where there are incompatible types. These errors don't happen in Delphi 6 so I must assume that this is because things have been upgraded.
I honestly don't know what the difference between PAnsiChar and PWideChar is, but Delphi sure knows the difference and won't let me compile. If I knew what the differences were maybe I could figure out which to use or how to fix this.
The short: prior to Delphi 2009 the native string type in Delphi used to be ANSI CHAR: Each char in every string was represented as an 8 bit char. Starting with Delphi 2009 Delphi's strings became UNICODE, using the UTF-16 notation: Now the basic Char uses 16 bits of data (2 bytes), and you probably don't need to know much about the Unicode code points that are represented as two consecutive 16 bits chars.
The 8 bit chars are called "Ansi Chars". An PAnsiChar is a pointer to 8 bit chars.
The 16 bit chars are called "Wide Chars". An PWideChar is a pointer to 16 bit chars.
Delphi knows the difference and does well if it doesn't allow you to mix the two!
More info
Here's a popular link on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You can find some more information on migrating Delphi to Unicode here: New White Paper: Delphi Unicode Migration for Mere Mortals
You may also search SO for "Delphi Unicode migration".
A couple years ago, the default character type in Delphi was changed from AnsiChar (single-byte variable representing an ANSI character) to WideChar (two-byte variable representing a UTF16 character.) The char type is now an alias to WideChar instead of AnsiChar, the string type is now an alias to UnicodeString (a UTF-16 Unicode version of Delphi's traditional string type) instead of AnsiString, and the PChar type is now an alias to PWideChar instead of PAnsiChar.
The compiler can take care of a lot of the conversions itself, but there are a few issues:
If you're using string-pointer types, such as PChar, you need to make sure your pointer is pointing to the right type of data, and the compiler can't always verify this.
If you're passing strings to var parameters, the variable type needs to be exactly the same. This can be more complicated now that you've got two string types to deal with.
If you're using string as a convenient byte-array buffer for holding arbitrary data instead of a variable that holds text, that won't work as a UnicodeString. Make sure those are declared as RawByteString as a workaround.
Anyplace you're dealing with string byte lengths, for example when reading or writing to/from a TStream, make sure your code isn't assuming that a char is one byte long.
Take a look at Delphi Unicode Migration for Mere Mortals for some more tricks and advice on how to get this to work. It's not as hard as it sounds, but it's not trivial either. Good luck!

ReadLn working with WideString (utf-8 files)

I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?
If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.
If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.
If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do
WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.
You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).

Delphi WideString and Delphi 2009+

I am writing a class that will save wide strings to a binary file. I'm using Delphi 2005 for this but the app will later be ported to Delphi 2010. I'm feeling very unsure here, can someone confirm that:
A Delphi 2005 WideString is exactly the same type as a Delphi 2010 String
A Delphi 2005 WideString char as well as a Delphi 2010 String char is guaranteed to always be 2 bytes in size.
With all the Unicode formats out there I don't want to be hit with one of the chars in my string suddenly being 3 bytes wide or something like that.
Edit: Found this: "I indeed said UnicodeString, not WideString. WideString still exists, and is unchanged. WideString is allocated by the Windows memory manager, and should be used for interacting with COM objects. WideString maps directly to the BSTR type in COM." at http://www.micro-isv.asia/2008/08/get-ready-for-delphi-2009-and-unicode/
Now I'm even more confused. So a Delphi 2010 WideString is not the same as a Delphi 2005 WideString? Should I use UnicodeString instead?
Edit 2: There's no UnicodeString type in Delphi 2005. FML.
For your first question: WideString is not exactly the same type as D2010's string. WideString is the same COM BSTR type that it has always been. It's managed by Windows, with no reference counting, so it makes a copy of the whole BSTR every time you pass it somewhere.
UnicodeString, which is the default string type in D2009 and on, is basically a UTF-16 version of the AnsiString we all know and love. It's got a reference count and is managed by the Delphi compiler.
For the second, the default char type is now WideChar, which are the same chars that have always been used in WideString. It's a UTF-16 encoding, 2 bytes per char. If you save WideString data to a file, you can load it into a UnicodeString without trouble. The difference between the two types has to do with memory management, not the data format.
As others mentioned, string (actually UnicodeString) data type in Delphi 2009 and above is not equivalent to WideString data type in previous versions, but the data content format is the same. Both of them save the string in UTF-16. So if you save a text using WideString in earlier versions of Delphi, you should be able to read it correctly using string data type in the recent versions of Delphi (2009 and above).
You should take note that performance of UnicodeString is way superior than WideString. So if you are going to use the same source code in both Delphi 2005 and Delphi 2010, I suggest you use a string type alias with conditional compiling in your code, so that you can have the best of both worlds:
type
{$IFDEF Unicode}
MyStringType = UnicodeString;
{$ELSE}
MyStringType = WideString;
{$ENDIF}
Now you can use MyStringType as your string type in your source code. If the compiler is Unicode (Delphi 2009 and above), then your string type would be an alias of UnicodeString type which is introduced in Delphi 2009 to hold Unicode strings. If the compiler is not unicode (e.g. Delphi 2005) then your string type would be an alias for the old WideString data type. And since they both are UTF-16, data saved by any of the versions should be read by the other one correctly.
A Delphi 2005 WideString is exactly the same type as a Delphi 2010 String
That is not true - ex Delphi 2010 string has hidden internal codepage field - but probably it does not matter for you.
A Delphi 2005 WideString char as well as a Delphi 2010 String char is guaranteed to always be 2 bytes in size.
That is true. In Delphi 2010 SizeOf(Char) = 2 (Char = WideChar).
There cannot be different codepage for unicode strings - codepage field was introduced to create a common binary format for both Ansi strings (that need codepage field) and Unicode string (that don't need it).
If you save WideString data to stream in Delphi 2005 and load the same data to string in Delphi 2010 all should work OK.
WideString = BSTR and that is not changed between Delphi 2005 and 2010
UnicodeString = WideString in Delphi 2005 (if UnicodeString type exists in Delphi 2005 - I don't know)
UnicodeString = string in Delphi 2009 and above.
#Marco - Ansi and Unicode strings in Delphi 2009+ have common binary format (12-byte header).
UnicodeString codepage CP_UTF16 = 1200;
The rule is simple:
If you want to work with unicode strings inside your module only - use UnicodeString type (*).
If you want to communicate with COM or with other cross-module purposes - use WideString type.
You see, WideString is a special type, since it's not native Delphi type. It is an alias/wrapper for BSTR - a system string type, intendent for using with COM or cross-module communications. Being a unicode - is just a side-effect.
On the other hand, AnsiString and UnicodeString - are native Delphi types, which have no analog in other languages. String is just an alias for either AnsiString or UnicodeString.
So, if you need to pass string to some other code - use WideString, otherwise - use either AnsiString or UnicodeString. Simple.
P.S.
(*) For old Delphi - just place
{$IFNDEF Unicode}
type
UnicodeString = WideString;
{$ENDIF}
somewhere in your code. This fix will allow you to write the same code for any Delphi version.
While a D2010 char is always and exactly 2 bytes, the same character folding and combining issues are present in UTF-16 characters as in UTF-8 characters. You don't see this with narrow strings because they're codepage based, but with unicode strings it's possible (and in some situations common) to have affective but non-visible characters. Examples include the byte order mark (BOM) at the start of a unicode file or stream, left to right/right to left indicator characters, and a huge range of combining accents. This mostly affects questions of "how many pixels wide will this string be on the screen" and "how many letters are in this string" (as distinct from "how many chars are in this string"), but also means that you can't randomly chop characters out of a string and assume they're printable. Operations like "remove the last letter from this word" become non-trivial and depend on the language in use.
The question about "one of the chars in my string suddenly being 3 bytes long" reflects a little confustion about how UTF works. It's possible (and valid) to take three bytes in a UTF-8 string to represent one printable character, but each byte will be a valid UTF-8 character. Say, a letter plus two combining accents. You will not get a character in UTF-16 or UTF-32 being 3 bytes long, but it might be 6 bytes (or 12 bytes) long, if it's represented using three code points in UTF-16 or UTF-32. Which brings us to normalisation (or not).
But provided you are only dealing with the strings as whole things, it's all very simple - you just take the string, write it to a file, then read it back in. You don't have to worry about the fine print of string display and manipulation, that's all handled by the operating system and libraries. Strings.LoadFromFile(name) and Listbox.Items.Add(string) work exactly the same in D2010 as in D2007, the unicode stuff is all transparent to you as a programmer.
I am writing a class that will save wide strings to a binary file.
When you write the class in D2005 you will be using Widestring
When you migrate to D2010 Widestring will still be valid and work properly.
Widestring in D2005 is the same as WideString in D2010.
The fact that String=WideString in D2010 need not be considered since the compiler deals with those issues easily.
Your input routine to save with (AString: String) need only one line entering the proc
procedure SaveAStringToBIN_File(AString:String);
var wkstr : Widestring;
begin
{$IFDEF Unicode} wkstr := AString;
{$ELSE} wkstr := UTF8Decode(AString); {$ENDIF}
...
the rest is the same saving a widestring to a file stream
write the length (word) of string then data
end;

How do the new string types work in Delphi 2009/2010?

I have to convert a large legacy application to Delphi 2009 which uses strings, AnsiStrings, WideStrings and UTF8 data all over the place and I have a hard time to understand how the new string types work and how they should be used.
The application fully supported Unicode using TntUnicodeControls and there are 3rd party DLLs which require strings in specific encodings, mostly UTF8 and UTF16, making the conversion task not as trivial as one would suspect.
I especially have problems with the C DLL calls and choosing the right type.
I also get the impression that there are many implicit string conversions happening, because one of the DLL seems to always receive UTF-8 encoded strings, no matter how the Delphi string is encoded.
Can someone please provide a short overview about the new Delphi 2009 string types UnicodeString and RawByteString, perhaps some usage hints and possible pitfalls when converting a pre 2009 application?
See Delphi and Unicode, a white paper written by Marco Cantù and I guess
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), written by Joel.
One pitfall is that the default Win32 API call has been mapped to use the W (wide string) version instead of the A (ANSI) version, for example ShellExecuteA If your code is doing tricky pointer code assuming internal layout of AnsiString, it will break. A fallback is to substitute PChar with PAnsiChar, Char with AnsiChar, string with AnsiString, and append A at the end of Win32 API call for that portion of code. After the code actually compiles and runs normally, you could refactor your code to use string (UnicodeString).
Watch my CodeRage 4 talk on "Using Unicode and Other Encodings in your Programs" this friday, or wait until the replay of it is available online.
I'm going to cover some encodings and explain about the string format.
The slides will be available shortly (I'll try to get them online today) and contain a lot of references to stuff you should read on the internet (but I must admit I forgot the link to Joel on Unicode that eed3si9n posted).
Will edit this answer today with the uploads and the links.
Edit:
If you have a small sample where you can show that your C/C++ DLL receives the strings UTF8 encoded, but thought they should be encoded otherwise, please post it (mail me; almost anything at the pluimers dot com gets to me, especially if you use my first name before the at sign).
Session materials can be downloaded now, including the "Using Unicode and Other Encodings in your Programs" session.
These are links from that session:
Read these:
Marco Cantu, Whitepaper “Delphi and Unicode”
Marco Cantu, Presentation “Delphi and Unicode”
Nick Hodges, Whitepaper “Delphi in a Unicode World”
Relevant on-line help topics:
What's New in Delphi and C++Builder 2009
String Types: Base: ShortString, AnsiString, WideString, UnicodeString
String Types: Unicode (including internal memory layouts of the string types)
String Types: Enabling for Unicode
String Types: RawByteString (AnsiString with CodePage $ffff)
String Types: UTF8String (AnsiString with CodePage 65001)
String <-> PChar conversions: PChar fundamentals
String <-> PChar conversions: Returning a PChar Local Variable
String <-> PChar conversions: Passing a Local Variable as a PChar
Hope this gets you going. If not, mail me and I'll try to extend the answer here.
Note that it does not only hit real string code. It also hits code where PCHAR is used to trawl through buffers, or interface with APIs.
E.g. initialization code of headers that load the DLL dynamically (getprocedureaddress/loadlibray)
It seems almost all my problems come from the automatic conversion on assignments to UTF8String.
I already had old code using UTF8String just to help me think which type of string a variable should contain.
When starting to port my application, I replaced AnsiString with UTF8String for the same reason, but the code depended on UTF8String being just an alias to (classic) AnsiString
Now with the automatic conversion that assumption is no longer true, which created many problems.
Be careful if you use UTF8String when porting from pre-2009 Delphi code!
Another thing to watch out for when passing string between dlls built with different versions of Delphi or C++ Builder is that, starting with 2009, the StrRec part of AnsiStringBase gained two extra fields; codePage and elemSize. They are 2 bytes each (short ints), so the size of StrRec is now 12 bytes instead of 8. This can cause invalid pointer exception problems with memory allocation and destruction, even when the data part of the string seems to transfer ok.

Is there a quick and dirty way to Cast PansiChar to Pchar in Delphi 2009

I have a very large number of app to convert to Delphi 2009 and there are a number of external interfaces that return pAnsiChars. Does anyone have a quick and simple way to cast these back to pChars? There is a lot on string to pAnsiChar, but much I can find on the other way around.
Delphi 2009 has added a new string type called RawByteString. It is defined as:
type
RawByteString = type AnsiString($ffff);
If you need to save binary data coming in as PAnsiString, you can use this. You should be able to use the RawByteString the way you used AnsiString previously.
However, the recommended long term solution is still to convert your programs to Unicode.
There is no way to "cast" a PAnsiChar to a PChar. PChar is Unicode in Delphi 2009. Ansi data cannot be simply casted to Unicode, and vice versa. You have to perform an actual data conversion. If you have a PAnsiChar pointer to some data, and want to put the data into a Unicode string, then assign the PAnsiChar data to an AnsiString first, and then assign the AnsiString to the Unicode string as needed. Likewise, if you need to pass a Unicode string to a PAnsiChar, you have to assign the data to an AnsiString first. There are articles on Embarcadero's and TeamB's blog sites that take about Delphi 2009 migration issues.

Resources