I am writing a class that will save wide strings to a binary file. I'm using Delphi 2005 for this but the app will later be ported to Delphi 2010. I'm feeling very unsure here, can someone confirm that:
A Delphi 2005 WideString is exactly the same type as a Delphi 2010 String
A Delphi 2005 WideString char as well as a Delphi 2010 String char is guaranteed to always be 2 bytes in size.
With all the Unicode formats out there I don't want to be hit with one of the chars in my string suddenly being 3 bytes wide or something like that.
Edit: Found this: "I indeed said UnicodeString, not WideString. WideString still exists, and is unchanged. WideString is allocated by the Windows memory manager, and should be used for interacting with COM objects. WideString maps directly to the BSTR type in COM." at http://www.micro-isv.asia/2008/08/get-ready-for-delphi-2009-and-unicode/
Now I'm even more confused. So a Delphi 2010 WideString is not the same as a Delphi 2005 WideString? Should I use UnicodeString instead?
Edit 2: There's no UnicodeString type in Delphi 2005. FML.
For your first question: WideString is not exactly the same type as D2010's string. WideString is the same COM BSTR type that it has always been. It's managed by Windows, with no reference counting, so it makes a copy of the whole BSTR every time you pass it somewhere.
UnicodeString, which is the default string type in D2009 and on, is basically a UTF-16 version of the AnsiString we all know and love. It's got a reference count and is managed by the Delphi compiler.
For the second, the default char type is now WideChar, which are the same chars that have always been used in WideString. It's a UTF-16 encoding, 2 bytes per char. If you save WideString data to a file, you can load it into a UnicodeString without trouble. The difference between the two types has to do with memory management, not the data format.
As others mentioned, string (actually UnicodeString) data type in Delphi 2009 and above is not equivalent to WideString data type in previous versions, but the data content format is the same. Both of them save the string in UTF-16. So if you save a text using WideString in earlier versions of Delphi, you should be able to read it correctly using string data type in the recent versions of Delphi (2009 and above).
You should take note that performance of UnicodeString is way superior than WideString. So if you are going to use the same source code in both Delphi 2005 and Delphi 2010, I suggest you use a string type alias with conditional compiling in your code, so that you can have the best of both worlds:
type
{$IFDEF Unicode}
MyStringType = UnicodeString;
{$ELSE}
MyStringType = WideString;
{$ENDIF}
Now you can use MyStringType as your string type in your source code. If the compiler is Unicode (Delphi 2009 and above), then your string type would be an alias of UnicodeString type which is introduced in Delphi 2009 to hold Unicode strings. If the compiler is not unicode (e.g. Delphi 2005) then your string type would be an alias for the old WideString data type. And since they both are UTF-16, data saved by any of the versions should be read by the other one correctly.
A Delphi 2005 WideString is exactly the same type as a Delphi 2010 String
That is not true - ex Delphi 2010 string has hidden internal codepage field - but probably it does not matter for you.
A Delphi 2005 WideString char as well as a Delphi 2010 String char is guaranteed to always be 2 bytes in size.
That is true. In Delphi 2010 SizeOf(Char) = 2 (Char = WideChar).
There cannot be different codepage for unicode strings - codepage field was introduced to create a common binary format for both Ansi strings (that need codepage field) and Unicode string (that don't need it).
If you save WideString data to stream in Delphi 2005 and load the same data to string in Delphi 2010 all should work OK.
WideString = BSTR and that is not changed between Delphi 2005 and 2010
UnicodeString = WideString in Delphi 2005 (if UnicodeString type exists in Delphi 2005 - I don't know)
UnicodeString = string in Delphi 2009 and above.
#Marco - Ansi and Unicode strings in Delphi 2009+ have common binary format (12-byte header).
UnicodeString codepage CP_UTF16 = 1200;
The rule is simple:
If you want to work with unicode strings inside your module only - use UnicodeString type (*).
If you want to communicate with COM or with other cross-module purposes - use WideString type.
You see, WideString is a special type, since it's not native Delphi type. It is an alias/wrapper for BSTR - a system string type, intendent for using with COM or cross-module communications. Being a unicode - is just a side-effect.
On the other hand, AnsiString and UnicodeString - are native Delphi types, which have no analog in other languages. String is just an alias for either AnsiString or UnicodeString.
So, if you need to pass string to some other code - use WideString, otherwise - use either AnsiString or UnicodeString. Simple.
P.S.
(*) For old Delphi - just place
{$IFNDEF Unicode}
type
UnicodeString = WideString;
{$ENDIF}
somewhere in your code. This fix will allow you to write the same code for any Delphi version.
While a D2010 char is always and exactly 2 bytes, the same character folding and combining issues are present in UTF-16 characters as in UTF-8 characters. You don't see this with narrow strings because they're codepage based, but with unicode strings it's possible (and in some situations common) to have affective but non-visible characters. Examples include the byte order mark (BOM) at the start of a unicode file or stream, left to right/right to left indicator characters, and a huge range of combining accents. This mostly affects questions of "how many pixels wide will this string be on the screen" and "how many letters are in this string" (as distinct from "how many chars are in this string"), but also means that you can't randomly chop characters out of a string and assume they're printable. Operations like "remove the last letter from this word" become non-trivial and depend on the language in use.
The question about "one of the chars in my string suddenly being 3 bytes long" reflects a little confustion about how UTF works. It's possible (and valid) to take three bytes in a UTF-8 string to represent one printable character, but each byte will be a valid UTF-8 character. Say, a letter plus two combining accents. You will not get a character in UTF-16 or UTF-32 being 3 bytes long, but it might be 6 bytes (or 12 bytes) long, if it's represented using three code points in UTF-16 or UTF-32. Which brings us to normalisation (or not).
But provided you are only dealing with the strings as whole things, it's all very simple - you just take the string, write it to a file, then read it back in. You don't have to worry about the fine print of string display and manipulation, that's all handled by the operating system and libraries. Strings.LoadFromFile(name) and Listbox.Items.Add(string) work exactly the same in D2010 as in D2007, the unicode stuff is all transparent to you as a programmer.
I am writing a class that will save wide strings to a binary file.
When you write the class in D2005 you will be using Widestring
When you migrate to D2010 Widestring will still be valid and work properly.
Widestring in D2005 is the same as WideString in D2010.
The fact that String=WideString in D2010 need not be considered since the compiler deals with those issues easily.
Your input routine to save with (AString: String) need only one line entering the proc
procedure SaveAStringToBIN_File(AString:String);
var wkstr : Widestring;
begin
{$IFDEF Unicode} wkstr := AString;
{$ELSE} wkstr := UTF8Decode(AString); {$ENDIF}
...
the rest is the same saving a widestring to a file stream
write the length (word) of string then data
end;
Related
I am migrating a delphi 7 application to delphi XE4. In Delphi 7, some variables are declared like this:
var abc : string[80];
While migrating this code, I am changing above code declaration as
var abc : string;
As per my understanding, string[80] is ansistring and string is unicode. So, is it right way to do it?
I am following the below link from stackoverflow:
Convert Char into AnsiChar or WideChar (Delphi)
Indeed you are correct:
string[#] are subtypes of ShortString.
It has a maximum of 255 characters (depending on the #), and the encoding is undetermined (i.e. up to you).
string is a regular string, which was single-byte (now called AnsiString) up until Delphi 2007, and multi-byte (now called UnicodeString) as of Delphi 2009.
Until Delphi 2007, the encoding is undetermined. As of Delphi 2009, both AnsiString and UnicodeString can have an encoding.
More background information can be found in these two Delphi documentation topics:
ShortString
AnsiString
UnicodeString
Delphi String Types
Unicode in RAD Studio
Answering your question about how you should replace ShortString:
It totally depends on how you used your ShortString in Delphi 7. Depending on the use, there are multiple ways to go:
string
arrays of byte
AnsiString
That all depends on the kind of data you are storing, so that is the first thing you need to find out.
I'm experienced with Delphi but new to Unicode.
The embedded Delphi XE2 documentation about UnicodeString (System.UnicodeString) says:
"Delphi utilizes several string types. UnicodeString can contain both Unicode and ANSI strings.
Support for this type includes the following features:
Strings as large as available memory.
Efficient use of memory through shared references.
Routines and operators that evaluate strings based on the current locale.
Despite its name, UnicodeString can represent both ANSI character set strings and Unicode strings. "
I don't understand what is meant by the word "can." ("It can contain both Unicode and ANSI." ... "Despite its name, UnicodeString can represent both ANSI character set strings and Unicode strings.")
My question: what determines if a variable of type UnicodeString represents a Unicode string or an ANSI string?
The documentation is outdated. UnicodeString in XE2 can only contain Unicode data.
In CB2009 and D2009, when UnicodeString was first introduced, there were cases, mostly in C++<->Delphi interactions, where the RTL allowed Ansi data to be stored in a UnicodeString and Unicode data to be stored in an AnsiString to help users migrate legacy Ansi code to Unicode. UnicodeString and AnsiString do have a unified internal structure, and the Delphi compiler had a {$STRINGCHECKS} directive that would detect any discrepancies and perform silent data conversions when needed. Although it did work, it also had subtle side effects if you were not careful with it.
By the time XE was released, Embarcadero figured users had had enough time to migrate, so the {$STRINGCHECKS} directive and supporting RTL functionality was removed. UnicodeString and AnsiString still have a unified internal structure, so it is technically possible to store Ansi data in a UnicodeString and Unicode in an AnsiString, but you would have to directly manipulate memory to do it manually, the compiler/RTL will not do it in "normal" code, and will not perform silent conversions anymore when discrepancies exist, so data corruption and/or crashes can occur if you are not careful.
Is there any function in freepascal to show the Unicode symbol by its code (e.g. U+1D15E)? Unfortunately Chr() works only with ANSI symbols (with codes less than 127).
I want to use symbols from custom symbolic font and it is very inconvenient to put them into sourcecode directly (they are shown in Lazarus as ? or something else because they are absent in system fonts).
Take a look at this page. I assume that Freepascal either uses UTF-16, in which it becomes a surrogate pair of two WideChars (see table) or UTF-8, in which it becomes a sequence of byte values (see table again).
UTF-8:
const
HalfNoteString = UTF8String(#$F0#$9D#$85#$9E);
UTF-16:
const
HalfNoteString = UnicodeString(#$D834#$DD5E);
The names of the string types may differ, as I don't know FreePascal very well. Perhaps AnsiString and WideString.
I have never used Free Pascal, but if I were you, I'd try
var
s: char;
begin
s := char($222b); // Just cast a word
or, if the compiler is really stubborn,
var
s: char;
begin
PWord(#s)^ := $222b; // Forcibly write a word
Current unicode status of FPC to my best knowledge
The codepage of literals can be set with $codepage http://www.freepascal.org/docs-html/prog/progsu81.html
FPC 2.4.x+ does have unicodestring (since it is +/- Kylix widestring) but only basic routine support. (pos and copy, not routines like format), but the "record" misses the codepage field.
Lazarus widgets expect UTF8 in normal ansistrings (D7..D2007 ansistrings without codepage data), and programmers must manually insert conversions if necessary. So on Windows the widgets ARE mostly using unicode (-W) calls, but take ansistrings with UTF8 in it.
FPC doesn't follow the utf8 in ansistring scheme , so for some string accepting routines in sysutils, there are special routines in Lazarus that assume UTF8 that call -W variants)
FPC ansistring is the system default 1-byte encoding. ansi on Windows, utf8 on most other platforms.
Trunk, 2.7.1, provides support for the new D2009+ ansistring (with codepages).
There has been no discussion yet how to deal with the default stringtype (e.g. will "string" be utf8string on *nix and unicodestring on Windows, or unicodestring or utf8string everywhere?)
Other unicodestring related enhancement (like encoding parameters to tstringlist.savetofile) are not implemented. Likewise for the pseudo objects (like TCharacter which are afaik mostly static)
Update: 2.7.1 has a variable encoding ansistring type, and lazarus has been fixed to keep working. Nothing is really taking advantage from it yet though, e.g. most of the RTL still uses -A calls, and prototypes of sysutils and system procedures that takes strings haven't changed to rawbytestring yet.
I assume the problem is to convert from UCS4 encoding (which is actually a Unicode codepoint number) to UTF16.
In Delphi, you can use UCS4StringToUnicodeString function.
Warning: Be careful with UCS4String type. It is actually a zero-terminated dynamic array, not a string (that means it is zero-based).
var
S1: UCS4String;
S: string;
begin
SetLength(S1, 2);
S1[0]:= UCS4Char($1D15E);
S1[1]:= UCS4Char(0);
S:= UCS4StringToUnicodeString(S1);
ShowMessage(Format('%d, %x, %x', [Length(S), Ord(S[1]), Ord(S[2])]));
end;
I have a case here, I am going to migrate over to delphi 2011 XE from Delphi 7, and to my surprise many components will have problems due to ansistrings, in delphi xe they look like japanese / chinese characters, now the unit I use is a PCSC connector and seem to be discontinued/abandoned from the original developper.
basically what I want is a easy way to read the strings again with as little modification to original code as possible..
also if there are any good tutorials out there on how to makae components ansistring ready for 2009 and newer would help me also
#Plastkort, Delphi >= 2009 is perfectly capable of reading and handling AnsiString. You only get the meaningless characters if you somehow hard-cast ANSI data to Unicode, possibly by hard-casting a pointer to PChar.
If I had to convert someone else's code to Unicode I'd start by searching for PChar, Char and String and specifically looking at places where other types are hard casted to those types. That's because those types changed meaning: In non-Unicode delphi CHAR was 1 byte, now it's 2 bytes.
The conversion itself isn't necessary difficult, you just need to understand the problem you're facing and you need to have good understanding of the code you're converting. And it's a lot of work, especially when dealing with code that did "smart things with strings".
The big change in Delphi (prior to Delphi 2009 I think) is the aliased string types; string, Char, PChar, etc which prios to 2009 were ANSI string types are now all WideChar types.
so
Type | Delphi 6,7 | Delphi 2009, XE
-------+------------+----------------
string | AnsiString | UnicodeString
char | AnsiChar | WideChar
pchar | PAnsiChar | PWideChar
The simplest way to migrate from Ansi Delphi to Unicode Delphi is a global search and replace for the aliased string types that replaces them with the explicit 8 bit ANSI equivlents (i.e. replace all string with AnsiString, PChar with PAnsiChar, etc...)
That should get you 90% of the way.
update
After reading the comments to my answer and the article referenced by #O.D
I think the advice to bite the bullet and go with Unicode is the wiser option.
I have a very large number of app to convert to Delphi 2009 and there are a number of external interfaces that return pAnsiChars. Does anyone have a quick and simple way to cast these back to pChars? There is a lot on string to pAnsiChar, but much I can find on the other way around.
Delphi 2009 has added a new string type called RawByteString. It is defined as:
type
RawByteString = type AnsiString($ffff);
If you need to save binary data coming in as PAnsiString, you can use this. You should be able to use the RawByteString the way you used AnsiString previously.
However, the recommended long term solution is still to convert your programs to Unicode.
There is no way to "cast" a PAnsiChar to a PChar. PChar is Unicode in Delphi 2009. Ansi data cannot be simply casted to Unicode, and vice versa. You have to perform an actual data conversion. If you have a PAnsiChar pointer to some data, and want to put the data into a Unicode string, then assign the PAnsiChar data to an AnsiString first, and then assign the AnsiString to the Unicode string as needed. Likewise, if you need to pass a Unicode string to a PAnsiChar, you have to assign the data to an AnsiString first. There are articles on Embarcadero's and TeamB's blog sites that take about Delphi 2009 migration issues.