Passing constants to TIniFile.ReadString - c++builder

Do I have to use L each time I pass a cosntant to ReadString?
s = MyIni->ReadString (L"ü", L"SomeEntry", "");
The Embarcadero example doesn't say so, but they also don't use non-ASCII characters in their example.

In C++Builder 2009 and later, the entire RTL is based on System::UnicodeString rather than System::AnsiString. Using the L prefix tells the compiler to create a wide string literal (based on wchar_t) instead of a narrow string literal (based on char).
While you don't HAVE to use the prefix L, you SHOULD use it, because it invokes less overhead at runtime. On Windows, constructing a UnicodeString from a wchar_t string is just a simple memory copy, whereas constructing it from a char string performs a data conversion (using the System::DefaultSystemCodePage variable as the codepage to use for the conversion). That conversion MAY be lossy for non-ASCII characters, depending on the encoding of the narrow string, which is subject to the charset that you save your source file in, as well as the charset used by the compiler when it parses the source file. So there is no guarantee that what you write in code in a narrow string literal is what you will actually get at runtime. Using a wide string literal avoids that ambiguity.
Note that UnicodeString is UTF-16 encoded on all platforms, but wchar_t is used for UTF-16 only on Windows, where wchar_t is a 16-bit data type. On other platforms, where wchar_t is usually a 32-bit data type used for UTF-32, char16_t is used instead. As such, if you need to write portable code, use the RTL's _D() macro instead of using the L prefix directly, eg:
s = MyIni->ReadString(_D("ü"), _D("SomeEntry"), _D(""));
_D() will map a string/character literal to the correct data type (wchar_t or char16_t, depending on the platform you are compiling for). So, when using string/character literals with the RTL, VCL, and FMX libraries, you should get in the habit of always using _D().

Related

What determines if a variable of type UnicodeString represents a Unicode string or an ANSI string?

I'm experienced with Delphi but new to Unicode.
The embedded Delphi XE2 documentation about UnicodeString (System.UnicodeString) says:
"Delphi utilizes several string types. UnicodeString can contain both Unicode and ANSI strings.
Support for this type includes the following features:
Strings as large as available memory.
Efficient use of memory through shared references.
Routines and operators that evaluate strings based on the current locale.
Despite its name, UnicodeString can represent both ANSI character set strings and Unicode strings. "
I don't understand what is meant by the word "can." ("It can contain both Unicode and ANSI." ... "Despite its name, UnicodeString can represent both ANSI character set strings and Unicode strings.")
My question: what determines if a variable of type UnicodeString represents a Unicode string or an ANSI string?
The documentation is outdated. UnicodeString in XE2 can only contain Unicode data.
In CB2009 and D2009, when UnicodeString was first introduced, there were cases, mostly in C++<->Delphi interactions, where the RTL allowed Ansi data to be stored in a UnicodeString and Unicode data to be stored in an AnsiString to help users migrate legacy Ansi code to Unicode. UnicodeString and AnsiString do have a unified internal structure, and the Delphi compiler had a {$STRINGCHECKS} directive that would detect any discrepancies and perform silent data conversions when needed. Although it did work, it also had subtle side effects if you were not careful with it.
By the time XE was released, Embarcadero figured users had had enough time to migrate, so the {$STRINGCHECKS} directive and supporting RTL functionality was removed. UnicodeString and AnsiString still have a unified internal structure, so it is technically possible to store Ansi data in a UnicodeString and Unicode in an AnsiString, but you would have to directly manipulate memory to do it manually, the compiler/RTL will not do it in "normal" code, and will not perform silent conversions anymore when discrepancies exist, so data corruption and/or crashes can occur if you are not careful.

Delphi String / Array of Strings

I have an old programm which was programmed in Delphi 1 (or 2, I'm not sure) and I want to build a 64-bit version of it (I use the Delphi XE2). Now the problem is that in the source code there are on the one hand strings and on the other arrays of strings (I guess to limit the string length).
Now there are a lot of errors while compiling because of incompatible types.
Above all there are procedures which should handle both types.
Is there an easy way to solve this problem (without changing every variable)?
Short answer
Search and replace : string => : ansistring
make sure you use length(astring) and setLength(astring) instead of manipulating string[0].
Long answer
Delphi 1 has only one type of string.
The old-skool ShortString that has a maximum length of 255 chars and a declared maximum length.
It looks and feels like an array of char, but it has a leading length byte.
var
ShortString: string[100];
In Delphi 2 longstrings (aka AnsiString) were introduced, these replace the shortstring. They do not have a fixed length, but are allocated dynamically instead and automatically grow and shrink as needed.
They are automatically created and destroyed.
var
Longstring: string; //AnsiString, can have any length up to 2GB.
In Delphi 2009 Unicode was introduced.
This changes the longstring because now each char no langer takes up 1 byte, but takes 2 bytes(*).
Additionally you can specify a character set to an AnsiString, whereas the new Unicode longstring uses UTF-16.
What you need to do depends on your needs:
If you just want the old code to work as before and you don't care about supporting all the multilingual stuff Unicode supports, you will need to replace all your string keywords with AnsiString (for all strings that are longstrings).
If you have Delphi 1 code, you can rename the string to ShortString.
I would recommend that you refactor the code to always use longstrings (read: AnsiString) though.
Delphi will automatically translate the UnicodeStrings that all return values of functions (Unicode string) are translated into AnsiStrings and visa versa, however this may include loss of data if your users enter symbols in a editbox that your AnsiString cannot store.
Also all that translation takes a bit of time (I doubt you will notice this though).
In Delphi 1 up to Delphi 2007 this problem did not exist, because controls did not allow Unicode characters to be entered.
(*) gross oversimplification

What is the difference between WideChar and AnsiChar?

I'm upgrading some ancient (from 2003) Delphi code to Delphi Architect XE and I'm running into a few problems. I am getting a number of errors where there are incompatible types. These errors don't happen in Delphi 6 so I must assume that this is because things have been upgraded.
I honestly don't know what the difference between PAnsiChar and PWideChar is, but Delphi sure knows the difference and won't let me compile. If I knew what the differences were maybe I could figure out which to use or how to fix this.
The short: prior to Delphi 2009 the native string type in Delphi used to be ANSI CHAR: Each char in every string was represented as an 8 bit char. Starting with Delphi 2009 Delphi's strings became UNICODE, using the UTF-16 notation: Now the basic Char uses 16 bits of data (2 bytes), and you probably don't need to know much about the Unicode code points that are represented as two consecutive 16 bits chars.
The 8 bit chars are called "Ansi Chars". An PAnsiChar is a pointer to 8 bit chars.
The 16 bit chars are called "Wide Chars". An PWideChar is a pointer to 16 bit chars.
Delphi knows the difference and does well if it doesn't allow you to mix the two!
More info
Here's a popular link on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You can find some more information on migrating Delphi to Unicode here: New White Paper: Delphi Unicode Migration for Mere Mortals
You may also search SO for "Delphi Unicode migration".
A couple years ago, the default character type in Delphi was changed from AnsiChar (single-byte variable representing an ANSI character) to WideChar (two-byte variable representing a UTF16 character.) The char type is now an alias to WideChar instead of AnsiChar, the string type is now an alias to UnicodeString (a UTF-16 Unicode version of Delphi's traditional string type) instead of AnsiString, and the PChar type is now an alias to PWideChar instead of PAnsiChar.
The compiler can take care of a lot of the conversions itself, but there are a few issues:
If you're using string-pointer types, such as PChar, you need to make sure your pointer is pointing to the right type of data, and the compiler can't always verify this.
If you're passing strings to var parameters, the variable type needs to be exactly the same. This can be more complicated now that you've got two string types to deal with.
If you're using string as a convenient byte-array buffer for holding arbitrary data instead of a variable that holds text, that won't work as a UnicodeString. Make sure those are declared as RawByteString as a workaround.
Anyplace you're dealing with string byte lengths, for example when reading or writing to/from a TStream, make sure your code isn't assuming that a char is one byte long.
Take a look at Delphi Unicode Migration for Mere Mortals for some more tricks and advice on how to get this to work. It's not as hard as it sounds, but it's not trivial either. Good luck!

ReadLn working with WideString (utf-8 files)

I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?
If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.
If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.
If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do
WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.
You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).

Is there some advantage in use resourcestring instead of a const string?

Would you tell me if there is some advantage (less sotorage space, increase speed, etc) in using:
resourcestring
MsgErrInvalidInputRange = 'Invalid Message Here!';
instead of
const
MsgErrInvalidInputRange : String = 'Invalid Message Here!';
The const option will be faster than resourcestring, because the later will call the Windows API to get the resource text.
You can make it faster by using some caching mechanism. This is what we do in our Enhanced Delphi RTL.
And it's a good idea to first load the resourcestring into a string, if you'll have to access many times to a resourcestring content.
The main point of resourcestring is to allow i18n (internationalization) of your program.
You've got the Translation Manager with some editions of the Delphi IDE. But it relies on external DLL.
You can use the gettext system, coming from the Linux world, from http://dxgettext.po.dk which relies on external .po files.
We included our own i18n mechanism in our framework, which translates and caches the resourcestring text, and relies on external .txt files (you can use UTF-8 or Unicode text files, from Delphi 6 up to XE). The caching make it quite as fast as the const usage. See http://synopse.info/fossil/finfo?name=SQLite3/SQLite3i18n.pas
There are other open source or commercial solutions around.
About size storage, resourcestring are stored as UC2 buffers. So resourcestring will use more memory than string up to Delphi 2009. Since Delphi 2009, all string are unicodestring i.e. UCS2, so you won't have much more storage size. In all cases, storage size of text is not the bigger size parameter for an application (bitmaps and code size have a much bigger effect to the final exe).
Resource strings are stored as STRINGTABLE entries in your exe resource, consts are stored as part of the fixed data segment. Since they're part of the resource section you can extract them and the DFMs, translate them, and store them in a resource module (data-only DLL). When a Delphi app starts, it looks for that DLL and will use the strings from it instead of the ones included in your EXE to load translations.
The Embarcadero docwiki covers using the Translation Manager, but a lot of other Delphi translation tools use resource strings too.
As others have mentioned, resourcestring strings will be included in a separate resource within your exe, and as such have advantages when you need to cater for multiple languages in the UI of your app.
As some have mentioned as well, const strings are included in the data section of your app.
Up to D2007
In Delphi versions up to D2007, const strings were stored as Ansi strings, requiring a single byte per character, whereas resource strings would be stored in UTF-16: the windows default encoding (though perhaps not for Win9x). IIRC D2007 and prior versions didn't support UTF-8 encoded unit files. So any strings coded in your sources would have to be supported by the ANSI code pages, and as such probably didn't go beyond the Unicode Basic Multilingual Plane. Which means that only the UCS-2 part of UTF-16 would be used and all strings could be stored in two bytes per character.
In short: up to D2007 const strings take a single byte per character, resource strings take two bytes per character.
D2009 and up
Delphi was unicode enabled in version D2009. Since then things are a little different. Resourcestring strings are still stored as UTF-16. No other option here as they are "managed" by Windows.
Consts strings however are a completely different story. Since D2009 Delphi stores multiple versions of each const string in your exe. Each version in a different encoding. Const can be stored as Ansi strings, UTF-8 strings and UTF-16 strings.
Which of these encodings is stored depends on the use of the const. By default UTf-16 will be used, as that is the default internal Delphi encoding. Assign the same const to a "normal" (UTF-16) string, as well as to an AnsiString variable, and the const will be stored in the exe both UTF-16 and Ansi encoded...
De-duping
By the looks of it (experimenting with D5 and D2009), Delphi "de-dupes" const strings, whereas it doesn't do this for resourcestring strings.
With resourcestring, the compiler places those strings as a stringtable resource in the executable, allowing anyone (say your translation team) to edit them with a resource editor without needing to recompile the application, or have access to the source code.
There's also a third options that is:
const
MsgErrInvalidInputRange = 'Invalid Message Here!';
The latter shoud be the more performant one because tell the compiler to not allocate space in the data segment, it could put the string in the code segment. Also remember that what coould be done with typed constants depends on the $WRITEABLECONST directive, although I do not know what the compiler exactly when it is on or off.

Resources