Wrong Unicode conversion, how to store accent characters in Delphi 2010 source code and handle character sets? - delphi

We are upgrading our project from Delphi 2006 to Delphi 2010. Old code was:
InputText: string;
InputText := SomeTEditComponent.Text;
...
for i := 1 to length(InputText) do
if InputText[i] in ['0'..'9', 'a'..'z', 'Ř' { and more special characters } ] then ...
Trouble is with accent letters - compare will fail.
I tried switch source code from ANSI to UTF8 and LE UCS-2 but without luck. Only cast as AnsiChar works:
if CharInSet(AnsiChar(InputText[i]), ['0'..'9', 'a'..'z', 'Ř']) then
Funny is how Delphi works with that letters - try this in Evaluate during debugging:
Ord('Ř') = Ord('Ø')
(yes, Delphi says True, on Windows 7 Czech)
Question is: How can I store and compare simple strings without forcing them as AnsiStrings? Because if this is not working why we should use Unicode?
Thanks all for reply
Right now we are using in some parts simple CharInSet(AnsiChar(...

The declaration of CharInSet is
function CharInSet(C: AnsiChar; const CharSet: TSysCharSet): Boolean; overload; inline;
function CharInSet(C: WideChar; const CharSet: TSysCharSet): Boolean; overload; inline;
while TSysCharSet is
TSysCharSet = set of AnsiChar;
Thus CharInSet can only compare to a set of AnsiChar. That is why your accented character is converted to AnsiChar.
There is no equivalent to a set of WideChar as sets are limited to 256 elements. You have to implement some other means to check the character.
Something like
const
specials: string = 'Ř';
if CharInSet(InputText[i], ['0'..'9', 'a'..'z']) or (Pos(InputText[I], specials) > 0) then
might be a try. You can add more characters to specials as needed.

Don't rely on the encoding of your Delphi source code files.
It might be mangled when using any non-Unicode tool to work on your text files (or even buggy Unicode aware tools).
The best way is to specify your characters as a 4-digit Unicode code point.
const
MyEuroSign = #$20AC;
See also my blog posting about this.

As mentioned by Uwe Raabe, the problem with Unicode char is, they're pretty large. If Delphi allowed you to create an "set of Char" it would be 8 Kb in size! An "set of AnsiChar" is only 32 bytes in size, pretty manageable.
I'd like to offer some alternatives. First is a sort of drop-in replacement for the CharInSet function, one that uses an array of CHAR to do the tests. It's only merit is that it can be called immediately from almost anywhere, but it's benefits stop there. I'd avoid this if I can:
function UnicodeCharInSet(UniChr:Char; CharArray:array of Char):Boolean;
var i:Integer;
begin
for i:=0 to High(CharArray) do
if CharArray[i] = UniChr then
begin
Result := True;
Exit;
end;
Result := False;
end;
The trouble with this function is that it doesn't handle the x in ['a'..'z'] syntax and it's slow! The alternatives are faster, but aren't as close to a drop-in replacement as one might want. The first set of alternatives to be investigated are the string functions from Microsoft. Amongst them there's IsCharAlpha and IsCharAlphanumeric, they might fix lots of issues. The problem with those, all "alpha" chars are the same: You might end up with valid Alpha chars in non-enlgish non-czech languages. Alternatively you can use the TCharacter class from Embarcadero - the implementation is all in the Character.pas unit, and it looks effective, I have no idea how effective Microsoft's implementation is.
An other alternative is to write your own functions, using an "case" statement to get things to work. Here's an example:
function UnicodeCharIs(UniChr:Char):Boolean;
var i:Integer;
begin
case UniChr of
'ă': Result := True;
'ş': Result := False;
'Ă': Result := True;
'Ş': Result := False;
else Result := False;
end;
end;
I inspected the assembler generated for this function. While Delphi has to implement a series of "if" conditions for this, it does it very effectively, way better then implementing the series of IF statements from code. But it could use a lot of improvement.
For tests that are used ALOT you might want to look for some bit-mask based implementation.

You should either use IFs instead of IN or find a WideCharSet implementation. This might help if you have a lot of sets: http://code.google.com/p/delphilhlplib/source/browse/trunk/Library/src/Extensions/DeHL.WideCharSet.pas.

You have stumbled onto a case where an idiom from Pre-Unicode Pascal should not be translated directly into the most visually similar idiom in Unicode era pascal.
First, let's deal with unicode string literals. If you can always be sure you will never have any body ever use your source code with any tool that could mess up your encodings
then you could use Unicode literals. Personally, I would not like to see Unicode codepoints in string literals in any of my code, for various reasons, the strongest reason being that my code may need to be reviewed for internationalization at some point, and having literals that belong to your local language peppered through your code is even more of a problem when you use a language other than those which use the simple Ascii/Ansi codepage symbols. Your source code will be more readable if you keep in mind the assumption that your accented characters, and even non-accented character literals would be better declared, as Jeroen says to declare them, in the const section, away from your actual place in the code that you use them.
Consider the case where you use the same string literal thirty three times throughout your code. Why should it be repeated instead of a constant? And even when it is used only once, isn't the code more readable if you declare a sane constant name?
So, first you should declare constants like he shows.
Second, the CharInSet function is deprecated for all uses other than the use it was intended for which is where you must continue to use the "Set of AnsiChar" types. This is no longer a recommended approach in Delphi 2009/2010, and using arrays of literal unicode characters, in your constant section, would be more readable, and more up-to-date.
I suggest you use the JCL StrContainsChars function and avoid character sets, since
you can not declare an inline SET of Unicode Characters at all, the language does not allow it. Instead use this, and be sure to comment it:
implementation
uses
JclStrings;
const
myChar1 = #$2001;
myChar2 = #$2002;
myChar3 = #$2003;
myMatchList1 : Array[0..2] of Char = (myChar1,myChar2,myChar3);
function Match(s:String):Boolean;
begin
result := StrContainsChars( s, myMatchList1,false);
end;
String, and Character Literals are bad to have peppering your code, especially character or numeric literals, are called "Magic values" and are to be avoided.
P.S. Your debug assertion shows that Ord('?') is downcasting the unicode character quietly to an AnsiChar byte-size character in the debugger. This behaviour is unexpected and should probably logged in QC.

Related

Delphi - check if a Unicode character occurs in a set of characters?

This code works good with Delphi-7 (until Delphi had Unicode support):
Value := edit1.Text[1];
if Value in ['м', 'ж'] then ...
'м', 'ж' - cyrillic symbols
But this construction doesn't work with Unicode charachter.
I try a lot of things, but they are doesn't work.
I also tried changing the value types to "Char" and "AnsiChar".
Doesn't work:
const
MySet : set of WideChar = [WideChar('м'), WideChar('ж')];
begin
Value := edit1.Text[1];
if Value in MySet then ...
Doesn't work:
if AnsiChar(Value) in ['м', 'ж'] then ...
Doesn't work:
if CharInSet(Value, ['м', 'ж']) then ...
But this works good:
if (Value = 'м') or (Value = 'ж') then ...
Whether there is an opportunity to check up UNICODE character by use of a SET in the modern versions of Delphi?
Or should we check each character individually?
My Delphi version is 10.4 update 2 Community Edition
A Delphi set type can only handle a maximum of 256 values, so it cannot be used for handling Unicode characters. For handling Unicode, the System.Character unit provides various methods and helpers.
For this particular case, there is an IsInArray() character helper you can use. Instead of declaring a set of characters, you will need to declare an array of characters:
var
ch: Char;
a: array of Char;
s: string;
begin
a := ['м', 'ж'];
s := 'abcж';
for ch in s do
if ch.IsInArray(a) then ...
end;
Note: Delphi XE7 introduced additional language support for initializing and working with dynamic arrays, and square brackets can also be used for simpler array initialization. In the context of above example, ['м', 'ж'] is not a set, but an array of wide characters.
check if a Unicode character occurs in a set of characters?
Do you mean a Delphi set?
In general, it is impossible to have a set of X where the base type X has more than 256 possible distinct values. So set of Byte is fine, but set of Word isn't possible. Since there are 256 * 256 distinct wide character values, it is therefore impossible to have a set of wide characters. (If this were indeed possible, a variable of such a set type would be 8 kB in size. That would be an unusually large variable.)
Since there is no such thing as "Delphi set of Unicode characters", the question "How to see if a character belongs to a Delphi set of Unicode characters" doesn't make sense.
Or do you simply mean a mathematical set?
If so, of course this is possible, but you cannot use a Delphi set to represent the mathematical set of characters. Instead, you need to use some other data type. One possibility is a simple array, if you don't mind its O(n) characteristics.

Error because of quote char after converting file to string with Delphi XE?

I have incorrect result when converting file to string in Delphi XE. There are several ' characters that makes the result incorrect. I've used UnicodeFileToWideString and FileToString from http://www.delphidabbler.com/codesnip and my code :
function LoadFile(const FileName: TFileName): ansistring;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
// ReadBuffer(Result[1], Size);
except
Result := '';
Free;
end;
Free;
end;
end;
The result between Delphi XE and Delphi 6 is different. The result from D6 is correct. I've compared with result of a hex editor program.
Your output is being produced in the style of the Delphi debugger, which displays string variables using Delphi's own string-literal format. Whatever function you're using to produce that output from your own program has actually been fixed for Delphi XE. It's really your Delphi 6 output that's incorrect.
Delphi string literals consist of a series of printable characters between apostrophes and a series of non-printable characters designated by number signs and the numeric values of each character. To represent an apostrophe, write two of them next to each other. The printable and non-printable series of characters can be written right not to each other; there's no need to concatenate them with the + operator.
Here's an excerpt from the output you say is correct:
#$12'O)=ù'dlû'#6't
There are four lone apostrophes in that string, so each one either opens or closes a series of printable characters. We don't necessarily know which is which when we start reading the string at the left because the #, $, 1, and 2 characters are all printable on their own. But if they represent printable characters, then the 0, ), =, and ù characters are in the non-printable region, and that can't be. Therefore, the first apostrophe above opens a printable series, and the #$12 part represents the character at code 18 (12 in hexadecimal). After the ù is another apostrophe. Since the previous one opened a printable string, this one must close it. But the next character after that is d, which is not #, and therefore cannot be the start of a non-printable character code. Therefore, this string from your Delphi 6 code is mal-formed.
The correct version of that excerpt is this:
#$12'O)=ù''dlû'#6't
Now there are three lone apostrophes and one set of doubled apostrophes. The problematic apostrophe from the previous string has been doubled, indicating that it is a literal apostrophe instead of a printable-string-closing one. The printable series continues with dlû. Then it's closed to insert character No. 6, and then opened again for t. The apostrophe that opens the entire string, at the beginning of the file, is implicit.
You haven't indicated what code you're using to produce the output you've shown, but that's where the problem was. It's not there anymore, and the code that loads the file is correct, so the only place that needs your debugging attention is any code that depended on the old, incorrect format. You'd still do well to replace your code with that of Robmil since it does better at handling (or not handling) exceptions and empty files.
Actually, looking at the real data, your problem is that the file stores binary data, not string data, so interpreting this as a string is not valid at all. The only reason it works at all in Delphi 6 is that non-Unicode Delphi allows you to treat binary data and strings the same way. You cannot do this in Unicode Delphi, nor should you.
The solution to get the actual text from within the file is to read the file as binary data, and then copy any values from this binary data, one byte at a time, to a string if it is a "valid" Ansi character (printable).
I will suggest the code:
function LoadFile(const FileName: TFileName): AnsiString;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
try
SetLength(Result, Size);
if Size > 0 then
Read(Result[1], Size);
finally
Free;
end;
end;

Delphi 2010 Blockread seems to get different data than previous version

I had my old MP3 Id3 tag reader recompiled under D2010 and it seems it won't find the tags anymore.
code is farily simple, but it doesn't work.
The debugger shows a lots of zero and then chineese signs in the results!
var dat:file of char;
id3:array [0..TAGLEN] of Char; //is 0..127 for ID3 v1
begin
vValid:=True;
if FileExists(vFilename) then begin
assignfile(dat,vFilename);
If (FileGetAttr(vFilename)>32) or (FileGetAttr(vFilename)=1) then
Filemode:= 0
Else
Filemode:= 2;
reset(dat);
seek(dat,FileSize(dat)-128);
blockread(dat,id3,128);
closefile(dat);
vMP3tag:=copy(id3, 0, 3);
if vMP3Tag='TAG' then begin
vTitle:=strip(copy(id3, 4, 30),' ');
vArtist:=strip(copy(id3, 34, 30), ' ');
I heard something about Unicode, and PansiChar, but I still don't understand much what these do anyway :)
thanks for looking
Try this:
var dat:file of AnsiChar;
id3:array [0..TAGLEN] of AnsiChar; //is 0..127 for ID3 v1
That is of course if your file is ansi-based instead of unicode based. I have no idea what might be in an id3 tag of an mp3 file.
If you want to understand the difference, this white paper explained it all to me. Basically Unicode uses more memory space to store a single character (like 4 times the amount of an ansi character), but they allow characters like ie Chinese and Japanese, which ansi doesn't provide. Just read the white paper, then it'll all be clear.
In short, Ansichar and Ansistring is what used to be a string in Delphi before D2009. In those days your application wouldn't be unicode compatible (you couldn't type chinese characters by default).
As from D2009, the definition of a string changed from an ansistring to a widestring and ansichar to widechar. That means your application will be unicode by default. But old code, expecting strings to be ansicode, need to be adapted to reflect that change.
Your code said char, meaning ansichar to pre-D2009 compilers, but widechar to D2009+ compilers. In other words, the new compilers read your code differently.
I hope that explains it a bit.
Oh!
it seems like AnsiCHar instead of Char is the way to go in D2010.
Ansi-char-them-all!

Casting Delphi 2009/2010 string literals to PAnsiChar

So the question is whether or not string literals (or const strings) in Delphi 2009/2010 can be directly cast as PAnsiChar's or do they need an additional cast to AnsiString first for this to work?
The background is that I am calling functions in a legacy DLL with a C interface that has some functions that require C-style char pointers. In the past (before Delphi 2009) code like the following worked like a charm (where the param to the C DLL function is a LPCSTR):
either:
LegacyFunction(PChar('Fred'));
or
const
FRED = 'Fred';
...
LegacyFunction(PChar(FRED));
So in changing to Delphi 2009 (and now in 2010), I changed the call to this:
LegacyFunction(PAnsiChar('Fred'));
or
const
FRED = 'Fred';
...
LegacyFunction(PAnsiChar(FRED));
This seems to work and I get the correct results from the function call. However there is some definite instability in the app that seems to be occurring mostly the second or third time through the code that calls the legacy functions (that was not present before the move to the 2009 version of the IDE). In investigating this, I realized that the native string literal (and const string) in Delphi 2009/2010 is a Unicode string so my cast was possibly in error. Examples here and elsewhere seem to indicate this call should look more like this:
LegacyFunction(PAnsiChar(AnsiString('Fred')))
What confuses me is that with the code above in the second examples, casting the string literal directly to a PAnsiChar does not generate any compiler warnings. If instead of a string literal, I was casting a string var, I would get a suspicious cast warning (and the string would be mangled). This (and the fact that the string is usable in the DLL) leads me to believe the compiler is doing some magic to correctly interpret the string literal as the intended string type. Is this what is happening or is the double cast (first to AnsiString, then to PAnsiChar) really necessary and the lack of it in my code the reason for the hard to track down instability? And does the same answer hold true for const strings as well?
For type-inferred constants (only initializable from literals) the compiler changes the actual text at compile-time, rather than at runtime. That means it knows whether or not the conversion loses data, so it doesn't need to warn you if it doesn't.
To 'visualize' Barry Kelly and Mason Wheeler words:
const
FRED = 'Fred';
var
p: PAnsiChar;
w: PWideChar;
begin
w := PWideChar(Fred);
p := PAnsiChar(Fred);
In ASM:
Unit7.pas.32: w := PWideChar(Fred);
00462146 BFA4214600 mov edi,$004621a4
// no conversion, just a pointer to constant/"-1 RefCounted" UnicodeString
Unit7.pas.33: p := PAnsiChar(Fred);
0046214B BEB0214600 mov esi,$004621b0
// no conversion, just a pointer to constant/"-1 RefCounted" AnsiString
As you can see in both cases PWideChar/PChar(FRED) and PAnsiChar(FRED), there is no conversion and Delphi compiler make 2 constant strings, one AnsiString and one UnicodeString.
Constants, including string literals, are untyped by default, and the compiler will fit them into whatever format works in the context you're using them in. As long as there are no non-ANSI characters in your string literal, the compiler won't have any trouble generating the string as ANSI instead of Unicode in this situation.
As Mason Wheeler points out all is fine as long as you don't have non-ANSI characters in your string const. If you have things like:
const FRED = 'Frédérick';
I'm pretty sure Delphi 2009/2010 will either issue charset hints (and apply a string conversion automatically - thus the hint) or fail at comparing ('Frédérick' is different in ISO-8859-1 than UTF-16).
If you can have "special" characters in your consts you will need to call string conversion.
Here are some basic examples with TStringList:
TStringList.SaveToFile(DestFilename, TEncoding.GetEncoding(28591)); //ISO-8859-1 (Latin1)
TStringList.SaveToFile(DestFilename, TEncoding.UTF8);

Delphi 2009 RawByteString vagaries

Suppose that for some perverse reason you want to display the raw byte contents of a UTF8String.
var
utf8Str : UTF8String;
begin
utf8Str := '€ąćęłńóśźż';
end;
(1) This doesn't do, it displays the readable form:
memo1.Lines.Add( RawByteString( utf8Str ));
// output: '€ąćęłńóśźż'
(2) This, however, does "work" - note the concatenation:
memo1.Lines.Add( 'x' + RawByteString( utf8Str ));
// output: 'x€ąćęłńóśźż'
I understand (1), though the compiler's forced coerction to UnicodeString seems to prevent ever displaying a RawByteString var as-is. However, why does the behavior change in (2)?
(3) Stranger still - let's reverse the concatenation:
memo1.Lines.Add( RawByteString( utf8Str ) + 'x' );
// output: '€ąćęłńóśźżx'
I've been reading up on the newfangled string types in Delphi and thought I understood how they work, but this is a puzzle.
RawByteString only exists to minimize the number of overloads required for functions that work with various flavours of AnsiStrings with different codepage affinities.
In general, don't declare variables of type RawByteString. Don't typecast values to that type. Don't do concatenations on variables of that type. About the only things you can do are:
Declaring a parameter of this type (the original intent)
Indexing on such a parameter
Searching in such a parameter
Intelligent operations that check the actual code page of the string, using the StringCodePage function.
For example, you'll note that the StringCodePage function itself uses RawByteString as its argument type. This way, it will work with any AnsiString, rather than doing a codepage translation before passing it as an argument.
For your case, things like concatenations are largely undefined. The behaviour changed between RTM and Update 2, but when the RTL string concatenation functions receive multiple strings with different code pages, there's no easy way for it to figure out what code page should be used for the final string. That's just one reason why you shouldn't concatenate them like you do here.
You cannot add a string to a TMemo "as is". You always need to so some kind of conversion to Unicode, because that's all TMemo knows about in Delphi 2009.
If you want to pretend that your UTF8String uses code page 1252, do this:
var
utf8Str : UTF8String;
Raw: RawByteString;
begin
utf8Str := '€ąćęłńóśźż';
Raw := utf8Str;
SetCodePage(Raw, 1252, False);
Memo.Lines.Add(Raw);
end;
For more details, see my article Using RawByteString Effectively

Resources