Convert Hi-Ansi chars to Ascii equivalent (é -> e) - delphi

Is there a routine available in Delphi 2007 to convert the characters in the high range of the ANSI table (>127) to their equivalent ones in pure ASCII (<=127) according to a locale (codepage)?
I know some chars cannot translate well but most can, esp. in the 192-255 range:
À → A
à → a
Ë → E
ë → e
Ç → C
ç → c
– (en dash) → - (hyphen - that can be trickier)
— (em dash) → - (hyphen)

WideCharToMultiByte does best-fit mapping for any characters that aren't supported by the specified character set, including stripping diacritics. You can do exactly what you want by using that and passing 20127 (US-ASCII) as the codepage.
function BestFit(const AInput: AnsiString): AnsiString;
const
CodePage = 20127; //20127 = us-ascii
var
WS: WideString;
begin
WS := WideString(AInput);
SetLength(Result, WideCharToMultiByte(CodePage, 0, PWideChar(WS),
Length(WS), nil, 0, nil, nil));
WideCharToMultiByte(CodePage, 0, PWideChar(WS), Length(WS),
PAnsiChar(Result), Length(Result), nil, nil);
end;
procedure TForm1.Button1Click(Sender: TObject);
begin
ShowMessage(BestFit('aÀàËëÇç–—€¢Š'));
end;
Calling that with your examples produces results you're looking for, including the emdash-to-minus case, which I don't think is handled by Jeroen's suggestion to convert to Normalization form D. If you did want to take that approach, Michael Kaplan has a blog post the explicitly discusses stripping diacritics (rather than normalization in general), but it uses C# and an API that was introduces in Vista. You can get something similar using the FoldString api (any WinNT release).
Of course if you're only doing this for one character set, and you want to avoid the overhead from converting to and from a WideString, Padu is correct that a simple for loop and a lookup table would be just as effective.

Just to extend Craig's answer for Delphi 2009:
If you use Delphi 2009 and newer, you can use a more readable code with the same result:
function OStripAccents(const aStr: String): String;
type
USASCIIString = type AnsiString(20127);//20127 = us ascii
begin
Result := String(USASCIIString(aStr));
end;
Unfortunately, this code does work only on MS Windows. On Mac, the accents are not replaced by best-fitted characters but by question marks.
Obviously, Delphi internally uses WideCharToMultiByte on Windows whereas on Mac iconv is used (see LocaleCharsFromUnicode in System.pas).
The question is if this different behaviour on different OS should be considered as bug and reported to CodeCentral.

I believe your best bet is creating a lookup table.

What you are looking for is normalization.
Michael Kaplan wrote a nice blog article about normalization.
It does not immediately solve your problem, but points you in the right direction.
--jeroen

Related

Delphi - check if a Unicode character occurs in a set of characters?

This code works good with Delphi-7 (until Delphi had Unicode support):
Value := edit1.Text[1];
if Value in ['м', 'ж'] then ...
'м', 'ж' - cyrillic symbols
But this construction doesn't work with Unicode charachter.
I try a lot of things, but they are doesn't work.
I also tried changing the value types to "Char" and "AnsiChar".
Doesn't work:
const
MySet : set of WideChar = [WideChar('м'), WideChar('ж')];
begin
Value := edit1.Text[1];
if Value in MySet then ...
Doesn't work:
if AnsiChar(Value) in ['м', 'ж'] then ...
Doesn't work:
if CharInSet(Value, ['м', 'ж']) then ...
But this works good:
if (Value = 'м') or (Value = 'ж') then ...
Whether there is an opportunity to check up UNICODE character by use of a SET in the modern versions of Delphi?
Or should we check each character individually?
My Delphi version is 10.4 update 2 Community Edition
A Delphi set type can only handle a maximum of 256 values, so it cannot be used for handling Unicode characters. For handling Unicode, the System.Character unit provides various methods and helpers.
For this particular case, there is an IsInArray() character helper you can use. Instead of declaring a set of characters, you will need to declare an array of characters:
var
ch: Char;
a: array of Char;
s: string;
begin
a := ['м', 'ж'];
s := 'abcж';
for ch in s do
if ch.IsInArray(a) then ...
end;
Note: Delphi XE7 introduced additional language support for initializing and working with dynamic arrays, and square brackets can also be used for simpler array initialization. In the context of above example, ['м', 'ж'] is not a set, but an array of wide characters.
check if a Unicode character occurs in a set of characters?
Do you mean a Delphi set?
In general, it is impossible to have a set of X where the base type X has more than 256 possible distinct values. So set of Byte is fine, but set of Word isn't possible. Since there are 256 * 256 distinct wide character values, it is therefore impossible to have a set of wide characters. (If this were indeed possible, a variable of such a set type would be 8 kB in size. That would be an unusually large variable.)
Since there is no such thing as "Delphi set of Unicode characters", the question "How to see if a character belongs to a Delphi set of Unicode characters" doesn't make sense.
Or do you simply mean a mathematical set?
If so, of course this is possible, but you cannot use a Delphi set to represent the mathematical set of characters. Instead, you need to use some other data type. One possibility is a simple array, if you don't mind its O(n) characteristics.

extra spaces with string to buffer void type conversion implicit in Filestream.WriteBuffer method

Haven't needed to post here for a while, but I have a problem implementing filestreams.
When writing a string to filestream, the resultnig text file has extra spaces inserted between each character
So when running this method:
Function TDBImportStructures.SaveIVDataToFile(const AMeasurementType: integer;
IVDataRecordList: TIV; ExportFileName, LogFileName: String;
var ProgressInfo: TProgressInfo): Boolean; // AM
var
TempString: unicodestring;
ExportLogfile, OutputFile: TFileStream;
begin
ExportLogfile := TFileStream.Create(LogFileName, fmCreate);
TempString :=
'FileUploadTimestamp, Filename, MeasurementTimestamp, SerialNumber, DeviceID, PVInstallID,'
+ #13#10;
ExportLogfile.WriteBuffer(TempString[1], Length(TempString) * SizeOf(Char));
ExportLogfile.Free;
OutputFile := TFileStream.Create(ExportFileName, fmCreate);
TempString :=
'measurementdatetime,closestfiveseconddatetime,closesttenminutedatetime,deviceid,'
+ 'measuredmoduletemperature,moduletemperature,isc,voc,ff,impp,vmpp,iscslope,vocslope,'
+ 'pvinstallid,numivpoints,errorcode' + #13#10;
OutputFile.WriteBuffer(TempString[1], Length(TempString) * SizeOf(Char));
OutputFile.Free;
end;
(which is a stripped down test method, writing headers only). The resulting csv file for the 'OutPutFile' reads
'm e a s u r e d m o d u l e t e m p e r a t u r e, etcetera when viewed in wordpad, but not in excel, notepad, etc.
I'm guessing its the SizeOf(Char) statement which is wrong in a unicode context, but I'm not sure what would be the correct thing to insert here.
The 'ExportLogfile' seems to work ok but not the 'OutPutFile'
From what I've read elsewhere it is the writing in unicode which is the problem & not WordPad, see http://social.msdn.microsoft.com/Forums/en-US/7e040fd1-f399-4fb1-b700-9e7cc6117cc4/unicode-to-files-and-console-vs-notepad-wordpad-word-etc?forum=vcgeneral
Any suggestions folks?
many thanks, Brian
You are writing 16 bit UTF-16 encoded characters. And then viewing the text as if it were ANSI encoded text. This mismatch explains the behaviour. In fact you don't have extra spaces, those are zero bytes, interpreted as null characters.
You need to decide which encoding you wish to use. Which programs will read the file? Which text encoding are they expecting? Few programs that read csv files understand UTF-16.
A quick fix would be to switch to using AnsiString which would result in 8 bit text. But would not support international text. Do you need to support international text? Then perhaps you need UTF-8. Again you could perform a quick fix using Utf8String, but I think you should look deeper.
It's odd that you handle the text to binary conversion. It would be much simpler to use TStringList, calling Add to add lines, and then specify an encoding when saving the file.
List.Add(...);
List.Add(...);
// etc.
List.SaveToFile(FileName, TEncoding.UTF8);
A perhaps more elegant approach would be to use the TStreamWriter class. Supply an output stream (or filename) and an encoding when creating the object. And then call Write or WriteLine to add text.
Writer := TStreamWriter.Create(FileName, TEncoding.UTF8);
try
Writer.WriteLine(...);
// etc.
finally
Writer.Free;
end;
I've assumed UTF-8 here but you can easily specify a different encoding.

TStringList splitting bugs

Recently I've been informed by a reputable SO user, that TStringList has splitting bugs which would cause it to fail parsing CSV data. I haven't been informed about the nature of these bugs, and a search on the internet including Quality Central did not produce any results, so I'm asking. What are TStringList splitting bugs?
Please note, I'm not interested in unfounded opinion based answers.
What I know:
Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.
The other is, as stated, they prevent proper parsing of CSV. Thinking that it is difficult to reproduce the bugs with test data, I am (probably) seeking help from whom have tried using a string list as a CSV parser in production code.
Irrelevant problems:
I obtained the information on a 'Delphi-XE' tagged question, so failing parsing due to the "space character being considered as a delimiter" feature do not apply. Because the introduction of the StrictDelimiter property with Delphi 2006 resolved that. I, myself, am using Delphi 2007.
Also since the string list can only hold strings, it is only responsible for splitting fields. Any conversion difficulty involving field values (f.i. date, floating point numbers..) arising from locale differences etc. are not in scope.
Basic rules:
There's no standard specification for CSV. But there are basic rules inferred from various specifications.
Below is demonstration of how TStringList handles these. Rules and example strings are from Wikipedia. Brackets ([ ]) are superimposed around strings to be able to see leading or trailing spaces (where relevant) by the test code.
Spaces are considered part of a field and should not be ignored.
Test string: [1997, Ford , E350]
Items: [1997] [ Ford ] [ E350]
Fields with embedded commas must be enclosed within double-quote characters.
Test string: [1997,Ford,E350,"Super, luxurious truck"]
Items: [1997] [Ford] [E350] [Super, luxurious truck]
Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.
Test string: [1997,Ford,E350,"Super, ""luxurious"" truck"]
Items: [1997] [Ford] [E350] [Super, "luxurious" truck]
Fields with embedded line breaks must be enclosed within double-quote characters.
Test string: [1997,Ford,E350,"Go get one now
they are going fast"]
Items: [1997] [Ford] [E350] [Go get one now
they are going fast]
In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.
Test string: [1997,Ford,E350," Super luxurious truck "]
Items: [1997] [Ford] [E350] [ Super luxurious truck ]
Fields may always be enclosed within double-quote characters, whether necessary or not.
Test string: ["1997","Ford","E350"]
Items: [1997] [Ford] [E350]
Testing code:
var
SL: TStringList;
rule: string;
function GetItemsText: string;
var
i: Integer;
begin
for i := 0 to SL.Count - 1 do
Result := Result + '[' + SL[i] + '] ';
end;
procedure Test(TestStr: string);
begin
SL.DelimitedText := TestStr;
Writeln(rule + sLineBreak, 'Test string: [', TestStr + ']' + sLineBreak,
'Items: ' + GetItemsText + sLineBreak);
end;
begin
SL := TStringList.Create;
SL.Delimiter := ','; // default, but ";" is used with some locales
SL.QuoteChar := '"'; // default
SL.StrictDelimiter := True; // required: strings are separated *only* by Delimiter
rule := 'Spaces are considered part of a field and should not be ignored.';
Test('1997, Ford , E350');
rule := 'Fields with embedded commas must be enclosed within double-quote characters.';
Test('1997,Ford,E350,"Super, luxurious truck"');
rule := 'Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.';
Test('1997,Ford,E350,"Super, ""luxurious"" truck"');
rule := 'Fields with embedded line breaks must be enclosed within double-quote characters.';
Test('1997,Ford,E350,"Go get one now'#10#13'they are going fast"');
rule := 'In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.';
Test('1997,Ford,E350," Super luxurious truck "');
rule := 'Fields may always be enclosed within double-quote characters, whether necessary or not.';
Test('"1997","Ford","E350"');
SL.Free;
end;
If you've read it all, the question was :), what are "TStringList splitting bugs?"
Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.
All it takes is one case. Test data is not random data, one user with one failure case should submit the data and voilà, we've got a test case. If no one can provide test data, maybe there's no bug/failure?
There's no standard specification for CSV.
That one sure helps with the confusion. Without a standard specification, how do you prove something is wrong? If this is left to one's own intuition, you might get into all kinds of troubles. Here's some from my own happy interaction with government issued software; My application was supposed to export data in CSV format, and the government application was supposed to import it. Here's what got us into a lot of trouble several years in a row:
How do you represent empty data? Since there's no CSV standard, one year my friendly gov decided anything goes, including nothing (two consecutive commas). Next they decided only consecutive commas are OK, that is, Field,"",Field is not valid, should be Field,,Field. Had a lot of fun explaining to my customers that the gov app changed validation rules from one week to the next...
Do you export ZERO integer data? This was probably an bigger abuse, but my "gov app" decided to validate that also. At one time it was mandatory to include the 0, then it was mandatory NOT to include the 0. That is, at one time Field,0,Field was valid, next Field,,Field was the only valid way...
And here's an other test-case where (my) intuition failed:
1997, Ford, E350, "Super, luxurious truck"
Please note the space between , and "Super, and the very lucky comma that follows "Super. The parser employed by TStrings only sees the quote char if it immediately follows the delimiter. That string is parsed as:
[1997]
[ Ford]
[ E350]
[ "Super]
[ luxurious truck"]
Intuitively I'd expect:
[1997]
[ Ford]
[ E350]
[Super luxurious truck]
But guess what, Excel does it the same way Delphi does it...
Conclusion
TStrings.CommaText is fairly good and nicely implemented, at least the Delphi 2010 version I looked at is quite effective (avoids multiple string allocations, uses a PChar to "walk" the parsed string) and works about the same as Excel's parser does.
In the real world you'll need to exchange data with other software, written using other libraries (or no libraries at all), where people might have miss-interpreted some of the (missing?) rules of CSV. You'll have to adapt, and it'll probably not be a case of right-or-wrong but a case of "my clients need to import this crap". If that happens, you'll have to write your own parser, one that adapts to the requirements of the 3rd party app you'd be dealing with. Until that happens, you can safely use TStrings. And when it does happen, it might not be TString's fault!
I'm going to go out on a limb and say that the most common failure case is the embedded linebreak. I know most of the CSV parsing I do ignores that. I'll use 2 TStringLists, 1 for the file I'm parsing, the other for the current line. So I'll end up with code similar to the following:
procedure Foo;
var
CSVFile, ALine: TStringList;
s: string;
begin
CSVFile := TStringList.Create;
ALine := TStringList.Create;
ALine.StrictDelimiter := True;
CSVFile.LoadFromFile('C:\Path\To\File.csv');
for s in CSVFile do begin
ALine.CommaText := s;
DoSomethingInteresting(ALine);
end;
end;
Of course, since I'm not taking care to make sure that each line is "complete", I can potentially run into cases where the input contains a quoted linebreak in a field and I miss it.
Until I run into real world data where it's an issue, I'm not going to bother fixing it. :-P
Another example... this TStringList.CommaText bug exists in Delphi 2009.
procedure TForm1.Button1Click(Sender: TObject);
var
list : TStringList;
begin
list := TStringList.Create();
try
list.CommaText := '"a""';
Assert(list.Count = 1);
Assert(list[0] = 'a');
Assert(list.CommaText = 'a'); // FAILS -- actual value is "a""
finally
FreeAndNil(list);
end;
end;
The TStringList.CommaText setter and related methods corrupt the memory of the string that holds the a item (its null terminator character is overwritten by a ").
Already tried use TArray<String> split?
var
text: String;
arr: TArray<String>;
begin
text := '1997,Ford,E350';
arr := text.split([',']);
So arr would be:
arr[0] = 1997;
arr[1] = Ford;
arr[2] = E350;

Delphi 2010 Blockread seems to get different data than previous version

I had my old MP3 Id3 tag reader recompiled under D2010 and it seems it won't find the tags anymore.
code is farily simple, but it doesn't work.
The debugger shows a lots of zero and then chineese signs in the results!
var dat:file of char;
id3:array [0..TAGLEN] of Char; //is 0..127 for ID3 v1
begin
vValid:=True;
if FileExists(vFilename) then begin
assignfile(dat,vFilename);
If (FileGetAttr(vFilename)>32) or (FileGetAttr(vFilename)=1) then
Filemode:= 0
Else
Filemode:= 2;
reset(dat);
seek(dat,FileSize(dat)-128);
blockread(dat,id3,128);
closefile(dat);
vMP3tag:=copy(id3, 0, 3);
if vMP3Tag='TAG' then begin
vTitle:=strip(copy(id3, 4, 30),' ');
vArtist:=strip(copy(id3, 34, 30), ' ');
I heard something about Unicode, and PansiChar, but I still don't understand much what these do anyway :)
thanks for looking
Try this:
var dat:file of AnsiChar;
id3:array [0..TAGLEN] of AnsiChar; //is 0..127 for ID3 v1
That is of course if your file is ansi-based instead of unicode based. I have no idea what might be in an id3 tag of an mp3 file.
If you want to understand the difference, this white paper explained it all to me. Basically Unicode uses more memory space to store a single character (like 4 times the amount of an ansi character), but they allow characters like ie Chinese and Japanese, which ansi doesn't provide. Just read the white paper, then it'll all be clear.
In short, Ansichar and Ansistring is what used to be a string in Delphi before D2009. In those days your application wouldn't be unicode compatible (you couldn't type chinese characters by default).
As from D2009, the definition of a string changed from an ansistring to a widestring and ansichar to widechar. That means your application will be unicode by default. But old code, expecting strings to be ansicode, need to be adapted to reflect that change.
Your code said char, meaning ansichar to pre-D2009 compilers, but widechar to D2009+ compilers. In other words, the new compilers read your code differently.
I hope that explains it a bit.
Oh!
it seems like AnsiCHar instead of Char is the way to go in D2010.
Ansi-char-them-all!

Wrong Unicode conversion, how to store accent characters in Delphi 2010 source code and handle character sets?

We are upgrading our project from Delphi 2006 to Delphi 2010. Old code was:
InputText: string;
InputText := SomeTEditComponent.Text;
...
for i := 1 to length(InputText) do
if InputText[i] in ['0'..'9', 'a'..'z', 'Ř' { and more special characters } ] then ...
Trouble is with accent letters - compare will fail.
I tried switch source code from ANSI to UTF8 and LE UCS-2 but without luck. Only cast as AnsiChar works:
if CharInSet(AnsiChar(InputText[i]), ['0'..'9', 'a'..'z', 'Ř']) then
Funny is how Delphi works with that letters - try this in Evaluate during debugging:
Ord('Ř') = Ord('Ø')
(yes, Delphi says True, on Windows 7 Czech)
Question is: How can I store and compare simple strings without forcing them as AnsiStrings? Because if this is not working why we should use Unicode?
Thanks all for reply
Right now we are using in some parts simple CharInSet(AnsiChar(...
The declaration of CharInSet is
function CharInSet(C: AnsiChar; const CharSet: TSysCharSet): Boolean; overload; inline;
function CharInSet(C: WideChar; const CharSet: TSysCharSet): Boolean; overload; inline;
while TSysCharSet is
TSysCharSet = set of AnsiChar;
Thus CharInSet can only compare to a set of AnsiChar. That is why your accented character is converted to AnsiChar.
There is no equivalent to a set of WideChar as sets are limited to 256 elements. You have to implement some other means to check the character.
Something like
const
specials: string = 'Ř';
if CharInSet(InputText[i], ['0'..'9', 'a'..'z']) or (Pos(InputText[I], specials) > 0) then
might be a try. You can add more characters to specials as needed.
Don't rely on the encoding of your Delphi source code files.
It might be mangled when using any non-Unicode tool to work on your text files (or even buggy Unicode aware tools).
The best way is to specify your characters as a 4-digit Unicode code point.
const
MyEuroSign = #$20AC;
See also my blog posting about this.
As mentioned by Uwe Raabe, the problem with Unicode char is, they're pretty large. If Delphi allowed you to create an "set of Char" it would be 8 Kb in size! An "set of AnsiChar" is only 32 bytes in size, pretty manageable.
I'd like to offer some alternatives. First is a sort of drop-in replacement for the CharInSet function, one that uses an array of CHAR to do the tests. It's only merit is that it can be called immediately from almost anywhere, but it's benefits stop there. I'd avoid this if I can:
function UnicodeCharInSet(UniChr:Char; CharArray:array of Char):Boolean;
var i:Integer;
begin
for i:=0 to High(CharArray) do
if CharArray[i] = UniChr then
begin
Result := True;
Exit;
end;
Result := False;
end;
The trouble with this function is that it doesn't handle the x in ['a'..'z'] syntax and it's slow! The alternatives are faster, but aren't as close to a drop-in replacement as one might want. The first set of alternatives to be investigated are the string functions from Microsoft. Amongst them there's IsCharAlpha and IsCharAlphanumeric, they might fix lots of issues. The problem with those, all "alpha" chars are the same: You might end up with valid Alpha chars in non-enlgish non-czech languages. Alternatively you can use the TCharacter class from Embarcadero - the implementation is all in the Character.pas unit, and it looks effective, I have no idea how effective Microsoft's implementation is.
An other alternative is to write your own functions, using an "case" statement to get things to work. Here's an example:
function UnicodeCharIs(UniChr:Char):Boolean;
var i:Integer;
begin
case UniChr of
'ă': Result := True;
'ş': Result := False;
'Ă': Result := True;
'Ş': Result := False;
else Result := False;
end;
end;
I inspected the assembler generated for this function. While Delphi has to implement a series of "if" conditions for this, it does it very effectively, way better then implementing the series of IF statements from code. But it could use a lot of improvement.
For tests that are used ALOT you might want to look for some bit-mask based implementation.
You should either use IFs instead of IN or find a WideCharSet implementation. This might help if you have a lot of sets: http://code.google.com/p/delphilhlplib/source/browse/trunk/Library/src/Extensions/DeHL.WideCharSet.pas.
You have stumbled onto a case where an idiom from Pre-Unicode Pascal should not be translated directly into the most visually similar idiom in Unicode era pascal.
First, let's deal with unicode string literals. If you can always be sure you will never have any body ever use your source code with any tool that could mess up your encodings
then you could use Unicode literals. Personally, I would not like to see Unicode codepoints in string literals in any of my code, for various reasons, the strongest reason being that my code may need to be reviewed for internationalization at some point, and having literals that belong to your local language peppered through your code is even more of a problem when you use a language other than those which use the simple Ascii/Ansi codepage symbols. Your source code will be more readable if you keep in mind the assumption that your accented characters, and even non-accented character literals would be better declared, as Jeroen says to declare them, in the const section, away from your actual place in the code that you use them.
Consider the case where you use the same string literal thirty three times throughout your code. Why should it be repeated instead of a constant? And even when it is used only once, isn't the code more readable if you declare a sane constant name?
So, first you should declare constants like he shows.
Second, the CharInSet function is deprecated for all uses other than the use it was intended for which is where you must continue to use the "Set of AnsiChar" types. This is no longer a recommended approach in Delphi 2009/2010, and using arrays of literal unicode characters, in your constant section, would be more readable, and more up-to-date.
I suggest you use the JCL StrContainsChars function and avoid character sets, since
you can not declare an inline SET of Unicode Characters at all, the language does not allow it. Instead use this, and be sure to comment it:
implementation
uses
JclStrings;
const
myChar1 = #$2001;
myChar2 = #$2002;
myChar3 = #$2003;
myMatchList1 : Array[0..2] of Char = (myChar1,myChar2,myChar3);
function Match(s:String):Boolean;
begin
result := StrContainsChars( s, myMatchList1,false);
end;
String, and Character Literals are bad to have peppering your code, especially character or numeric literals, are called "Magic values" and are to be avoided.
P.S. Your debug assertion shows that Ord('?') is downcasting the unicode character quietly to an AnsiChar byte-size character in the debugger. This behaviour is unexpected and should probably logged in QC.

Resources