unicode text file output differs between XE2 and Delphi 2009? - delphi

When I try the code below there seem to be different output in XE2 compared to D2009.
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Writeln(Outfile,utf8string('总结'));
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
Compiling with XE2 on a Windows 8 PC gives in WordPad
??
C
txt hex code: EF BB BF 3F 3F 0D 0A B0 43 0D 0A
Compiling with D2009 on a Windows XP PC gives in Wordpad
总结
°C
txt hex code: EF BB BF E6 80 BB E7 BB 93 0D 0A B0 43 0D 0A
My questions is why it differs and how can I save Chinese characters to a text file using the old text file I/O?
Thanks!

In XE2 onwards, AssignFile() has an optional CodePage parameter that sets the codepage of the output file:
function AssignFile(var F: File; FileName: String; [CodePage: Word]): Integer; overload;
Write() and Writeln() both have overloads that support UnicodeString and WideChar inputs.
So, you can create a file that has its codepage set to CP_UTF8, and then Write/ln() will automatically convert Unicode strings to UTF-8 when writing them to the file.
The downside is that you will not be able to write the UTF-8 BOM using AnsiChar values anymore, because the individual bytes will get converted to UTF-8 and thus not be written correctly. You can get around that by writing the BOM as a single Unicode character (which it what it really is - U+FEFF) instead of as individual bytes.
This works in XE2:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TextFile;
begin
AssignFile(Outfile, 'test_chinese.txt', CP_UTF8);
Rewrite(Outfile);
//This is the UTF-8 BOM
Write(Outfile, #$FEFF);
Writeln(Outfile, '总结');
Writeln(Outfile, '°C');
CloseFile(Outfile);
end;
With that said, if you want something that is more compatible and reliable between D2009 and XE2, use TStreamWriter instead:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TStreamWriter;
begin
Outfile := TStreamWriter.Create('test_chinese.txt', False, TEncoding.UTF8);
try
Outfile.WriteLine('总结');
Outfile.WriteLine('°C');
finally
Outfile.Free;
end;
end;
Or do the file I/O manually:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TFileStream;
BOM: TBytes;
procedure WriteBytes(const B: TBytes);
begin
if B <> '' then Outfile.WriteBuffer(B[0], Length(B));
end;
procedure WriteStr(const S: UTF8String);
begin
if S <> '' then Outfile.WriteBuffer(S[1], Length(S));
end;
procedure WriteLine(const S: UTF8String);
begin
WriteStr(S);
WriteStr(sLineBreak);
end;
begin
Outfile := TFileStream.Create('test_chinese.txt', fmCreate);
try
WriteBytes(TEncoding.UTF8.GetPreamble);
WriteLine('总结');
WriteLine('°C');
finally
Outfile.Free;
end;
end;

You really shouldn't use the old text I/O anymore.
Anyway, you can use TEncoding to get the UTF-8 TBytes like this:
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
Bytes: TBytes;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Bytes := TEncoding.UTF8.GetBytes('总结');
for myByte in Bytes do begin
Write(Outfile, AnsiChar(myByte));
end;
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
I'm not sure if there is an easier way to write TBytes to a Textfile, maybe somebody else has a better idea.
Edit:
For a pure binary file (File instead of TextFile type) use can use BlockWrite.

There are a couple of tell-tale signs that may tell you what whent wrong when dealing with Unicode. In your case you're seeing "?" in the resulting output file: You get question marks when you try to convert some thing from Unicode to a Code Page and the target Code Page can't represent the requested characters.
Looking at the hex dump it's obvious (counting line terminators) that the question marks are the result of saving the two Chinese characters to the file. The two chars got converted to exactly two question marks. This tells you the Writeln() decided to give you helping and converted the text from UTF8 (a unicode representation) to your local code page. The Delphi team probably decided to do this since the old I/O routines are not supposed to be UNICODE compatible; since you're writing an UTF8 string using the old I/O routines, they're helping you by converting this to your Code Page. You might not welcome that helping hand, but it doesn't mean it was wrong to do so: it's undocumented territory.
Since you now know why that's happening you know what to do to stop it. Let WriteLn() know you're sending something that doesn't need converting. You'll discover that's not particularly easy, since Delphi XE2 apparently "helps you out" whatever you. For example, stuff like this doesn't just change the string type, it converts to AnsiString, going through the code-page conversion routine that gets you question marks:
AnsiString(UTF8String('Whatever Unicode'));
Because of this, and if you need one-liner solutions, you could try a conversion routine, something like this:
function FakeConvert(const InStr: UTF8String): AnsiString;
var N: Integer;
begin
N := Length(InStr);
SetLength(Result, N);
Move(InStr[1], Result[1], N);
end;
You'll then be able to do:
Writeln(Outfile,FakeConvert('总结'));
And it'll do what you expect (I did actually try it before posting!)
Of course the only TRUE answer to this question is, since you upgraded all the way to Delphi XE2:
Stop using deprecated I/O routines, move to TStream based

Related

Can Delphi 6 convert UTF-8 Portuguese to WideString?

I am using Delphi 6.
I want to decode a Portuguese UTF-8 encoded string to a WideString, but I found that it isn't decoding correctly.
The original text is "ANÁLISE8". After using UTF8Decode(), the result is "ANALISE8". The symbol on top of the "A" disappears.
Here is the code:
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := UTF8Decode(s);
How can I decode the Portuguese UTF-8 string to WideString correctly?
Note that the implementation of UTF8Decode() in Delphi 6 is incomplete. Specifically, it does not support encoded 4-byte sequences, which are needed to handle Unicode codepoints above U+FFFF. Which means UTF8Decode() can only decode Unicode codepoints in the UCS-2 range, not the full Unicode repertoire. Thus making UTF8Decode() basically useless in Delphi 6 (and all the way up to Delphi 2007 - it was finally fixed in Delphi 2009).
Try using the Win32 MultiByteToWideChar() function instead, eg:
uses
..., Windows;
function MyUTF8Decode(const s: UTF8String): WideString;
var
Len: Integer;
begin
Len := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), nil, 0);
SetLength(Result, Len);
if Len > 0 then
MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), PWideChar(Result), Len));
end;
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := MyUTF8Decode(s);
That being said, your ANÁLISE8 string falls within the UCS-2 range, so I tested UTF8Decode() in Delphi 6 and it decoded the UTF-8 encoded form of ANÁLISE8 just fine. I would conclude that either:
your UTF8String variable DOES NOT contain the UTF-8 encoded form of ANÁLISE8 to begin with (byte sequence 41 4E C3 81 4C 49 53 45 38), but instead contains the ASCII string ANALISE8 instead (byte sequence 41 4E 41 4C 49 53 45 38), which would decode as-is since ASCII is a subset of UTF-8. Double check your file, and the output of Readln().
your WideString contains ANÁLISE8 correctly as expected, but the way you are outputting/debugging it (which you did not show) is converting it to ANSI, losing the Á during the conversion.

Delphi Lockbox Hashing

I need to hash a string, preferably as SHA512, although it could be SHA256, SHA1, MD5 or CRC32.
I have downloaded Lockbox 3, put a TCryptographicLibrary and a THash component on a form, set the Hash property to SHA-512 and used the following code to produce a test result:
procedure TForm1.Button1Click(Sender: TObject);
begin
Hash1.HashString('myhashtest');
Edit1.Text := Stream_To_AnsiString(Hash1.HashOutputValue);
end;
To best illustrate the problem, I have gone on to an online hash calculator and the MD5 hash of 'myhashtest' is ff91e22313f0a41b46719e7ee6f99451 but setting the hash property in my test program to MD5 results in ÿ‘â#ð¤Fqž~æù”Q which is clearly wrong. I have tried the same test using other Hash properties, including the SHA512 which i want, and they all return rubbish.
Where am I going wrong?
THash.HashOutputValue is a stream of the raw hashed bytes. It appears that Stream_To_AnsiString() merely copies those raw bytes as-is into an AnsiString, it does not encode the bytes in any way. What you are looking for is the hex encoded version of the raw bytes instead. I do know that LockBox has a Stream_To_Base64() function (as shown in this example), but I do not know if it has a Stream_To_Hex() type of function. If it does not, you can easily create your own, eg:
function Stream_To_Hex(Stream: TStream): AnsiString;
var
NumBytes, I: Integer;
B: Byte;
begin
NumBytes := Stream.Size - Stream.Position;
SetLength(Result, NumBytes * 2);
for I := 0 to NumBytes-1 do
begin
Stream.ReadBuffer(B, 1);
BinToHex(#B, #Result[(I*2)+1], 1);
end;
end;
procedure TForm1.Button1Click(Sender: TObject);
begin
Hash1.HashString('myhashtest');
Edit1.Text := Stream_To_Hex(Hash1.HashOutputValue);
end;
Many cryptographic functions 'silently' (i.e. without stating so in the docs) output and require Base64- or hex-encoded strings (and also often AnsiStrings). This is because encrypted text can contain any data, and as soon as you start treating that as 'strings', string handling functions can easily choke on that (e.g. null-terminated strings containing a null). By Base-64/hex encoding the cryptotext you make sure it will be plain old ASCII characters that evene old code can read/write.
If you dig around a little in the cryptocode or its method parameters you usually can determine that, and convert your strings accordingly.
I figured out where stream_to_hex, it is inside uTPLB_StreamUtils (pas or hpp) depending if you are using c builder or delphi.

How to convert AnsiChar to UnicodeChar with specific CodePage?

I'm generating texture atlases for rendering Unicode texts in my app. Source texts are stored in ANSI codepages (1250, 1251, 1254, 1257, etc). I want to be able to generate all the symbols from each ANSI codepage.
Here is the outline of the code I would expect to have:
for I := 0 to 255 do
begin
anChar := AnsiChar(I); //obtain AnsiChar
//Apply codepage without converting the chars
//<<--- this part does not work, showing:
//"E2033 Types of actual and formal var parameters must be identical"
SetCodePage(anChar, aCodepages[K], False);
//Assign AnsiChar to UnicodeChar (automatic conversion)
uniChar := anChar;
//Here we get Unicode character index
uniCode := Ord(uniChar);
end;
The code above does not works (E2033) and I'm not sure it is a proper solution at all. Perhaps there's much shorter version.
What is the proper way of converting AnsiChar into Unicode with specific codepage in mind?
I would do it like this:
function AnsiCharToWideChar(ac: AnsiChar; CodePage: UINT): WideChar;
begin
if MultiByteToWideChar(CodePage, 0, #ac, 1, #Result, 1) <> 1 then
RaiseLastOSError;
end;
I think you should avoid using strings for what is in essence a character operation. If you know up front which code pages you need to support then you can hard code the conversions into a lookup table expressed as an array constant.
Note that all the characters that are defined in the ANSI code pages map to Unicode characters from the Basic Multilingual Plane and so are represented by a single UTF-16 character. Hence the size assumptions of the code above.
However, the assumption that you are making, and that this answer persists, is that a single byte represents a character in an ANSI character set. That's a valid assumption for many character sets, for example the single byte western character sets like 1252. But there are character sets like 932 (Japanese), 949 (Koren) etc. that are double byte character sets. Your entire approach breaks down for those code pages. My guess is that only wish to support single byte character sets.
If you are writing cross-platform code then you can replace MultiByteToWideChar with UnicodeFromLocaleChars.
You can also do it in one step for all characters. Here is an example for codepage 1250:
var
encoding: TEncoding;
bytes: TBytes;
unicode: TArray<Word>;
I: Integer;
S: string;
begin
SetLength(bytes, 256);
for I := 0 to 255 do
bytes[I] := I;
SetLength(unicode, 256);
encoding := TEncoding.GetEncoding(1250); // change codepage as needed
try
S := encoding.GetString(bytes);
for I := 0 to 255 do
unicode[I] := Word(S[I+1]); // as long as strings are 1-based
finally
encoding.Free;
end;
end;
Here is the code I have found to be working well:
var
I: Byte;
anChar: AnsiString;
Tmp: RawByteString;
uniChar: Char;
uniCode: Word;
begin
for I := 0 to 255 do
begin
anChar := AnsiChar(I);
Tmp := anChar;
SetCodePage(Tmp, aCodepages[K], False);
uniChar := UnicodeString(Tmp)[1];
uniCode := Word(uniChar);
<...snip...>
end;

Delphi search and Replace code not working

I am trying to get this code to work. It's a standard search and replace function.
I get no errors at all but nothing changes in the text file for some reason.
Here is the full code:
procedure FileReplaceString(const FileName, searchstring, replacestring: string);
var
fs: TFileStream;
S: string;
begin
fs := TFileStream.Create(FileName, fmOpenread or fmShareDenyNone);
try
SetLength(S, fs.Size);
fs.ReadBuffer(S[1], fs.Size);
finally
fs.Free;
end;
S := StringReplace(S, SearchString, replaceString, [rfReplaceAll, rfIgnoreCase]);
fs := TFileStream.Create(FileName, fmCreate);
try
fs.WriteBuffer(S[1], Length(S));
finally
fs.Free;
end;
end;
procedure TForm1.Button1Click(Sender: TObject);
var Path, FullPath:string;
begin
Path:= ExtractFilePath(Application.ExeName);
FullPath:= Path + 'test.txt';
FileReplaceString(FullPath,'changethis','withthis');
end;
The reason is that S, searchstring, and replacestring are Unicode strings (so, e.g., "Test" is 54 00 65 00 73 00 74 00) while the text file probably is a UTF-8 or ANSI file (so, e.g., "Test" is 54 65 73 74).
This means that the value stored in S will be highly corrupt (you take the bytes of a UTF-8 text and interpret them as the bytes of a Unicode text)! In the Test example, you will get 敔瑳?? where the two last characters are random (why?).
To test this hypothesis, simply declare S as AnsiString instead, then it should work.
Of course, if you need Unicode support, you need to do some UTF-8 encoding/decoding. The simplest solution to your problem would be to use the TStringList; then you get everything you need for free.

how to convert chinese string to hex in delphi 2010 and achieve same result as delphi 2007 mbcs

this code in delphi2007 is convert success
for example:
i have a chinese 短刀 , in delphi2007 convert is B5 CC B5 C6 ,but in delphi 2010 convert is 77 ED 52 00
function StringToHex(str: string): string;
var
i:integer;
s:string;
begin
s:='';
for i:=1 to length(str) do begin
s:=s+inttohex(Integer(str[i]),2);
end;
result:=s;
end;
but in delphi2010, it's wrong
who can edit it work in delphi2010 success?
First, in Delphi 2007, String=AnsiString, and in Delphi 2010, String=UnicodeString. That is enough explanation for you to understand, if you know what AnsiString (char is 8 bits) and UnicodeString (char is 16 bits) means.
Even though you are calling "IntToHex(x,2)", each Delphi 2010 character when converted to an integer will be in the range from 0 to 65535, which means that the IntToHex call is returning between 2 and 4 hex digits, which makes it hard for you to read the results without confusion.
A minimal unicode-aware fix is to change to IntToHex(x,4) for unicode versions of delphi, and maybe put a space in there so you can at least see where the codepoints separate Four digits like 0000 is enough hex digits for a single unicode character represented as hex. Two digits is not enough.
Why are the values different though? That's a good question. Let me try to make it clearer; I believe you are seeing a consequence of using Delphi 2007 and its ANSI+MBCS support (which is codepage reliant) versus Delphi 2010 which uses Unicode Strings. You should not be surprised that MBCS values different from unicode codepoints.
Also you should know that it takes two hex digits to show a byte, and four hex digits to show a Unicode character, which is 16 bits in size.
If you really want to see the Hex of the UTF8 string, then in Delphi 2010 you must create a UTF8 string first. If you really want MBCS, then say so. The whole world is Unicode now, I suggest you let MBCS go.
Fixed code for Unicode strings character codepoints (4 hex digits, 16 bit):
A UnicodeString=String aware version (Delphi 2009,2010,XE):
function StringToHex16(str: string): string;
var
i:integer;
s:string;
begin
s:='';
for i:=1 to length(str) do begin
s:=s+inttohex(Integer(str[i]),4);
end;
result:=s;
end;
UTF8 version for Delphi 2009,2010,XE:
function StringToHexUtf8(str: string): string;
var
i:integer;
s:string;
u:RawByteString;
begin
u := Utf8String(str);
s:='';
for i:=1 to length(u) do begin
s:=s+inttohex(Integer(u[i]),2);
end;
result:=s;
end;
And finally, since probably what you want is to reproduce exactly Delphi 2007's behaviour, here is an explicit example using MBCS functions:
function StringToHexMbcs(str: string;cp:Integer): string;
var
sz,i:integer;
s:string;
u:RawByteString;
flags:Integer;
begin
// use cp 936 or 950 for simplified or traditional chinese mbcs.
flags := WC_COMPOSITECHECK or WC_DISCARDNS or WC_SEPCHARS or WC_DEFAULTCHAR;
sz := Windows.WideCharToMultiByte( cp, flags, #str[1],-1,nil,0,nil,nil); // get length.
SetLength(u,sz+1);
Windows.WideCharToMultiByte( cp, flags, #str[1],Length(str),#u[1],sz-1, nil,nil);
s:='';
for i:=1 to sz do begin
s:=s+inttohex(Integer(u[i]),2);
end;
result:=s;
end;
For future reference though, Delphi 2007 is not the gold standard of what is "right". You have to make some effort to understand the difference between MBCS and Unicode.
To obtain the same result in D2010 as in D2007, simple change the function parameter from (Unicode)String to AnsiString. Any string value you pass in, regardless of type, with be converted by the RTL into its MBCS equivalent based on the system default codepage - the same AnsiString has always used in past versions and continues using.

Resources