I am using Delphi 6.
I want to decode a Portuguese UTF-8 encoded string to a WideString, but I found that it isn't decoding correctly.
The original text is "ANÁLISE8". After using UTF8Decode(), the result is "ANALISE8". The symbol on top of the "A" disappears.
Here is the code:
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := UTF8Decode(s);
How can I decode the Portuguese UTF-8 string to WideString correctly?
Note that the implementation of UTF8Decode() in Delphi 6 is incomplete. Specifically, it does not support encoded 4-byte sequences, which are needed to handle Unicode codepoints above U+FFFF. Which means UTF8Decode() can only decode Unicode codepoints in the UCS-2 range, not the full Unicode repertoire. Thus making UTF8Decode() basically useless in Delphi 6 (and all the way up to Delphi 2007 - it was finally fixed in Delphi 2009).
Try using the Win32 MultiByteToWideChar() function instead, eg:
uses
..., Windows;
function MyUTF8Decode(const s: UTF8String): WideString;
var
Len: Integer;
begin
Len := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), nil, 0);
SetLength(Result, Len);
if Len > 0 then
MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), PWideChar(Result), Len));
end;
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := MyUTF8Decode(s);
That being said, your ANÁLISE8 string falls within the UCS-2 range, so I tested UTF8Decode() in Delphi 6 and it decoded the UTF-8 encoded form of ANÁLISE8 just fine. I would conclude that either:
your UTF8String variable DOES NOT contain the UTF-8 encoded form of ANÁLISE8 to begin with (byte sequence 41 4E C3 81 4C 49 53 45 38), but instead contains the ASCII string ANALISE8 instead (byte sequence 41 4E 41 4C 49 53 45 38), which would decode as-is since ASCII is a subset of UTF-8. Double check your file, and the output of Readln().
your WideString contains ANÁLISE8 correctly as expected, but the way you are outputting/debugging it (which you did not show) is converting it to ANSI, losing the Á during the conversion.
Related
My program reads from a device via a serial port and returns this string. 'IC'#$0088#$0080'Ô'#$0080#$0080
I need to get the 5 hex values and convert to binary. #$0088 = 10001000, #$0080 = 10000000, Ô = 11010100.
I can convert the 80 & 88, but am having difficulty extracting them from the whole string. The Ô(xD4) I can neither extract or convert. An extended character like the Ô could be at any or all locations.
The read methods in my serial component are:
function Read(var Buffer; Count: Integer): Integer;
function ReadStr(var Str: string; Count: Integer): Integer;
function ReadAsync(var Buffer; Count: Integer; var AsyncPtr: PAsync): Integer;
function ReadStrAsync(var Str: Ansistring; Count: Integer; var AsyncPtr: PAsync): Integer;
Can you give me an example of reading binary?
It looks like the real problem is that you are treating binary data as though it were UTF-16 encoded text.
Whatever is feeding you this data, is not feeding you UTF-16 encoded text. What the device is really feeding you is a byte array. Treat it as such rather than as text. Then you can pick out the five values you are interested in by index.
So, declare an array of bytes:
var
Data: TArray<Byte>; // dynamic array
or
var
Data: TBytes; // shorthand for the same
or
var
Data: array [0..N-1] of Byte; // fixed length array
And then read into those arrays. To pick out values, use Data[i].
Note that I am using a significant amount of guesswork here, based on the question and your comments. Don't take my word for it. My guessing could be wrong. Consult the specification of the communication protocol for the device. And learn carefully the difference between text and binary.
As I wrote earlier in the comments, the problem with the message in your question is that it consists partly of non-ASCII characters. The ASCII range is from $00 to $7F and have the same characters as Unicode U+0000 to U+007F. Therefore no conversion (except for the leading 0). AnsiCharacters ($80 to $FF) on the other hand are subject to conversion according to the code page in use, in order to keep the same glyph for both. F.Ex. AnsiChar $80 (Euro sign in CP1252) is therefore converted to Unicode U+02C6. Bit patten for the lower byte doesn't match anymore.
Ref: https://msdn.microsoft.com/en-us/library/cc195054.aspx
Following code shows the result of two tests, Using Char vs. AnsiChar
procedure TMainForm.Button2Click(Sender: TObject);
const
Buffer: array[0..7] of AnsiChar = ('I','C', #$88, #$80, #$D4, #$80, #$80, ';');
// Buffer: array[0..7] of Char = ('I','C', #$88, #$80, #$D4, #$80, #$80, ';');
BinChars: array[0..1] of Char = ('0','1');
var
i, k: integer;
c: AnsiChar;
// c: Char;
s: string;
begin
for k := 2 to 6 do
begin
c := Buffer[k];
SetLength(s, 8);
for i := 0 to 7 do
s[8-i] := BinChars[(ord(c) shr i) and 1];
Memo1.Lines.Add(format('Character %d in binary format: %s',[k, s]));
end;
end;
Using Char (UTF-16 WideChar)
AnsiChar #$88 is converted to U+02C6
AnsiChar #$80 is converted to U+20AC
AnsiChar #$D4 is converted to U+00D4 !
Lower byte gives
Character 2 in binary format: 11000110
Character 3 in binary format: 10101100
Character 4 in binary format: 11010100
Character 5 in binary format: 10101100
Character 6 in binary format: 10101100
Using AnsiChar
Character 2 in binary format: 10001000
Character 3 in binary format: 10000000
Character 4 in binary format: 11010100
Character 5 in binary format: 10000000
Character 6 in binary format: 10000000
Unfortunately a conversion from Unicode to Ansi (even if originally converted from Ansi to Unicode) is lossy and will fail.
I really don't see any easy solution with the information available.
I am testing migration from Delphi 5 to XE. Being unfamiliar with UnicodeString, before asking my question I would like to present its background.
Delphi XE string-oriented functions: Copy, Delete and Insert have a parameter Index telling where the operation should start. Index may have any integer value starting from 1 and finishing at the length of the string to which the function is applied.
Since the string can have multi-element characters, function operation can start at an element (surrogate) belonging to a multi-element series encoding a single unicode named code-point.
Then, having a sensible string and using one of the functions, we can obtain non sensible result.
The phenomenon can be illustrated with the below cases using the function Copy with respect to strings representing the same array of named codepoints (i.e. meaningful signs)
($61, $13000, $63)
It's concatenation of 'a', EGYPTIAN_HIEROGLYPH_A001 and 'c'; it looks as
Case 1. Copy of AnsiString (element = byte)
We start with the above mentioned UnicodeString #$61#$13000#$63 and we convert it to UTF-8 encoded AnsiString s0.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 6 of them since s0 is 6 bytes long.
procedure Copy_Utf8Test;
type TAnsiStringUtf8 = type AnsiString (CP_UTF8);
var ss : string;
s0,s1 : TAnsiStringUtf8;
ii : integer;
begin
ss := #$61#$13000#$63; //mem dump of ss: $61 $00 $0C $D8 $00 $DC $63 $00
s0 := ss; //mem dump of s0: $61 $F0 $93 $80 $80 $63
ii := length(s0); //sets ii=6 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$F0 F means "start of 4-byte series"; no corresponding named code-point
s1 := copy(s0,3,1); //#$93 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,4,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,5,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,6,1); //'c'
end;
The first and last results are sensible within UTF-8 codepage, while the other 4 are not.
Case 2. Copy of UnicodeString (element = word)
We start with the same UnicodeString s0 := #$61#$13000#$63.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 4 of them since s0 is 4 words long.
procedure Copy_Utf16Test;
var s0,s1 : string;
ii : integer;
begin
s0 := #$61#$13000#$63; //mem dump of s0: $61 $00 $0C $D8 $00 $DC $63 $00
ii := length(s0); //sets ii=4 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$D80C surrogate pair member; no corresponding named code-point
s1 := copy(s0,3,1); //#$DC00 surrogate pair member; no corresponding named code-point
s1 := copy(s0,4,1); //'c'
end;
The first and last results are sensible within codepage CP_UNICODE (1200), while the other 2 are not.
Conclusion.
The string-oriented functions: Copy, Delete and Insert perfectly operate on string considered as a mere array of bytes or words. But they are not helpful if string is seen as that what it essentially is, i.e. representation of array of named code-points.
Both above two cases deal with strings which represent the same array of 3 named code-points. They are considered as representations (encodings) of the same text composed of 3 meaningful signs (to avoid abuse of the term "characters").
One may want to be able to extract (copy) any of those meaningful signs regardless whether a particular text representation (encoding) is mono- or multi-element one.
I've spent quite a time looking around for a satisfactory equivalent of Copy that I used to in Delphi 5.
Question.
Do such equivalents exist or I have to write them myself?
What you have described is how Copy(), Delete(), and Insert() have ALWAYS worked, even for AnsiString. The functions operate on elements (ie codeunits in Unicode terminology), and always have.
AnsiString is a string of 8bit AnsiChar elements, which can be encoded in any 8bit ANSI/MBCS format, including UTF-8.
UnicodeString (and WideString) is a string of 16bit WideChar elements, which are encoded in UTF-16.
The functions HAVE NEVER taken encoding into account. Not for MBCS AnsiString. Not for UTF-16 UnicodeString. Indexes are absolute element indexes from the beginning of the string.
If you need encoding-aware Copy/Delete/Insert functions that operate on logical codepoint boundaries, where each codepoint may be 1+ elements in the string, then you have to write your own functions, or find third-party functions that do what you need. There is no MBCS/UTF-aware mutilator functions in the RTL.
You should parse Unicode string youself. Fortunaly the Unicode encoding is designed to make parsing easy. Here is an example how to parse UTF8 string:
program Project9;
{$APPTYPE CONSOLE}
uses
SysUtils;
function GetFirstCodepointSize(const S: UTF8String): Integer;
var
B: Byte;
begin
B:= Byte(S[1]);
if (B and $80 = 0 ) then
Result:= 1
else if (B and $E0 = $C0) then
Result:= 2
else if (B and $F0 = $E0) then
Result:= 3
else if (B and $F8 = $F0) then
Result:= 4
else
Result:= -1; // invalid code
end;
var
S: string;
begin
S:= #$61#$13000#$63;
Writeln(GetFirstCodepointSize(S));
S:= #$13000#$63;
Writeln(GetFirstCodepointSize(S));
S:= #$63;
Writeln(GetFirstCodepointSize(S));
Readln;
end.
I'm generating texture atlases for rendering Unicode texts in my app. Source texts are stored in ANSI codepages (1250, 1251, 1254, 1257, etc). I want to be able to generate all the symbols from each ANSI codepage.
Here is the outline of the code I would expect to have:
for I := 0 to 255 do
begin
anChar := AnsiChar(I); //obtain AnsiChar
//Apply codepage without converting the chars
//<<--- this part does not work, showing:
//"E2033 Types of actual and formal var parameters must be identical"
SetCodePage(anChar, aCodepages[K], False);
//Assign AnsiChar to UnicodeChar (automatic conversion)
uniChar := anChar;
//Here we get Unicode character index
uniCode := Ord(uniChar);
end;
The code above does not works (E2033) and I'm not sure it is a proper solution at all. Perhaps there's much shorter version.
What is the proper way of converting AnsiChar into Unicode with specific codepage in mind?
I would do it like this:
function AnsiCharToWideChar(ac: AnsiChar; CodePage: UINT): WideChar;
begin
if MultiByteToWideChar(CodePage, 0, #ac, 1, #Result, 1) <> 1 then
RaiseLastOSError;
end;
I think you should avoid using strings for what is in essence a character operation. If you know up front which code pages you need to support then you can hard code the conversions into a lookup table expressed as an array constant.
Note that all the characters that are defined in the ANSI code pages map to Unicode characters from the Basic Multilingual Plane and so are represented by a single UTF-16 character. Hence the size assumptions of the code above.
However, the assumption that you are making, and that this answer persists, is that a single byte represents a character in an ANSI character set. That's a valid assumption for many character sets, for example the single byte western character sets like 1252. But there are character sets like 932 (Japanese), 949 (Koren) etc. that are double byte character sets. Your entire approach breaks down for those code pages. My guess is that only wish to support single byte character sets.
If you are writing cross-platform code then you can replace MultiByteToWideChar with UnicodeFromLocaleChars.
You can also do it in one step for all characters. Here is an example for codepage 1250:
var
encoding: TEncoding;
bytes: TBytes;
unicode: TArray<Word>;
I: Integer;
S: string;
begin
SetLength(bytes, 256);
for I := 0 to 255 do
bytes[I] := I;
SetLength(unicode, 256);
encoding := TEncoding.GetEncoding(1250); // change codepage as needed
try
S := encoding.GetString(bytes);
for I := 0 to 255 do
unicode[I] := Word(S[I+1]); // as long as strings are 1-based
finally
encoding.Free;
end;
end;
Here is the code I have found to be working well:
var
I: Byte;
anChar: AnsiString;
Tmp: RawByteString;
uniChar: Char;
uniCode: Word;
begin
for I := 0 to 255 do
begin
anChar := AnsiChar(I);
Tmp := anChar;
SetCodePage(Tmp, aCodepages[K], False);
uniChar := UnicodeString(Tmp)[1];
uniCode := Word(uniChar);
<...snip...>
end;
When I try the code below there seem to be different output in XE2 compared to D2009.
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Writeln(Outfile,utf8string('总结'));
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
Compiling with XE2 on a Windows 8 PC gives in WordPad
??
C
txt hex code: EF BB BF 3F 3F 0D 0A B0 43 0D 0A
Compiling with D2009 on a Windows XP PC gives in Wordpad
总结
°C
txt hex code: EF BB BF E6 80 BB E7 BB 93 0D 0A B0 43 0D 0A
My questions is why it differs and how can I save Chinese characters to a text file using the old text file I/O?
Thanks!
In XE2 onwards, AssignFile() has an optional CodePage parameter that sets the codepage of the output file:
function AssignFile(var F: File; FileName: String; [CodePage: Word]): Integer; overload;
Write() and Writeln() both have overloads that support UnicodeString and WideChar inputs.
So, you can create a file that has its codepage set to CP_UTF8, and then Write/ln() will automatically convert Unicode strings to UTF-8 when writing them to the file.
The downside is that you will not be able to write the UTF-8 BOM using AnsiChar values anymore, because the individual bytes will get converted to UTF-8 and thus not be written correctly. You can get around that by writing the BOM as a single Unicode character (which it what it really is - U+FEFF) instead of as individual bytes.
This works in XE2:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TextFile;
begin
AssignFile(Outfile, 'test_chinese.txt', CP_UTF8);
Rewrite(Outfile);
//This is the UTF-8 BOM
Write(Outfile, #$FEFF);
Writeln(Outfile, '总结');
Writeln(Outfile, '°C');
CloseFile(Outfile);
end;
With that said, if you want something that is more compatible and reliable between D2009 and XE2, use TStreamWriter instead:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TStreamWriter;
begin
Outfile := TStreamWriter.Create('test_chinese.txt', False, TEncoding.UTF8);
try
Outfile.WriteLine('总结');
Outfile.WriteLine('°C');
finally
Outfile.Free;
end;
end;
Or do the file I/O manually:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TFileStream;
BOM: TBytes;
procedure WriteBytes(const B: TBytes);
begin
if B <> '' then Outfile.WriteBuffer(B[0], Length(B));
end;
procedure WriteStr(const S: UTF8String);
begin
if S <> '' then Outfile.WriteBuffer(S[1], Length(S));
end;
procedure WriteLine(const S: UTF8String);
begin
WriteStr(S);
WriteStr(sLineBreak);
end;
begin
Outfile := TFileStream.Create('test_chinese.txt', fmCreate);
try
WriteBytes(TEncoding.UTF8.GetPreamble);
WriteLine('总结');
WriteLine('°C');
finally
Outfile.Free;
end;
end;
You really shouldn't use the old text I/O anymore.
Anyway, you can use TEncoding to get the UTF-8 TBytes like this:
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
Bytes: TBytes;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Bytes := TEncoding.UTF8.GetBytes('总结');
for myByte in Bytes do begin
Write(Outfile, AnsiChar(myByte));
end;
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
I'm not sure if there is an easier way to write TBytes to a Textfile, maybe somebody else has a better idea.
Edit:
For a pure binary file (File instead of TextFile type) use can use BlockWrite.
There are a couple of tell-tale signs that may tell you what whent wrong when dealing with Unicode. In your case you're seeing "?" in the resulting output file: You get question marks when you try to convert some thing from Unicode to a Code Page and the target Code Page can't represent the requested characters.
Looking at the hex dump it's obvious (counting line terminators) that the question marks are the result of saving the two Chinese characters to the file. The two chars got converted to exactly two question marks. This tells you the Writeln() decided to give you helping and converted the text from UTF8 (a unicode representation) to your local code page. The Delphi team probably decided to do this since the old I/O routines are not supposed to be UNICODE compatible; since you're writing an UTF8 string using the old I/O routines, they're helping you by converting this to your Code Page. You might not welcome that helping hand, but it doesn't mean it was wrong to do so: it's undocumented territory.
Since you now know why that's happening you know what to do to stop it. Let WriteLn() know you're sending something that doesn't need converting. You'll discover that's not particularly easy, since Delphi XE2 apparently "helps you out" whatever you. For example, stuff like this doesn't just change the string type, it converts to AnsiString, going through the code-page conversion routine that gets you question marks:
AnsiString(UTF8String('Whatever Unicode'));
Because of this, and if you need one-liner solutions, you could try a conversion routine, something like this:
function FakeConvert(const InStr: UTF8String): AnsiString;
var N: Integer;
begin
N := Length(InStr);
SetLength(Result, N);
Move(InStr[1], Result[1], N);
end;
You'll then be able to do:
Writeln(Outfile,FakeConvert('总结'));
And it'll do what you expect (I did actually try it before posting!)
Of course the only TRUE answer to this question is, since you upgraded all the way to Delphi XE2:
Stop using deprecated I/O routines, move to TStream based
this code in delphi2007 is convert success
for example:
i have a chinese 短刀 , in delphi2007 convert is B5 CC B5 C6 ,but in delphi 2010 convert is 77 ED 52 00
function StringToHex(str: string): string;
var
i:integer;
s:string;
begin
s:='';
for i:=1 to length(str) do begin
s:=s+inttohex(Integer(str[i]),2);
end;
result:=s;
end;
but in delphi2010, it's wrong
who can edit it work in delphi2010 success?
First, in Delphi 2007, String=AnsiString, and in Delphi 2010, String=UnicodeString. That is enough explanation for you to understand, if you know what AnsiString (char is 8 bits) and UnicodeString (char is 16 bits) means.
Even though you are calling "IntToHex(x,2)", each Delphi 2010 character when converted to an integer will be in the range from 0 to 65535, which means that the IntToHex call is returning between 2 and 4 hex digits, which makes it hard for you to read the results without confusion.
A minimal unicode-aware fix is to change to IntToHex(x,4) for unicode versions of delphi, and maybe put a space in there so you can at least see where the codepoints separate Four digits like 0000 is enough hex digits for a single unicode character represented as hex. Two digits is not enough.
Why are the values different though? That's a good question. Let me try to make it clearer; I believe you are seeing a consequence of using Delphi 2007 and its ANSI+MBCS support (which is codepage reliant) versus Delphi 2010 which uses Unicode Strings. You should not be surprised that MBCS values different from unicode codepoints.
Also you should know that it takes two hex digits to show a byte, and four hex digits to show a Unicode character, which is 16 bits in size.
If you really want to see the Hex of the UTF8 string, then in Delphi 2010 you must create a UTF8 string first. If you really want MBCS, then say so. The whole world is Unicode now, I suggest you let MBCS go.
Fixed code for Unicode strings character codepoints (4 hex digits, 16 bit):
A UnicodeString=String aware version (Delphi 2009,2010,XE):
function StringToHex16(str: string): string;
var
i:integer;
s:string;
begin
s:='';
for i:=1 to length(str) do begin
s:=s+inttohex(Integer(str[i]),4);
end;
result:=s;
end;
UTF8 version for Delphi 2009,2010,XE:
function StringToHexUtf8(str: string): string;
var
i:integer;
s:string;
u:RawByteString;
begin
u := Utf8String(str);
s:='';
for i:=1 to length(u) do begin
s:=s+inttohex(Integer(u[i]),2);
end;
result:=s;
end;
And finally, since probably what you want is to reproduce exactly Delphi 2007's behaviour, here is an explicit example using MBCS functions:
function StringToHexMbcs(str: string;cp:Integer): string;
var
sz,i:integer;
s:string;
u:RawByteString;
flags:Integer;
begin
// use cp 936 or 950 for simplified or traditional chinese mbcs.
flags := WC_COMPOSITECHECK or WC_DISCARDNS or WC_SEPCHARS or WC_DEFAULTCHAR;
sz := Windows.WideCharToMultiByte( cp, flags, #str[1],-1,nil,0,nil,nil); // get length.
SetLength(u,sz+1);
Windows.WideCharToMultiByte( cp, flags, #str[1],Length(str),#u[1],sz-1, nil,nil);
s:='';
for i:=1 to sz do begin
s:=s+inttohex(Integer(u[i]),2);
end;
result:=s;
end;
For future reference though, Delphi 2007 is not the gold standard of what is "right". You have to make some effort to understand the difference between MBCS and Unicode.
To obtain the same result in D2010 as in D2007, simple change the function parameter from (Unicode)String to AnsiString. Any string value you pass in, regardless of type, with be converted by the RTL into its MBCS equivalent based on the system default codepage - the same AnsiString has always used in past versions and continues using.