How to convert AnsiChar to UnicodeChar with specific CodePage? - delphi

I'm generating texture atlases for rendering Unicode texts in my app. Source texts are stored in ANSI codepages (1250, 1251, 1254, 1257, etc). I want to be able to generate all the symbols from each ANSI codepage.
Here is the outline of the code I would expect to have:
for I := 0 to 255 do
begin
anChar := AnsiChar(I); //obtain AnsiChar
//Apply codepage without converting the chars
//<<--- this part does not work, showing:
//"E2033 Types of actual and formal var parameters must be identical"
SetCodePage(anChar, aCodepages[K], False);
//Assign AnsiChar to UnicodeChar (automatic conversion)
uniChar := anChar;
//Here we get Unicode character index
uniCode := Ord(uniChar);
end;
The code above does not works (E2033) and I'm not sure it is a proper solution at all. Perhaps there's much shorter version.
What is the proper way of converting AnsiChar into Unicode with specific codepage in mind?

I would do it like this:
function AnsiCharToWideChar(ac: AnsiChar; CodePage: UINT): WideChar;
begin
if MultiByteToWideChar(CodePage, 0, #ac, 1, #Result, 1) <> 1 then
RaiseLastOSError;
end;
I think you should avoid using strings for what is in essence a character operation. If you know up front which code pages you need to support then you can hard code the conversions into a lookup table expressed as an array constant.
Note that all the characters that are defined in the ANSI code pages map to Unicode characters from the Basic Multilingual Plane and so are represented by a single UTF-16 character. Hence the size assumptions of the code above.
However, the assumption that you are making, and that this answer persists, is that a single byte represents a character in an ANSI character set. That's a valid assumption for many character sets, for example the single byte western character sets like 1252. But there are character sets like 932 (Japanese), 949 (Koren) etc. that are double byte character sets. Your entire approach breaks down for those code pages. My guess is that only wish to support single byte character sets.
If you are writing cross-platform code then you can replace MultiByteToWideChar with UnicodeFromLocaleChars.

You can also do it in one step for all characters. Here is an example for codepage 1250:
var
encoding: TEncoding;
bytes: TBytes;
unicode: TArray<Word>;
I: Integer;
S: string;
begin
SetLength(bytes, 256);
for I := 0 to 255 do
bytes[I] := I;
SetLength(unicode, 256);
encoding := TEncoding.GetEncoding(1250); // change codepage as needed
try
S := encoding.GetString(bytes);
for I := 0 to 255 do
unicode[I] := Word(S[I+1]); // as long as strings are 1-based
finally
encoding.Free;
end;
end;

Here is the code I have found to be working well:
var
I: Byte;
anChar: AnsiString;
Tmp: RawByteString;
uniChar: Char;
uniCode: Word;
begin
for I := 0 to 255 do
begin
anChar := AnsiChar(I);
Tmp := anChar;
SetCodePage(Tmp, aCodepages[K], False);
uniChar := UnicodeString(Tmp)[1];
uniCode := Word(uniChar);
<...snip...>
end;

Related

With Delphi 6/7, how can I convert an AnsiString in a different CharSet, to hex String UTF-8?

I need to draw a barcode (QR) with Delphi 6/7. The program can run in various windows locales, and the data is from an input box.
On this input box, the user can choose a charset, and input his own language. This works fine. The input data is only ever from the same codepage. Example configurations could be:
Windows is on Western Europe, Codepage 1252 for ANSI text
Input is done in Shift-JIS ANSI charset
I need to get the Shift-JIS across to the barcode. The most robust way is to use hex encoding.
So my question is: how do I go from Shift-JIS to a hex String in UTF-8 encoding, if the codepage is not the same as the Windows locale?
As example: I have the string 能ラ. This needs to be converted to E883BDE383A9 as per UTF-8. I have tried this but the result is different and meaningless:
String2Hex(UTF8Encode(ftext))
Unfortunately I can't just have an inputbox for WideStrings. But if I can find a way to convert the ANSI text to a WideString, the barcode module can work with Unicode Strings as well.
If it's relevant: I am using the TEC-IT TBarcode DLL.
Creating and accessing a Unicode text control
This is easier than you may think and I did so in the past with the brand new Windows 2000 when convenient components like Tnt Delphi Unicode Controls were not available. Having background knowledge on how to create a Windows GUI program without using Delphi's VCL and manually creating everything helps - otherwise this is also an introduction of it.
First add a property to your form, so we can later access the new control easily:
type
TForm1= class(TForm)
...
private
hEdit: THandle; // Our new Unicode control
end;
Now just create it at your favorite event - I chose FormCreate:
// Creating a child control, type "edit"
self.hEdit:= CreateWindowW( PWideChar(WideString('edit')), PWideChar(WideString('myinput')), WS_CHILD or WS_VISIBLE, 10, 10, 200, 25, Handle, 0, HINSTANCE, nil );
if self.hEdit= 0 then begin // Failed. Get error code so we know why it failed.
//GetLastError();
exit;
end;
// Add a sunken 3D edge (well, historically speaking)
if SetWindowLong( self.hEdit, GWL_EXSTYLE, WS_EX_CLIENTEDGE )= 0 then begin
//GetLastError();
exit;
end;
// Applying new extended style: the control's frame has changed
if not SetWindowPos( self.hEdit, 0, 0, 0, 0, 0, SWP_FRAMECHANGED or SWP_NOMOVE or SWP_NOZORDER or SWP_NOSIZE ) then begin
//GetLastError();
exit;
end;
// The system's default font is no help, let's use this form's font (hopefully Tahoma)
SendMessage( self.hEdit, WM_SETFONT, self.Font.Handle, 1 );
At some point you want to get the edit's content. Again: how is this done without Delphi's VCL but instead directly with the WinAPI? This time I used a button's Click event:
var
sText: WideString;
iLen, iError: Integer;
begin
// How many CHARACTERS to copy?
iLen:= GetWindowTextLengthW( self.hEdit );
if iLen= 0 then iError:= GetLastError() else iError:= 0; // Could be empty, could be an error
if iError<> 0 then begin
exit;
end;
Inc( iLen ); // For a potential trailing #0
SetLength( sText, iLen ); // Reserve space
if GetWindowTextW( self.hEdit, #sText[1], iLen )= 0 then begin // Copy text
//GetLastError();
exit;
end;
// Demonstrate that non-ANSI text was copied out of a non-ANSI control
MessageBoxW( Handle, PWideChar(sText), nil, 0 );
end;
There are detail issues, like not being able to reach this new control via Tab, but we're already basically re-inventing Delphi's VCL, so those are details to take care about at other times.
Converting codepages
The WinAPI deals either in codepages (Strings) or in UTF-16 LE (WideStrings). For historical reasons (UCS-2 and later) UTF-16 LE fits everything, so this is always the implied target to achieve when coming from codepages:
// Converting an ANSI charset (String) to UTF-16 LE (Widestring)
function StringToWideString( s: AnsiString; iSrcCodePage: DWord ): WideString;
var
iLenDest, iLenSrc: Integer;
begin
iLenSrc:= Length( s );
iLenDest:= MultiByteToWideChar( iSrcCodePage, 0, PChar(s), iLenSrc, nil, 0 ); // How much CHARACTERS are needed?
SetLength( result, iLenDest );
if iLenDest> 0 then begin // Otherwise we get the error ERROR_INVALID_PARAMETER
if MultiByteToWideChar( iSrcCodePage, 0, PChar(s), iLenSrc, PWideChar(result), iLenDest )= 0 then begin
//GetLastError();
result:= '';
end;
end;
end;
The source codepage is up to you: maybe
1252 for "Windows-1252" = ANSI Latin 1 Multilingual (Western Europe)
932 for "Shift-JIS X-0208" = IBM-PC Japan MIX (DOS/V) (DBCS) (897 + 301)
28595 for "ISO 8859-5" = Cyrillic
65001 for "UTF-8"
However, if you want to convert from one codepage to another, and both source and target shall not be UTF-16 LE, then you must go forth and back:
Convert from ANSI to WIDE
Convert from WIDE to a different ANSI
// Converting UTF-16 LE (Widestring) to an ANSI charset (String, hopefully you want 65001=UTF-8)
function WideStringToString( s: WideString; iDestCodePage: DWord= CP_UTF8 ): AnsiString;
var
iLenDest, iLenSrc: Integer;
begin
iLenSrc:= Length( s );
iLenDest:= WideCharToMultiByte( iDestCodePage, 0, PWideChar(s), iLenSrc, nil, 0, nil, nil );
SetLength( result, iLenDest );
if iLenDest> 0 then begin // Otherwise we get the error ERROR_INVALID_PARAMETER
if WideCharToMultiByte( iDestCodePage, 0, PWideChar(s), iLenSrc, PChar(result), iLenDest, nil, nil )= 0 then begin
//GetLastError();
result:= '';
end;
end;
end;
As per every Windows installation not every codepage is supported, or different codepages are supported, so conversion attempts may fail. It would be more robust to aim for a Unicode program right away, as that is what every Windows installation definitly supports (unless you still deal with Windows 95, Windows 98 or Windows ME).
Combining everything
Now you got everything you need to put it together:
you can have a Unicode text control to directly get it in UTF-16 LE
you can use an ANSI text control to then convert the input to UTF-16 LE
you can convert from UTF-16 LE (WIDE) to UTF-8 (ANSI)
Size
UTF-8 is mostly the best choice, but size wise UTF-16 may need fewer bytes in total when your target audience is Asian: in UTF-8 both 能 and ラ need 3 bytes each, but in UTF-16 both only need 2 bytes each. As per your QR barcode size is an important factor, I guess.
Likewise don't waste by turning binary data (8 bits per byte) into ASCII text (displaying 4 bits per character, but itself needing 1 byte = 8 bits again). Have a look at Base64 which encodes 6 bits into every byte. A concept that you encountered countless times in your life already, because it's used for email attachments.

How pass word number to widestring?

First of all I am sorry that I cannot better to describe my problem.
What I have is Word number 65025 which is 0xFE01 or
11111110 00000001 in binary. And I want to pass the value to wstr Word => 11111110 00000001.
I found that using typecast does not work.
And one more question here. If I want to add another number like 10000 => 0x03E8 how to do it. So in the result the widestring should refer to values 0xFE01 0x03E8.
And then, how to retrieve the same numbers from widestring to word back?
var wstr: Widestring;
wo: Word;
begin
wo := 65025;
wstr := Widestring(wo);
wo := 10000;
wstr := wstr + Widestring(wo);
end
Edit:
I'm giving another, simpler example of what I want... If I have word value 49, which is equal to ASCII value 1, then I want the wstr be '1' which is b00110001 in binary terms. I want to copy the bits from word number to the string.
It looks like you want to interpret a word as a UTF-16 code unit. In Unicode Delphi you would use the Chr() function. But I suspect you use an ANSI Delphi. In which case cast to WideChar with WideChar(wo).
You are casting a Word to a WideString. In Delphi, casting usually doesn't convert, so you are simply re-interpreting the value 65025 as a pointer (a WideString is a pointer). But 65025 is not a valid pointer value.
You will have to explicitly convert the Word to a WideString, e.g. with a function like this (untested, but should work):
function WordToBinary(W: Word): WideString;
var
I: Integer;
begin
Result := '0000000000000000';
for I := 0 to 15 do // process bits 0..15
begin
if Odd(W) then
Result[16 - I] := '1';
W := W shr 1;
end;
end;
Now you can do something like:
wo := 65025;
wstr := WordToBinary(wo);
wo := 10000;
wstr := wstr + ' ' + WordToBinary(wo);
For the reverse, you will have to write a function that converts from a WideString to a Word. I'll leave that exercise to you.
Again, you can't cast. You will have to explicitly convert. Both ways.

Get a hex substring & convert to binary Delphi XE2

My program reads from a device via a serial port and returns this string. 'IC'#$0088#$0080'Ô'#$0080#$0080
I need to get the 5 hex values and convert to binary. #$0088 = 10001000, #$0080 = 10000000, Ô = 11010100.
I can convert the 80 & 88, but am having difficulty extracting them from the whole string. The Ô(xD4) I can neither extract or convert. An extended character like the Ô could be at any or all locations.
The read methods in my serial component are:
function Read(var Buffer; Count: Integer): Integer;
function ReadStr(var Str: string; Count: Integer): Integer;
function ReadAsync(var Buffer; Count: Integer; var AsyncPtr: PAsync): Integer;
function ReadStrAsync(var Str: Ansistring; Count: Integer; var AsyncPtr: PAsync): Integer;
Can you give me an example of reading binary?
It looks like the real problem is that you are treating binary data as though it were UTF-16 encoded text.
Whatever is feeding you this data, is not feeding you UTF-16 encoded text. What the device is really feeding you is a byte array. Treat it as such rather than as text. Then you can pick out the five values you are interested in by index.
So, declare an array of bytes:
var
Data: TArray<Byte>; // dynamic array
or
var
Data: TBytes; // shorthand for the same
or
var
Data: array [0..N-1] of Byte; // fixed length array
And then read into those arrays. To pick out values, use Data[i].
Note that I am using a significant amount of guesswork here, based on the question and your comments. Don't take my word for it. My guessing could be wrong. Consult the specification of the communication protocol for the device. And learn carefully the difference between text and binary.
As I wrote earlier in the comments, the problem with the message in your question is that it consists partly of non-ASCII characters. The ASCII range is from $00 to $7F and have the same characters as Unicode U+0000 to U+007F. Therefore no conversion (except for the leading 0). AnsiCharacters ($80 to $FF) on the other hand are subject to conversion according to the code page in use, in order to keep the same glyph for both. F.Ex. AnsiChar $80 (Euro sign in CP1252) is therefore converted to Unicode U+02C6. Bit patten for the lower byte doesn't match anymore.
Ref: https://msdn.microsoft.com/en-us/library/cc195054.aspx
Following code shows the result of two tests, Using Char vs. AnsiChar
procedure TMainForm.Button2Click(Sender: TObject);
const
Buffer: array[0..7] of AnsiChar = ('I','C', #$88, #$80, #$D4, #$80, #$80, ';');
// Buffer: array[0..7] of Char = ('I','C', #$88, #$80, #$D4, #$80, #$80, ';');
BinChars: array[0..1] of Char = ('0','1');
var
i, k: integer;
c: AnsiChar;
// c: Char;
s: string;
begin
for k := 2 to 6 do
begin
c := Buffer[k];
SetLength(s, 8);
for i := 0 to 7 do
s[8-i] := BinChars[(ord(c) shr i) and 1];
Memo1.Lines.Add(format('Character %d in binary format: %s',[k, s]));
end;
end;
Using Char (UTF-16 WideChar)
AnsiChar #$88 is converted to U+02C6
AnsiChar #$80 is converted to U+20AC
AnsiChar #$D4 is converted to U+00D4 !
Lower byte gives
Character 2 in binary format: 11000110
Character 3 in binary format: 10101100
Character 4 in binary format: 11010100
Character 5 in binary format: 10101100
Character 6 in binary format: 10101100
Using AnsiChar
Character 2 in binary format: 10001000
Character 3 in binary format: 10000000
Character 4 in binary format: 11010100
Character 5 in binary format: 10000000
Character 6 in binary format: 10000000
Unfortunately a conversion from Unicode to Ansi (even if originally converted from Ansi to Unicode) is lossy and will fail.
I really don't see any easy solution with the information available.

How manipulate substrings, and not subarrays, of UnicodeString?

I am testing migration from Delphi 5 to XE. Being unfamiliar with UnicodeString, before asking my question I would like to present its background.
Delphi XE string-oriented functions: Copy, Delete and Insert have a parameter Index telling where the operation should start. Index may have any integer value starting from 1 and finishing at the length of the string to which the function is applied.
Since the string can have multi-element characters, function operation can start at an element (surrogate) belonging to a multi-element series encoding a single unicode named code-point.
Then, having a sensible string and using one of the functions, we can obtain non sensible result.
The phenomenon can be illustrated with the below cases using the function Copy with respect to strings representing the same array of named codepoints (i.e. meaningful signs)
($61, $13000, $63)
It's concatenation of 'a', EGYPTIAN_HIEROGLYPH_A001 and 'c'; it looks as
Case 1. Copy of AnsiString (element = byte)
We start with the above mentioned UnicodeString #$61#$13000#$63 and we convert it to UTF-8 encoded AnsiString s0.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 6 of them since s0 is 6 bytes long.
procedure Copy_Utf8Test;
type TAnsiStringUtf8 = type AnsiString (CP_UTF8);
var ss : string;
s0,s1 : TAnsiStringUtf8;
ii : integer;
begin
ss := #$61#$13000#$63; //mem dump of ss: $61 $00 $0C $D8 $00 $DC $63 $00
s0 := ss; //mem dump of s0: $61 $F0 $93 $80 $80 $63
ii := length(s0); //sets ii=6 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$F0 F means "start of 4-byte series"; no corresponding named code-point
s1 := copy(s0,3,1); //#$93 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,4,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,5,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,6,1); //'c'
end;
The first and last results are sensible within UTF-8 codepage, while the other 4 are not.
Case 2. Copy of UnicodeString (element = word)
We start with the same UnicodeString s0 := #$61#$13000#$63.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 4 of them since s0 is 4 words long.
procedure Copy_Utf16Test;
var s0,s1 : string;
ii : integer;
begin
s0 := #$61#$13000#$63; //mem dump of s0: $61 $00 $0C $D8 $00 $DC $63 $00
ii := length(s0); //sets ii=4 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$D80C surrogate pair member; no corresponding named code-point
s1 := copy(s0,3,1); //#$DC00 surrogate pair member; no corresponding named code-point
s1 := copy(s0,4,1); //'c'
end;
The first and last results are sensible within codepage CP_UNICODE (1200), while the other 2 are not.
Conclusion.
The string-oriented functions: Copy, Delete and Insert perfectly operate on string considered as a mere array of bytes or words. But they are not helpful if string is seen as that what it essentially is, i.e. representation of array of named code-points.
Both above two cases deal with strings which represent the same array of 3 named code-points. They are considered as representations (encodings) of the same text composed of 3 meaningful signs (to avoid abuse of the term "characters").
One may want to be able to extract (copy) any of those meaningful signs regardless whether a particular text representation (encoding) is mono- or multi-element one.
I've spent quite a time looking around for a satisfactory equivalent of Copy that I used to in Delphi 5.
Question.
Do such equivalents exist or I have to write them myself?
What you have described is how Copy(), Delete(), and Insert() have ALWAYS worked, even for AnsiString. The functions operate on elements (ie codeunits in Unicode terminology), and always have.
AnsiString is a string of 8bit AnsiChar elements, which can be encoded in any 8bit ANSI/MBCS format, including UTF-8.
UnicodeString (and WideString) is a string of 16bit WideChar elements, which are encoded in UTF-16.
The functions HAVE NEVER taken encoding into account. Not for MBCS AnsiString. Not for UTF-16 UnicodeString. Indexes are absolute element indexes from the beginning of the string.
If you need encoding-aware Copy/Delete/Insert functions that operate on logical codepoint boundaries, where each codepoint may be 1+ elements in the string, then you have to write your own functions, or find third-party functions that do what you need. There is no MBCS/UTF-aware mutilator functions in the RTL.
You should parse Unicode string youself. Fortunaly the Unicode encoding is designed to make parsing easy. Here is an example how to parse UTF8 string:
program Project9;
{$APPTYPE CONSOLE}
uses
SysUtils;
function GetFirstCodepointSize(const S: UTF8String): Integer;
var
B: Byte;
begin
B:= Byte(S[1]);
if (B and $80 = 0 ) then
Result:= 1
else if (B and $E0 = $C0) then
Result:= 2
else if (B and $F0 = $E0) then
Result:= 3
else if (B and $F8 = $F0) then
Result:= 4
else
Result:= -1; // invalid code
end;
var
S: string;
begin
S:= #$61#$13000#$63;
Writeln(GetFirstCodepointSize(S));
S:= #$13000#$63;
Writeln(GetFirstCodepointSize(S));
S:= #$63;
Writeln(GetFirstCodepointSize(S));
Readln;
end.

CharInSet accepting Unicode NULL character

I'm reading some data from memory, and this area of memory is in Unicode. So to make one ansi string I need something like this:
while CharInSet(Chr(Ord(Buff[aux])), ['0'..'9', #0]) do
begin
Target:= Target + Chr(Ord(Buff[aux]));
inc(aux);
end;
Where Buff is array of Bytes and Target is string. I just want keep getting Buff and adding in Target while it's 0..9, but when it finds NULL memory char (00), it just stops. How can I keep adding data in Target until first letter or non-numeric character?? The #0 has no effect.
I would not even bother with CharInSet() since you are dealing with bytes and not characters:
var
b: Byte;
while aux < Length(Buff) do
begin
b := Buff[aux];
if ((b >= Ord('0')) and (b <= Ord('9'))) or (b = 0) then
begin
Target := Target + Char(Buff[aux]);
Inc(aux);
end else
Break;
end;
If your data is Unicode, then I am assuming that the encoding is UTF-16. In which case you cannot process it byte by byte. A character unit is 2 bytes wide. Put the data into a Delphi string first, and then parse it:
var
str: string;
....
SetString(str, PChar(Buff), Length(Buff) div SizeOf(Char));
Do it this way and your loop can look like this:
for i := 1 to Length(str) do
if not CharInSet(str[i], ['0'..'9']) then
begin
SetLength(str, i-1);
break;
end;
I believe that your confusion was caused by processing byte by byte. With UTF-16 encoded text, ASCII characters are encoded as a pair of bytes, the most significant of which is zero. I suspect that explains what you were trying to achieve with your CharInSet call.
If you want to cater for other digit characters then you can use the Character unit and test with TCharacter.IsDigit().

Resources