Convert char pos of UnicodeString to byte pos in a utf8 string

Convert char pos of UnicodeString to byte pos in a utf8 string - delphi

I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.
The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.
PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).

You should parse UTF8 strings yourself using UTF8 description. I have written a quick UTF8 analog of ByteToCharIndex and tested on cyrillic string:
function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
I: Integer;
P: PAnsiChar;
begin
Result:= 0;
if (Index <= 0) or (Index > Length(S)) then Exit;
I:= 1;
P:= PAnsiChar(S);
while I <= Index do begin
if Ord(P^) and $C0 <> $80 then Inc(Result);
Inc(I);
Inc(P);
end;
end;
const TestStr: UTF8String = 'abФЫВА';
procedure TForm1.Button2Click(Sender: TObject);
begin
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;
The reverse function is no problem too:
function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
P: PAnsiChar;
begin
Result:= 0;
P:= PAnsiChar(S);
while (Result < Length(S)) and (Index > 0) do begin
Inc(Result);
if Ord(P^) and $C0 <> $80 then Dec(Index);
Inc(P);
end;
if Index <> 0 then Result:= 0; // char index not found
end;

I wrote a function based on Serg's code with great respect, I posted it here as a separate answer with the hope that it's helpful to others too. Serg's answer is accepted instead.
{Return the index (1-based) of the first byte of the character (unicode point)
specified by aCharIdx (1-based) in aUtf8Str.
Code is amended by Edwin Yip based on code written by SO member Serg (https://stackoverflow.com/users/246408/serg)
ref 1: https://stackoverflow.com/a/10388131/133516
ref 2: http://sergworks.wordpress.com/2012/05/01/parsing-utf8-strings/
}
function CharPosToUTF8BytePos(const aUtf8Str: UTF8String; const aCharIdx:
Integer): Integer;
var
p: PAnsiChar;
charCount: Integer;
begin
p:= PAnsiChar(aUtf8Str);
Result:= 0;
charCount:= 0;
while (Result < Length(aUtf8Str)) do
begin
if IsUTF8LeadChar(p^) then
Inc(charCount);
if charCount = aCharIdx then
Exit(Result + 1);
Inc(p);
Inc(Result);
end;
end;

Both UTF-8 and UTF-16 (what UnicodeString uses) are variable-length encodings. A given Unicode codepoint can be encoded in UTF-8 using between 1-4 single-byte codeunits, and in UTF-16 using either 1 or 2 2-byte codeunits, depending on the codepoint's numeric value. The only way to translate a position in a UTF-16 string into a position in an equivilent UTF-8 string is to decode the UTF-16 codeunits preceeding the position back to their original Unicode codepoint values and then re-encode them to UTF-8 codeunits.
It sounds like you are better off re-writting the code that interacts with Scintilla to use UTF8String instead of UnicodeString, then you won't have to translate between UTF-8 and UTF-16 at that layer anymore. When interacting with the rest of your code, you can convert between UTF8String and UnicodeString as needed.

Related

How to convert widestring to string of unicode bytes?

When i create a file in Notepad, containing (example) the string 1d and save as unicode file, i get a 6 bytes size file containing the bytes #255#254#49#0#100#0.
OK. Now I need a Delphi 6 function which takes (example) input the widestring 1d and returns the string containing #255#254#49#0#100#0 (and viceversa).
How?
Thanks.
D

It is easier to read bytes if you use hex. #255#254#49#0#100#0 is represented in hex as
FF FE 31 00 64 00
Where
FF FE is the UTF-16LE BOM, which identifies the following bytes as being encoded as UTF-16 using values in Little Endian.
31 00 is the ASCII character '1'
64 00 is the ASCII character 'd'.
To create a WideString containing these bytes is very easy:
var
W: WideString;
S: String;
begin
S := '1d';
W := WideChar($FEFF) + S;
end;
When an AnsiString (which is Delphi 6's default string type) is assigned to a WideString, the RTL automatically converts the AnsiString data from 8-bit to UTF-16LE using the local machine's default Ansi charset for the conversion.
Going the other way is just as easy:
var
W: WideString;
S: String;
begin
W := WideChar($FEFF) + '1d';
S := Copy(W, 2, MaxInt);
end;
When you assign a WideString to an AnsiString, the RTL automatically converts the WideString data from UTF-16LE to 8-bit using the default Ansi charset.
If the default Ansi charset is not suitable for your needs (say the 8-bit data needs to be encoded in a different charset), you will have to use the Win32 API MultiByteToWideChar() and WideCharToMultiByte() functions directly (or 3rd party library with equivalent functionality) so you can specify the desired charset/codepage as needed.
Now then, Delphi 6 does not offer any useful helpers to read Unicode files (Delphi 2009 and later do), so you will have to do it yourself manually, for example:
function ReadUnicodeFile(const FileName: string): WideString;
const
cBOM_UTF8: array[0..2] of Byte = ($EF, $BB, $BF);
cBOM_UTF16BE: array[0..1] of Byte = ($FE, $FF);
cBOM_UTF16LE: array[0..1] of Byte = ($FF, $FE);
cBOM_UTF32BE: array[0..3] of Byte = ($00, $00, $FE, $FF);
cBOM_UTF32LE: array[0..3] of Byte = ($FF, $FE, $00, $00);
var
FS: TFileStream;
BOM: array[0..3] of Byte;
NumRead: Integer;
U8: UTF8String;
U32: UCS4String;
I: Integer;
begin
Result := '';
FS := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
try
NumRead := FS.Read(BOM, 4);
// UTF-8
if (NumRead >= 3) and CompareMem(#BOM, #cBOM_UTF8, 3) then
begin
if NumRead > 3 then
FS.Seek(-(NumRead-3), soCurrent);
SetLength(U8, FS.Size - FS.Position);
if Length(U8) > 0 then
begin
FS.ReadBuffer(PAnsiChar(U8)^, Length(U8));
Result := UTF8Decode(U8);
end;
end
// the UTF-16LE and UTF-32LE BOMs are ambiguous! Check for UTF-32 first...
// UTF-32
else if (NumRead = 4) and (CompareMem(#BOM, cBOM_UTF32LE, 4) or CompareMem(#BOM, cBOM_UTF32BE, 4)) then
begin
// UCS4String is not a true string type, it is a dynamic array, so
// it must include room for a null terminator...
SetLength(U32, ((FS.Size - FS.Position) div SizeOf(UCS4Char)) + 1);
if Length(U32) > 1 then
begin
FS.ReadBuffer(PUCS4Chars(U32)^, (Length(U32) - 1) * SizeOf(UCS4Char));
if CompareMem(#BOM, cBOM_UTF32BE, 4) then
begin
for I := Low(U32) to High(U32) do
begin
U32[I] := ((U32[I] and $000000FF) shl 24) or
((U32[I] and $0000FF00) shl 8) or
((U32[I] and $00FF0000) shr 8) or
((U32[I] and $FF000000) shr 24);
end;
end;
U32[High(U32)] := 0;
// Note: UCS4StringToWidestring() does not actually support UTF-16,
// only UCS-2! If you need to handle UTF-16 surrogates, you will
// have to convert from UTF-32 to UTF-16 manually, there is no RTL
// or Win32 function that will do it for you...
Result := UCS4StringToWidestring(U32);
end;
end
// UTF-16
else if (NumRead >= 2) and (CompareMem(#BOM, cBOM_UTF16LE, 2) or CompareMem(#BOM, cBOM_UTF16BE, 2)) then
begin
if NumRead > 2 then
FS.Seek(-(NumRead-2), soCurrent);
SetLength(Result, (FS.Size - FS.Position) div SizeOf(WideChar));
if Length(Result) > 0 then
begin
FS.ReadBuffer(PWideChar(Result)^, Length(Result) * SizeOf(WideChar));
if CompareMem(#BOM, cBOM_UTF16BE, 2) then
begin
for I := 1 to Length(Result) then
begin
Result[I] := WideChar(
((Word(Result[I]) and $00FF) shl 8) or
((Word(Result[I]) and $FF00) shr 8)
);
end;
end;
end;
end
// something else, assuming UTF-8
else
begin
if NumRead > 0 then
FS.Seek(-NumRead, soCurrent);
SetLength(U8, FS.Size - FS.Position);
if Length(U8) > 0 then
begin
FS.ReadBuffer(PAnsiChar(U8)^, Length(U8));
Result := UTF8Decode(U8);
end;
end;
finally
FS.Free;
end;
end;
Update: if you want to store UTF-16LE encoded bytes inside of an AnsiString variable (why?), then you can Move() the raw bytes of a WideString's character data into the memory block of an AnsiString: eg:
function WideStringAsAnsi(const AValue: WideString): AnsiString;
begin
SetLength(Result, Length(AValue) * SizeOf(WideChar));
Move(PWideChar(AValue)^, PAnsiChar(Result)^, Length(Result));
end;
var
W: WideString;
S: AnsiString;
begin
W := WideChar($FEFF) + '1d';
S := WideStringAsAnsi(W);
end;
I would not suggest misusing AnsiString like this, though. If you need bytes, operate on bytes, eg:
type
TBytes = array of Byte;
function WideStringAsBytes(const AValue: WideString): TBytes;
begin
SetLength(Result, Length(AValue) * SizeOf(WideChar));
Move(PWideChar(AValue)^, PByte(Result)^, Length(Result));
end;
var
W: WideString;
B: TBytes;
begin
W := WideChar($FEFF) + '1d';
B := WideStringAsBytes(W);
end;

A WideString is already a string of Unicode bytes. Specifically, in UTF16-LE encoding.
The two extra bytes you see in the Unicode file saved by Notepad are called a BOM - Byte Order Mark. This is a special character in Unicode that is used to indicate the order of bytes in the data that follows, to ensure that the string is decoded correctly.
Adding a BOM to a string (which is what you are asking for) is simply a matter of pre-fixing the string with that special BOM character. The BOM character is U+FEFF (that is the Unicode notation for the hex representation of a 'character').
So, the function you need is very simple:
function WideStringWithBOM(aString: WideString): WideString;
const
BOM = WideChar($FEFF);
begin
result := BOM + aString;
end;
However, although the function is very simple, this possibly isn't the end of the matter.
The string that is returned from this function will include the BOM and as far as any Delphi code is concerned that BOM will be treated as part of the string.
Typically you would only add a BOM to string when passing that string to some external recipient (via a file or web service response for example) if there is no other mechanism for indicating the encoding you have used.
Likewise, when reading strings from some received data which may be Unicode you should check the first two bytes:
If you find #255#254 ($FFFE) then you know that the bytes in the U+FEFF BOM have been switched (U+FFFE is not a valid Unicode character). i.e. the string that follows is UTF16-LE. Therefore, for a Delphi WideString you can discard those first two bytes and load the remaining bytes directly in to a suitable WideString variable.
If you find #254#255 then the bytes in the U+FEFF BOM have not been switched around. i.e. you know that the string that follows is UTF16-BE. In that case you again need to discard the first two bytes but when loading the remaining bytes into the WideString you must switch each pair of bytes around to convert from the UTF16-BE bytes to the UTF16-LE encoding of a WideString.
If the first 2 bytes are #255#254 (or vice versa) then you are either dealing with UTF16-LE without a BOM or possibly some other encoding entirely.
Good luck. :)

Searching for Unicode chars from a raw byte array - Free Pascal\Lazarus or Delphi

I don't want to bore people with the explanation of why and how so I 'll just jump right in.
I have an array of bytes containing raw byte data. The array is 1000 bytes. I want to go through that array of 1000 bytes and extract UTF-16 Unicode characters only that might resemble a filename but I don't know where, exactly, in that array of 1000 bytes the characters appear.
I have read
Lazarus Unicode Page and this but am still somewhat unsure with the syntactical approach to my problem. I understand that a Unicode char can be up to 4 bytes in size but is commonly two (a letter and a space).
I have used UTF8encode(WideCharLenToString(#MyArray,SomeIntValue) with success for other areas where I KNOW certain Unicode chars exist further to this thread that I asked about and is now solved. But I now need to "hunt" for them now, for a different reason, within the array. e.g. "Look at the first 16 bytes. Are they Unicode? If not, Look at the next 16. Are they Unicode? If so, convert them to a string and display them".
Can anyone help me?

Without knowing the actual layout of the bytes, or the formatting of the filename (does it have a drive letter and path, does it use UNC paths, or is it just a file name by itself?), hunting for the boundaries of the filename string is going to be difficult.
If you can assume that the filename always begins with a drive letter and path, then you can loop through the array one byte a time until you decode a six-byte UTF-16 sequence that consists of a character between 'a'-'z' or 'A'-'Z' followed by ':' and '\' characters. If you find that, keep decoding UTF-16 sequences until you encounter a decoded null character or a binary value that is not a valid UTF-16 sequence, eg:
var
Buffer: array[0..1000-1] of Byte;
I: Integer;
PCh: PWord;
Hi, Lo: Word;
Ch: Cardinal;
PStart: PWideChar;
Len: Integer;
FileName: WideString;
begin
...
I := 0;
while I <= (SizeOf(Buffer)-6) do
begin
PCh := PWord(#Buffer[I]);
if not (((PCh^ >= Ord('a')) and (PCh^ <= Ord('z'))) or ((PCh^ >= Ord('A')) and (PCh^ <= Ord('Z')))) then
begin
Inc(I);
Continue;
end;
Inc(PCh);
if PCh^ <> Ord(':') then
begin
Inc(I);
Continue;
end;
Inc(PCh);
if PCh^ <> Ord('\') then
begin
Inc(I);
Continue;
end;
PStart := PWideChar(#Buffer[I]);
Len := 0;
Inc(I, 6);
Inc(PCh);
while I <= (SizeOf(Buffer)-2) do
begin
if (PCh^ < $D800) or (PCh^ > $DFFF) then
begin
Ch := Cardinal(PCh^);
Inc(I, 2);
if Ch = 0 then Break;
Inc(Len);
end else
begin
if PCh^ > $DBFF then Break;
if (I+2) = SizeOf(Buffer) then Break;
Hi := PCh^;
Inc(PCh);
if (PCh^ < $DC00) or (PCh^ > $DFFF) then Break;
Lo := PCh^;
Ch := ((Cardinal(Hi) - $D800) * $400) + (Cardinal(Lo) - $DC00) + $10000;
if Ch > $10FFFF then Break;
Inc(I, 4);
Inc(Len, 2);
end;
end;
SetString(FileName, PStart, Len);
if Len > 0 then
begin
... use FileName as nedeed...
end;
end;
...
end;

UTF-16 codepoints are either 2 bytes or 4 bytes long. It's not a letter and a space; in isolation, most 16-bit words are valid UTF-16 characters. (Codepoints with values between D800 and DBFF need to be followed by a value in the range DC00-DFFF to make one complete Unicode character.) If you're just looking for valid UTF-16, it's unlikely you'll make much headway. You'll need to look specific patterns found in filenames, like .ext (which would be encoded in UTF-16 as either \00.\00e\00x\00t or .\00e\00x\00t\00, depending on whether it's big-endian or little-endian.)

Hex view of a file

I am using Delphi 2009.
I want to view the contents of a file (in hexadecimal) inside a memo.
I'm using this code :
var
Buffer:String;
begin
Buffer := '';
AssignFile(sF,Source); //Assign file
Reset(sF);
repeat
Readln(sF,Buffer); //Load every line to a string.
TempChar:=StrToHex(Buffer); //Convert to Hex using the function
...
until EOF(sF);
end;
function StrToHex(AStr: string): string;
var
I ,Len: Integer;
s: chr (0)..255;
//s:byte;
//s: char;
begin
len:=length(AStr);
Result:='';
for i:=1 to len do
begin
s:=AStr[i];
//The problem is here. Ord(s) is giving false values (251 instead of 255)
//And in general the output differs from a professional hex editor.
Result:=Result +' '+IntToHex(Ord(s),2)+'('+IntToStr(Ord(s))+')';
end;
Delete(Result,1,1);
end;
When I declare variable "s" as char (i know that char goes up to 255) I get results hex values up to 65535!
When i declare variable "s" as byte or chr (0)..255, it outputs different hex values, comparing to any Hexadecimal Editor!
Why is that? How can I see the correct values?
Check images for the differences.
1st image: Professional Hex Editor.
2nd image: Function output to Memo.
Thank you.

Your Delphi 2009 is unicode-enabled, so Char is actually WideChar and that's a 2 byte, 16 bit unsigned value, that can have values from 0 to 65535.
You could change all your Char declarations to AnsiChar and all your String declarations to AnsiString, but that's not the way to do it. You should drop Pascal I/O in favor of modern stream-based I/O, use a TFileStream, and don't treat binary data as Char.
Console demo:
program Project26;
{$APPTYPE CONSOLE}
uses SysUtils, Classes;
var F: TFileStream;
Buff: array[0..15] of Byte;
CountRead: Integer;
HexText: array[0..31] of Char;
begin
F := TFileStream.Create('C:\Temp\test', fmOpenRead or fmShareDenyWrite);
try
CountRead := F.Read(Buff, SizeOf(Buff));
while CountRead <> 0 do
begin
BinToHex(Buff, HexText, CountRead);
WriteLn(HexText); // You could add this to the Memo
CountRead := F.Read(Buff, SizeOf(Buff));
end;
finally F.Free;
end;
end.

In Delphi 2009, a Char is the same thing as a WideChar, that is, a Unicode character. A wide character occupies two bytes. You want to use AnsiChar. Prior to Delphi 2009 (that is, prior to Unicode Delphi), Char was the same thing as AnsiChar.
Also, you shouldn't use ReadLn. You are treating the file as a text file with text-file line endings! This is a general file! It might not have any text-file line endings at all!

For an easier to read output, and looking better too, you might want to use this simple hex dump formatter.
The HexDump procedure dumps an area of memory into a TStrings in lines of two chunks of 8 bytes in hex, and 16 ascii chars
example
406563686F206F66 660D0A6966206578 #echo off..if ex
69737420257E7331 5C6E756C20280D0A ist %~s1\nul (..
0D0A290D0A ..)..
Here is the code for the dump format function
function HexB (b: Byte): String;
const HexChar: Array[0..15] of Char = '0123456789ABCDEF';
begin
result:= HexChar[b shr 4]+HexChar[b and $0f];
end;
procedure HexDump(var data; size: Integer; s: TStrings);
const
sepHex=' ';
sepAsc=' ';
nonAsc='.';
var
i : Integer;
hexDat, ascDat : String;
buff : Array[0..1] of Byte Absolute data;
begin
hexDat:='';
ascDat:='';
for i:=0 to size-1 do
begin
hexDat:=hexDat+HexB(buff[i]);
if ((buff[i]>31) and (buff[i]<>255)) then
ascDat:=ascDat+Char(buff[i])
else
ascDat:=ascDat+nonAsc;
if (((i+1) mod 16)<>0) and (((i+1) mod 8)=0) then
hexDat:=hexDat+sepHex;
if ((i+1) mod 16)=0 then
begin
s.Add(hexdat+sepAsc+ascdat);
hexdat:='';
ascdat:='';
end;
end;
if (size mod 16)<>0 then
begin
if (size mod 16)<8 then
hexDat:=hexDat+StringOfChar(' ',(8-(size mod 8))*2)
+sepHex+StringOfChar(' ',16)
else
hexDat:=hexDat+StringOfChar(' ',(16-(size mod 16))*2);
s.Add(hexDat + sepAsc + ascDat);
end;
end;
And here is a complete code example for dumping the contents of a file into a Memo field.
procedure TForm1.Button1Click(Sender: TObject);
var
FStream: TFileStream;
buff: array[0..$fff] of Byte;
nRead: Integer;
begin
FStream := TFileStream.Create(edit1.text, fmOpenRead or fmShareDenyWrite);
try
repeat
nRead := FStream.Read(Buff, SizeOf(Buff));
if nRead<>0 then
hexdump(buff,nRead,memo1.lines);
until nRead=0;
finally
F.Free;
end;
end;

string is UnicodeString in Delphi 2009. If you want to use single-byte strings use AnsiString or RawByteString.
See String types.

Unicode string and TStringStream

Delphi 2009 and above uses unicode strings for their default string type. To my understanding unicode char is actually 16 bit value or 2 bytes (note: I understand there is possibility of 3 or 4 bytes char, but let's consider the most usual case). However I found that TStringStream is not very reliable to manipulating this strings. For example, TStringStream.Size property returns the length of the string, while I think it should return the byte count of the contained string. Okay, you can adjust it on your own, but the thing that really confused me the most is: TStringStream does not read from or write to a buffer reliably.
Please check the following code (it's a DUnit test and always fail). Please let me know where the problem is (I was using D2010 when testing the code).
procedure TestTCPackage.TestStringStream;
const
cCount = 10;
cOrdMaxChar = Ord(High(Char));
var
B: Pointer;
SW, SR: TStringStream;
T: string;
i, j, k : Integer;
vStrings: array [0..cCount-1] of string;
begin
RandSeed := GetTickCount;
for i := 0 to cCount - 1 do
begin
j := Random(100) + 1;
SetLength(vStrings[i], j);
for k := 1 to j do
// fill string with random char (but no #0)
vStrings[i][k] := Char(Random(cOrdMaxChar-1) + 1);
end;
for i := 0 to cCount - 1 do
begin
SW := TStringStream.Create(vStrings[i]);
try
GetMem(B, SW.Size * SizeOf(Char));
try
SW.Read(B^, SW.Size * SizeOf(Char));
SR := TStringStream.Create;
try
SR.Write(B^, SW.Size * SizeOf(Char));
SR.Position := 0;
// check the string in the TStringStream with original value
Check(SR.DataString = vStrings[i]);
finally
SR.Free;
end;
finally
FreeMem(B);
end;
finally
SW.Free;
end;
end;
end;
Note: I already tried to use an instance of TMemoryStream as intermediary from reading/writing the buffer and use CopyFrom of the TStringStream to read the content of that TMemoryStream with same failing effect.

Unicode strings aren't for data storage; use TBytes for that. TStringStream uses its associated encoding (the Encoding property) for encoding strings passed in with WriteString, and decoding strings read out with ReadString or the DataString property.

After reading this post (and thanks to Serg who provided the answer to that question) and Barry Kelly's answer, I have found the problem. TStringStream is actually using ASCII/ansistring encoding by default. So even if your default string type is unicode, unless you spesifically tell it to, it won't use unicode encoding. Personally I think it's strange. Maybe for making it easier to convert old codes.
So you have to specifically set the encoding of the TStringStream to TEncoding.Unicode to manipulate unicode string properly.
Here is my modified code which passes DUnit test is:
procedure TestTCPackage.TestStringStream;
const
cCount = 10;
cOrdMaxChar = Ord(High(Char));
var
B: Pointer;
SW, SR: TStringStream;
i, j, k : Integer;
vStrings: array [0..cCount-1] of string;
begin
RandSeed := GetTickCount;
for i := 0 to cCount - 1 do
begin
j := Random(100) + 1;
SetLength(vStrings[i], j);
for k := 1 to j do
// fill string with random char (but no #0)
vStrings[i][k] := Char(Random(cOrdMaxChar-1) + 1);
end;
for i := 0 to cCount - 1 do
begin
SW := TStringStream.Create(vStrings[i], ***TEncoding.Unicode***);
try
GetMem(B, SW.Size);
try
SW.ReadBuffer(B^, SW.Size);
SR := TStringStream.Create('', ***TEncoding.Unicode***);
try
SR.WriteBuffer(B^, SW.Size);
SR.Position := 0;
// check the string in the TStringStream with original value
Check(SR.DataString = vStrings[i]);
finally
SR.Free;
end;
finally
FreeMem(B);
end;
finally
SW.Free;
end;
end;
end;
Last note: Unicode does bite! :D

Delphi: Encoding Strings as Python do

I want to encode strings as Python do.
Python code is this:
def EncodeToUTF(inputstr):
uns = inputstr.decode('iso-8859-2')
utfs = uns.encode('utf-8')
return utfs
This is very simple.
But in Delphi I don't understand, how to encode, to force first the good character set (no matter, which computer we have).
I tried this test code to see the convertion:
procedure TForm1.Button1Click(Sender: TObject);
var
w : WideString;
buf : array[0..2048] of WideChar;
i : integer;
lc : Cardinal;
begin
lc := GetThreadLocale;
Caption := IntToStr(lc);
StringToWideChar(Edit1.Text, buf, SizeOF(buf));
w := buf;
lc := MakeLCID(
MakeLangID( LANG_ENGLISH, SUBLANG_ENGLISH_US),
0);
Win32Check(SetThreadLocale(lc));
Edit2.Text := WideCharToString(PWideChar(w));
Caption := IntToStr(AnsiCompareText(Edit1.Text, Edit2.Text));
end;
The input is: "árvíztűrő tükörfúrógép", the hungarian accent tester phrase.
The local lc is 1038 (hun), the new lc is 1033.
But this everytime makes 0 result (same strings), and the accents are same, I don't lost ŐŰ which is not in english lang.
What I do wrong? How to I do same thing as Python do?
Thanks for every help, link, etc:
dd

Windows uses codepage 28592 for ISO-8859-2. If you have a buffer containing ISO-8859-2 encoded bytes, then you have to decode the bytes to UTF-16 first, and then encode the result to UTF-8. Depending on which version of Delphi you are using, you can either:
1) on pre-D2009, use MultiByteToWideChar() and WideCharToMultiByte():
function EncodeToUTF(const inputstr: AnsiString): UTF8String;
var
ret: Integer;
uns: WideString;
begin
Result := '';
if inputstr = '' then Exit;
ret := MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), nil, 0);
if ret < 1 then Exit;
SetLength(uns, ret);
MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), PWideChar(uns), Length(uns));
ret := WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), nil, 0, nil, nil);
if ret < 1 then Exit;
SetLength(Result, ret);
WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), PAnsiChar(Result), Length(Result), nil, nil);
end;
2a) on D2009+, use SysUtils.TEncoding.Convert():
function EncodeToUTF(const inputstr: RawByteString): UTF8String;
var
enc: TEncoding;
buf: TBytes;
begin
Result := '';
if inputstr = '' then Exit;
enc := TEncoding.GetEncoding(28592);
try
buf := TEncoding.Convert(enc, TEncoding.UTF8, BytesOf(inputstr));
if Length(buf) > 0 then
SetString(Result, PAnsiChar(#buf[0]), Length(buf));
finally
enc.Free;
end;
end;
2b) on D2009+, alternatively define a new string typedef, put your data into it, and assign it to a UTF8String variable. No manual encoding/decoding needed, the RTL will handle everything for you:
type
Latin2String = type AnsiString(28592);
var
inputstr: Latin2String;
outputstr: UTF8String;
begin
// put the ISO-8859-2 encoded bytes into inputstr, then...
outputstr := inputstr;
end;

If you're using Delphi 2009 or newer every input from the default VCL controls will be UTF-16, so no need to do any conversions on your input.
If you're using Delphi 2007 or older (as it seems) you are at mercy of Windows, because the VCL is ANSI and Windows has a fixed Codepage that determines which characters can be used in i.e. a TEdit.
You can change the system-wide default ANSI CP in the control panel though, but that requires a reboot each time you do.
In Delphi 2007 you have some chance to use TNTUnicode controls or some similar solution to get the Text from the UI to your code.
In Delphi 2009 and newer there are also plenty of Unicode and character set handling routines in the RTL.
The conversion between character sets can be done with SysUtils.TEncoding:
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html

The Python code in your question returns a string in UTF-8 encoding. To do this with pre-2009 Delphi versions you can use code similar to:
procedure TForm1.Button1Click(Sender: TObject);
var
Src, Dest: string;
Len: integer;
buf : array[0..2048] of WideChar;
begin
Src := Edit1.Text;
Len := MultiByteToWideChar(CP_ACP, 0, PChar(Src), Length(Src), #buf[0], 2048);
buf[Len] := #0;
SetLength(Dest, 2048);
SetLength(Dest, WideCharToMultiByte(CP_UTF8, 0, #buf[0], Len, PChar(Dest),
2048, nil, nil));
Edit2.Text := Dest;
end;
Note that this doesn't change the current thread locale, it simply passes the correct code page parameters to the API.

There are encoding tools in Open XML library. There is cUnicodeCodecsWin32 unit with functions like: EncodingToUTF16().
My code that converts between ISO Latin2 and UTF-8 looks like:
s2 := EncodingToUTF16('ISO-8859-2', s);
s2utf8 := UTF16ToEncoding('UTF-8', s2);

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Convert char pos of UnicodeString to byte pos in a utf8 string - delphi

Related

How to convert widestring to string of unicode bytes?

Searching for Unicode chars from a raw byte array - Free Pascal\Lazarus or Delphi

Hex view of a file

Unicode string and TStringStream

Delphi: Encoding Strings as Python do

Categories

Resources