Delphi: Encoding Strings as Python do

Delphi: Encoding Strings as Python do - delphi

I want to encode strings as Python do.
Python code is this:
def EncodeToUTF(inputstr):
uns = inputstr.decode('iso-8859-2')
utfs = uns.encode('utf-8')
return utfs
This is very simple.
But in Delphi I don't understand, how to encode, to force first the good character set (no matter, which computer we have).
I tried this test code to see the convertion:
procedure TForm1.Button1Click(Sender: TObject);
var
w : WideString;
buf : array[0..2048] of WideChar;
i : integer;
lc : Cardinal;
begin
lc := GetThreadLocale;
Caption := IntToStr(lc);
StringToWideChar(Edit1.Text, buf, SizeOF(buf));
w := buf;
lc := MakeLCID(
MakeLangID( LANG_ENGLISH, SUBLANG_ENGLISH_US),
0);
Win32Check(SetThreadLocale(lc));
Edit2.Text := WideCharToString(PWideChar(w));
Caption := IntToStr(AnsiCompareText(Edit1.Text, Edit2.Text));
end;
The input is: "árvíztűrő tükörfúrógép", the hungarian accent tester phrase.
The local lc is 1038 (hun), the new lc is 1033.
But this everytime makes 0 result (same strings), and the accents are same, I don't lost ŐŰ which is not in english lang.
What I do wrong? How to I do same thing as Python do?
Thanks for every help, link, etc:
dd

Windows uses codepage 28592 for ISO-8859-2. If you have a buffer containing ISO-8859-2 encoded bytes, then you have to decode the bytes to UTF-16 first, and then encode the result to UTF-8. Depending on which version of Delphi you are using, you can either:
1) on pre-D2009, use MultiByteToWideChar() and WideCharToMultiByte():
function EncodeToUTF(const inputstr: AnsiString): UTF8String;
var
ret: Integer;
uns: WideString;
begin
Result := '';
if inputstr = '' then Exit;
ret := MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), nil, 0);
if ret < 1 then Exit;
SetLength(uns, ret);
MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), PWideChar(uns), Length(uns));
ret := WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), nil, 0, nil, nil);
if ret < 1 then Exit;
SetLength(Result, ret);
WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), PAnsiChar(Result), Length(Result), nil, nil);
end;
2a) on D2009+, use SysUtils.TEncoding.Convert():
function EncodeToUTF(const inputstr: RawByteString): UTF8String;
var
enc: TEncoding;
buf: TBytes;
begin
Result := '';
if inputstr = '' then Exit;
enc := TEncoding.GetEncoding(28592);
try
buf := TEncoding.Convert(enc, TEncoding.UTF8, BytesOf(inputstr));
if Length(buf) > 0 then
SetString(Result, PAnsiChar(#buf[0]), Length(buf));
finally
enc.Free;
end;
end;
2b) on D2009+, alternatively define a new string typedef, put your data into it, and assign it to a UTF8String variable. No manual encoding/decoding needed, the RTL will handle everything for you:
type
Latin2String = type AnsiString(28592);
var
inputstr: Latin2String;
outputstr: UTF8String;
begin
// put the ISO-8859-2 encoded bytes into inputstr, then...
outputstr := inputstr;
end;

If you're using Delphi 2009 or newer every input from the default VCL controls will be UTF-16, so no need to do any conversions on your input.
If you're using Delphi 2007 or older (as it seems) you are at mercy of Windows, because the VCL is ANSI and Windows has a fixed Codepage that determines which characters can be used in i.e. a TEdit.
You can change the system-wide default ANSI CP in the control panel though, but that requires a reboot each time you do.
In Delphi 2007 you have some chance to use TNTUnicode controls or some similar solution to get the Text from the UI to your code.
In Delphi 2009 and newer there are also plenty of Unicode and character set handling routines in the RTL.
The conversion between character sets can be done with SysUtils.TEncoding:
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html

The Python code in your question returns a string in UTF-8 encoding. To do this with pre-2009 Delphi versions you can use code similar to:
procedure TForm1.Button1Click(Sender: TObject);
var
Src, Dest: string;
Len: integer;
buf : array[0..2048] of WideChar;
begin
Src := Edit1.Text;
Len := MultiByteToWideChar(CP_ACP, 0, PChar(Src), Length(Src), #buf[0], 2048);
buf[Len] := #0;
SetLength(Dest, 2048);
SetLength(Dest, WideCharToMultiByte(CP_UTF8, 0, #buf[0], Len, PChar(Dest),
2048, nil, nil));
Edit2.Text := Dest;
end;
Note that this doesn't change the current thread locale, it simply passes the correct code page parameters to the API.

There are encoding tools in Open XML library. There is cUnicodeCodecsWin32 unit with functions like: EncodingToUTF16().
My code that converts between ISO Latin2 and UTF-8 looks like:
s2 := EncodingToUTF16('ISO-8859-2', s);
s2utf8 := UTF16ToEncoding('UTF-8', s2);

Related

Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

First I get a TMemoryStream from an HTTP request, which contains the body of the response.
Then I load it in a TStringList and save the text in a widestring (also tried with ansistring).
The problem is that I need to convert the string because the users language is spanish, so vowels with accent marks are very common and I need to store the info.
lServerResponse := TStringList.Create;
lServerResponse.LoadFromStream(lResponseMemoryStream);
lStringResponse := lServerResponse.Text;
lDecodedResponse := Utf8Decode(lStringResponse );
If the response (a part of it) is "Hólá Múndó", lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³", and lDecodedResponse will be "Hólá Múndó".
But if the user adds any emoji (lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³ ðŸ˜€" if the emoji is 😀) Utf8Decode fails and returns an empty string.
Is there a way to get just the ANSI characters from a string (or MemoryStream)?, or removing whatever Utf8Decode can't convert?
Thanks for your time.

TMemoryStream is just raw bytes. There is no reason to loading that stream into a TStringList just to extract a (Wide|Ansi)String from it. You can assign the bytes directly to an AnsiString/UTF8String using SetString() instead, eg:
var
lStringResponse: UTF8String;
lDecodedResponse: WideString;
begin
SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
lDecodedResponse := UTF8Decode(lStringResponse);
end;
Just make sure the HTTP content really is encoded as UTF-8, or else this approach will not work.
That being said - UTF8Decode() (and UTF8Encode()) in Delphi 7 DO NOT support Unicode codepoints above U+FFFF, which means they DO NOT support Emojis at all. That was fixed in Delphi 2009.
To work around that issue in earlier versions, you can use the Win32 API MultiByteToWideChar() function instead, eg:
uses
..., Windows;
function My_UTF8Decode(const S: UTF8String): WideString;
var
WLen: Integer;
begin
WLen := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), nil, 0);
if WLen > 0 then
begin
SetLength(Result, WLen);
MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), PWideChar(Result), WLen);
end else
Result := '';
end;
var
lStringResponse: UTF8String;
lDecodedResponse: WideString;
begin
SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
lDecodedResponse := My_UTF8Decode(lStringResponse);
end;
Alternatively:
uses
..., Windows;
function My_UTF8Decode(const S: PAnsiChar; const SLen: Integer): WideString;
var
WLen: Integer;
begin
WLen := MultiByteToWideChar(CP_UTF8, 0, S, SLen, nil, 0);
if WLen > 0 then
begin
SetLength(Result, WLen);
MultiByteToWideChar(CP_UTF8, 0, S, SLen, PWideChar(Result), WLen);
end else
Result := '';
end;
var
lDecodedResponse: WideString;
begin
lDecodedResponse := My_UTF8Decode(PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
end;
Or, use a 3rd party Unicode conversion library, like ICU or libiconv, which handle this for you.

What's the correct way to assign PAnsiChar to a (unicode-) string?

I have got a DLL function that returns a pointer to ANSI text (PAnsiChar). I want to assign this to a (unicode-) string (This is Delphi XE2.). The following compiles but I get a warning
"W1057 Implicit String cast from 'AnsiChar' to 'string'":
function TProj4.pj_strerrno(_ErrorCode: Integer): string;
var
Err: PAnsiChar;
begin
Err := Fpj_strerrno(_ErrorCode);
Result := Err;
end;
EDIT: The text in question is an error message in English, so there are unlikely to be any conversion problems here.
I am now tempted to just explicitly typecast Err to string like this ...
Result := String(Err);
.. to get rid of the warning. Could this go wrong? Should I rather use a temporary AnsiString variable instead?
var
s: AnsiString;
[...]
s := Err;
Result := String(s);
If yes, why?
Or should I make it explicit, that the code first converts a PAnsiChar to AnsiString and then the AnsiString to a String?
Result := String(AnsiString(Err));
And of course I could make it a function:
function PAnsicharToString(_a: PAnsiChar): string;
begin
// one of the above conversion codes goes here
end;
All these options compile, but will they work? And what's the best practice here?
Bonus points: The code should ideally compile and work with Delphi 2007 and newer versions as well.

If the text is encoded in the users current locale then I'd say it is simplest to write:
var
p: PAnsiChar;
str: string;
....
str := string(p);
Otherwise if you wish to convert from a specific code page to a Unicode string then you would use UnicodeFromLocaleChars.

I think the general solution is assigning c char pointer to RawByteString, then set its codepage corresponding to c null-terminated string encoding.
var
bys :TBytes;
rbstr :RawByteString;
ustr :string;
pastr :PAnsiChar;
begin
SetLength(bys,5);
bys[0] := $ca;
bys[1] := $e9;
bys[2] := $d2;
bys[3] := $b5;
bys[4] := 0;
pastr := #bys[0]; // just simulate char* returned by c api
rbstr := pastr; // assign PAnsiChar to RawByteString
// assume text encoded as codepage 936
// Note here: set 3rd param to false!
SetCodePage(rbstr,936,false);
ustr := string(rbstr);
ShowMessage(ustr);
end;
And the other cross-platform solution is (vcl,fmx,fmx with mobile platform)
function CString2TBytes(ptr :{$IFDEF NEXTGEN} MarshaledAString {$ELSE} PAnsiChar {$ENDIF}) :TBytes;
var
pby :PByte;
len :Integer;
begin
pby := PByte(ptr);
while pby^<>0 do Inc(pby);
len := pby - ptr;
SetLength(Result,len);
if len>0 then Move(ptr^,Result[0],len);
end;
procedure TForm5.Button1Click(Sender: TObject);
var
bys, cbys: TBytes;
ustr: string;
// PAnsiChar is undefined in mobile platform
// remap param foo(outSting:PAnsiString) => foo(outString:MarshaledAString)
ptr: {$IFDEF NEXTGEN} MarshaledAString {$ELSE} PAnsiChar {$ENDIF}; //
encoding : TEncoding;
begin
SetLength(bys, 5);
bys[0] := $CA;
bys[1] := $E9;
bys[2] := $D2;
bys[3] := $B5;
bys[4] := 0;
ptr := #bys[0]; // just simulate char* returned by c api
cbys := CString2TBytes(ptr);
// assume text encoded as codepage 936
encoding := TEncoding.GetEncoding(936);
try
ustr := encoding.GetString(cbys);
ShowMessage(ustr);
finally
encoding.Free;
end;
end;

Decode UTF-8 encoded Cyrillic with Delphi 2007

I am working in Delphi 2007 (no Unicode support) and I am retrieving XML and JSON data from the Google Analytics API. Below is some UTF-8 encoded data that I get for a URL referral path:
ga:referralPath=/add/%D0%9F%D0%B8%D0%B6%D0%B0%D0%BC
When I decode it using this decoder it properly generates this:
ga:referralPath=/add/Пижам
Is there a function I can use in Delphi 2007 which will perform this decoding?
UPDATE
This data is corresponds to a URL. Ultimately what I want to do is to store this in a SqlServer database (out of the box - no settings modified regarding character sets). And then be able to produce/create an html pages with a working link to this page (note: I am only dealing with the url referral path in this example - obviously to make a valid url link a source would be needed).

D2007 supports Unicode, just not to the extent that D2009+ does. Unicode in D2007 is handled using WideString and the few RTL support functions that do exist.
The URL contains percent-encoded UTF-8 byte octets. Simply convert those sequences into their binary representation and then use UTF8Decode() to decode the UTF-8 data to a WideString. For example:
function HexToBits(C: Char): Byte;
begin
case C of
'0'..'9': Result := Byte(Ord(C) - Ord('0'));
'a'..'f': Result := Byte(10 + (Ord(C) - Ord('a')));
'A'..'F': Result := Byte(10 + (Ord(C) - Ord('A')));
else
raise Exception.Create('Invalid encoding detected');
end;
end;
var
sURL: String;
sWork: UTF8String;
C: Char;
B: Byte;
wDecoded: WideString;
I: Integer;
begin
sURL := 'ga:referralPath=/add/%D0%9F%D0%B8%D0%B6%D0%B0%D0%BC';
sWork := sURL;
I := 1;
while I <= Length(sWork) do
begin
if sWork[I] = '%' then
begin
if (I+2) > Length(sWork) then
raise Exception.Create('Incomplete encoding detected');
sWork[I] := Char((HexToBits(sWork[I+1]) shl 4) or HexToBits(sWork[I+2]));
Delete(sWork, I+1, 2);
end;
Inc(I);
end;
wDecoded := UTF8Decode(sWork);
...
end;

You can use the following code, which uses Windows API :
function Utf8ToStr(const Source : string) : string;
var
i, len : integer;
TmpBuf : array of byte;
begin
SetLength(Result, 0);
i := MultiByteToWideChar(CP_UTF8, 0, #Source[1], Length(Source), nil, 0);
if i = 0 then Exit;
SetLength(TmpBuf, i * SizeOf(WCHAR));
Len := MultiByteToWideChar(CP_UTF8, 0, #Source[1], Length(Source), #TmpBuf[0], i);
if Len = 0 then Exit;
i := WideCharToMultiByte(CP_ACP, 0, #TmpBuf[0], Len, nil, 0, nil, nil);
if i = 0 then Exit;
SetLength(Result, i);
i := WideCharToMultiByte(CP_ACP, 0, #TmpBuf[0], Len, #Result[1], i, nil, nil);
SetLength(Result, i);
end;

Hex view of a file

I am using Delphi 2009.
I want to view the contents of a file (in hexadecimal) inside a memo.
I'm using this code :
var
Buffer:String;
begin
Buffer := '';
AssignFile(sF,Source); //Assign file
Reset(sF);
repeat
Readln(sF,Buffer); //Load every line to a string.
TempChar:=StrToHex(Buffer); //Convert to Hex using the function
...
until EOF(sF);
end;
function StrToHex(AStr: string): string;
var
I ,Len: Integer;
s: chr (0)..255;
//s:byte;
//s: char;
begin
len:=length(AStr);
Result:='';
for i:=1 to len do
begin
s:=AStr[i];
//The problem is here. Ord(s) is giving false values (251 instead of 255)
//And in general the output differs from a professional hex editor.
Result:=Result +' '+IntToHex(Ord(s),2)+'('+IntToStr(Ord(s))+')';
end;
Delete(Result,1,1);
end;
When I declare variable "s" as char (i know that char goes up to 255) I get results hex values up to 65535!
When i declare variable "s" as byte or chr (0)..255, it outputs different hex values, comparing to any Hexadecimal Editor!
Why is that? How can I see the correct values?
Check images for the differences.
1st image: Professional Hex Editor.
2nd image: Function output to Memo.
Thank you.

Your Delphi 2009 is unicode-enabled, so Char is actually WideChar and that's a 2 byte, 16 bit unsigned value, that can have values from 0 to 65535.
You could change all your Char declarations to AnsiChar and all your String declarations to AnsiString, but that's not the way to do it. You should drop Pascal I/O in favor of modern stream-based I/O, use a TFileStream, and don't treat binary data as Char.
Console demo:
program Project26;
{$APPTYPE CONSOLE}
uses SysUtils, Classes;
var F: TFileStream;
Buff: array[0..15] of Byte;
CountRead: Integer;
HexText: array[0..31] of Char;
begin
F := TFileStream.Create('C:\Temp\test', fmOpenRead or fmShareDenyWrite);
try
CountRead := F.Read(Buff, SizeOf(Buff));
while CountRead <> 0 do
begin
BinToHex(Buff, HexText, CountRead);
WriteLn(HexText); // You could add this to the Memo
CountRead := F.Read(Buff, SizeOf(Buff));
end;
finally F.Free;
end;
end.

In Delphi 2009, a Char is the same thing as a WideChar, that is, a Unicode character. A wide character occupies two bytes. You want to use AnsiChar. Prior to Delphi 2009 (that is, prior to Unicode Delphi), Char was the same thing as AnsiChar.
Also, you shouldn't use ReadLn. You are treating the file as a text file with text-file line endings! This is a general file! It might not have any text-file line endings at all!

For an easier to read output, and looking better too, you might want to use this simple hex dump formatter.
The HexDump procedure dumps an area of memory into a TStrings in lines of two chunks of 8 bytes in hex, and 16 ascii chars
example
406563686F206F66 660D0A6966206578 #echo off..if ex
69737420257E7331 5C6E756C20280D0A ist %~s1\nul (..
0D0A290D0A ..)..
Here is the code for the dump format function
function HexB (b: Byte): String;
const HexChar: Array[0..15] of Char = '0123456789ABCDEF';
begin
result:= HexChar[b shr 4]+HexChar[b and $0f];
end;
procedure HexDump(var data; size: Integer; s: TStrings);
const
sepHex=' ';
sepAsc=' ';
nonAsc='.';
var
i : Integer;
hexDat, ascDat : String;
buff : Array[0..1] of Byte Absolute data;
begin
hexDat:='';
ascDat:='';
for i:=0 to size-1 do
begin
hexDat:=hexDat+HexB(buff[i]);
if ((buff[i]>31) and (buff[i]<>255)) then
ascDat:=ascDat+Char(buff[i])
else
ascDat:=ascDat+nonAsc;
if (((i+1) mod 16)<>0) and (((i+1) mod 8)=0) then
hexDat:=hexDat+sepHex;
if ((i+1) mod 16)=0 then
begin
s.Add(hexdat+sepAsc+ascdat);
hexdat:='';
ascdat:='';
end;
end;
if (size mod 16)<>0 then
begin
if (size mod 16)<8 then
hexDat:=hexDat+StringOfChar(' ',(8-(size mod 8))*2)
+sepHex+StringOfChar(' ',16)
else
hexDat:=hexDat+StringOfChar(' ',(16-(size mod 16))*2);
s.Add(hexDat + sepAsc + ascDat);
end;
end;
And here is a complete code example for dumping the contents of a file into a Memo field.
procedure TForm1.Button1Click(Sender: TObject);
var
FStream: TFileStream;
buff: array[0..$fff] of Byte;
nRead: Integer;
begin
FStream := TFileStream.Create(edit1.text, fmOpenRead or fmShareDenyWrite);
try
repeat
nRead := FStream.Read(Buff, SizeOf(Buff));
if nRead<>0 then
hexdump(buff,nRead,memo1.lines);
until nRead=0;
finally
F.Free;
end;
end;

string is UnicodeString in Delphi 2009. If you want to use single-byte strings use AnsiString or RawByteString.
See String types.

Unicode string and TStringStream

Delphi 2009 and above uses unicode strings for their default string type. To my understanding unicode char is actually 16 bit value or 2 bytes (note: I understand there is possibility of 3 or 4 bytes char, but let's consider the most usual case). However I found that TStringStream is not very reliable to manipulating this strings. For example, TStringStream.Size property returns the length of the string, while I think it should return the byte count of the contained string. Okay, you can adjust it on your own, but the thing that really confused me the most is: TStringStream does not read from or write to a buffer reliably.
Please check the following code (it's a DUnit test and always fail). Please let me know where the problem is (I was using D2010 when testing the code).
procedure TestTCPackage.TestStringStream;
const
cCount = 10;
cOrdMaxChar = Ord(High(Char));
var
B: Pointer;
SW, SR: TStringStream;
T: string;
i, j, k : Integer;
vStrings: array [0..cCount-1] of string;
begin
RandSeed := GetTickCount;
for i := 0 to cCount - 1 do
begin
j := Random(100) + 1;
SetLength(vStrings[i], j);
for k := 1 to j do
// fill string with random char (but no #0)
vStrings[i][k] := Char(Random(cOrdMaxChar-1) + 1);
end;
for i := 0 to cCount - 1 do
begin
SW := TStringStream.Create(vStrings[i]);
try
GetMem(B, SW.Size * SizeOf(Char));
try
SW.Read(B^, SW.Size * SizeOf(Char));
SR := TStringStream.Create;
try
SR.Write(B^, SW.Size * SizeOf(Char));
SR.Position := 0;
// check the string in the TStringStream with original value
Check(SR.DataString = vStrings[i]);
finally
SR.Free;
end;
finally
FreeMem(B);
end;
finally
SW.Free;
end;
end;
end;
Note: I already tried to use an instance of TMemoryStream as intermediary from reading/writing the buffer and use CopyFrom of the TStringStream to read the content of that TMemoryStream with same failing effect.

Unicode strings aren't for data storage; use TBytes for that. TStringStream uses its associated encoding (the Encoding property) for encoding strings passed in with WriteString, and decoding strings read out with ReadString or the DataString property.

After reading this post (and thanks to Serg who provided the answer to that question) and Barry Kelly's answer, I have found the problem. TStringStream is actually using ASCII/ansistring encoding by default. So even if your default string type is unicode, unless you spesifically tell it to, it won't use unicode encoding. Personally I think it's strange. Maybe for making it easier to convert old codes.
So you have to specifically set the encoding of the TStringStream to TEncoding.Unicode to manipulate unicode string properly.
Here is my modified code which passes DUnit test is:
procedure TestTCPackage.TestStringStream;
const
cCount = 10;
cOrdMaxChar = Ord(High(Char));
var
B: Pointer;
SW, SR: TStringStream;
i, j, k : Integer;
vStrings: array [0..cCount-1] of string;
begin
RandSeed := GetTickCount;
for i := 0 to cCount - 1 do
begin
j := Random(100) + 1;
SetLength(vStrings[i], j);
for k := 1 to j do
// fill string with random char (but no #0)
vStrings[i][k] := Char(Random(cOrdMaxChar-1) + 1);
end;
for i := 0 to cCount - 1 do
begin
SW := TStringStream.Create(vStrings[i], ***TEncoding.Unicode***);
try
GetMem(B, SW.Size);
try
SW.ReadBuffer(B^, SW.Size);
SR := TStringStream.Create('', ***TEncoding.Unicode***);
try
SR.WriteBuffer(B^, SW.Size);
SR.Position := 0;
// check the string in the TStringStream with original value
Check(SR.DataString = vStrings[i]);
finally
SR.Free;
end;
finally
FreeMem(B);
end;
finally
SW.Free;
end;
end;
end;
Last note: Unicode does bite! :D

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Delphi: Encoding Strings as Python do - delphi

There are encoding tools in Open XML library. There is cUnicodeCodecsWin32 unit with functions like: EncodingToUTF16(). My code that converts between ISO Latin2 and UTF-8 looks like: s2 := EncodingToUTF16('ISO-8859-2', s); s2utf8 := UTF16ToEncoding('UTF-8', s2);

Related

Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

What's the correct way to assign PAnsiChar to a (unicode-) string?

Decode UTF-8 encoded Cyrillic with Delphi 2007

Hex view of a file

Unicode string and TStringStream

Categories

Resources