How to use an arbitrary string encoding? - delphi

I'm trying to get some code working against an API published by a Chinese company. I have a spec and some sample code (in Java), enough to understand most of what's going on, but I ran across one thing I don't know how to do.
String ecodeform = "GBK";
String sm = new String(Hex.encodeHex("Insert message here".getBytes(ecodeform))); //test message
It's creating a string from the char array result of the hex representation of the original string, encoded in GBK format (the standard Chinese character encoding, equivalent to ASCII for English text). I can work out how to do most of that in Delphi, but I don't know how to encode a string to GBK, which is specifically required by this API.
In SysUtils, there's a TEncoding class that comes with a few built-in encodings, such as UTF8, UTF16, and "Default" (the system's default code page), but I don't know how to set up a TEncoding for an arbitrary encoding such as GBK.
Does anyone know how to set this up?

You can use the TEncoding.GetEncoding() method to get a TEncoding object for a specific codepage/charset, eg:
Enc: TEncoding;
Bytes: TBytes;
Enc := TEncoding.GetEncoding(936); // or TEncoding.GetEncoding('gb2312')
Bytes := Enc.GetBytes('Insert message here');
// encode Bytes to hex string as needed...

TEncoding has a GetEncoding method for that. Give it the encoding name or number, and it will return a TEncoding instance.
For GBK, the number I think you want is 936. See Microsoft's list of code pages for more.


How to correct encode a string to UTF8 in delphi10?

I am trying to replace some wildcards in a html code to send it via mailing.
Problem is when I try to replace the string with wildcard 'España$country$' with the string 'España', the result would be 'EspañaEspa?a'. I had the same problem before in Delphi 7 and I solved it by using the function 'UTF8Encode('España')' but it does not work on Delphi 10.
I have tried with 'España', 'UTF8Encode('España')' and 'AnsiToUTF8('España')'. I also tried to change the function StringReplace with ReplaceStr and ReplaceText, with same result.
var htmlText : TStringList;
htmlText := TStringList.Create;
htmlText.StringReplace(htmlText.Text, '$country$', UTF8Encode('España'), [rfReplaceAll]);
This "stringreplace" along with "utf8encode" works well in Delphi7, showing 'España', but not in delphi 10, where you can read 'Espa?a' in the anotherpath.html.
The Delphi 7 string type, and consequently TStrings, did not support Unicode. Which is why you needed to use UTF8Encode.
Since Delphi 2009, Unicode is supported, and string maps to UnicodeString, and TStrings is a collection of such strings. Note that UnicodeString is internall encoded as UTF-16 although that's not a detail that you need to be concerned with here.
Since you are now using a Delphi that supports Unicode, your code can be much simpler. You can now write it like this:
htmlText.Text := StringReplace(htmlText.Text, '$country$', 'España', [rfReplaceAll]);
Note that if you wish the file to be encoded as UTF-8 when you save it you need to specify that when you save it. Like this:
htmlText.SaveToFile('anotherpath.html', TEncoding.UTF8);
And you may also need to specify the encoding when loading the file in case it does not include a UTF-8 BOM:
htmlText.LoadFromFile('path.html', TEncoding.UTF8);

Problems with unicode text

I use delphi xe3 and i have small problem !! but i don't how to fix it..
problem is with this letter "è" this letter is inside a file path "C:\lène.mp4"
i save this path into a tstringlist , when i save this tstringlist to a file the path will be shown fine inside the txt file ..
but when trying to loading it using tstringlist it will be shown as "è" (showing it inside a memo or int a variable) in this case it gonna be an invalid path ..
but adding the path(string) directly to the tstring list and then passing it to the path variable it works fine
but loading from the file and passing to the path variable it doesnt work (getting "è" instead of "è")
normally i will work with a lot of uncite string but for i'm struggling with that letter
this will not work ..
resp : widestring;
xfiles : tstringlist;
xfiles := tstringlist.Create;
xfiles.LoadFromFile('C:\Demo6-out.txt'); // this file contains only "C:\lène.mp4"
resp := (xfiles.Strings[0]);
// if i save xfiles to a file "path string" will be saved fine ... !
xfiles.Free ;
but like this it work ..
resp : widestring;
xfiles : tstringlist;
xfiles := tstringlist.Create;
resp := (xfiles.Strings[0]);
xfiles.Free ;
i'm really confused
First, you should be using UnicodeString instead of WideString. UnicodeString was introduced in Delphi 2009, and is much more efficient than WideString. The RTL uses UnicodeString (almost) everywhere it previously used AnsiString prior to 2009.
Second, something else introduced in Delphi 2009 is SysUtils.TEncoding, which is used for Byte<->Character conversions. Several existing RTL classes, including TStrings/TStringList, were updated to support TEncoding when converting bytes to/from strings.
What happens when you load a file into TStringList is that an internal TEncoding object is assigned to help convert the file's raw bytes to UnicodeString values. Which implementation of TEncoding it uses depends on the character encoding that LoadFromFile() thinks the file is using, if not explicitly stated (LoadFromFile() has an optional AEncoding parameter). If the file has a UTF BOM, a matching TEncoding is used, whether that be TEncoding.UTF8 or TEncoding.(BigEndian)Unicode. If no BOM is present, and the AEncoding parameter is not used, then TEncoding.Default is used, which represents the OS's default charset locale (and thus provides backwards compatibility with existing pre-2009 code).
When saving a TStringList to file, if the list was previously loaded from a file then the same TEncoding used for loading is used for saving, otherwise TEncoding.Default is used (again, for backwards compatibility), unless overwritten by the optional AEncoding parameter of SaveToFile().
In your first example, the input file is most likely encoded in UTF-8 without a BOM. So LoadFromFile() would use TEncoding.Default to interpret the file's bytes. è is the result of the UTF-8 encoded form of è (byte octets 0xC3 0xA8) being misinterpreted as Windows-1252 instead of UTF-8. So, you would have to load the file like this instead:
xfiles.LoadFromFile('C:\Demo6-out.txt', TEncoding.UTF8);
In your second example, you are not loading a file or saving a file. You are simply assigning a string literal (which is unicode-aware in D2009+) to a UnicodeString variable (inside of the TStringList) and then assigning that to a WideString variable (WideString and UnicodeString use the same UTF-16 character encoding, they just different memory managements). So there are no data conversions being performed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to avoid wrong characters reading UTF-8 emails with Indy 10.6 and Delphi 7

I am reading email with Indy and Delphi 7.
I have only problems with UTF-8 encoded emails, which translate to wrong characters in the customers computers.
I have read a lot about this problem, but I have not found a solution, except decoding the raw email by myself.
I wonder if there is a way to get correct emails when the sender encodes them in UTF-8.
UTF8 string is received like it was an Ansi string. You have to decode it.
You have to receive the message text in an UTF8String (aka. AnsiString aka String in Delphi 7) then convert them from UTF8 to AnsiString or (preferably)WideString. You can use the UFT8Decode() or Utf8ToAnsi() function to decode the email body.
If you use the UFT8Decode() function, you will still need WideString aware controls to display the received message.
If you use the Utf8ToAnsi() function, the result might not contain characters that are not part of the users local codepage.
So you will use something like:
ustrEmailBody: UTF8String;
wstrDecoded: WideString;
// ustrEmailBody now contains the email body
wstrDecoded := UTF8Decode(ustrEmailBody);
SomeUnicodeAwareMemo.Text := wstrDecoded;
ustrEmailBody: UTF8String;
astrDecoded: AnsiString;
// ustrEmailBody now contains the email body
astrDecoded := Utf8ToAnsi(ustrEmailBody);
SomeMemo.Text := astrDecoded; // the memo might display '?' in place of unknown characters
For further information see the documentation of the UFT8Decode() or Utf8ToAnsi() functions in the Delphi help.

Delphi XE and ZLib Problems

I'm in Delphi XE and I'm having some problems with ZLib routines...
I'm trying to compress some strings (and encode it to send it via a SOAP webservice -not really important-)
The string results from ZDecompressString differs used in ZCompressString.
uses ZLib;
// compressing string
// ZCompressString('1234567890', zcMax);
// compressed string ='xÚ3426153·°4'
// Uncompressing the result of ZCompressString, don't return the same:
// ZDecompressString('xÚ3426153·°4');
// uncompressed string = '123456789'
if '1234567890' <> ZDecompressString(ZCompressString('1234567890', zcMax)) then
ShowMessage('Compression/Decompression fails');
Uses ZLib;
// compressing string
// ZCompressString('12345678901234567890', zcMax)
// compressed string ='xÚ3426153·°40„³'
// Uncompressing the result of ZCompressString, don't return the same:
// ZDecompressString('xÚ3426153·°40„³')
// uncompressed string = '12345678901'
if '12345678901234567890' <> ZDecompressString(ZCompressString('12345678901234567890', zcMax)) then
ShowMessage('Compression/Decompression fails');
the functions used are from some other posts about compressing and deCompressing
function TForm1.ZCompressString(aText: string; aCompressionLevel: TZCompressionLevel): string;
strOutput: TStringStream;
Zipper: TZCompressionStream;
Result:= '';
strInput:= TStringStream.Create(aText);
strOutput:= TStringStream.Create;
Zipper:= TZCompressionStream.Create(strOutput, aCompressionLevel);
Zipper.CopyFrom(strInput, strInput.Size);
Result:= strOutput.DataString;
function TForm1.ZDecompressString(aText: string): string;
strOutput: TStringStream;
Unzipper: TZDecompressionStream;
Result:= '';
strInput:= TStringStream.Create(aText);
strOutput:= TStringStream.Create;
Unzipper:= TZDecompressionStream.Create(strInput);
strOutput.CopyFrom(Unzipper, Unzipper.Size);
Result:= strOutput.DataString;
Where I was wrong?
Someone else have same problems??
ZLib, like all compression codes I know, is a binary compression algorithm. It knows nothing of string encodings. You need to supply it with byte streams to compress. And when you decompress, you are given back byte streams.
But you are working with strings, and so need to convert between encoded text and byte streams. The TStringStream class is doing that work in your code. You supply the string stream instance a text encoding when you create it.
Only your code does not supply an encoding. And so the default local ANSI encoding is used. And here's the first problem. That is not a full Unicode encoding. As soon as you use characters outside your local ANSI codepage the chain breaks down.
Solve that problem by supplying an encoding when you create string stream instances. Pass the encoding to the TStringStream constructor. A sound choice is TEncoding.UTF8. Pass this when creating strInput in the compressor, and strOutput in the decompressor.
Now the next and bigger problem that you face is that your compressed data may not be a meaningful string in any encoding. You might make your existing code sort of work if you switch to using AnsiString instead of string. But it's a rather brittle solution.
Fundamentally you are making the mistake of treating binary data as text. Once you compress you have binary data. My recommendation is that you don't attempt to interpret the compressed binary as text. Leave it as binary. Compress to a TBytesStream. And decompress from a TBytesStream. So the compressor function returns TBytes and the decompressor receives that same TBytes.
If, for some reason, you must compress to a string, then you must encode the compressed binary. Do that using base64. The EncdDecd unit can do that for you.
This flow for the compressor looks like this: string -> UTF-8 bytes -> compressed bytes -> base64 string. Obviously you reverse the arrows to decompress.

Is a PChar UTF-8 coded?

I'm writing a tool, which use a C-DLL. The functions of the C-DLL expect a char*, which is in UTF-8 Format.
My question: Can I pass a PChar or do I have to use UTF8Encode(string)?
Consider a string variable named s. On an ANSI Delphi PChar(s) is ANSI encoded. On a Unicode Delphi it is UTF-16 encoded.
Therefore, either way, you need to convert s to UTF-8 encoding. And then you can use PAnsiChar(...) to get a pointer to a null terminated C string.
So, the code you need looks like this:
Please edit the question and add the tag with your target Delphi version.
Pass it as PAnsiChar; PChar is a joker and may mean different data types. When you work with DLL-like API, you ignore compiler safety net and that means you should make your own. And that means you should use real types, not jokers, the types that would not change no matter which compiler settings and version would be active.
But before getting passing the pointer you should ensure that the source data is encoded in UTF8 actually.
Var data: string; buffer: UTF8String; buffer_ptr: PAnsiChar;
buffer := data + #0;
// transcoding to UTF8 from whatever charset it was, transparently done by Delphi RTL
// last zero to ensure that even for empty string you would have valid pointer below
buffer_ptr := Pointer(#buffer[1]); // making sure there can be no codepage bound to the datatype
