TIdHTTP character encoding of POST response - delphi

Take following situation:
procedure Test;
var
Response : String;
begin
Response := IdHttp.Post(MyUrL, AStream);
DoSomethingWith(Response);
end;
Now the webserver returns me data in UTF-8.
Suppose it returns me some UTF-8 XML containing the character é.
If I use the variable Response it does not contain this character but it's UTF-8 variant (#C3#A9), so Indy did no decoding?
Now I know how to solve this problem:
procedure Test;
var
Response : String;
begin
Response := UTF8ToString(IdHttp.Post(MyUrL, AStream));
DoSomethingWith(Response);
end;
One caveat with this solution: Delphi raises warning W1058 (Implicit string cast with potential data loss from 'string' to 'RawByteString')
My question : is this the correct way to deal with this problem or can I instruct TIdHTTP to do the conversion to UnicodeString for me?

If you are using an up-to-date version of Indy 10, then the overloaded version of TIdHTTP.Post() that returns a String does decode the data to Unicode, however the actual charset used for the decoding depends on what media type the HTTP Content-Type response header specifies:
if the media type is either application/xml, application/xml-external-parsed-entity, application/xml-dtd, or is not a text/... type but does end with +xml, then the charset specified in the encoding attribute of the XML's prolog is used. If no charset is specified, UTF-8 is used.
otherwise, if the Content-Type response header specifies a charset, then it is used.
otherwise, if the media type is a text/... type, then:
a. if the media type is text/xml, text/xml-external-parsed-entity, or ends with +xml, then us-ascii is used.
b. otherwise ISO-8859-1 is used.
otherwise, Indy's default encoding (ASCII by default) is used.
Without seeing the actual HTTP Content-Type header, it is hard to know which condition your situation falls into. It sounds like it is falling into either #2 or #3b, which would account for the UTF-8 byte values being returned as-is, if ISO-8859-1 or similar charset is being used.
UTF8ToString() expects a UTF-8 encoded RawByteString as input, but you are passing it a UTF-16 encoded UnicodeString instead. The RTL will perform a UTF16->Ansi conversion in that situation, using a default Ansi charset for the conversion. That is why you get the compiler warning, because such a conversion can lose data.
XML is really a binary data format, subject to charset encodings. An XML parser needs to know what the XML's encoding is, and be able to parse the raw encoded bytes accordingly. That is why XML has an explicit encoding attribute right in the XML prolog. However, when TIdHTTP downloads XML as a String, although it does automatically decode it to Unicode, it does not yet update the XML's prolog accordingly.
The real solution is to not download XML as a String in the first place. Download it as a TStream instead (TMemoryStream is a better choice than TStringStream) so your XML parser has access to the original bytes, the original charset declaration, etc. You can pass the TStream to the TXMLDocument.LoadFromStream() method, for instance.

You can do this:
var
sstream: TStringStream;
begin
sstream := TStringStream.Create('', TEncoding.UTF8);
try
IdHttp.Post(MyUrL, AStream, sstream);
DoSomethingWith(sstream.DataString);
finally
sstream.Free;
end;

Related

Receiving Unicode strings with Indy 10

I am using the latest Delphi 10.4.2 with Indy 10.
In a REST server, JSON commands are received and handled. It works fine except for Unicode.
A simple JSON like this:
{"driverNote": "Test"}
is shown correctly
If I now change to Unicode Russian characters:
{"driverNote": "Статья"}
Not sure where I should begin to track this. I expect ARequestInfo.FormParams to have the same value in debugger as s variable.
If I debug Indy itself, FormParams are set in this code:
if LRequestInfo.PostStream <> nil then
begin
// decoding percent-encoded octets and applying the CharSet is handled by
// DecodeAndSetParams() further below...
EnsureEncoding(LEncoding, enc8Bit);
LRequestInfo.FormParams :=
ReadStringFromStream( LRequestInfo.PostStream,
-1,
LEncoding
{$IFDEF STRING_IS_ANSI}, LEncoding{$ENDIF});
DoneWithPostStream(AContext, LRequestInfo); // don't need the PostStream anymore
end;
It use enc8Bit. But my string has 16-bits characters.
Is this handled incorrect in Indy?
The code snippet you quoted from IdCustomHTTPServer.pas is not what is in Indy's GitHub repo.
In the official code, TIdHTTPServer does not decode the PostStream to FormParams unless the ContentType is 'application/x-www-form-urlencoded':
if LRequestInfo.PostStream <> nil then begin
if TextIsSame(LContentType, ContentTypeFormUrlencoded) then
begin
// decoding percent-encoded octets and applying the CharSet is handled by DecodeAndSetParams() further below...
EnsureEncoding(LEncoding, enc8Bit);
LRequestInfo.FormParams := ReadStringFromStream(LRequestInfo.PostStream, -1, LEncoding{$IFDEF STRING_IS_ANSI}, LEncoding{$ENDIF});
DoneWithPostStream(AContext, LRequestInfo); // don't need the PostStream anymore
end;
end;
That ContentType check was added way back in 2010, so I don't know why it is not present in your version.
In your example, the ContentType is 'application/json', so the raw JSON should be in the PostStream and the FormParams should be blank.
That being said, in your version of Indy, TIdHTTPServer is simply reading the raw bytes from the PostStream and zero-extending each byte to a 16-bit character in the FormParams. To recover the original bytes, simply truncate each Char to an 8-bit Byte. For instance, you can use Indy's ToBytes() function in the IdGlobal unit, specifying enc8Bit/IndyTextEncoding_8Bit as the byte encoding.
JSON is most commonly transmitted as UTF-8 (and that is the case in your example), so when you have access to the raw bytes, in any version, make sure you parse the JSON bytes as UTF-8.

Problems with unicode text

I use delphi xe3 and i have small problem !! but i don't how to fix it..
problem is with this letter "è" this letter is inside a file path "C:\lène.mp4"
i save this path into a tstringlist , when i save this tstringlist to a file the path will be shown fine inside the txt file ..
but when trying to loading it using tstringlist it will be shown as "è" (showing it inside a memo or int a variable) in this case it gonna be an invalid path ..
but adding the path(string) directly to the tstring list and then passing it to the path variable it works fine
but loading from the file and passing to the path variable it doesnt work (getting "è" instead of "è")
normally i will work with a lot of uncite string but for i'm struggling with that letter
this will not work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.LoadFromFile('C:\Demo6-out.txt'); // this file contains only "C:\lène.mp4"
resp := (xfiles.Strings[0]);
// if i save xfiles to a file "path string" will be saved fine ... !
finally
xfiles.Free ;
end;
but like this it work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.Add('C:lène.mp4');
resp := (xfiles.Strings[0]);
finally
xfiles.Free ;
end;
i'm really confused
First, you should be using UnicodeString instead of WideString. UnicodeString was introduced in Delphi 2009, and is much more efficient than WideString. The RTL uses UnicodeString (almost) everywhere it previously used AnsiString prior to 2009.
Second, something else introduced in Delphi 2009 is SysUtils.TEncoding, which is used for Byte<->Character conversions. Several existing RTL classes, including TStrings/TStringList, were updated to support TEncoding when converting bytes to/from strings.
What happens when you load a file into TStringList is that an internal TEncoding object is assigned to help convert the file's raw bytes to UnicodeString values. Which implementation of TEncoding it uses depends on the character encoding that LoadFromFile() thinks the file is using, if not explicitly stated (LoadFromFile() has an optional AEncoding parameter). If the file has a UTF BOM, a matching TEncoding is used, whether that be TEncoding.UTF8 or TEncoding.(BigEndian)Unicode. If no BOM is present, and the AEncoding parameter is not used, then TEncoding.Default is used, which represents the OS's default charset locale (and thus provides backwards compatibility with existing pre-2009 code).
When saving a TStringList to file, if the list was previously loaded from a file then the same TEncoding used for loading is used for saving, otherwise TEncoding.Default is used (again, for backwards compatibility), unless overwritten by the optional AEncoding parameter of SaveToFile().
In your first example, the input file is most likely encoded in UTF-8 without a BOM. So LoadFromFile() would use TEncoding.Default to interpret the file's bytes. è is the result of the UTF-8 encoded form of è (byte octets 0xC3 0xA8) being misinterpreted as Windows-1252 instead of UTF-8. So, you would have to load the file like this instead:
xfiles.LoadFromFile('C:\Demo6-out.txt', TEncoding.UTF8);
In your second example, you are not loading a file or saving a file. You are simply assigning a string literal (which is unicode-aware in D2009+) to a UnicodeString variable (inside of the TStringList) and then assigning that to a WideString variable (WideString and UnicodeString use the same UTF-16 character encoding, they just different memory managements). So there are no data conversions being performed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to use an arbitrary string encoding?

I'm trying to get some code working against an API published by a Chinese company. I have a spec and some sample code (in Java), enough to understand most of what's going on, but I ran across one thing I don't know how to do.
String ecodeform = "GBK";
String sm = new String(Hex.encodeHex("Insert message here".getBytes(ecodeform))); //test message
It's creating a string from the char array result of the hex representation of the original string, encoded in GBK format (the standard Chinese character encoding, equivalent to ASCII for English text). I can work out how to do most of that in Delphi, but I don't know how to encode a string to GBK, which is specifically required by this API.
In SysUtils, there's a TEncoding class that comes with a few built-in encodings, such as UTF8, UTF16, and "Default" (the system's default code page), but I don't know how to set up a TEncoding for an arbitrary encoding such as GBK.
Does anyone know how to set this up?
You can use the TEncoding.GetEncoding() method to get a TEncoding object for a specific codepage/charset, eg:
var
Enc: TEncoding;
Bytes: TBytes;
begin
Enc := TEncoding.GetEncoding(936); // or TEncoding.GetEncoding('gb2312')
try
Bytes := Enc.GetBytes('Insert message here');
finally
Enc.Free;
end;
// encode Bytes to hex string as needed...
end;
TEncoding has a GetEncoding method for that. Give it the encoding name or number, and it will return a TEncoding instance.
For GBK, the number I think you want is 936. See Microsoft's list of code pages for more.

What is the correct encoding for Indy Tidhttp for posting XML Files?

I notice I have invalid characters for XML files in an application who use Indy Client (I actually use default parameters for IdHttp)
Here is my code :
ts := TStringList.Create;
try
ts.Add('XML=' + AXMLDoc.XML.Text));
HTTPString := IdHTTPClient.Post('http://' + FHost + ':' + IntToStr(FPort) + FHttpRoot, ts);
finally
ts.Free;
end;
My XML file is UTF-8 encoded.
What I have to do get good encoding on my server (I also use Indy for server) ?
UTF-8 is the default charset that TIdHTTP uses for submitting a TStringList object. The real issue is that XML should not be submitted using a TStringList to begin with, even with a proper charset. The reason is because the TIdHTTP.Post(TStrings) method implements the application/x-www-form-urlencoded content type, and thus url-encodes the TStringList content, which can break XML if the receiver is not expecting that. So unless the receiver is actually expecting a real application/x-www-form-urlencoded encoded request, XML should be transmitted using the TIdHTTP.Post(TStream) method instead so the raw XML bytes are preserved as-is.

Is a PChar UTF-8 coded?

I'm writing a tool, which use a C-DLL. The functions of the C-DLL expect a char*, which is in UTF-8 Format.
My question: Can I pass a PChar or do I have to use UTF8Encode(string)?
Consider a string variable named s. On an ANSI Delphi PChar(s) is ANSI encoded. On a Unicode Delphi it is UTF-16 encoded.
Therefore, either way, you need to convert s to UTF-8 encoding. And then you can use PAnsiChar(...) to get a pointer to a null terminated C string.
So, the code you need looks like this:
PAnsiChar(UTF8Encode(s))
Please edit the question and add the tag with your target Delphi version.
Pass it as PAnsiChar; PChar is a joker and may mean different data types. When you work with DLL-like API, you ignore compiler safety net and that means you should make your own. And that means you should use real types, not jokers, the types that would not change no matter which compiler settings and version would be active.
But before getting passing the pointer you should ensure that the source data is encoded in UTF8 actually.
.
Var data: string; buffer: UTF8String; buffer_ptr: PAnsiChar;
Begin
buffer := data + #0;
// transcoding to UTF8 from whatever charset it was, transparently done by Delphi RTL
// last zero to ensure that even for empty string you would have valid pointer below
buffer_ptr := Pointer(#buffer[1]); // making sure there can be no codepage bound to the datatype
C_DLL_CALL(buffeR_ptr);
End;

Resources