Receiving Unicode strings with Indy 10 - delphi

I am using the latest Delphi 10.4.2 with Indy 10.
In a REST server, JSON commands are received and handled. It works fine except for Unicode.
A simple JSON like this:
{"driverNote": "Test"}
is shown correctly
If I now change to Unicode Russian characters:
{"driverNote": "Статья"}
Not sure where I should begin to track this. I expect ARequestInfo.FormParams to have the same value in debugger as s variable.
If I debug Indy itself, FormParams are set in this code:
if LRequestInfo.PostStream <> nil then
begin
// decoding percent-encoded octets and applying the CharSet is handled by
// DecodeAndSetParams() further below...
EnsureEncoding(LEncoding, enc8Bit);
LRequestInfo.FormParams :=
ReadStringFromStream( LRequestInfo.PostStream,
-1,
LEncoding
{$IFDEF STRING_IS_ANSI}, LEncoding{$ENDIF});
DoneWithPostStream(AContext, LRequestInfo); // don't need the PostStream anymore
end;
It use enc8Bit. But my string has 16-bits characters.
Is this handled incorrect in Indy?

The code snippet you quoted from IdCustomHTTPServer.pas is not what is in Indy's GitHub repo.
In the official code, TIdHTTPServer does not decode the PostStream to FormParams unless the ContentType is 'application/x-www-form-urlencoded':
if LRequestInfo.PostStream <> nil then begin
if TextIsSame(LContentType, ContentTypeFormUrlencoded) then
begin
// decoding percent-encoded octets and applying the CharSet is handled by DecodeAndSetParams() further below...
EnsureEncoding(LEncoding, enc8Bit);
LRequestInfo.FormParams := ReadStringFromStream(LRequestInfo.PostStream, -1, LEncoding{$IFDEF STRING_IS_ANSI}, LEncoding{$ENDIF});
DoneWithPostStream(AContext, LRequestInfo); // don't need the PostStream anymore
end;
end;
That ContentType check was added way back in 2010, so I don't know why it is not present in your version.
In your example, the ContentType is 'application/json', so the raw JSON should be in the PostStream and the FormParams should be blank.
That being said, in your version of Indy, TIdHTTPServer is simply reading the raw bytes from the PostStream and zero-extending each byte to a 16-bit character in the FormParams. To recover the original bytes, simply truncate each Char to an 8-bit Byte. For instance, you can use Indy's ToBytes() function in the IdGlobal unit, specifying enc8Bit/IndyTextEncoding_8Bit as the byte encoding.
JSON is most commonly transmitted as UTF-8 (and that is the case in your example), so when you have access to the raw bytes, in any version, make sure you parse the JSON bytes as UTF-8.

Related

TIdTCPClient.IOHandler.Write(TStream) cannot send Big5?

Through TCPClient.IOHandler.Write(StmMsg);, the message is delivered to the frontend. English is ok, but Big5 cannot be delivered, why!!??
(StmMsg: TStringStream, the program has added... TCPClient.IOHandler.DefStringEncoding:= IndyTextEncoding_UTF8;)
The following is the code:
if not TCPClient.Connected then
TCPClient.Connect;
deviceToken :=
'6aa5bfcfe731ab29b260fab38a43f1e1abac0de3d6e8e0bc5f4b89c422938e8f';
MensajeEnviar := edtMensaje.Text;
strMessage := Get_Msg(deviceToken, Get_PayLoad(MensajeEnviar, 1,
'default'));
StmMsg := TStringStream.Create(strMessage);
StmMsg.Seek(0, soBeginning);
TCPClient.IOHandler.Write(StmMsg);
Big5 is not a language. It is a byte encoding used for Chinese.
The TIdIOHandler.DefStringEncoding property applies only to string operations, not to stream operations. The TIdIOHandler.Write(TStream) method writes the content of a stream as-is. So, it is your responsibility to make sure the contents of the stream are encoded properly beforehand.
However, the TStringStream constructor you are calling uses TEncoding.Default for the stream's byte encoding. On Windows1, TEncoding.Default represents the default ANSI charset of the user that is running your program. An ANSI charset will not work for Chinese text, and will lose data.
1: on non-Windows platforms, TEncoding.Default uses UTF-8 instead.
You need to use TEncoding.UTF8 instead for the stream's byte encoding, eg:
StmMsg := TStringStream.Create(strMessage, TEncoding.UTF8);
Alternatively, you can remove the stream altogether and just use the TIdIOHandler.Write(String) method instead, which will then use the TIdIOHandler.DefStringEncoding property, eg:
TCPClient.IOHandler.Write(strMessage);

Delphi7 Base64 encode UTF8 XML

I'm still using Delphi7 (I know) and I need to encode an UTF8 XML in Base64 format.
I create the XML using IXMLDocument, which support UTF8 (that is, if I save to a file).
Since I'm using Indy10 to HTTP Post the XML request, I tried using TIdEncoderMIME to Base64 encode the XML. But some UTF8 chars are not encoded well.
Try1:
XMLText := XML.XML.Text;
EncodedXML := TIdEncoderMIME.EncodeBytes(ToBytes(XMLText));
In the above case most probably some UTF8 information/characters are already lost when the XML is saved to a string.
Try2:
XMLStream := TMemoryStream.Create;
XML.SaveToStream(XMLStream);
EncodedXML := TIdEncoderMIME.EncodeStream(XMLStream);
//or
EncodedXML := TIdEncoderMIME.EncodeStream(XMLStream, XMLStream.Size);
Both of the above gives back EncodedXML = '' (empty string).
What am I doing wrong?
Try using the TIdEncoderMIME.EncodeString() method instead. It has an AByteEncoding parameter that you can use to specify the desired byte encoding that Indy should encode the string characters as, such as UTF-8, before it then base64 encodes the resulting bytes:
XMLText := XML.XML.Text;
EncodedXML := TIdEncoderMIME.EncodeString(XMLText, IndyTextEncoding_UTF8);
Also note that in Delphi 2007 and earlier, where string is AnsiString, there is also an optional ASrcEncoding that you can use the specify the encoding of the AnsiString (for instance, if it is already UTF-8), so that it can be decoded to Unicode properly before then being encoded to the specified byte encoding (or, in the case where the two encodings are the same, the AnsiString can be base64 encoded as-is):
XMLText := XML.XML.Text;
EncodedXML := TIdEncoderMIME.EncodeString(XMLText, IndyTextEncoding_UTF8, IndyTextEncoding_UTF8);
You are getting data loss when using EncodeBytes() because you are using ToBytes() without specifying any encoding parameters for it. ToBytes() has similar AByteEncoding and ASrcEncoding parameters.
In the case where you tried to encode a TMemoryStream, you simply forgot to reset the stream's Position back to 0 after calling SaveToStream(), so there was nothing for EncodeStream() to encode. That is why it returned a blank base64 string:
XMLStream := TMemoryStream.Create;
try
XML.SaveToStream(XMLStream);
XMLStream.Position := 0; // <-- add this
EncodedXML := TIdEncoderMIME.EncodeStream(XMLStream);
finally
XMLStream.Free;
end;

Delphi tidhttp encoding special characters

I have upgraded an app from D2007 to XE6. It posts data to a webserver.
I cannot work out what encoding will send the left and right quote characters correctly (code snippet below). I have tried every option I can find, but they get encoded as ? when sent (as far as I can see in WireShark).
D2007 had no problem, but XE6 is all about Unicode, and I am not sure if the problem is encoding or codepages or what.
Params := TIdMultipartFormDataStream.Create;
params.AddFormField('TEST', 'Test ‘n’ Try', 'utf8').ContentTransfer := '8bit';
IdHTTP1.Request.ContentType := 'text/plain';
IdHTTP1.Request.Charset := 'utf-8';
IdHTTP1.Post('http://test.com.au/TestEncoding.php', Params, Stream);
When calling params.AddFormField(), you are setting the charset to 'utf8', which is not a valid charset name. The official charset name is 'utf-8' instead:
params.AddFormField('TEST', 'Test ‘n’ Try', 'utf-8').ContentTransfer := '8bit';
When compiling for Unicode, an invalid charset ends up using Indy's built-in 8bit encoder, which encodes Unicode codepages > U+00FF as byte 0x3F ('?'). The quote characters you are using, ‘ and ’, are codepoints U+2018 and U+2019, respectively.
The reason you do not encounter this issue in D2007 is because the TIdFormDataField.Charset property is ignored for encoding purposes when compiling for Ansi. The TIdFormDataField.FieldValue property is an AnsiString, and its raw bytes get transmitted as-is, so you are required to ensure it is encoded properly before adding it to TIdMultipartFormDataStream, eg:
params.AddFormField('TEST', UTF8Encode('Test ‘n’ Try'), 'utf-8').ContentTransfer := '8bit';
On a side note, you do not need to set the Request.ContentType or Request.Charset properties when posting a TIdMultipartFormDataStream (and especially since 'text/plain' is an invalid content type for a MIME post anyway). This version of Post() will set those properties for you:
Params := TIdMultipartFormDataStream.Create;
params.AddFormField(...);
IdHTTP1.Post('http://test.com.au/TestEncoding.php', Params, Stream);

TIdHTTP character encoding of POST response

Take following situation:
procedure Test;
var
Response : String;
begin
Response := IdHttp.Post(MyUrL, AStream);
DoSomethingWith(Response);
end;
Now the webserver returns me data in UTF-8.
Suppose it returns me some UTF-8 XML containing the character é.
If I use the variable Response it does not contain this character but it's UTF-8 variant (#C3#A9), so Indy did no decoding?
Now I know how to solve this problem:
procedure Test;
var
Response : String;
begin
Response := UTF8ToString(IdHttp.Post(MyUrL, AStream));
DoSomethingWith(Response);
end;
One caveat with this solution: Delphi raises warning W1058 (Implicit string cast with potential data loss from 'string' to 'RawByteString')
My question : is this the correct way to deal with this problem or can I instruct TIdHTTP to do the conversion to UnicodeString for me?
If you are using an up-to-date version of Indy 10, then the overloaded version of TIdHTTP.Post() that returns a String does decode the data to Unicode, however the actual charset used for the decoding depends on what media type the HTTP Content-Type response header specifies:
if the media type is either application/xml, application/xml-external-parsed-entity, application/xml-dtd, or is not a text/... type but does end with +xml, then the charset specified in the encoding attribute of the XML's prolog is used. If no charset is specified, UTF-8 is used.
otherwise, if the Content-Type response header specifies a charset, then it is used.
otherwise, if the media type is a text/... type, then:
a. if the media type is text/xml, text/xml-external-parsed-entity, or ends with +xml, then us-ascii is used.
b. otherwise ISO-8859-1 is used.
otherwise, Indy's default encoding (ASCII by default) is used.
Without seeing the actual HTTP Content-Type header, it is hard to know which condition your situation falls into. It sounds like it is falling into either #2 or #3b, which would account for the UTF-8 byte values being returned as-is, if ISO-8859-1 or similar charset is being used.
UTF8ToString() expects a UTF-8 encoded RawByteString as input, but you are passing it a UTF-16 encoded UnicodeString instead. The RTL will perform a UTF16->Ansi conversion in that situation, using a default Ansi charset for the conversion. That is why you get the compiler warning, because such a conversion can lose data.
XML is really a binary data format, subject to charset encodings. An XML parser needs to know what the XML's encoding is, and be able to parse the raw encoded bytes accordingly. That is why XML has an explicit encoding attribute right in the XML prolog. However, when TIdHTTP downloads XML as a String, although it does automatically decode it to Unicode, it does not yet update the XML's prolog accordingly.
The real solution is to not download XML as a String in the first place. Download it as a TStream instead (TMemoryStream is a better choice than TStringStream) so your XML parser has access to the original bytes, the original charset declaration, etc. You can pass the TStream to the TXMLDocument.LoadFromStream() method, for instance.
You can do this:
var
sstream: TStringStream;
begin
sstream := TStringStream.Create('', TEncoding.UTF8);
try
IdHttp.Post(MyUrL, AStream, sstream);
DoSomethingWith(sstream.DataString);
finally
sstream.Free;
end;

Can TidHttpServer (Delphi XE2) handle urlencoded characters?

I have a TidHttpServer listening to port 8844 with the following code:
procedure TForm1.IdHTTPServer1CommandGet(AContext: TIdContext;
ARequestInfo: TIdHTTPRequestInfo; AResponseInfo: TIdHTTPResponseInfo);
begin
if ARequestInfo.Document <> '/favicon.ico' then
begin
Memo1.Text := ARequestInfo.Params.Text;
end;
end;
This is compiled with Delphi XE2. When I browse to
http://localhost:8844/document?Value=%F6 <-- %F6 is the encoded value for ö
...I get the result:
value=?
If i compile the application using Delphi 2007 I get the following result
value=ö
Is this a bug in Indy of something that I have missed?
In XE2, strings are Unicode. When TIdHTTPServer decodes the ARequestInfo.Document in D2009 and later, it requires percent-encoded data to decode into UTF-8 encoded data, which is then decoded into the final Unicode string. There is currently no option to change that (I have submitted a feature request to our issue trackers for it). %F6 does not represent a valid UTF-8 octet, which is why you end up with '?'. In UTF-8, the 'ö' character would be UTF-8 encoded as $C3 $B6 and thus percent-encoded as %C3%B6, not %F6.
In D2007, strings are Ansi. When TIdHTTPServer decodes the ARequestInfo.Document in D2007 and earlier, it provides the decoded data as-is, thus %F6 would decode into $F6 and be stored as #246. That value is then interpretted by the RTL using whatever the local machine's default Ansi codepage is, so it would represent the 'ö' character only for Ansi codepages that define it that way (Windows-1252 and ISO-8859-1 do, but ISO-8859-5 does not, for example).
I would suggest you change your server logic to use UTF-8 encoded URLs in both Delphi versions. In D2007, you can use the RTL's UTF8Decode() function to decode a UTF-8 encoded AnsiString into a WideString, which you can then assign to another AnsiString to convert the data into the Ansi value you were originally expecting. In D009+, that is handled automatically for you.
On a side note, accessing a UI component directly in the OnCommandGet event is not thread-safe. ou have to synchronize with the main thread in order to access the UI safely.

Resources