Encoding problem while processing a multipart request on Indy HTTP server - delphi

I have a web server based on TIdHTTPServer. It is built in Delphi Sydney. From a webpage I'm receiving following multipart/form-data post stream:
-----------------------------16857441221270830881532229640
Content-Disposition: form-data; name="d"
83AAAFUaVVs4Q07z
-----------------------------16857441221270830881532229640
Content-Disposition: form-data; name="dir"
Upload
-----------------------------16857441221270830881532229640
Content-Disposition: form-data; name="file_name"; filename="česká tečka.png"
Content-Type: image/png
PNG_DATA
-----------------------------16857441221270830881532229640--
Problem is that text parts are not received correctly. I read the Indy MIME decoding of Multipart/Form-Data Requests returns trailing CR/LF and changed transfer encoding to 8bit which helps to receive file correctly, but received file name is still wrong (dir should be Upload and filename should be česká tečka.png).
d=83AAAFUaVVs4Q07z
dir=UploadW
??esk?? te??ka.png 75
To demonstrate the issue I simplified my code to a console app (please note that the MIME.txt file contains the same as is in post stream above):
program MIMEMultiPartTest;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.Classes, System.SysUtils,
IdGlobal, IdCoder, IdMessage, IdMessageCoder, IdGlobalProtocols, IdCoderMIME, IdMessageCoderMIME,
IdCoderQuotedPrintable, IdCoderBinHex4;
procedure ProcessAttachmentPart(var Decoder: TIdMessageDecoder; var MsgEnd: Boolean);
var
MS: TMemoryStream;
Name: string;
Value: string;
NewDecoder: TIdMessageDecoder;
begin
MS := TMemoryStream.Create;
try
// http://stackoverflow.com/questions/27257577/indy-mime-decoding-of-multipart-form-data-requests-returns-trailing-cr-lf
TIdMessageDecoderMIME(Decoder).Headers.Values['Content-Transfer-Encoding'] := '8bit';
TIdMessageDecoderMIME(Decoder).BodyEncoded := False;
NewDecoder := Decoder.ReadBody(MS, MsgEnd);
MS.Position := 0; // nutne?
if Decoder.Filename <> EmptyStr then // je to atachment
begin
try
Writeln(Decoder.Filename + ' ' + IntToStr(MS.Size));
except
FreeAndNil(NewDecoder);
Writeln('Error processing MIME');
end;
end
else // je to parametr
begin
Name := ExtractHeaderSubItem(Decoder.Headers.Text, 'name', QuoteHTTP);
if Name <> EmptyStr then
begin
Value := string(PAnsiChar(MS.Memory));
try
Writeln(Name + '=' + Value);
except
FreeAndNil(NewDecoder);
Writeln('Error processing MIME');
end;
end;
end;
Decoder.Free;
Decoder := NewDecoder;
finally
MS.Free;
end;
end;
function ProcessMultiPart(const ContentType: string; Stream: TStream): Boolean;
var
Boundary: string;
BoundaryStart: string;
BoundaryEnd: string;
Decoder: TIdMessageDecoder;
Line: string;
BoundaryFound: Boolean;
IsStartBoundary: Boolean;
MsgEnd: Boolean;
begin
Result := False;
Boundary := ExtractHeaderSubItem('multipart/form-data; boundary=---------------------------16857441221270830881532229640', 'boundary', QuoteHTTP);
if Boundary <> EmptyStr then
begin
BoundaryStart := '--' + Boundary;
BoundaryEnd := BoundaryStart + '--';
Decoder := TIdMessageDecoderMIME.Create(nil);
try
TIdMessageDecoderMIME(Decoder).MIMEBoundary := Boundary;
Decoder.SourceStream := Stream;
Decoder.FreeSourceStream := False;
BoundaryFound := False;
IsStartBoundary := False;
repeat
Line := ReadLnFromStream(Stream, -1, True);
if Line = BoundaryStart then
begin
BoundaryFound := True;
IsStartBoundary := True;
end
else
begin
if Line = BoundaryEnd then
BoundaryFound := True;
end;
until BoundaryFound;
if BoundaryFound and IsStartBoundary then
begin
MsgEnd := False;
repeat
TIdMessageDecoderMIME(Decoder).MIMEBoundary := Boundary;
Decoder.SourceStream := Stream;
Decoder.FreeSourceStream := False;
Decoder.ReadHeader;
case Decoder.PartType of
mcptText,
mcptAttachment:
begin
ProcessAttachmentPart(Decoder, MsgEnd);
end;
mcptIgnore:
begin
Decoder.Free;
Decoder := TIdMessageDecoderMIME.Create(nil);
end;
mcptEOF:
begin
Decoder.Free;
MsgEnd := True;
end;
end;
until (Decoder = nil) or MsgEnd;
Result := True;
end
finally
Decoder.Free;
end;
end;
end;
var
Stream: TMemoryStream;
begin
Stream := TMemoryStream.Create;
try
Stream.LoadFromFile('MIME.txt');
ProcessMultiPart('multipart/form-data; boundary=---------------------------16857441221270830881532229640', Stream);
finally
Stream.Free;
end;
Readln;
end.
Could someone help me what is wrong with my code? Thank you.

Your call to ExtractHeaderSubItem() in ProcessMultiPart() is wrong, it needs to pass in the ContentType string parameter, not a hard-coded string literal.
Your call to ExtractHeaderSubItem() in ProcessAttachmentPart() is also wrong, it needs to pass in only the content of just the Content-Disposition header, not the entire Headers.Text. ExtractHeaderSubItem() is designed to only operate on 1 header at a time.
Regarding the dir MIME part, the reason the body data ends up as 'UploadW' instead of 'Upload' is because you are not taking MS.Size into account when assigning MS.Memory to your Value string. The TMemoryStream data is NOT null-terminated! So, you will need to use SetString() instead of the := operator, eg:
var
Value: AnsiString;
...
SetString(Value, PAnsiChar(MS.Memory), MS.Size);
Regarding the Decoder.FileName, that value is not affected by the Content-Transfer-Encoding header at all. MIME headers simply do not allow unencoded Unicode characters. Currently, Indy's MIME decoder supports RFC2047-style encodings for Unicode characters in headers, per RFC 7578 Section 5.1.3, but your stream data is not using that format. It looks like your data is using raw UTF-8 octets 1 (which 5.1.3 also mentions as a possible encoding, but the decoder does not currently look for). So, you may have to manually extract and decode the original filename yourself as needed. If you know the filename will always be encoded as UTF-8, you could try setting Indy's global IdGlobal.GIdDefaultTextEncoding variable to encUTF8 (it defaults to encASCII), and then the Decoder.FileName should be accurate. But, that is a global setting, so may have unwanted side effects elsewhere in Indy, depending on context and data. So, I would suggest setting GIdDefaultTextEncoding to enc8Bit instead, so that unwanted side effects are minimized, and the Decoder.FileName will contain the original raw bytes as-is (just extended to 16-bit chars). That way, you can recover the original filename bytes by simply passing the Decoder.FileName as-is to IndyTextEncoding_8Bit.GetBytes(), and then decode them as needed (such as with IndyTextEncoding_UTF8.GetString(), after validating the bytes are valid UTF-8).
1: However, ÄŤeská teÄŤka.png is not the correct UTF-8 form of česká tečka.png, it looks like that data may have been double-encoded, ie česká tečka.png was UTF-8 encoded, and then the resulting bytes were UTF-8 encoded again

Nowadays the filename parameter should only be added for fallback reasons, while filename* should be added to clearly tell which text encoding the filename has. Otherwise each client only guesses and supposes. Which may go wrong.
RFC 5987 §3.2 defines the format of that filename* parameter:
charset ' [ language ] ' value-chars
...whereas:
charset can be UTF-8 or ISO-8859-1 or any MIME-charset
...and the language is optional.
RFC 6266 §4.3 defines that filename* should be used and comes up with examples in §5:
Content-Disposition: attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates`
Do you spot the asterisk *? Do you spot the text encoding utf-8? Do you spot the two apostrophes '', designating no further specified language (see RFC 5646 § 2.1)? And then come the octets according to the specified text encoding: either percent-encoded, or (if allowed) in plain ASCII.
Other examples:
Content-Disposition: attachment; filename="green.jpg"; filename*=UTF-8''%e3%82%b0%e3%83%aa%e3%83%bc%e3%83%b3.jpg
will present "green.jpg" on older web browsers and "グリーン.jpg" on compliant web browsers.
Content-Disposition: attachment; filename="Gruesse.txt"; filename*=ISO-8859-1''Gr%fc%dfe.txt
will present "Gruesse.txt" on older web browsers and "Grüße.txt" on compliant web browsers.
Content-Disposition: attachment; filename="Hello.png"; filename*=Shift_JIS'en-US'Howdy.png; filename*=EUC-KR'de'Hallo.png
will present "Hello.png" on older web browsers, and "Howdy.png" on compliant web browsers where the preferred language is set to American English, and "Hallo.png" on compliant ones with a preferred language of German (Deutsch). Note that the different text encodings are unbound to percent encoding as long as the octets are within the allowed range (and latin letters are, along with the dot).
From my experiences nobody cares for this nice feature - everybody just shoves UTF-8 into filename, which still violates the standard - no matter how many clients silently support it. Linking How to encode the filename parameter of Content-Disposition header in HTTP? and PHP: RFC-2231 How to encode UTF-8 String as Content-Disposition filename.

Related

Indy Http Server delivers Javascript with Syntax errors

I am trying to use Indy to serve Javascript (deploying a Swagger UI to render API documentation).
procedure TfmMain.SendJavaScriptFileResponse(AResponseInfo: TIdHTTPResponseInfo; AFileName: String);
begin
AResponseInfo.ContentType := 'application/javascript';
AResponseInfo.CharSet := 'utf-8';
var LFileContents := TStringList.Create;
try
LFileContents.LoadFromFile(AFileName);
AResponseInfo.ContentText := LFileContents.Text;
finally
LFileContents.Free;
end;
end;
When the browser receives the Javascript and attempts to run it, I get a syntax error:
Uncaught SyntaxError: illegal character U+20AC
The respoinsde headers received from the Indy IdHttpServer look like so:
HTTP/1.1 200 OK
Connection: close
Content-Encoding: utf-8
Content-Type: application/javascript; charset=utf-8
Content-Length: 1063786
Date: Sun, 05 Feb 2023 20:45:56 GMT
However, when I serve the exact same Javascript files via my hosted website, the Javascript runs fine in the browser with no errors.
Is there a setting or character set I need to use when sending Javascript files using the Indy HTTP server?
You are loading the Javascript from a file into a string, and then you are sending that string to the client. That requires 2 data conversions at runtime - from the file's encoding to UTF-16 in memory, and from UTF-16 to the specified AResponseInfo.Charset on the data transmission to the client. Either one of those conversions can fail if you are not careful.
In memory, a string in Delphi 2009+ is always UTF-16 encoded, but you are not specifying the file's encoding when loading the file into the TStringList. So, if the file uses an encoding other than ASCII (say, UTF-8), does not have a BOM, and contains any non-ASCII characters (say, the Euro sign €), then TStringList WILL NOT decode the file into UTF-16 correctly. In which case, you MUST specify the file's actual encoding, eg:
procedure TfmMain.SendJavaScriptFileResponse(
AResponseInfo: TIdHTTPResponseInfo;
const AFileName: String);
begin
AResponseInfo.ContentType := 'application/javascript';
AResponseInfo.CharSet := 'utf-8';
var LFileContents := TStringList.Create;
try
LFileContents.LoadFromFile(AFileName, TEncoding.UTF8); // <-- HERE
AResponseInfo.ContentText := LFileContents.Text;
finally
LFileContents.Free;
end;
end;
Another option is to send the actual file itself, without having to load and decode it into memory first, eg:
procedure TfmMain.SendJavaScriptFileResponse(
AContext: TIdContext;
AResponseInfo: TIdHTTPResponseInfo;
const AFileName: String);
begin
AResponseInfo.ContentType := 'application/javascript';
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ServeFile(AContext, AFileName);
end;
Either way, utf-8 is not a valid value for the HTTP Content-Encoding header. Indy does not assign any value to that header by default, so you must be assigning it manually. Don't do that in this case.

Delphi & Indy & utf8

i have a problem to access into websites whit utf8 charset, for example when i try to accesso at this www
Click for example
all utf8 characters are not correctly codified.
This is my access routine:
var
Web : TIdHTTP;
Sito : String;
hIOHand : TIdSSLIOHandlerSocketOpenSSL;
begin
Url := TIdURI.URLEncode(Url);
try
Web := TIdHTTP.Create(nil);
hIOHand := TIdSSLIOHandlerSocketOpenSSL.Create(nil);
hIOHand.DefStringEncoding := IndyTextEncoding_UTF8;
hIOHand.SSLOptions.SSLVersions := [sslvTLSv1,sslvTLSv1_1,sslvTLSv1_2,sslvSSLv2,sslvSSLv3,sslvSSLv23];
Web.IOHandler := hIOHand;
Web.Request.CharSet := 'utf-8';
Web.Request.UserAgent := INET_USERAGENT; //Custom user agent string
Web.RedirectMaximum := INET_REDIRECT_MAX; //Maximum redirects
Web.HandleRedirects := INET_REDIRECT_MAX <> 0; //Handle redirects
Web.ReadTimeOut := INET_TIMEOUT_SECS * 1000; //Read timeout msec
try
Sito := Web.Get(Url);
Web.Disconnect;
except
on e : exception do
Sito := 'ERR: ' +Url+#32+e.Message;
end;
finally
Web.Free;
hIOHand.Free;
end;
I try all solution but in the Sito var i find alltime wrong characthers, for example correct value of the "name" is
"name": "Aire d'adhésion du Parc national du Mercantour",
but after the Get instruction i have
"name": "Aire d'adhésion du Parc national du Mercantour",
Do you have idea where is my error?
Thankyou all!
In Delphi 2009+, which includes XE6, string is a UTF-16 encoded UnicodeString.
You are using the overloaded version of TIdHTTP.Get() that returns a string. It decodes the sent text to UTF-16 using whatever charset is reported by the response. If the text is not decoding properly, it likely means the response is not reporting a correct charset. If the wrong charset is used, the text will not decode properly.
The URL in question is, in fact, sending a response Content-Type header that is set to application/json without specifying a charset at all. The default charset for application/json is UTF-8, but Indy does not know that, so it ends up using its own internal default instead, which is not UTF-8. That is why the text is not decoding properly when non-ASCII characters are present.
In which case, if you KNOW the charset will always be UTF-8, you have a few workarounds to choose from:
you can set Indy's default charset to UTF-8 by setting the global GIdDefaultTextEncoding variable in the IdGlobal unit:
GIdDefaultTextEncoding := encUTF8;
you can use the TIdHTTP.OnHeadersAvailable event to change the TIdHTTP.Response.Charset property to 'utf-8' if it is blank or incorrect.
Web.OnHeadersAvailable := CheckResponseCharset;
...
procedure TMyClass.CheckResponseCharset(Sender: TObject; AHeaders: TIdHeaderList; var VContinue: Boolean);
var
Response: TIdHTTPResponse;
begin
Response := TIdHTTP(Sender).Response;
if IsHeaderMediaType(Response.ContentType, 'application/json') and (Response.Charset = '') then
Response.Charset := 'utf-8';
VContinue := True;
end;
you can use the other overloaded version of TIdHTTP.Get() that fills an output TStream instead of returning a string. Using a TMemoryStream or TStringStream, you can decode the raw bytes yourself using UTF-8:
MStrm := TMemoryStream.Create;
try
Web.Get(Url, MStrm);
MStrm.Position := 0;
Sito := ReadStringFromStream(MStrm, IndyTextEncoding_UTF8);
finally
SStrm.Free;
end;
SStrm := TStringStream.Create('', TEncoding.UTF8);
try
Web.Get(Url, SStrm);
Sito := SStrm.DataString;
finally
SStrm.Free;
end;

Authorization failure TIdHTTP over HTTPS when password is russian

I try to test my webservice with the TIdHTTP (Indy 10.6.0 and Delphi XE5) by this code:
GIdDefaultTextEncoding := encUTF8;
HTTP.IOHandler.DefStringEncoding := IndyTextEncoding_UTF8;
Http.Request.UserName := AUser;
Http.Request.Password := APass;
Http.Request.Accept := 'text/javascript';
Http.Request.ContentType := 'application/json';
Http.Request.ContentEncoding := 'utf-8';
Http.Request.URL := 'https://sameService';
Http.MaxAuthRetries := 1;
Http.Request.BasicAuthentication := True;
TIdSSLIOHandlerSocketOpenSSL(HTTP.IOHandler).SSLOptions.Method := sslvSSLv3;
HTTP.HandleRedirects := True;
"AUser" and "APass" in UTF-8. When "APass" have same Russian chars I can't login.
By "HTTP Analyze" I see:
...
Authorization: Basic cDh1c2VyOj8/Pz8/PzEyMw==
Decode from Base 64 (base64decode.org) we can see:
p8user:??????123
Why DefStringEncoding not work ?
TIdHTTP's authentication system has no concept of TIdIOHandler or its DefStringEncoding property.
Internally, TIdBasicAuthentication uses TIdEncoderMIME.Encode(), but without specifying any encoding. TIdEncoder.Encode() defaults to 8bit encoding, and thus is not affected by GIdDefaultTextEncoding.
If you need to send a UTF-8 encoded password with BASIC authentication, you will have to encode the UTF-8 data manually and store the resulting octets into a string, then the 8bit encoder can process the octets as-is, eg:
Http.Request.Password := BytesToStringRaw(IndyTextEncoding_UTF8.GetBytes(APass));
On the other hand, Indy's DIGEST authentication, for instance, uses TIdHashMessageDigest5.HashStringAsHex(), and TIdHash.HashString() does not default to any specific encoding, it depends on GIdDefaultTextEncoding.
So, you have to be careful about how you encode passwords, based on which authentications you use. To account for the discrepency, what you could try is not encode TIdHTTP.Request.Password itself, but instead encode the password inside the TIdHTTP.OnAuthorization event instead when BASIC authentication is being used, eg:
Http.Request.Password := APass;
...
procedure TMyForm.HttpAuthorization(Sender: TObject;
Authentication: TIdAuthentication; var Handled: Boolean);
begin
if Authentication is TIdBasicAuthentication then
begin
Authentication.Password := BytesToStringRaw(IndyTextEncoding_UTF8.GetBytes(TheDesiredPasswordHere));
Handled := True;
end;
end;
UPDATE:
Internally, TIdBasicAuthentication uses TIdEncoderMIME.Encode(), but without specifying any encoding.
That last part is no longer true. TIdBasicAuthentication was updated in 2016 to now pass an encoding to TIdEncoderMIME.Encode(). When an HTTP server asks for BASIC authentication, TIdBasicAuthentication now checks if the server's WWW-Authenticate header includes one of the following attributes: charset, accept-charset, encoding, or enc (in that order). If one is found, the specified charset is passed to Encode(), otherwise ISO-8859-1 is used (there is a TODO in the code to use UTF-8 if the username or password contain any characters that do not exist in ISO-8859-1).
If you want to ensure that UTF-8 is used in BASIC authentication, you are better off setting Request.BasicAuthentication to False and using the Request.CustomHeaders to supply your own Authorization header, eg:
Http.Request.BasicAuthentication := False;
Http.Request.CustomHeaders.Values['Authorization'] := 'Basic ' + TIdEncoderMIME.EncodeString(AUser + ':' + APass, IndyTextEncoding_UTF8);
Alternatively, you might be able to just get away with updating the protected TIdBasicAuthentication.FCharset member inside of the TIdHTTP.OnAuthorization event (which is fired after the server's WWW-Authenticate header has been parsed), eg:
Http.Request.Password := APass;
...
type
TIdBasicAuthenticationAccess = class(TIdBasicAuthentication)
end;
procedure TMyForm.HttpAuthorization(Sender: TObject;
Authentication: TIdAuthentication; var Handled: Boolean);
begin
if Authentication is TIdBasicAuthentication then
begin
TIdBasicAuthenticationAccess(Authentication).FCharset := 'utf-8';
Authentication.Password := TheDesiredPasswordHere;
Handled := True;
end;
end;

How to turn an Indy UTF-8 response into a native Delphi (Unicode)String?

Using Indy THTTP I obtain a response that has Content-Type: text/html; charset=UTF-8 and store it in a TStringStream. If I then use ReponseStream.ReadString(ResponseStream.Size), the resulting String is not correctly shown. I bet this is due to the fact that Windows uses UTF-16.
I tried a few things with TEncoding.UTF8 and TEncoding.Convert that only messed up the result even more (started to look Chinese).
Here's the current code:
var
LHTTP: TIdHTTP;
LResponseStream: TStringStream;
LResponse: String;
begin
LResponseStream := TStringStream.Create();
try
LHTTP := TIdHTTP.Create(nil);
try
LHTTP.Get('url', LResponseStream); // Returns 'hęllo'
finally
LHTTP.Free;
end;
LResponseStream.Position := 0;
LResponse := LResponseStream.ReadString(LResponseStream.Size);
ShowMessage(LResponse); // Make me pretty
finally
LResponseStream.Free;
end;
end;
What should I change to get a regular Delphi String...?
TIdHTTP has an overloaded version of Get() that returns a String. It will decode the UTF-8 into UTF-16 for you:
LResponse := LHTTP.Get('url');
If the content you are trying to download is encoded as UTF-8 character set, you could simply force TStringStream to re-encode that data to UTF-8 internally in this way :
LResponseStream := TStringStream.Create('', TEncoding.UTF8);

Delphi. Indy & cyrillic letters

I've been writing some function that downloads source code of specified web page by URL:
function GetWebPage(const url: string): tStringList;
var
idHttp: TidHttp;
begin
Result := tStringList.Create;
idHttp := TidHttp.Create(nil);
// set params
idHttp.Request.UserAgent := 'Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)';
idHttp.Request.AcceptLanguage := 'ru en';
idHttp.Response.KeepAlive := True;
idHttp.HandleRedirects := True;
idHttp.ConnectTimeout := 5000;
idHttp.ReadTimeout := 5000;
try
try
Result.values['responce'] := idHttp.Get(url);
except
Result.values['responce'] := '';
end;
finally
Result.values['code'] := IntToStr(idHttp.ResponseCode);
FreeAndNil(idHttp);
end;
I'ts working perfectly with english URL adresses, when I specify a URL like президент.рф, iside Indy that URL transforms to ?????????.?? - (screen shot of HTTP Analyzer)
I've found this solution for my problem:
idHttp.IOHandler.DefStringEncoding := TEncoding.Ansi;
// also tried - TEncoding.Unicode, TEncoding.UTF8
But it not working - when I try to call my function, I get error:
So, how I can force its function to work with cyrillic adresses?
Thank you.
URLs can only contain ASCII characters in them. You need to pre-format the URL to encode non-ASCII characters before then passing it to TIdHTTP. You can use the TIdURI.URLEncode() method for that purpose, eg:
Result.values['responce'] := idHttp.Get(TIdURI.URLEncode(url));
GetWebPage('http://президент.рф');
UTF-8 is commonly used for URL encodings, so it is the default encoding used by TIdURL, but not all servers use UTF-8, so if you need to use a different encoding then TIdURI.URLEncode() has an optional AByteEncoding parameter for that purpose.
With that said, international resources are better serviced using IRIs instead of URLs, but Indy does not natively support IRIs yet (that will be implemented in Indy 11).

Resources