Delphi TIdHTTP POST does not encode plus sign - delphi

I have a TIdHTTP component on a form, and I am sending an http POST request to a cloud-based server. Everything works brilliantly, except for 1 field: a text string with a plus sign, e.g. 'hello world+dog', is getting saved as 'hello world dog'.
Researching this problem, I realise that a '+' in a URL is regarded as a space, so one has to encode it. This is where I'm stumped; it looks like the rest of the POST request is encoded by the TIdHTTP component, except for the '+'.
Looking at the request through Fiddler, it's coming through as 'hello%20world+dog'. If I manually encode the '+' (hello world%2Bdog), the result is 'hello%20world%252Bdog'.
I really don't know what I'm doing here, so if someone could point me in the right direction it would be most appreciated.
Other information:
I am using Delphi 2010. The component doesn't have any special settings, I presume I might need to set something? The header content-type that comes through in Fiddler is 'application/x-www-form-urlencoded'.
Then, the Delphi code:
Request:='hello world+dog';
URL :='http://............./ExecuteWithErrors';
TSL:=TStringList.Create;
TSL.Add('query='+Request);
Try
begin
IdHTTP1.ConnectTimeout:=5000;
IdHTTP1.ReadTimeout :=5000;
Reply:=IdHTTP1.Post(URL,TSL);

You are using an outdated version of Indy and need to upgrade.
TIdHTTP's webform data encoder was changed several times in late 2010. Your version appears to predate all of those changes.
In your version, TIdHTTP uses TIdURI.ParamsEncode() internally to encode the form data, where a space character is encoded as %20 and a + character is left un-encoded, thus:
hello%20world+dog
In October 2010, the encoder was updated to encode a space character as & before calling TIdURI.ParamsEncode(), thus:
hello&world+dog
In early December 2010, the encoder was updated to encode a space character as + instead, thus:
hello+world+dog
In late December 2010, the encoder was completely re-written to follow W3C's HTML specifications for application/x-www-form-urlencoded. A space character is encoded as + and a + character is encoded as %2B, thus:
hello+world%2Bdog
In all cases, the above logic is applied only if the hoForceEncodeParams flag is enabled in the TIdHTTP.HTTPOptions property (which it is by default). If upgrading is not an option, you will have to disable the hoForceEncodeParams flag and manually encode the TStringList content yourself:
Request:='hello+world%2Bdog';

Related

NSString: dealing with UTF8-based API

Which characterset is the default characterset for NSString, when i get typed content from a UITextField?
I developed an app, which sends such NSStrings to a UTF8-based REST-API. At the backend, there is an utf8 based MySQL-Database and also utf8-based varchar-fields.
My POST-Request sends string data from the iOS App to the server. And with a GET-Request i receive those strings from the REST API.
Within the App, everything is printed fine. Special UTF-8-Characters like ÄÖÜ are showed correctly after sending them to the server and after receive them back.
But when i enter the mysql-console of the server of the REST API, and do a SELECT-Command at these data, there are broken characters visible.
What could be the root cause? In which characterset does Apple use a NSString?
It sounds like it is a server issue. Check that the version you are using supports UTF-8, older versions do not. See : How to support full Unicode in MySQL database
MySQL’s utf8 encoding is different from proper UTF-8 encoding. It doesn’t offer full Unicode support.
MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
NSString has in internal representation that is essentially opaque.
The UITextField method text returns an NSString.
When you want data from a string use to send to a server use - (NSData *)dataUsingEncoding:(NSStringEncoding)encoding and specify the encoding such as NSUTF8StringEncoding.
NSData *textFieldUTF8Data = [textFieldInstance.text dataUsingEncoding: NSUTF8StringEncoding];
If, by "mysql console", you are referring to the DOS-like window in Windows, then you need:
The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console
Also, tell the 'console' that your bytes are UTF8 by doing SET NAMES utf8mb4.

Indy message with Unicode Subject

I need to create a IdMessage with Unicode subject (eg "本語 - test")
I have tried setting it using
Msg.Subject := UTF8Encode(subject);
where subject is a WideString containing the text above
but when I look at the encoded subject (by saving the Message to file) it looks like this:
Subject: =?UTF-8?Q?=C3=A6=C5=93=C2=AC=C3=A8=C2=AA=C5=BE?= - test
instead of
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
and Outlook displays it as "本語 - test"
Any pointers as to where I am going wrong?
Delphi 2006 (pre-unicode), Indy 10 (fairly recent from source)
In pre-Unicode versions of Delphi, where everything is based on AnsiString, the value you assign to the TIdMessage.Subject property (and any other AnsiString property of TIdMessage, for that matter) MUST be encoded using the OS default character encoding. You are encoding it to UTF-8 instead, which will not work. This is because TIdMessage will first decode the Subject value to Unicode using the OS default encoding, then MIME-encode the Unicode data using the encoding parameters provided by the TIdMessage.OnInitializeISO event, or defaults if no event handler is assigned (in this case, those parameters are CharSet=UTF-8 and HeaderEncoding=QuotedPrintable). TIdMessage has no mechanism to allow you to specify the encoding used for any AnsiString data you assign to it. So the only possibility to send a value of '本語 - test' with the Subject property is to assign your source WideString as-is to the property and let the RTL convert the data to AnsiString using the OS default encoding:
Msg.Subject := subject;
However, if the OS does not support the Unicode characters being used, there will be data lost. There is no avoiding that in this scenario.
The alternative is to set the Subject property to a blank string and then use the TIdMessage.ExtraHeaders property instead so that you can provide your own header value that will be put into the email as-is. Using this approach, you can call Indy's EncodeHeader() function directly. In pre-Unicode versions of Delphi, it has an optional ASrcEncoding parameter that defaults to the OS default encoding (TIdMessage does not currently provide a value for that parameter when encoding headers):
uses
..., IdCoderHeader;
Msg.Subject := '';
Msg.ExtraHeaders.Values['Subject'] := EncodeHeader(UTF8Encode(subject), '', 'Q', 'UTF-8', IndyTextEncoding_UTF8);
This way, EncodeHeader() will be able to avoid a redundant conversion because it can detect that the source and target character encodings are both UTF-8, and thus just MIME-encode the source UTF-8 data as-is. Worse case, even if it did not detect the character encodings were the same, it would simply decode the source data to Unicode using UTF-8 and then re-encode it back to UTF-8. Those are lossless conversions, so no data is lost.
And FYI, the correct encoding for the Unicode characters you have shown would be:
Subject: =?UTF-8?Q?=E6=9C=AC=E8=AA=9E?= - test
Not
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
As you have shown. Notice the second encoded octet is 9C instead of 0C.

Patched Delphi library for unicode support in TPageProducer callbacks?

I've been using Delphi 2009 with the Indy library (10) that ships and have been upgrading a legacy application that makes heavy use of the TPageProducer. The legacy app was originally written for Delphi 5 / Indy 8.
I'm using the OnHTMLTag property of TPageProducer to specify a function that will handle the HTML transparent tags in my source. My problem was that if I put unicode (Simplified Chinese) characters in the TPageProducer.HTMLDoc property, when the OnHTMLTag callback was called, the TagParams argument contains ?? instead of the expected Chinese characters.
I traced this down to around line 2053 of HTTPApp.pas where we separate out the key / value pairs of the transparent tag:
procedure ExtractHeaderFields(Separators, WhiteSpace: TSysCharSet; Content: PChar;
Strings: TStrings; Decode: Boolean; StripQuotes: Boolean = False);
...
if Decode then
Strings.Add(string(HTTPDecode(AnsiString(DoStripQuotes(ExtractedField)))))
else
Strings.Add(DoStripQuotes(ExtractedField));
...
Everything is fine until we cast the string to an AnsiString and pass it to HTTPDecode, at which point my Strings list contains ?? as does my final TagParams and webpage.
Should there be a version of HTTPDecode that works with Strings instead of AnsiStrings? If so, where might I find this?
For now, I've just disabled the decode routine when I parse my tokens for the TPageProducer, but it isn't a nice fix and would prefer to have a version of this that works with wide characters (if that is even possible).

Indy is altering the binary data in my URL

I want to send some binary data over via GET using the Indy components.
So, I have an URL like www.awebsite.com/index.php?data=xxx where xxx is the binary data encoded using ParamsEncode function. After encoding the binary data is converted to something like bB7%18%11z\ so my URL is something like:
www.awebsite.com/bB7%18%11z\
I have seen that if my URL contains the backshash char (see the last char in the URL) it is replaced with slash char (/) in TIdURI.NormalizePath so my binary data is corrupted. What am I doing wrong?
Backslashes aren't allowed in URL's, and to avoid confusion between Windows and *nix systems, all backslashes are replaced by slashes to attempt to keep things working. See http://www.faqs.org/rfcs/rfc1738.html section 5, HTTP, httpurl
You could try with replacing backslashes with %5C yourself.
That said, you should either try with MIME encoding, or try to get a hang of POST requests.
You're using an old version of Indy. Backslashes are included in the UnsafeChars list that Indy uses now. Remy changed it in July 2010 with revision 4272 in the Tiburon branch:
r4272 | Indy-RemyLebeau | 2010-07-07 03:12:23 -0500 (Wed, 07 Jul 2010) | 1 line
Internal logic changes for TIdURI, and moved some sharable logic into IdGlobalProtocols.pas for later use in TIdHTTP.
It was merged into the trunk with the rest of Indy 10.5.7 with revision 4394, in September 2010.

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources