Patched Delphi library for unicode support in TPageProducer callbacks? - delphi

I've been using Delphi 2009 with the Indy library (10) that ships and have been upgrading a legacy application that makes heavy use of the TPageProducer. The legacy app was originally written for Delphi 5 / Indy 8.
I'm using the OnHTMLTag property of TPageProducer to specify a function that will handle the HTML transparent tags in my source. My problem was that if I put unicode (Simplified Chinese) characters in the TPageProducer.HTMLDoc property, when the OnHTMLTag callback was called, the TagParams argument contains ?? instead of the expected Chinese characters.
I traced this down to around line 2053 of HTTPApp.pas where we separate out the key / value pairs of the transparent tag:
procedure ExtractHeaderFields(Separators, WhiteSpace: TSysCharSet; Content: PChar;
Strings: TStrings; Decode: Boolean; StripQuotes: Boolean = False);
...
if Decode then
Strings.Add(string(HTTPDecode(AnsiString(DoStripQuotes(ExtractedField)))))
else
Strings.Add(DoStripQuotes(ExtractedField));
...
Everything is fine until we cast the string to an AnsiString and pass it to HTTPDecode, at which point my Strings list contains ?? as does my final TagParams and webpage.
Should there be a version of HTTPDecode that works with Strings instead of AnsiStrings? If so, where might I find this?
For now, I've just disabled the decode routine when I parse my tokens for the TPageProducer, but it isn't a nice fix and would prefer to have a version of this that works with wide characters (if that is even possible).

Related

Delphi TIdHTTP POST does not encode plus sign

I have a TIdHTTP component on a form, and I am sending an http POST request to a cloud-based server. Everything works brilliantly, except for 1 field: a text string with a plus sign, e.g. 'hello world+dog', is getting saved as 'hello world dog'.
Researching this problem, I realise that a '+' in a URL is regarded as a space, so one has to encode it. This is where I'm stumped; it looks like the rest of the POST request is encoded by the TIdHTTP component, except for the '+'.
Looking at the request through Fiddler, it's coming through as 'hello%20world+dog'. If I manually encode the '+' (hello world%2Bdog), the result is 'hello%20world%252Bdog'.
I really don't know what I'm doing here, so if someone could point me in the right direction it would be most appreciated.
Other information:
I am using Delphi 2010. The component doesn't have any special settings, I presume I might need to set something? The header content-type that comes through in Fiddler is 'application/x-www-form-urlencoded'.
Then, the Delphi code:
Request:='hello world+dog';
URL :='http://............./ExecuteWithErrors';
TSL:=TStringList.Create;
TSL.Add('query='+Request);
Try
begin
IdHTTP1.ConnectTimeout:=5000;
IdHTTP1.ReadTimeout :=5000;
Reply:=IdHTTP1.Post(URL,TSL);
You are using an outdated version of Indy and need to upgrade.
TIdHTTP's webform data encoder was changed several times in late 2010. Your version appears to predate all of those changes.
In your version, TIdHTTP uses TIdURI.ParamsEncode() internally to encode the form data, where a space character is encoded as %20 and a + character is left un-encoded, thus:
hello%20world+dog
In October 2010, the encoder was updated to encode a space character as & before calling TIdURI.ParamsEncode(), thus:
hello&world+dog
In early December 2010, the encoder was updated to encode a space character as + instead, thus:
hello+world+dog
In late December 2010, the encoder was completely re-written to follow W3C's HTML specifications for application/x-www-form-urlencoded. A space character is encoded as + and a + character is encoded as %2B, thus:
hello+world%2Bdog
In all cases, the above logic is applied only if the hoForceEncodeParams flag is enabled in the TIdHTTP.HTTPOptions property (which it is by default). If upgrading is not an option, you will have to disable the hoForceEncodeParams flag and manually encode the TStringList content yourself:
Request:='hello+world%2Bdog';

Indy message with Unicode Subject

I need to create a IdMessage with Unicode subject (eg "本語 - test")
I have tried setting it using
Msg.Subject := UTF8Encode(subject);
where subject is a WideString containing the text above
but when I look at the encoded subject (by saving the Message to file) it looks like this:
Subject: =?UTF-8?Q?=C3=A6=C5=93=C2=AC=C3=A8=C2=AA=C5=BE?= - test
instead of
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
and Outlook displays it as "本語 - test"
Any pointers as to where I am going wrong?
Delphi 2006 (pre-unicode), Indy 10 (fairly recent from source)
In pre-Unicode versions of Delphi, where everything is based on AnsiString, the value you assign to the TIdMessage.Subject property (and any other AnsiString property of TIdMessage, for that matter) MUST be encoded using the OS default character encoding. You are encoding it to UTF-8 instead, which will not work. This is because TIdMessage will first decode the Subject value to Unicode using the OS default encoding, then MIME-encode the Unicode data using the encoding parameters provided by the TIdMessage.OnInitializeISO event, or defaults if no event handler is assigned (in this case, those parameters are CharSet=UTF-8 and HeaderEncoding=QuotedPrintable). TIdMessage has no mechanism to allow you to specify the encoding used for any AnsiString data you assign to it. So the only possibility to send a value of '本語 - test' with the Subject property is to assign your source WideString as-is to the property and let the RTL convert the data to AnsiString using the OS default encoding:
Msg.Subject := subject;
However, if the OS does not support the Unicode characters being used, there will be data lost. There is no avoiding that in this scenario.
The alternative is to set the Subject property to a blank string and then use the TIdMessage.ExtraHeaders property instead so that you can provide your own header value that will be put into the email as-is. Using this approach, you can call Indy's EncodeHeader() function directly. In pre-Unicode versions of Delphi, it has an optional ASrcEncoding parameter that defaults to the OS default encoding (TIdMessage does not currently provide a value for that parameter when encoding headers):
uses
..., IdCoderHeader;
Msg.Subject := '';
Msg.ExtraHeaders.Values['Subject'] := EncodeHeader(UTF8Encode(subject), '', 'Q', 'UTF-8', IndyTextEncoding_UTF8);
This way, EncodeHeader() will be able to avoid a redundant conversion because it can detect that the source and target character encodings are both UTF-8, and thus just MIME-encode the source UTF-8 data as-is. Worse case, even if it did not detect the character encodings were the same, it would simply decode the source data to Unicode using UTF-8 and then re-encode it back to UTF-8. Those are lossless conversions, so no data is lost.
And FYI, the correct encoding for the Unicode characters you have shown would be:
Subject: =?UTF-8?Q?=E6=9C=AC=E8=AA=9E?= - test
Not
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
As you have shown. Notice the second encoded octet is 9C instead of 0C.

Strange character before pound symbol in Titanium Studio

In Titanium Studio, I am storing a one character value in an SQLite database (which uses UTF-8 encoding). When I store a pound symbol (£), it stores fine, but when I read it back, I get ¬£ instead. Strangely enough, the string length still reports to be 1, in spite of two characters being visible. The main problem is that this character forms part of a filename that gets sent to a Windows Server. So, while in Titanium, despite the extra character, everything works, when the filename gets sent to Windows, we get another strange character. I tried converting the character using Ti.Buffer, but when I decode, I still get the same characters back.
var tipo_v='';
var buf = Ti.createBuffer({length:1024});
var l = Ti.Codec.encodeString({
source: Vtipo_visita,
dest: buf,
});
buf.length= l;
tipo_v = Ti.Codec.decodeString({
source: buf,
charset: Ti.Codec.CHARSET_ASCII
});
The variable Vtipo_visita has the ¬£ value. After the call to decodeString(), tipo_v has the value √Ǭ.
I also tried using CHATSET_ISO_LATIN_1, but it didn't make any difference. How can I get this character to display correctly without the extra character in front.
As a final note, I found that simply doing
String.fromCharCode(163)
outputs the two characters in the Debugger, instead of just one. Thanks for any suggestions.

LoadFromFile with Unicode data

My input file(f) has some Unicode (Swedish) that isn't being read correctly.
Neither of these approaches works, although they give different results:
LoadFromFile(f);
or
LoadFromFile(f,TEncoding.GetEncoding(GetOEMCP));
I'm using Delphi XE
How can I LoadFromFile some Unicode data....also how do I subsequently SaveToFile? Thanks
In order to load a Unicode text file you need to know its encoding. If the file has a Byte Order Mark (BOM), then you can simply call LoadFromFile(FileName) and the RTL will use the BOM to determine the encoding.
If the file does not have a BOM then you need to explicitly specify the encoding, e.g.
LoadFromFile(FileName, TEncoding.UTF8);
LoadFromFile(FileName, TEncoding.Unicode);//UTF-16 LE
LoadFromFile(FileName, TEncoding.BigEndianUnicode);//UTF-16 BE
For some reason, unknown to me, there is no built in support for UTF-32, but if you had such a file then it would be easy enough to add a TEncoding instance to handle that.
I assume that you mean 'UTF-8' when you say 'Unicode'.
If you know that the file is UTF-8, then do
LoadFromFile(f, TEncoding.UTF8).
To save:
SaveToFile(f, TEncoding.UTF8);
(The GetOEMCP WinAPI function is for old 255-character character sets.)

KaZip for C++Builder2009/Delphi

I have download and install KaZip2.0 on C++Builder2009 (with little minor changes => only set type String to AnsiString). I have write:
KAZip1->FileName = "test.zip";
KAZip1->CreateZip("test.zip");
KAZip1->Active = true;
KAZip1->Entries->AddFile("pack\\text.txt","xxx.txt");
KAZip1->Active = false;
KAZip1->Close();
now he create a test.zip with included xxx.txt (59byte original, 21byte packed). I open the archiv in WinRAR successful and want open the xxx.txt, but WinRAR says file is corrupt. :(
What is wrong? Can somebody help me?
Extract not working, because file is corrupt?
KAZip1->FileName = "test.zip";
KAZip1->Active = true;
KAZip1->Entries->ExtractToFile("xxx.txt","zzz.txt");
KAZip1->Active = false;
KAZip1->Close();
with little minor changes => only set
type String to AnsiString
Use RawByteString instead of AnsiString.
I have no idea how KaZip2.0 is implemented, but in general, to make a Delphi/C++ library that was designed without Unicode support in mind working properly you need to do two things:
Replace all Char with AnsiChar and all string to AnsiString
Replace all Win API calls with their Ansi variant, i.e. replace AWin32Function with AWin32FunctionA.
In Delphi < 2009, Char = AnsiChar, String = AnsiString, AWin32Function = AWin32FunctionA, but in Delphi >= 2009, by default, Char = WideChar, String = UnicodeString, AWin32Function = AWin32FunctionW.
WinRAR could be simply failing to recognize the header. Try opening it in Windows or some other zip programs.
with little minor changes => only set
type String to AnsiString
That's doesn't work always right, it may compile but it doesn't mean it will work right in D2009 or CB2009, you need to show the places that you convert Strings to AnsiStrings, specially the code deal with : Buffers, Streams and I/O.
It's not surprising that your code is wrong; KaZip has no documentation.
Proper code is:
//Create a new empty zip file
KAZip1->CreateZip("test.zip");
//Open our newly created zip file so we can add files to it
KAZIP1->Open("test.zip");
//Compress text.txt into xxx.txt
KAZip1->Entries->AddFile("pack\\text.txt","xxx.txt");
//Close the file stream
KAZip1->Close();

Resources