LoadFromFile with Unicode data - delphi

My input file(f) has some Unicode (Swedish) that isn't being read correctly.
Neither of these approaches works, although they give different results:
LoadFromFile(f);
or
LoadFromFile(f,TEncoding.GetEncoding(GetOEMCP));
I'm using Delphi XE
How can I LoadFromFile some Unicode data....also how do I subsequently SaveToFile? Thanks

In order to load a Unicode text file you need to know its encoding. If the file has a Byte Order Mark (BOM), then you can simply call LoadFromFile(FileName) and the RTL will use the BOM to determine the encoding.
If the file does not have a BOM then you need to explicitly specify the encoding, e.g.
LoadFromFile(FileName, TEncoding.UTF8);
LoadFromFile(FileName, TEncoding.Unicode);//UTF-16 LE
LoadFromFile(FileName, TEncoding.BigEndianUnicode);//UTF-16 BE
For some reason, unknown to me, there is no built in support for UTF-32, but if you had such a file then it would be easy enough to add a TEncoding instance to handle that.

I assume that you mean 'UTF-8' when you say 'Unicode'.
If you know that the file is UTF-8, then do
LoadFromFile(f, TEncoding.UTF8).
To save:
SaveToFile(f, TEncoding.UTF8);
(The GetOEMCP WinAPI function is for old 255-character character sets.)

Related

Using Umlaut or special characters in ibm-doors from batch

We have a link module that looks something like this:
const string lMod = "/project/_admin/somethingÜ" // Umlaut
We later use the linkMod like this to loop through the outlinks:
for a in obj->lMod do {}
But this only works when executing directly from DOORS and not from a batch script since it for some reason doesn't recognize the Umlaut causing the inside of the loop to never to be run; exchanging lMod with "*" works and also shows the objects linked to by the lMod.
We are already using UTF-8 encoding for the file:
pragma encoding, "UTF-8"
Any solutions are welcome.
Encode the file as UTF-8 in Notepad++ by going to Encoding > Convert to UTF-8. (Make sure it's not already set to UTF-8 before you do it).

Indy message with Unicode Subject

I need to create a IdMessage with Unicode subject (eg "本語 - test")
I have tried setting it using
Msg.Subject := UTF8Encode(subject);
where subject is a WideString containing the text above
but when I look at the encoded subject (by saving the Message to file) it looks like this:
Subject: =?UTF-8?Q?=C3=A6=C5=93=C2=AC=C3=A8=C2=AA=C5=BE?= - test
instead of
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
and Outlook displays it as "本語 - test"
Any pointers as to where I am going wrong?
Delphi 2006 (pre-unicode), Indy 10 (fairly recent from source)
In pre-Unicode versions of Delphi, where everything is based on AnsiString, the value you assign to the TIdMessage.Subject property (and any other AnsiString property of TIdMessage, for that matter) MUST be encoded using the OS default character encoding. You are encoding it to UTF-8 instead, which will not work. This is because TIdMessage will first decode the Subject value to Unicode using the OS default encoding, then MIME-encode the Unicode data using the encoding parameters provided by the TIdMessage.OnInitializeISO event, or defaults if no event handler is assigned (in this case, those parameters are CharSet=UTF-8 and HeaderEncoding=QuotedPrintable). TIdMessage has no mechanism to allow you to specify the encoding used for any AnsiString data you assign to it. So the only possibility to send a value of '本語 - test' with the Subject property is to assign your source WideString as-is to the property and let the RTL convert the data to AnsiString using the OS default encoding:
Msg.Subject := subject;
However, if the OS does not support the Unicode characters being used, there will be data lost. There is no avoiding that in this scenario.
The alternative is to set the Subject property to a blank string and then use the TIdMessage.ExtraHeaders property instead so that you can provide your own header value that will be put into the email as-is. Using this approach, you can call Indy's EncodeHeader() function directly. In pre-Unicode versions of Delphi, it has an optional ASrcEncoding parameter that defaults to the OS default encoding (TIdMessage does not currently provide a value for that parameter when encoding headers):
uses
..., IdCoderHeader;
Msg.Subject := '';
Msg.ExtraHeaders.Values['Subject'] := EncodeHeader(UTF8Encode(subject), '', 'Q', 'UTF-8', IndyTextEncoding_UTF8);
This way, EncodeHeader() will be able to avoid a redundant conversion because it can detect that the source and target character encodings are both UTF-8, and thus just MIME-encode the source UTF-8 data as-is. Worse case, even if it did not detect the character encodings were the same, it would simply decode the source data to Unicode using UTF-8 and then re-encode it back to UTF-8. Those are lossless conversions, so no data is lost.
And FYI, the correct encoding for the Unicode characters you have shown would be:
Subject: =?UTF-8?Q?=E6=9C=AC=E8=AA=9E?= - test
Not
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
As you have shown. Notice the second encoded octet is 9C instead of 0C.

iconv C API: charset conversion from/to local encoding

I am using the iconv C API and I want iconv to detect the local encoding of the computer. Is that possible? Apparently it is because when I look in the source code, I find in the file iconv_open1.h that if the fromcode or tocode variables are empty strings ("") then the local encoding is used using the locale_charset() function call.
Someone also told me that in order to convert the locale encoding to unicode, all I needed was to use iconv_open ("UTF-8", "")
Unfortunately, I find no mention of this in the documentation.
And when I convert some iso-8859-1 text to the locale encoding (which is utf-8 on my machine), then during conversion I get errno=EILSEQ (illegal sequence). I checked and iconv_open returned no error.
If instead of the empty string in iconv_open I specify "utf-8", then I get no error. Obviously iconv failed to detect my current charset.
edit: I checked with a simple C program that puts(nl_langinfo(CODESET)) and I get ANSI_X3.4-1968 (which is ASCII). Apparently, I got a problem with charset detection.
edit: this should be related to Why is nl_langinfo(CODESET) different from locale charmap?
additional information: my program is written in Ada, and I bind at link-time to C functions. Apparently, the locale setting is not initialized the same way in the Ada runtime and C runtime.
I'll take the same answer as in Why is nl_langinfo(CODESET) different from locale charmap?
You need to first call
setlocale(LC_ALL, "");

Patched Delphi library for unicode support in TPageProducer callbacks?

I've been using Delphi 2009 with the Indy library (10) that ships and have been upgrading a legacy application that makes heavy use of the TPageProducer. The legacy app was originally written for Delphi 5 / Indy 8.
I'm using the OnHTMLTag property of TPageProducer to specify a function that will handle the HTML transparent tags in my source. My problem was that if I put unicode (Simplified Chinese) characters in the TPageProducer.HTMLDoc property, when the OnHTMLTag callback was called, the TagParams argument contains ?? instead of the expected Chinese characters.
I traced this down to around line 2053 of HTTPApp.pas where we separate out the key / value pairs of the transparent tag:
procedure ExtractHeaderFields(Separators, WhiteSpace: TSysCharSet; Content: PChar;
Strings: TStrings; Decode: Boolean; StripQuotes: Boolean = False);
...
if Decode then
Strings.Add(string(HTTPDecode(AnsiString(DoStripQuotes(ExtractedField)))))
else
Strings.Add(DoStripQuotes(ExtractedField));
...
Everything is fine until we cast the string to an AnsiString and pass it to HTTPDecode, at which point my Strings list contains ?? as does my final TagParams and webpage.
Should there be a version of HTTPDecode that works with Strings instead of AnsiStrings? If so, where might I find this?
For now, I've just disabled the decode routine when I parse my tokens for the TPageProducer, but it isn't a nice fix and would prefer to have a version of this that works with wide characters (if that is even possible).

KaZip for C++Builder2009/Delphi

I have download and install KaZip2.0 on C++Builder2009 (with little minor changes => only set type String to AnsiString). I have write:
KAZip1->FileName = "test.zip";
KAZip1->CreateZip("test.zip");
KAZip1->Active = true;
KAZip1->Entries->AddFile("pack\\text.txt","xxx.txt");
KAZip1->Active = false;
KAZip1->Close();
now he create a test.zip with included xxx.txt (59byte original, 21byte packed). I open the archiv in WinRAR successful and want open the xxx.txt, but WinRAR says file is corrupt. :(
What is wrong? Can somebody help me?
Extract not working, because file is corrupt?
KAZip1->FileName = "test.zip";
KAZip1->Active = true;
KAZip1->Entries->ExtractToFile("xxx.txt","zzz.txt");
KAZip1->Active = false;
KAZip1->Close();
with little minor changes => only set
type String to AnsiString
Use RawByteString instead of AnsiString.
I have no idea how KaZip2.0 is implemented, but in general, to make a Delphi/C++ library that was designed without Unicode support in mind working properly you need to do two things:
Replace all Char with AnsiChar and all string to AnsiString
Replace all Win API calls with their Ansi variant, i.e. replace AWin32Function with AWin32FunctionA.
In Delphi < 2009, Char = AnsiChar, String = AnsiString, AWin32Function = AWin32FunctionA, but in Delphi >= 2009, by default, Char = WideChar, String = UnicodeString, AWin32Function = AWin32FunctionW.
WinRAR could be simply failing to recognize the header. Try opening it in Windows or some other zip programs.
with little minor changes => only set
type String to AnsiString
That's doesn't work always right, it may compile but it doesn't mean it will work right in D2009 or CB2009, you need to show the places that you convert Strings to AnsiStrings, specially the code deal with : Buffers, Streams and I/O.
It's not surprising that your code is wrong; KaZip has no documentation.
Proper code is:
//Create a new empty zip file
KAZip1->CreateZip("test.zip");
//Open our newly created zip file so we can add files to it
KAZIP1->Open("test.zip");
//Compress text.txt into xxx.txt
KAZip1->Entries->AddFile("pack\\text.txt","xxx.txt");
//Close the file stream
KAZip1->Close();

Resources