I need to create a IdMessage with Unicode subject (eg "本語 - test")
I have tried setting it using
Msg.Subject := UTF8Encode(subject);
where subject is a WideString containing the text above
but when I look at the encoded subject (by saving the Message to file) it looks like this:
Subject: =?UTF-8?Q?=C3=A6=C5=93=C2=AC=C3=A8=C2=AA=C5=BE?= - test
instead of
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
and Outlook displays it as "本語 - test"
Any pointers as to where I am going wrong?
Delphi 2006 (pre-unicode), Indy 10 (fairly recent from source)
In pre-Unicode versions of Delphi, where everything is based on AnsiString, the value you assign to the TIdMessage.Subject property (and any other AnsiString property of TIdMessage, for that matter) MUST be encoded using the OS default character encoding. You are encoding it to UTF-8 instead, which will not work. This is because TIdMessage will first decode the Subject value to Unicode using the OS default encoding, then MIME-encode the Unicode data using the encoding parameters provided by the TIdMessage.OnInitializeISO event, or defaults if no event handler is assigned (in this case, those parameters are CharSet=UTF-8 and HeaderEncoding=QuotedPrintable). TIdMessage has no mechanism to allow you to specify the encoding used for any AnsiString data you assign to it. So the only possibility to send a value of '本語 - test' with the Subject property is to assign your source WideString as-is to the property and let the RTL convert the data to AnsiString using the OS default encoding:
Msg.Subject := subject;
However, if the OS does not support the Unicode characters being used, there will be data lost. There is no avoiding that in this scenario.
The alternative is to set the Subject property to a blank string and then use the TIdMessage.ExtraHeaders property instead so that you can provide your own header value that will be put into the email as-is. Using this approach, you can call Indy's EncodeHeader() function directly. In pre-Unicode versions of Delphi, it has an optional ASrcEncoding parameter that defaults to the OS default encoding (TIdMessage does not currently provide a value for that parameter when encoding headers):
uses
..., IdCoderHeader;
Msg.Subject := '';
Msg.ExtraHeaders.Values['Subject'] := EncodeHeader(UTF8Encode(subject), '', 'Q', 'UTF-8', IndyTextEncoding_UTF8);
This way, EncodeHeader() will be able to avoid a redundant conversion because it can detect that the source and target character encodings are both UTF-8, and thus just MIME-encode the source UTF-8 data as-is. Worse case, even if it did not detect the character encodings were the same, it would simply decode the source data to Unicode using UTF-8 and then re-encode it back to UTF-8. Those are lossless conversions, so no data is lost.
And FYI, the correct encoding for the Unicode characters you have shown would be:
Subject: =?UTF-8?Q?=E6=9C=AC=E8=AA=9E?= - test
Not
Subject: =?UTF-8?Q?=E6=0C=AC=E8=AA=9E?= - test
As you have shown. Notice the second encoded octet is 9C instead of 0C.
Related
I've got a string variable containing a text that I need to encode and write to a file, in UTF-16LE code page.
Currently the following code generates a UTF-8 file and I don't see any option in the statement OPEN DATASET to generate the file in UTF-16LE.
REPORT zmyprogram.
DATA(filename) = `/tmp/myfile`.
OPEN DATASET filename IN TEXT MODE ENCODING DEFAULT FOR OUTPUT.
TRANSFER 'HELLO WORLD' TO filename.
CLOSE DATASET filename.
I guess one solution is to first encode the string in memory, then write the encoded bytes to the file.
Generally speaking, how to encode a string of characters into a given code page, in memory?
In the first part, I explain how to encode a string of characters into a given code page (all is done in memory), and in the second part, I explain specifically how to write files to the application server in a given code page.
General way (all in memory)
If a string of characters (type STRING) has to be encoded, the result has to be stored in a string of bytes, which corresponds to the built-in data type XSTRING.
There are several possibilities which depend on the ABAP version:
Since 7.53, use the class CL_ABAP_CONV_CODEPAGE:
DATA(xstring) = cl_abap_conv_codepage=>create_out( codepage = `UTF-16LE` )->convert( source = `ABCDE` ).
Since 7.02, use the class CL_ABAP_CODEPAGE:
DATA xstring TYPE xstring.
xstring = cl_abap_codepage=>convert_to( source = `ABCDE` codepage = `UTF-16LE` ).
Before 7.02, use the class CL_ABAP_CONV_OUT_CE (documentation provided with the class):
First, instantiate the conversion object, use a SAP code page number instead of the ISO name (list of values shown hereafter):
DATA: conv TYPE REF TO CL_ABAP_CONV_OUT_CE, xstring TYPE xstring.
conv = CL_ABAP_CONV_OUT_CE=>CREATE( encoding = '4103' ). "4103 = utf-16le
Then encode the string and retrieve the bytes encoded:
conv->RESET( ).
conv->WRITE( data = `ABCDE` ).
xstring = conv->GET_BUFFER( ).
Eventually, instead of using RESET, WRITE and GET_BUFFER, the method CONVERT was added in 6.40 and retroported :
conv->CONVERT( EXPORTING data = `ABCDE` IMPORTING buffer = xstring ).
With the class CL_ABAP_CONV_OUT_CE, you need to use the number of the SAP Code Page, not the ISO name. Here are the most common SAP code pages and their equivalent ISO names:
1100: ISO-8859-1
1101: US-ASCII
1160: Windows-1252 ("ANSI")
1401: ISO-8859-2
4102: UTF-16BE
4103: UTF-16LE
4104: UTF-32BE
4105: UTF-32LE
4110: UTF-8
Etc. (the possible values are defined in the table TCP00A, in lines with column CPATTRKIND = 'H').
Writing a file on the application server in a given code page
In ABAP, OPEN DATASET can directly specify the target code page, most code pages are supported including UTF-8, but not other UTF (code pages 41xx) which can be done only by the solution explained in 2.3 below (by first encoding in memory).
2.1) IN TEXT MODE ENCODING ...
Possible ENCODING values:
UTF-8: in this mode, it's possible to add the Byte Order Mark if needed, via the option WITH BYTE-ORDER MARK.
DEFAULT: will be UTF-8 in a SAP "Unicode" system (that you can check via the menu System > Status > Unicode System Yes/No), NON-UNICODE otherwise.
NON-UNICODE: will depend on the current ABAP linguistic environment; for language English, it's the character encoding iso-8859-1, for language Polish, it's the character encoding iso-8859-2, etc. (the equivalences are shown in table TCP0C.)
Example in ABAP version 7.52 to write to UTF-8 with the byte order mark:
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_utf_8`.
OPEN DATASET filename IN TEXT MODE ENCODING UTF-8 WITH BYTE-ORDER MARK FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
Example in ABAP version 7.52 to write to iso-8859-2 (Polish language here):
REPORT zmyprogram.
SET LOCALE LANGUAGE 'L'. " Polish
DATA(filename) = `/tmp/dataset_nonunicode_pl`.
OPEN DATASET filename IN TEXT MODE ENCODING NON-UNICODE FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.2) IN LEGACY TEXT MODE CODE PAGE ...
Use any code page number except code pages 41xx (i.e. UTF-8 and other UTF; see workaround in 2.3 below).
Example in ABAP version 7.52 to write to iso-8859-2 (code page 1401) :
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_iso_8859_2`.
OPEN DATASET filename IN LEGACY TEXT MODE CODE PAGE '1401' FOR OUTPUT. " iso-8859-2
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.3) UTF = general way + IN BINARY MODE
Example in ABAP version 7.52:
REPORT zmyprogram.
TRY.
DATA(xstring) = cl_abap_codepage=>convert_to( source = `Witaj świecie` codepage = `UTF-16LE` ).
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
BREAK-POINT.
ENDTRY.
DATA(filename) = `/tmp/dataset_utf_16le`.
OPEN DATASET filename IN BINARY MODE FOR OUTPUT.
TRANSFER xstring TO filename.
CLOSE DATASET filename.
I have a TIdHTTP component on a form, and I am sending an http POST request to a cloud-based server. Everything works brilliantly, except for 1 field: a text string with a plus sign, e.g. 'hello world+dog', is getting saved as 'hello world dog'.
Researching this problem, I realise that a '+' in a URL is regarded as a space, so one has to encode it. This is where I'm stumped; it looks like the rest of the POST request is encoded by the TIdHTTP component, except for the '+'.
Looking at the request through Fiddler, it's coming through as 'hello%20world+dog'. If I manually encode the '+' (hello world%2Bdog), the result is 'hello%20world%252Bdog'.
I really don't know what I'm doing here, so if someone could point me in the right direction it would be most appreciated.
Other information:
I am using Delphi 2010. The component doesn't have any special settings, I presume I might need to set something? The header content-type that comes through in Fiddler is 'application/x-www-form-urlencoded'.
Then, the Delphi code:
Request:='hello world+dog';
URL :='http://............./ExecuteWithErrors';
TSL:=TStringList.Create;
TSL.Add('query='+Request);
Try
begin
IdHTTP1.ConnectTimeout:=5000;
IdHTTP1.ReadTimeout :=5000;
Reply:=IdHTTP1.Post(URL,TSL);
You are using an outdated version of Indy and need to upgrade.
TIdHTTP's webform data encoder was changed several times in late 2010. Your version appears to predate all of those changes.
In your version, TIdHTTP uses TIdURI.ParamsEncode() internally to encode the form data, where a space character is encoded as %20 and a + character is left un-encoded, thus:
hello%20world+dog
In October 2010, the encoder was updated to encode a space character as & before calling TIdURI.ParamsEncode(), thus:
hello&world+dog
In early December 2010, the encoder was updated to encode a space character as + instead, thus:
hello+world+dog
In late December 2010, the encoder was completely re-written to follow W3C's HTML specifications for application/x-www-form-urlencoded. A space character is encoded as + and a + character is encoded as %2B, thus:
hello+world%2Bdog
In all cases, the above logic is applied only if the hoForceEncodeParams flag is enabled in the TIdHTTP.HTTPOptions property (which it is by default). If upgrading is not an option, you will have to disable the hoForceEncodeParams flag and manually encode the TStringList content yourself:
Request:='hello+world%2Bdog';
I'm writing a multilanguage WebBroker application (for Apache 2.2) which display unicode encoded data from an oracle database. The same data with the same data-aware components (Devart's ODAC) in a test program written in Delphi XE7, are displaying correctly.
My problem occured in the WebBroker where I'm facing with a strange behavior when I'm using a PageProducer to prepare the content for the response
When I'm using the follow code in the action :
Response.ContentType := 'text/html; charset=UTF-8';
PageProducer1.HTMLFile:= htmltemplate
Response.Content :=
PageProducer1.Content+
'Label 1 ='+Label1fromDB+
' Label 2='+Label2fromDB+
'</body></html>';
Response.SendResponse;
the result in the webbrowser is that all the non-latin chars that was not inserted by the PageProducer but already existed in the htmltemplate file (which has declared as utf-8 with ) replaced by other not correct chars, the text that was inserted by the PageProducer and was retrived from the database also not displayed correctly but the extra labels, Label1fromDB and Label2fromDB which are added to the content response as showing in the code above are displayed correctly having also the same chars which are inside the htmltemplate.
Now, when I omit the declaration
Response.ContentType := 'text/html; charset=UTF-8'
the content of the htmltemplate are diplaying correctly but all the other texts , the text filled by PageProducer and the two labels Label1fromDB and Label2fromDB not correctly
Can you please help me identify where the problem is and what have I do in order to be able to serve unicode multilanguage content via web broker?
You are having issues with different Unicode encodings.
Both PageProducer.Content and Response.Content are UTF16 encoded strings. Your htmltemplate is encoded in UTF8 and is not correctly interpreted by PageProducer.
Most likely cause is that htmltemplate file does not have UTF8 BOM at beginning and PageProducer will interpret that file encoding as default ANSI. If that is so, adding UTF8 BOM should solve your problem.
My input file(f) has some Unicode (Swedish) that isn't being read correctly.
Neither of these approaches works, although they give different results:
LoadFromFile(f);
or
LoadFromFile(f,TEncoding.GetEncoding(GetOEMCP));
I'm using Delphi XE
How can I LoadFromFile some Unicode data....also how do I subsequently SaveToFile? Thanks
In order to load a Unicode text file you need to know its encoding. If the file has a Byte Order Mark (BOM), then you can simply call LoadFromFile(FileName) and the RTL will use the BOM to determine the encoding.
If the file does not have a BOM then you need to explicitly specify the encoding, e.g.
LoadFromFile(FileName, TEncoding.UTF8);
LoadFromFile(FileName, TEncoding.Unicode);//UTF-16 LE
LoadFromFile(FileName, TEncoding.BigEndianUnicode);//UTF-16 BE
For some reason, unknown to me, there is no built in support for UTF-32, but if you had such a file then it would be easy enough to add a TEncoding instance to handle that.
I assume that you mean 'UTF-8' when you say 'Unicode'.
If you know that the file is UTF-8, then do
LoadFromFile(f, TEncoding.UTF8).
To save:
SaveToFile(f, TEncoding.UTF8);
(The GetOEMCP WinAPI function is for old 255-character character sets.)
I've been using Delphi 2009 with the Indy library (10) that ships and have been upgrading a legacy application that makes heavy use of the TPageProducer. The legacy app was originally written for Delphi 5 / Indy 8.
I'm using the OnHTMLTag property of TPageProducer to specify a function that will handle the HTML transparent tags in my source. My problem was that if I put unicode (Simplified Chinese) characters in the TPageProducer.HTMLDoc property, when the OnHTMLTag callback was called, the TagParams argument contains ?? instead of the expected Chinese characters.
I traced this down to around line 2053 of HTTPApp.pas where we separate out the key / value pairs of the transparent tag:
procedure ExtractHeaderFields(Separators, WhiteSpace: TSysCharSet; Content: PChar;
Strings: TStrings; Decode: Boolean; StripQuotes: Boolean = False);
...
if Decode then
Strings.Add(string(HTTPDecode(AnsiString(DoStripQuotes(ExtractedField)))))
else
Strings.Add(DoStripQuotes(ExtractedField));
...
Everything is fine until we cast the string to an AnsiString and pass it to HTTPDecode, at which point my Strings list contains ?? as does my final TagParams and webpage.
Should there be a version of HTTPDecode that works with Strings instead of AnsiStrings? If so, where might I find this?
For now, I've just disabled the decode routine when I parse my tokens for the TPageProducer, but it isn't a nice fix and would prefer to have a version of this that works with wide characters (if that is even possible).