I can't understand what encoding approach uses Tunderbird while searching on IMAP server with command IMAP SEARCH CHARSET
I've tried to search Russian word "привет" and this was mapped to "?#825B", i.e.
A001 SEARCH CHARSET ISO-8859-1 BODY "?#825B"
How that happen? I'm sure this is correct as I've used sniffer for catch this and the Dovecot server correctly found the mail with "привет" word. The ISO-8859-1 encoding hasn't Russian glyphs at all! So how it was converted?
For example, "привет" (written as Unicode characters) gives "??????" for ISO-8859-1 encoding on my machine or here http://www.motobit.com/util/charset-codepage-conversion.asp
The way that Thunderbird is getting this value is by downcasting a (16-bit?) unicode character to a byte.
For example, in C# (which uses UTF-16 internally for its char and string types), this would get the result you are seeing:
const string text = "привет";
var buffer = new char[text.Length];
for (int i = 0; i < text.Length; i++)
buffer[i] = (char) ((byte) text[i]);
var result = new string (buffer);
How Thunderbird handles surrogate pairs is anyone's guess based on what is known from the question. It might treat the surrogate pair as 2 separate characters (like my above code would) or it might combine them into a 32-bit unicode character and downcast that to a byte.
Related
My flutter app retrieves information via a REST interface which can contain Extended ASCII characters e.g. e-acute 0xe9. How can I convert this into UTF-8 (e.g 0xc3 0xa9) so that it displays correctly?
0xE9 corresponds to e-acute (é) in the ISO-8859/Latin 1 encoding. (It's one of many possible encodings for "extended ASCII", although personally I associate the term "extended ASCII" with code page 437.)
You can decode it to a Dart String (which internally stores UTF-16) using Latin1Codec. If you really want UTF-8, you can encode that String to UTF-8 afterward with Utf8Codec.
import 'dart:convert';
void main() {
var s = latin1.decode([0xE9]);
print(s); // Prints: é
var utf8Bytes = utf8.encode(s);
print(utf8Bytes); // Prints: [195, 169]
}
I was getting confused because sometimes the data contained extended ascii characters and sometimes UTF-8 characters. When I tried doing a UTF-8 decode it baulked at the extended ascii.
I fixed it by trying the utf8 decode and catching the error when it is extended ascii, it seems to decode this OK.
I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string.
For example, "ABCģķī" would result in "ABCģķī"
Now 2 basic things:
Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.
So - How do I get a value of a char, that is in 1-255 range?
I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.
Is there any other method for returning char value? Any help appreciated
I suggest you avoid codepages like the plague.
There are two approaches for Unicode that I'd consider: WideString, and UTF-8.
Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.
UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.
#DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.
In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABCģķī" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABCģķī" or "ABCģķī" instead. Most other charsets do no support those particular Unicode characters.
In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;ķī" or "ABCģķī".
So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:
if you need HTML 4:
A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
If you need HTML 5:
A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.
In case I understood the OP correctly, I'll just leave this here.
function Entitties(const S: WideString): string;
var
I: Integer;
begin
Result := '';
for I := 1 to Length(S) do
begin
if Word(S[I]) > Word(High(AnsiChar)) then
Result := Result + '#' + IntToStr(Word(S[I])) + ';'
else
Result := Result + S[I];
end;
end;
I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.
I am trying to encode the 'subject' field, written in Hebrew, of an email into Base64 so that the subject can be read correctly in all browsers. At the moment, I am using the encoding Windows-1255 which works on some clients but not all, so I want to use utf-8, base64.
My reading on the subject (no pun intended) shows that the text has to be in the form
=?<charset>?<encoding>?<encoded text>?=
eg
=?windows-1255?Q?=E0=E1?=
I have taken encoded subject lines from letters which were sent to me in Hebrew with UTF-8B encoding and decoded them successfully on this website, www.webatic.com/run/convert/base64.php. I have also used this website to encode simple letters and have noticed that the return encoding is not the same as the result which I get from a Delphi algorithm.
So - I am looking for an algorithm which successfully encodes letters such as aleph (ord=224), bet (ord=225), etc. According to the website, the string composed of the two letters aleph and bet returns the code 15DXkq==, but the basic Delphi algorithm returns Ue4 and the TIdEncoderQuotedPrintable component returns =E0=E1 (which is the ISO-8859 encoding).
Edit (after several comments):
I asked a friend to send me an email from her Mac computer, which unsurprisingly uses UTF-8 encoding (as opposed to Windows-1255). The subject was one letter, aleph, ord 224. The encoded subject appeared in the email's header as follows
=?UTF-8?B?15A=?=
This can be separated into three parts: the 'prefix' (=?UTF-8?B?) which means that UTF-8 with base64 encoding is being used; the 'payload' (15A=), which the web site which I quoted translates this correctly as the letter aleph; and the suffix (?=).
I need an algorithm to translate an arbitrary string of letters, most of which will be in Hebrew (and thus with ord >= 224) into base64/utf-8; a correct solution is one that decodes correctly on the web site quoted.
I'm sorry to have wasted all your time. I spent several hours again on the subject today and discovered that the base64 code which I was using has a huge bug.
The steps necessary to send a base64 encoded UTF-8 subject line are:
Convert 'normal' text (ie local ANSI code page) to UTF-8 via the AnsiToUTF8 function
Encode this into base64
Create a string with the prefix '=?UTF-8?B?', the result from stage 2 and the suffix '=?='
Send!
Here is the complete code for creating and sending the email (obviously simplified)
with IdSMTP1 do
begin
host:= ....;
username:= ....;
password:= ....;
end;
with email do
begin
From.Address:= ....;
Recipients.EMailAddresses:= ....;
cclist.add.address:= ....;
email.subject:= '=?UTF-8?B?' + encode64 (AnsiToUTF8 (edit1.text)) + '=?=';
email.Body.text:= ....;
end;
try
IdSMTP1.Connect (1000);
IdSMTP1.Send (email);
finally
if IdSMTP1.Connected
then IdSMTP1.Disconnect;
end;
Using the code on this page which is the same as this page, the 'codes64' string begins with the digits, then capital letters, then lower case letters and then punctuation. But this page shows that the capital letters should come first, followed by the lower case letters, followed by the digits, followed by the punctuation.
Once I had made this correction, the strings began to be encoded 'correctly' - I could read them properly in my email client, which I am taking to be the definition of 'correct'.
It would be interesting to read whether anybody else has had problems with the base64 encoding code which I found.
You do not need to encode the Subject property manually at all. TIdMessage encodes it automatically for you. Simply assign the
Edit1.Text value as-is to the
Subject and let TIdMessage encode
it as needed.
If you want to customize how
TIdMessage encodes headers, use the TIdMessage.OnInitializeISO
event to provide the desired charset and encoding
values. In Delphi 2009+, it defaults to UTF-8 and Base64. In earlier versions, TIdMessage reads the RTL's current OS language and chooses some default values for known languages. However, Hebrew is not one of them, and so ISO-8859-1 and QuotedPrintable would end up being used. You can override those values, eg:
email.Subject := Edit1.Text;
.
procedure TForm1.emailInitializeISO(var VHeaderEncoding: Char; var VCharSet: string);
begin
VHeaderEncoding := 'B';
VCharSet := 'UTF-8';
end;
I want to escape Unicode word to be used in URL to make HTTPRequest, for example want to convert "محمود" to "%D9%85%D8%AD%D9%85%D9%88%D8%AF" I noticed that each character has converted to two HEX
Thanks a lot
Convert to UTF-8, then url-encode chars not in [[:alnum:]].
\Url-encoding is where a character is converted into %<HIGHNIBBLE><LOWNIBBLE> form, where HIGHNIBBLE = (ch >> 4) & 0x0F and LOWNIBBLE = (ch & 0x0F).
Look into RFC 1738 (S) 2.2 for more details.
Because it looks like you're using java, you'll have to work with byte[] instead of String or char[].