TextEncoder produces UTF-8 instead of request charset encoding

TextEncoder produces UTF-8 instead of request charset encoding - character-encoding

As part of transitioning my Thunderbird extension to Thunderbird 60, I need to switch from using nsIScriptableUnicodeConverter (If you don't know Mozilla, never mind what that is) to the more popular, and multiple-browser-supported, TextDecoder and TextEncoder. The thing is, their behavior is not what I would expect.
Specifically, suppose I have the string str containing "ùìåí," (without the quotes of course). Now, when I run:
undecoded_str = new TextEncoder("windows-1252").encode(str);
I expect to be getting the sequence
F9, EC, E5, ED, 2C
(the 1-octet windows-1252 value for each of the 5 characters). But what I actually get is:
C3, B9, C3, AC, C3, A5, C3, AD, 2C
which seems to be the UTF-8 encoding of the string. Why is this happening?

Annoyingly, many browser have simply dropped support for multiple character set encodings in TextEncoder (and TextDecoder):
Note: Firefox, Chrome and Opera used to have support for encoding types other than utf-8 (such as utf-16, iso-8859-2, koi8, cp1261, and gbk). As of Firefox 48 (ticket), Chrome 54 (ticket) and Opera 41, no other encoding types are available other than utf-8, in order to match the spec. In all cases, passing in an encoding type to the constructor will be ignored and a utf-8 TextEncoder will be created (the TextDecoder still allows for other decoding types).
Damn it!

Related

Delphi UTF8ToAnsi failure

When I use UTF8ToAnsi on this string, the result is empty. Any idea why that might be?
msgid "2. Broughton, PMG. ^iJournal of Automatic Chemistry.^n ^lVol 6. No 2. (April – June 1984) pp 94-95."
This demonstrates the problem:
procedure TForm1.FormShow(Sender: TObject);
begin
Memo1.Lines.Text :=
'<<' +
UTF8ToANSI('msgid "2. Broughton, PMG. ^iJournal of Automatic Chemistry.^n^lVol 6. No 2. (April – June 1984) pp 94-95."') +
'>>';
end;
which produces
"<<>>"

Your code fails because what you pass is not UTF-8 encoded. What you pass this function is actually ANSI encoded. When Utf8Decode receives that text, it attempts to decode it and when it encounters the malformed bytes, bytes that are not UTF-8, it bails out and returns the empty string.
The problem character is the dash in April – June 1984 which is an n-dash. In ANSI that is encoded as #150. When you attempt to interpret that as UTF-8, that #150 is not a single byte encoding of a character, and is also invalid as the first byte of a multi-byte sequence. Hence the failure.
To solve your actual problem, you'll need to work out why you have data that is not UTF-8 in a place where you expect UTF-8.

Utf8ToAnsi returns an empty string if the input isn't valid UTF-8 (such as having an incomplete multibyte character or a malformed trailing byte). You can debug your program to discover what your string really contains. You evidently have a problem in the way you obtain your input string. Perhaps you're misinterpreting UTF-8, or perhaps you never really had UTF-8 in the first place.

The dash that you use between April – June is not valid UTF8. So it cannot be decoded correctly. This is not immediately visible, but the symbol that you used here is not a normal minus, but a different character.

Mime encoded headers with extra '=' (==?utf-8?b?base64string?=)

This might be a silly question but... here it goes!
I wrote my own MIME parser in native C++. It's a nightmare with the encodings! It was stable for the last 3 months or so but recently I noticed this Subject: header.
Subject: =?UTF-8?B?T2ZpY2luYSBkZSBJbmZvcm1hY2nDs24sIEluaWNpYXRpdmFzIHkgUmVjbGFt?===?UTF-8?B?YWNpb25lcw==?=
which should decode to this:
Subject: Oficina de Información, Iniciativas y Reclamaciones
The problem is there is one extra = (equal) in there which I can't figure out binding the two (why 2?) encoded elements which I don't understand why are separated. In theory the format should be: =?charset?encoding?encoded_string?= but found another subject that starts with two =.
==?UTF-8?B?blahblahlblah?=
How should I handle the extra =?
I could replace ==? with =? (which I am) before doing anything (and it works)... but I'm wondering if there's any kind of spec regarding this so I don't hack my way into proper functionality.
PS: How much I hate these relic protocols! All text communications should be UTF-8 and XML :)

In MIME headers encoded words are used (RFC 2047 Section 2.).
... (why 2?)
To overcome 75 encoded word limit, which is there because of 78 line length limit (or to use 2 different encodings like Chinese and Polish for example).
RFC 2047:
An 'encoded-word' may not be more than 75 characters long,
including 'charset', 'encoding', 'encoded-text', and delimiters.
If it is desirable to encode more text than will fit in an
'encoded-word' of 75 characters, multiple 'encoded-word's
(separated by CRLF SPACE) may be used.
Here's the example from RFC2047 (note there is no '=' in between):
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
Your subject should be decoded as:
"Oficina de Información, Iniciativas y Reclam=aciones"
mraq answer is incorrect. Soft line breaks apply to 'Quoted Printable' Content-Transfer-Encoding only, which can be used in MIME body.

It is called the "Soft Line Break" and it is the heritage of the SMTP protocol.
Quoting page 20 of RFC2045
(Soft Line Breaks) The Quoted-Printable encoding
REQUIRES that encoded lines be no more than 76
characters long. If longer lines are to be encoded
with the Quoted-Printable encoding, "soft" line breaks
must be used. An equal sign as the last character on a
encoded line indicates such a non-significant ("soft")
line break in the encoded text.
And also Wikipedia on Quoted-printable
A soft line break consists of an "=" at the end of an encoded line,
and does not appear as a line break in the decoded text.

From what I can see in the MIME RFC double equal signs are not valid input (for encoding), but keep in mind you could interpret the first equal sign as what it is and then use the following stuff for decoding. But seriously, those extra equal signs look like artifacts, maybe from an incorrect encoder.

ascii character not showing in browser

I have an MVC Razor view
#{
ViewBag.Title = "Index";
var c = (char)146;
var c2 = (short)'’';
}
<h2>#c --- #c2 --’-- ‘Why Oh Why’ & </h2>
#String.Format("hi {0} there", (char)146)
characters stored in my database in varchar fields are not rendering to the browser.
This example demonstrates how character 146 doesn't show up
How do I make them render?
[EDIT]
When I do this the character 146 get converted to UNICODE 8217 but if 146 is attempted to be rendered directly on the browser it fails
public ActionResult Index()
{
using (var context = new DataContext())
{
var uuuuuggghhh = (from r in context.Projects
where r.bizId == "D11C6FD5-D084-43F0-A1EB-76FEED24A28F"
select r).FirstOrDefault();
if (uuuuuggghhh != null)
{
var ca = uuuuuggghhh.projectSummaryTxt.ToCharArray();
ViewData.Model = ca[72]; // this is the character in question
return View();
}
}
return View();
}

#Html.Raw(((char)146).ToString())
or
#Html.Raw(String.Format("hi {0} there", (char)146))
both appear to work. I was testing this in Chrome and kept getting blank data, after viewing with FF I can confirm the representation was printing (however 146 doesn't appear to be a readable character).
This is confirmed with a readable character '¶' below:
#Html.Raw(((char)182).ToString())
Not sure why you would want this though. But best of luck!

You do not want to use character 146. Character 146 is U+0092 PRIVATE USE TWO, an obscure and useless control character that typically renders as invisible, or a missing-glyph box/question mark.
If you want the character ’: that is U+2019 SINGLE RIGHT QUOTATION MARK, which may be written directly or using ’ or ’.
146 is the byte number of the encoding of U+2019 into the Windows Western code page (cp1252), but it is not the Unicode character number. The bottom 256 Unicode characters are ordered the same as the bytes in the ISO-8859-1 encoding; ISO-8859-1 is similar to cp1252 but not the same.
Bytes 128–159 in cp1252 encode various typographical niceties like smart quotes, whereas bytes 128–159 in ISO-8859-1 (and hence characters 128–159 in Unicode) are seldom-used control characters. For web applications, you usually want to filter out the control characters (0–31 and 128–159 amongst a few others) as they come in, so they never get as far as the database.
If you are getting character 146 out of your database where you expect to have a smart quote, then you have corrupt data and you need to fix it up before continuing, or possibly you are reading the database using the wrong encoding (quite how this works depends what database you're talking to).
Now here's the trap. If you write:

as a character reference, the browser actually displays the smart quote U+2019 ’, and, confusingly, not the useless control character that actually owns that code point!
This is an old browser quirk: character references in the range to are converted to the character that maps to that number in cp1252, instead of the real character with that number.
This was arguably a bug, but the earliest browsers did it back before they grokked Unicode properly, and everyone else was forced to follow suit to avoid breaking pages. HTML5 now documents and sanctions this. (Though not in the XHTML serialisation; browsers in XHTML parsing mode won't do this because it's against the basic rules of XML.)

We finally agreed that the data was corrupt we have asked users who can't see this character rendered to fix the source data

Problem with ord () and string

i having this problem, if i have:
mychr = ' ';
where the 'space' in mychr equival to #255 (typed manually ALT+255), and i write:
myord = ord (mychr)
to myord return value 160 and not 255. Of course, same problem is too with charater ALT+254 etc.
As i can solve this problem? I have tested on delphi xe in console mode.
Note: if i use:
mychar = #255;
then function ord() return value correctly.

I think the problem is that the Windows Alt+Num shortcuts insert characters according to the local codepage, whereas a modern Delphi use Unicode characters, and these differ (unless the value is less than or equal to 127, I think). The solution is to enter the values #255 explicitly in code. In addition, it is a very bad habit to include 'invisible' special characters in code, because you cannot tell what character it is without copying in to an external tool! In addition, you will have to trust the text encoding of the .pas file. It is much better to use constants like #255. Even better, do
const
MY_PRECIOUS_VALUE = #255;
and use this constant every time you need it.
Update
According to the English Wikipedia article on Alt code:
If the number typed has a leading 0
(zero), the character set used is the
Windows code page that matches the
current input locale. For most systems
using the Latin alphabet, this is
Windows-1252. For a complete list, see
code page. If the number does not have
a leading 0 (zero), DOS compatibility
is invoked. The character set used is
the DOS code page for the current
input locale. For systems using
English, this is code page 437. For
most other systems using the Latin
alphabet, this is code page 850. For a
complete list, see code page.
So, if you really, really want to continue entering Alt keycodes, you'd better type Alt and 0255 with the leading zero.

If you type ALT+255, DOS codepage is used; for 437 and 850 DOS codepages (one of which you probably use) #255 is NBSP (non-breaking space). In Unicode, NBSP is $A0 (160). That explains why you obtain Ord 160.

AFAIK console mode use the OEM Ansi char set. And under Delphi XE, you're not in the Ansi world, but in the UCS-2 / Unicode world.
var MyChar: char;
MyWideChar: WideChar;
MyAnsiChar: AnsiChar;
begin
MyChar := #255;
MyWideChar := #255;
MyAnsiChar := #255;
The first two variables are the same, i.e. a character with Unicode code 255 = $00FF, since in Delphi XE, char = WideChar. For the first Unicode Page, see this article.
But MyAnsiChar is what will be displayed on the console, after conversion from the current code page into the OEM console code page.
In the Unicode chart, this $00FF is a minuscule y with trema:
U+00FF ÿ Latin Small Letter Y with diaeresis
Under the console, you'll use the OEM char set, i.e. Code Page 347. So in your case $FF is NOT a character, but a special code
FF NBSP Non Breaking SPace
which is converted into U+00A0 when converted back to Unicode:
U+00A0 NBSP Non Breaking SPace
It is very likely that you are in a Windows-1252 code page, so normally the Delphi XE AnsiString will map #255 into a minuscule y with trema:
FF ÿ Latin Small Letter Y with diaeresis
You can use low-level e.g. CharToOemBuff windows functions to perform the conversion to or from OEM, or use an OEM AnsiString type:
type
TOemString = AnsiString(437);
In all cases, the console is not the best way of entering accentuated text under modern Windows, and Unicode Delphi XE.
Using InputQuery function e.g. should be safer, since it will return an Unicode string variable. ;)

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.

226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]

As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart