strange leading unicode in html form post

strange leading unicode in html form post - character-encoding

Occasionally I get these errors occurring:
An invalid character was found in the mail header: '‎'
which didn't make any sense, upon investigation it seems there's some invisible character in there.
I know which user this is, so I select them from the DB:
select email from user where email = 'their#address.com'
the user's email appears as their#address.com, but copying it into a text editor, shows a wierd leading char:
So why does the sql equality operator match, when it isnt the same string? because its some invisible char?
If I save just that leading char in the text file as unicode and open it in a hex editor, I see this:
FF FE 0E 20
Update: the offending bytes are:
E2 80 8E
What is this crazyness, how did it get there?
How can I prevent this in future, and how can I clean my database (as there are a few of these)
These are the relevant headers from when the user was created:
Content-Type: application/x-www-form-urlencoded
Accept: application/json, text/javascript, */*; q=0.01
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Thanks

The bytes FF FE are U+FEFF BYTE ORDER MARK in UTF-16BE encoding, and 0E 20 are U+200E LEFT-TO-RIGHT MARK in the same encoding. At the start of a file, they are harmless, at least if the content is in a left-to-right writing system, like the Latin alphabet.
I cannot make a guess on their origin, especially since I didn’t quite get what file is being discussed and how it was created (from a form post? from the database? some other way? how?).

FFFE is a Unicode Byte Order Mark (BOM).
Edit:
0E20 is THAI CHARACTER PHO SAMPHAO. No idea where that could come from.

Related

TextEncoder produces UTF-8 instead of request charset encoding

As part of transitioning my Thunderbird extension to Thunderbird 60, I need to switch from using nsIScriptableUnicodeConverter (If you don't know Mozilla, never mind what that is) to the more popular, and multiple-browser-supported, TextDecoder and TextEncoder. The thing is, their behavior is not what I would expect.
Specifically, suppose I have the string str containing "ùìåí," (without the quotes of course). Now, when I run:
undecoded_str = new TextEncoder("windows-1252").encode(str);
I expect to be getting the sequence
F9, EC, E5, ED, 2C
(the 1-octet windows-1252 value for each of the 5 characters). But what I actually get is:
C3, B9, C3, AC, C3, A5, C3, AD, 2C
which seems to be the UTF-8 encoding of the string. Why is this happening?

Annoyingly, many browser have simply dropped support for multiple character set encodings in TextEncoder (and TextDecoder):
Note: Firefox, Chrome and Opera used to have support for encoding types other than utf-8 (such as utf-16, iso-8859-2, koi8, cp1261, and gbk). As of Firefox 48 (ticket), Chrome 54 (ticket) and Opera 41, no other encoding types are available other than utf-8, in order to match the spec. In all cases, passing in an encoding type to the constructor will be ignored and a utf-8 TextEncoder will be created (the TextDecoder still allows for other decoding types).
Damn it!

Indy10 GMTToLocalDateTime issue

It seems that Indy GMTToLocalDateTime does not ignore comments when decoding date:
TDateTime dtDate1 = GMTToLocalDateTime("12 Mar 2015 14:03:21 -0000");
TDateTime dtDate2 = GMTToLocalDateTime("Thu, 12 Mar 2015 14:03:20 +0000 (GMT)");
TDateTime dtDate3 = GMTToLocalDateTime("Thu, 12 Mar 2015 14:03:20 +0000 (envelope-from <aaa#bbb.ccc>)");
TDateTime dtDate4 = GMTToLocalDateTime("Thu, 12 Mar 2015 14:03:20 +0000 (aaa#bbb.ccc)");
UnicodeString Dt1 = DateTimeToStr(dtDate1);
UnicodeString Dt2 = DateTimeToStr(dtDate2);
UnicodeString Dt3 = DateTimeToStr(dtDate3);
UnicodeString Dt4 = DateTimeToStr(dtDate4);
First 2 are decoded correctly. The last 2 are not.
The part in the parenthesis is supposed to be ignored because it is just a comment but it seems that it is not.
Is this a bug in Indy?
Also - is there a bug-tracker for Indy (as it appears forums are down)?

GMTToLocalDateTime() (more specifically, RawStrInternetToDateTime()) is not meant to accept or look for embedded comments. Comments do not belong in the input and must be stripped off beforehand. Embedded comments are a feature of email, but are to be ignored when processing data (see RFC 822 Section 3.4.3).
In this situation, the comments were not stripped by the caller, and the presence of the '.' character in the comments of the last 2 examples was throwing off RawStrInternetToDateTime() when it checks for the presence of a timestamp and whether it uses ':' or '.' as a delimiter between the hour/minutes/seconds.
Indy as a whole is not designed to even recognize, let alone handle, embedded comments in headers. However, in this situation, I have made a small tweak to RawStrInternetToDateTime() so comments will not confuse the timestamp parsing anymore (though it is really the caller's responsibility to strip comments before parsing).
And yes, there are bug trackers for Indy:
http://code.google.com/p/indyproject
(though Google Code is shutting down, so this one will go away eventually).
http://indy.codeplex.com

Mime encoded headers with extra '=' (==?utf-8?b?base64string?=)

This might be a silly question but... here it goes!
I wrote my own MIME parser in native C++. It's a nightmare with the encodings! It was stable for the last 3 months or so but recently I noticed this Subject: header.
Subject: =?UTF-8?B?T2ZpY2luYSBkZSBJbmZvcm1hY2nDs24sIEluaWNpYXRpdmFzIHkgUmVjbGFt?===?UTF-8?B?YWNpb25lcw==?=
which should decode to this:
Subject: Oficina de Información, Iniciativas y Reclamaciones
The problem is there is one extra = (equal) in there which I can't figure out binding the two (why 2?) encoded elements which I don't understand why are separated. In theory the format should be: =?charset?encoding?encoded_string?= but found another subject that starts with two =.
==?UTF-8?B?blahblahlblah?=
How should I handle the extra =?
I could replace ==? with =? (which I am) before doing anything (and it works)... but I'm wondering if there's any kind of spec regarding this so I don't hack my way into proper functionality.
PS: How much I hate these relic protocols! All text communications should be UTF-8 and XML :)

In MIME headers encoded words are used (RFC 2047 Section 2.).
... (why 2?)
To overcome 75 encoded word limit, which is there because of 78 line length limit (or to use 2 different encodings like Chinese and Polish for example).
RFC 2047:
An 'encoded-word' may not be more than 75 characters long,
including 'charset', 'encoding', 'encoded-text', and delimiters.
If it is desirable to encode more text than will fit in an
'encoded-word' of 75 characters, multiple 'encoded-word's
(separated by CRLF SPACE) may be used.
Here's the example from RFC2047 (note there is no '=' in between):
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
Your subject should be decoded as:
"Oficina de Información, Iniciativas y Reclam=aciones"
mraq answer is incorrect. Soft line breaks apply to 'Quoted Printable' Content-Transfer-Encoding only, which can be used in MIME body.

It is called the "Soft Line Break" and it is the heritage of the SMTP protocol.
Quoting page 20 of RFC2045
(Soft Line Breaks) The Quoted-Printable encoding
REQUIRES that encoded lines be no more than 76
characters long. If longer lines are to be encoded
with the Quoted-Printable encoding, "soft" line breaks
must be used. An equal sign as the last character on a
encoded line indicates such a non-significant ("soft")
line break in the encoded text.
And also Wikipedia on Quoted-printable
A soft line break consists of an "=" at the end of an encoded line,
and does not appear as a line break in the decoded text.

From what I can see in the MIME RFC double equal signs are not valid input (for encoding), but keep in mind you could interpret the first equal sign as what it is and then use the following stuff for decoding. But seriously, those extra equal signs look like artifacts, maybe from an incorrect encoder.

percent encoding - how is a greek letter percent encoded?

I'm browsing the web for a answer but cannot find one. I have a HTML form (method=GET) and submit in a text field the text helloΩ (hello with the greek letter Omega appended)
The URL in the browser encodes it as:
mytext=hello%26%23937%3B
Without the greek letter Omega appended, I get (as expected):
mytext=hello
So how is the greek Omega letter percent encoded into:
%26%23937%3B
Thanks

This happens when your web server declared an encoding that doesn't support the character. For example, ISO-8859-1 doesn't support it which is the default encoding for many web servers.
That's a html entity character reference percent-encoded: Ω, because #, & .. are all ASCII characters, this is the only way to not lose information because the browser thinks the server only supports ISO-8859-1.
To fix this, declare UTF-8 in your http header:
Content-Type: text/html; charset=utf-8
This isn't even consistent behavior between browsers, because IE encodes it as hello%D9, which is Ú in CP1252/ISO-8859-1.

ascii character not showing in browser

I have an MVC Razor view
#{
ViewBag.Title = "Index";
var c = (char)146;
var c2 = (short)'’';
}
<h2>#c --- #c2 --’-- ‘Why Oh Why’ & </h2>
#String.Format("hi {0} there", (char)146)
characters stored in my database in varchar fields are not rendering to the browser.
This example demonstrates how character 146 doesn't show up
How do I make them render?
[EDIT]
When I do this the character 146 get converted to UNICODE 8217 but if 146 is attempted to be rendered directly on the browser it fails
public ActionResult Index()
{
using (var context = new DataContext())
{
var uuuuuggghhh = (from r in context.Projects
where r.bizId == "D11C6FD5-D084-43F0-A1EB-76FEED24A28F"
select r).FirstOrDefault();
if (uuuuuggghhh != null)
{
var ca = uuuuuggghhh.projectSummaryTxt.ToCharArray();
ViewData.Model = ca[72]; // this is the character in question
return View();
}
}
return View();
}

#Html.Raw(((char)146).ToString())
or
#Html.Raw(String.Format("hi {0} there", (char)146))
both appear to work. I was testing this in Chrome and kept getting blank data, after viewing with FF I can confirm the representation was printing (however 146 doesn't appear to be a readable character).
This is confirmed with a readable character '¶' below:
#Html.Raw(((char)182).ToString())
Not sure why you would want this though. But best of luck!

You do not want to use character 146. Character 146 is U+0092 PRIVATE USE TWO, an obscure and useless control character that typically renders as invisible, or a missing-glyph box/question mark.
If you want the character ’: that is U+2019 SINGLE RIGHT QUOTATION MARK, which may be written directly or using ’ or ’.
146 is the byte number of the encoding of U+2019 into the Windows Western code page (cp1252), but it is not the Unicode character number. The bottom 256 Unicode characters are ordered the same as the bytes in the ISO-8859-1 encoding; ISO-8859-1 is similar to cp1252 but not the same.
Bytes 128–159 in cp1252 encode various typographical niceties like smart quotes, whereas bytes 128–159 in ISO-8859-1 (and hence characters 128–159 in Unicode) are seldom-used control characters. For web applications, you usually want to filter out the control characters (0–31 and 128–159 amongst a few others) as they come in, so they never get as far as the database.
If you are getting character 146 out of your database where you expect to have a smart quote, then you have corrupt data and you need to fix it up before continuing, or possibly you are reading the database using the wrong encoding (quite how this works depends what database you're talking to).
Now here's the trap. If you write:

as a character reference, the browser actually displays the smart quote U+2019 ’, and, confusingly, not the useless control character that actually owns that code point!
This is an old browser quirk: character references in the range to are converted to the character that maps to that number in cp1252, instead of the real character with that number.
This was arguably a bug, but the earliest browsers did it back before they grokked Unicode properly, and everyone else was forced to follow suit to avoid breaking pages. HTML5 now documents and sanctions this. (Though not in the XHTML serialisation; browsers in XHTML parsing mode won't do this because it's against the basic rules of XML.)

We finally agreed that the data was corrupt we have asked users who can't see this character rendered to fix the source data

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart