Go - failsafe charsets from emails - character-encoding

I have a bunch of emails that I decided to process in Go.
Go parse everything (headers, multipart) very well.
How do I convert all emails text to UTF-8?
I read encoding name from Content-Type field and parse it with mime.ParseMediaType
I believe some emails may have bugs in encodings.
e.g. wrong encoding or multiple encodings in single body.
So if there is single wrong character but 99% of text is readable. I wish to be able to read it.
PS
There are libs in go to work with charset. https://godoc.org/code.google.com/p/go.text/encoding
and a set of iconv wrappers like https://github.com/djimenez/iconv-go
I think first lacks encodings and it does give decoder by encoding name. I am not sure sure that I know all synonyms of encodings.
e.g. UTF-8 and utf8 are same encoding. Windows-1251 and CP-1251 are same also.
Second is iconv wrapper. Go is secure language and that is why I wish to do that in Go. There is no buffer overflow. But iconv is written in C and is less secure. I do

Related

How to decode unexpected strings from users?

I've published an app, and I find some of the comments to be like this: РекамедÑ
I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?
These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &
When decoding the data twice using this online tool (there are many others) the result is
РекамедÑ
That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that
was misinterpreted as single-byte input
was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).
The fix will likely need to be on server side.
If you are using PHP, see UTF-8 all the way through for a handy guide.

Handling UTF-8 Character with Latin1 db encoding

I keep getting an exception that ActiveRecord::StatementInvalid: PG::UntranslatableCharacter: ERROR: character with byte sequence 0xe2 0x80 0x99 in encoding "UTF8" has no equivalent in encoding "LATIN1". I did some checking and it looks like it is the backtick or apostrophe. What is the best way to handle this? Just strip out the character or convert the whole db to UTF-8? If it is converting to UTF-8 how can I do that permanently as it always seems to revert if you do it in the shell?
I don't understand what you mean by "revert, if done in the shell", but: You seem to have an application where some parts (at least the database) using encoding LATIN1, and one part (your Rails App) is using UTF-8. IMO, it is best if you have every in Unicode, but to what extend a conversion makes sense, can not be said in general. For example, if your database is also being processed by other tools, and those expect Latin1, a conversion is not sensible.
In any case, you need to define a clear borderline between where you use which encoding, and handle conversion at this border. This applies not only to the database, but also - for example - to the HTML pages you are generating (hopefully UTF-8), to files uploaded by the users and processes by your application, and so on.
If you convert to an encoding, where certain characters can not be represented - as this is in your case -, you have only three choices:
Reject the data (they must have been generated somewhere, perhaps as user input in a web form),
Simply remove the offending characters
Replace the offending characters by a placeholder (for instance, a question mark)
None of these options is very pleasant, but if converting your database to UTF-8 is no option, you should deal with this problem at the point where the problem string is generated, and not when it is written into the database.

Which charcter encoding is used by email client to encode Japanese characters?

I'm analyzing character set used in MIME to combine multiple character set.
For that wrote as sample email as:
This is sample test email 精巣日本 dsdsadsadsads
which is automatically gets convert into:
This is sample test email 精巣日本 dsdsadsadsads
I want to know, which character set encoding is used to encode theses character?
Is this possible to use that character set encoding in C?
Email client: Postfix webmail
The purpose of MIME is to allow for support of arbitrary content types and encodings. As long as the content is adequately tagged in the MIME headers, you can use any encoding you see fit. There is no single right encoding for your use case; though in this day and age, the simplest solution by far is to use Unicode for everything.
In MIME terms, you'd use something like Content-Type: text/plain; charset="utf-8" and then correspondingly encode the body text. If you need the email to be 7-bit safe, you might use a quoted-printable or base64 content-trasfer encoding on top, but any modern MIME library should take care of this detail for you.
The HTML entities you observed in your experiment are not suitable for plain-text emails, though they are a viable alternative for pure-HTML email. (If your webmail client used them in plaintext emails, it is buggy; it will only work if the sender and recipient both have the same bug.)
Traditionally, Japanese email messages would use one of the legacy Japanese encodings, like Shift_JIS or ISO-2022-JP. These have reasonable support for English, but generalize poorly to properly multilingual text (though ISO-2022 does somehow support it). With Unicode, by contrast, mixing Japanese with e.g. Farsi, Uzbek, and Turkish is straightforward and undramatic.
Using UTF-8 from C is easy and basically transparent. See e.g. http://utf8everywhere.org/ for some starting points.

What is the usefulness of mb_http_output() given that the output encoding is typically fixed by other means?

All over the Internet, including in stackoverflow, it is suggested to use mb_http_input('utf-8') to have PHP works in the UTF-8 encoding. For example, see PHP/MySQL encoding problems. � instead of certain characters. On the other hand, the PHP manual says that we cannot fix the input encoding within the PHP script and that mb_http_input is only a way to query what it is, not a way to set it. See http://www.php.net/manual/en/mbstring.http.php and http://php.net/manual/en/function.mb-httpetinput.php . Ok, this was just a clarification of the context before the question. It seems to me that there is a lot of redundant commands in Apache + PHP + HTML to control the conversion from the input encoding to the internal encoding and finally to the output encoding. I don't understand the usefulness of this. For example, if the original input encoding from some external HTTP client is EUC-JP and I set the internal encoding to UTF-8, then PHP would have to make the conversion. Am I right? If I am right, why would I set an input encoding in php.ini (instead of just passing the original one) given that it would be next immediately converted to the utf-8 internal encoding anyway? A similar question hold for the output. In all my htpp files, I use a meta tag with charset=utf-8. So, the output HTTP encoding is fixed. Moreover, in PHP.ini, I can set the default_charset that will appear in the HTTP header to utf-8. Why would I bother to use mb_http_output('uft-8') when the final output encoding is already fixed. To sum up, can someone give me a practical concrete example where mb_http_output('uft-8') is clearly necessary and cannot be replaced by more usual commands that are often inserted by default in editors such as Dreamweaver?
These two options are just about the worst idea the PHP designers ever had, and they had plenty of bad ideas when it comes to encodings.
To convert strings to a specific encoding, one has to know what encoding one is converting from. Incoming data is often in an undeclared encoding; the server just receives some binary data, it doesn't know what encoding it represents. You should declare what encoding you expect the browser to send by setting the accept-charset attribute on forms; doing that is no guarantee that the browser will do so and it doesn't make PHP know what encoding to expect though.
The same goes for output; PHP strings are just byte arrays, they do not have an associated encoding. I have no idea how PHP thinks it knows how to convert arbitrary strings to a specific encoding upon input or output.
You should handle this manually, and it's really easy to do anyway: declare to clients what encoding you expect, check whether input is in the correct encoding using mb_check_encoding (not _detect encoding or some such, just check), reject invalid input, take care to keep everything in the same encoding within the whole application flow. I.e., ideally you have no conversion whatsoever in your app.
If you do need to convert at any point, make it a Unicode sandwich: convert input from the expected encoding to UTF-8 or another Unicode encoding on input, convert it back to desired output encoding upon output. Whenever you need to convert, make sure you know what you're converting from. You cannot magically "make all strings UTF-8" with one declaration.

Character Encoding and the ’ Issue

Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:
(Note: This is an example, not a spam job post... :-)
I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.
My two-part question:
What causes this particular, common encoding issue?
As a developer, what should I do with user input to avoid common encoding issues like
this one? If this question requires simplification to provide a
meaningful answer, assume content is entered through a web browser.
What causes this particular, common encoding issue?
This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.
In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) ’ which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, € and ™.
This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).
As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.
Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.
If you're familiar with Java (your question history confirms this), you may find this article useful.

Resources