Which charcter encoding is used by email client to encode Japanese characters? - character-encoding

I'm analyzing character set used in MIME to combine multiple character set.
For that wrote as sample email as:
This is sample test email 精巣日本 dsdsadsadsads
which is automatically gets convert into:
This is sample test email 精巣日本 dsdsadsadsads
I want to know, which character set encoding is used to encode theses character?
Is this possible to use that character set encoding in C?
Email client: Postfix webmail

The purpose of MIME is to allow for support of arbitrary content types and encodings. As long as the content is adequately tagged in the MIME headers, you can use any encoding you see fit. There is no single right encoding for your use case; though in this day and age, the simplest solution by far is to use Unicode for everything.
In MIME terms, you'd use something like Content-Type: text/plain; charset="utf-8" and then correspondingly encode the body text. If you need the email to be 7-bit safe, you might use a quoted-printable or base64 content-trasfer encoding on top, but any modern MIME library should take care of this detail for you.
The HTML entities you observed in your experiment are not suitable for plain-text emails, though they are a viable alternative for pure-HTML email. (If your webmail client used them in plaintext emails, it is buggy; it will only work if the sender and recipient both have the same bug.)
Traditionally, Japanese email messages would use one of the legacy Japanese encodings, like Shift_JIS or ISO-2022-JP. These have reasonable support for English, but generalize poorly to properly multilingual text (though ISO-2022 does somehow support it). With Unicode, by contrast, mixing Japanese with e.g. Farsi, Uzbek, and Turkish is straightforward and undramatic.
Using UTF-8 from C is easy and basically transparent. See e.g. http://utf8everywhere.org/ for some starting points.

Related

How to decode unexpected strings from users?

I've published an app, and I find some of the comments to be like this: РекамедÑ
I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?
These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &
When decoding the data twice using this online tool (there are many others) the result is
РекамедÑ
That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that
was misinterpreted as single-byte input
was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).
The fix will likely need to be on server side.
If you are using PHP, see UTF-8 all the way through for a handy guide.

Cross Platform Url Encoding for Query Strings

There are multiple classes and functions in different Programming Languages for encoding and decoding strings to be URL friendly. For example
in java
URLEncoder.encode(String, String)
in PHP
urlencode ( string $str )
And ...
My question is, If I UrlEncode a String in java, can I expect the other different UrlDecoders in other Languages decode to the same original sting?
I'm creating a Service that needs to encode some Base64 value in query string and I have no idea who are serving to.
Please consider the only option I have here seems to be the query string. I can't use xml or json or HTTP headers Since I need this to be in a url to be redirected.
I looked around and there were some questions exactly like this but non of them had a proper answer.
I appreciate so much for any acknowledge or any solutions.
EDIT:
For example in PHP Manual there is this description:
Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.
That sounds it does not follow the RFC
It sounds url encoders can use various algorithms in different Programming Languages.
But one should look for the encoding schema for every function. For example one of them could be
application/x-www-form-urlencoded
looking into JAVA Url Encoder:
Translates a string into application/x-www-form-urlencoded format using a specific encoding scheme. This method uses the supplied encoding scheme to obtain the bytes for unsafe characters.
Also looking into PHP's
that is the same way as in application/x-www-form-urlencoded media type
So if you are looking for a Cross Platform Url Encoding you should tell your users what is the format of your encoder.
This way, they can found the appropriate Decoder or otherwise they can implement their own.
After some investigation, sounds application/x-www-form-urlencoded is the most popular among others.

Go - failsafe charsets from emails

I have a bunch of emails that I decided to process in Go.
Go parse everything (headers, multipart) very well.
How do I convert all emails text to UTF-8?
I read encoding name from Content-Type field and parse it with mime.ParseMediaType
I believe some emails may have bugs in encodings.
e.g. wrong encoding or multiple encodings in single body.
So if there is single wrong character but 99% of text is readable. I wish to be able to read it.
PS
There are libs in go to work with charset. https://godoc.org/code.google.com/p/go.text/encoding
and a set of iconv wrappers like https://github.com/djimenez/iconv-go
I think first lacks encodings and it does give decoder by encoding name. I am not sure sure that I know all synonyms of encodings.
e.g. UTF-8 and utf8 are same encoding. Windows-1251 and CP-1251 are same also.
Second is iconv wrapper. Go is secure language and that is why I wish to do that in Go. There is no buffer overflow. But iconv is written in C and is less secure. I do

Can urls have UTF-8 characters?

I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6 could be pretty much anything, and %C3%B6 could be e.g. UTF-16.

Character Encoding and the ’ Issue

Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:
(Note: This is an example, not a spam job post... :-)
I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.
My two-part question:
What causes this particular, common encoding issue?
As a developer, what should I do with user input to avoid common encoding issues like
this one? If this question requires simplification to provide a
meaningful answer, assume content is entered through a web browser.
What causes this particular, common encoding issue?
This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.
In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) ’ which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, € and ™.
This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).
As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.
Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.
If you're familiar with Java (your question history confirms this), you may find this article useful.

Resources