I've been working with a Japanese company who chooses to encode our files with EUC-JP.
I've been curious for quite a while now and tried asking superiors why EUC-JP over SHIFT-JIS or UTF-8, but get answers "like it's convention or such".
Do you know why the initial coders might have chosen EUC-JP over other character encoding?
Unlike Shift-JIS, EUC-JP is ASCII-safe - any byte where the eigth bit is zero is ASCII. It was also historically popular in Unix variants. Either of these things could have been an important factor a long time ago before UTF8 was generally adopted. Check the Wikipedia article for more details.
Related
Why does String.to_atom hardcode the encoding option to :utf8 ?
https://github.com/elixir-lang/elixir/blob/d6bb3342b7ea8b921b3d4b69f65064c4158c99d7/lib/elixir/lib/string.ex#L1927
def to_atom(string) do
:erlang.binary_to_atom(string, :utf8)
end
The available encoding options for erlang binary_to_atom are :
latin1 | unicode | utf8
http://erlang.org/documentation/doc-8.0-rc1/erts-8.0/doc/html/erlang.html#binary_to_atom-2
TL;DR
Because the Erlang universe is finally settling on UTF-8 everywhere.
Discussion
latin1 is going away and is largely a subset of UTF-8 (except a few characters), unicode is an old alias for utf8, and that leaves us with just one universally applicable option: utf8. This is important since UTF-8 atoms (and strings) are the way forward within Erlang and also within Elixir.
If you are dealing in old data with non-UTF-8 encodings then convert it before your call to binary_to_atom/2.
This also falls in line with the newer string and unicode module changes in Erlang's standard library -- which can finally settle on UTF-8 as a generally accepted standard after decades of uncertainty (because encodings are hard and there was not much agreement about this when Erlang was invented).
A word on coding practice
I work in Japan handling mostly business data, some of it quite old, and some of it in really crazy encodings. I tend to code mostly in Erlang (I prefer tiny languages). When some of the older string handling functions and unicode module were written strings fell into two categories:
A list of code points in ASCII (that was implicitly extend to encompass latin1 quite a bit of the time because, well, European languages were a common use and CJK was a wild mess back then)
Some waking nightmare of dragonfire and frost zombies (because there was zero agreement about anything else and a gazillion radically incomplete, half-baked, technically inaccurate "standards")
Times have changed. Now we know that strings are nearly always going to be in UTF-8 and everything in the Unixverse has finally settled on this which has had the pleasant effect of having (pretty much) every other meaningful system settle on that as well (if not internally, then through robust detection libraries that can pick between UTF-16 and UTF-8).
The cases where you actually do have non-UTF-8 data then you know this to be the case and should convert your data before sending it to a universal function such as binary_to_atom/2. I actually think we should shift next to including a binary_to_atom/1 and phase out binary_to_atom/2 entirely -- which is what has already happened with list_to_atom/1 as of Erlang R20 (yay!).
So how does that affect your code?
When you start dealing in ancient encodings the complexity of your code suddenly explodes and that needs to be contained right away lest it infect your entire codebase with insanity. The best way to do this is to keep the crazy outside of your business system proper and do conversions out at the edges. Whenever we deal in old data that comes in crazy encodings we already know and are prepared for that -- so we convert to UTF-8 explicitly right up front, so there isn't anything left to encounter later on deeper in the system.
You might think, "Why don't they just detect the encoding of every string?" Alas, there is no proper way to detect string encodings. It is just not possible with a high degree of confidence. It is also quickly becoming an obsolete task in the majority case as the vast majority of data generated today is UTF-8 (or UTF-16, but it is very rare to encounter this over the wire).
There is a public project called Moby containing several word lists. Some files contain European alphabets symbols and were created in pre-Unicode time. Readme, dated 1993, reads:
"Foreign words commonly used in English usually include their
diacritical marks, for example, the acute accent e is denoted by ASCII
142."
Wikipedia says that the last ASCII symbol has number 127.
For example this file: http://www.gutenberg.org/files/3203/files/mobypos.txt contains symbols that I couldn't read in any of vatious Latin encodings. (There are plenty of such symbols in the very end of section of words beginning with B, just before C letter. )
Could someone advise please what encoding should be used for reading this file or how can it be converted to some readable modern encoding?
A little research suggests that the encoding for this page is Mac OS Roman, which has é at position 142. Viewing the page you linked and changing the encoding (in Chrome, View → Encoding → Western (Macintosh)) seems to display all the words correctly (it is incorrectly reporting ISO-8859-1).
How you deal with this depends on the language / tools you are using. Here’s an example of how you could convert into UTF-8 with Ruby:
require 'open-uri'
s = open('http://www.gutenberg.org/files/3203/files/mobypos.txt').read
s.force_encoding('macroman')
s.encode!('utf-8')
You are right in that ASCII only goes up to position 127 (it’s a 7-bit encoding), but there are a large number of 8 bit encodings that are supersets of ASCII and people sometimes refer to those as “Extended ASCII”. It appears that whoever wrote the readme you refer to didn’t know about the variety of encodings and thought the one he happened to be using at the time was universal.
There isn’t a general solution to problems like this, as there is no guaranteed way to determine the encoding of some text from the text itself. In this case I just used Wikipedia to look through a few until I found one that matched. Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a good place to start reading about character sets and encodings if you want to learn more.
Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:
(Note: This is an example, not a spam job post... :-)
I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.
My two-part question:
What causes this particular, common encoding issue?
As a developer, what should I do with user input to avoid common encoding issues like
this one? If this question requires simplification to provide a
meaningful answer, assume content is entered through a web browser.
What causes this particular, common encoding issue?
This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.
In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) ’ which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, € and ™.
This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).
As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.
Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.
If you're familiar with Java (your question history confirms this), you may find this article useful.
I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.
I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing. I've heard the argument that UTF-8, 16, etc may use more space but in the end it has been negligible.
Also, what about in Windows programs, Linux shell and things of that nature -- can you safely use UTF-8 there?
If UTF-32 is available, prefer that over the other versions for processing.
If your platform supports UTF-32/UCS-4 Unicode natively - then the "compressed" versions UTF-8 and UTF-16 may be slower, because they use varying numbers of bytes for each character (character sequences), which makes impossible to do a direct lookup in a string by index, while UTF-32 uses 32 bit "flat" for each character, speeding up some string operations a lot.
Of course, if you are programming in a very restricted environment like, say, embedded systems and can be certain there will be only ASCII or ISO 8859-x characters around, ever, then you can chose those charsets for efficiency and speed. But in general, stick with the Unicode Transformation Formats.
When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard.
UTF-8 works well on almost every recent software, even on Windows.
It is well-known that utf-8 works best for file storage and network transport. But people debate whether utf-16/32 are better for processing. One major argument is that utf-16 is still variable length and even utf-32 is still not one code-point per character, so how are they better than utf-8? My opinion is that utf-16 is a very good compromise.
First, characters out side of BMP which need double code-points in utf-16 are extremely rarely used ones. The Chinese characters (also some other Asia characters) in that range are basically dead ones. Ordinary people won't use them at all, except experts use them to digitalize ancient books. So, utf-32 will be a waste most of the time. Don't worry too much about those characters, as they won't make your software look bad if you didn't handle them properly, as long as your software is not for those special users.
Second, often we need the string memory allocation to be related to character count. e.g. a database string column for 10 characters (assuming we store unicode string in normalized form), which will be 20 bytes for utf-16. In most cases it will work just like that, except in extreme cases it will hold only 5-8 characters. But for utf-8, the common byte length of one character is 1-3 for western languages and 3-5 for Asia languages. Which means we need 10-50 bytes even for the common cases. More data, more processing.