I was wondering is there an article that discusses the rules for the character encodings UTF-8 and ISO-8859-1?
Can someone also point me to the rules of other character encodings as well?
Read this: http://www.joelonsoftware.com/articles/Unicode.html - it will clear any questions you have about Unicode, Encoding etc.
Edit: B.T.W., I'm not so clear about what you mean by "rules", but this article should clear any questions you have about what UTF-8 and ISO-8859-1 are.
UTF-8 on Wikipedia is a good place to start.
Related
I've been working with a Japanese company who chooses to encode our files with EUC-JP.
I've been curious for quite a while now and tried asking superiors why EUC-JP over SHIFT-JIS or UTF-8, but get answers "like it's convention or such".
Do you know why the initial coders might have chosen EUC-JP over other character encoding?
Unlike Shift-JIS, EUC-JP is ASCII-safe - any byte where the eigth bit is zero is ASCII. It was also historically popular in Unix variants. Either of these things could have been an important factor a long time ago before UTF8 was generally adopted. Check the Wikipedia article for more details.
I've published an app, and I find some of the comments to be like this: РекамедÑ
I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?
These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &
When decoding the data twice using this online tool (there are many others) the result is
РекамедÑ
That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that
was misinterpreted as single-byte input
was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).
The fix will likely need to be on server side.
If you are using PHP, see UTF-8 all the way through for a handy guide.
I have a bunch of emails that I decided to process in Go.
Go parse everything (headers, multipart) very well.
How do I convert all emails text to UTF-8?
I read encoding name from Content-Type field and parse it with mime.ParseMediaType
I believe some emails may have bugs in encodings.
e.g. wrong encoding or multiple encodings in single body.
So if there is single wrong character but 99% of text is readable. I wish to be able to read it.
PS
There are libs in go to work with charset. https://godoc.org/code.google.com/p/go.text/encoding
and a set of iconv wrappers like https://github.com/djimenez/iconv-go
I think first lacks encodings and it does give decoder by encoding name. I am not sure sure that I know all synonyms of encodings.
e.g. UTF-8 and utf8 are same encoding. Windows-1251 and CP-1251 are same also.
Second is iconv wrapper. Go is secure language and that is why I wish to do that in Go. There is no buffer overflow. But iconv is written in C and is less secure. I do
I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.
This stems from a question I had about nvarchar and varchar.
According to MSDN, varchar is:
...non-Unicode character data...
I've looked around for a clear definition of "non-unicode" but haven't had any luck. Is this the same thing as ASCII? If so, is there a reason that they don't just say ASCII?
No, it is not the same thing and that's the reason why they didn't just say ASCII. There are many encodings out that are neither Unicode nor ASCII like Windows 1251 also known as CP1251 (cyrillic).
No. It's not the same. LATIN1 is an example of a charset that's not UNICODE and is not ASCII either. Here is a list of charsets.