I am using javamail and trying to process emails that has Content-Type: text/html; charset="8Bit" can i assume this as us-ascii or is there a better equivalent?
US-ASCII is the lowest common denominator of all character sets and it is a seven bit character set. "8Bit" is no character set! The used character encoding should be specified i.e. iso-8859-1 or utf-8.
Please see: https://www.w3schools.com/html/html_charset.asp
Related
I'm writing my own BASE64 encoder/decoder for some constrained environments.
And I found that Base64.Encoder#encodeString saying that it uses ISO-8859-1 for construct a String from those encoded bytes.
I perfectly presuming that ISO-8859-1 charset also covers all base64 alphabets.
Is there any possible reason not to use US-ASCII?
I suspect it's more efficient: converting from ISO-8859-1 back to text is just a matter of promoting each byte straight to a char, whereas for ASCII you'd need to check that the byte is valid ASCII. The result for base64 will always be the same, of course.
(That's only a guess, but an educated one. You could always run benchmarks if you want to validate it...)
When someone types an url in a browser to access a page, which charset is used for that URL? Is there a standard? Can I consider that UTF-8 is used everywhere? Which characters are accepted?
URLs may contain only a subset of ASCII, all URLs are valid ASCII.
International domain names must be Punycode encoded. Non-ASCII characters in the path or query parts must be encoded, with Percent-encoding being the generally agreed-upon standard.
Percent-encoding only takes the raw bytes and encodes each byte as %xx. There's no generally followed standard on what encoding should be used to determine a byte representation. As such, it's basically impossible to assume any particular character set being used in the percent-encoded representation. If you're creating those links, then you're in full control over the used charset before percent-encoding; if you're not, you're mostly out of luck. Though you will most likely encounter UTF-8, this is not guaranteed.
Suppose an URL is encoded in a multi-byte character set where one of the characters in the multi-byte sequence could be between 0 and 127, i.e. an otherwise valid 7-bit ASCII character.
Example: The Japanese Shift_JIS character set, where the character カ would
be escaped as %83%4a. Now %4a is also the ASCII character J, so I could instead write %83J.
Would that be OK by the whatever standard(s) apply?
I'm not asking because I want to send URLs like this (although the latter saves a couple bytes), but whether I should accept those on the server side, i.e. whether it is standards-compliant and also, whether I can expect other servers to handle this in the same way.
I'm basing my answer on RFC 2396, as that is what's being used by HTTP 1.1.
According to Section 2.1, there are 2 separate steps, the latter being optional:
URI character sequence->octet sequence
octet sequence->original character sequence
So the answer is: Yes, it's OK.
I have a file. I don't know how it was processed. It's probably a double encoding. I've found this link about double encoding that solved almost my problem:
http://www.spamusers.com/encoding.htm
It has all the double encodings substitutions to do like:
À àÁ
 Â
Unfortnately I still others weird characters like:
ú
ç
ö
Do you have an idea on how to clean these weird characters? For the ones I know I've just made a bash script and I've just replaced them. But I don't know how to recognize the others. I'm running on linux so if you have some magic commands I would like that.
The “double encodings substitutions” page that you link to seems to contain mappings meant to fix character data that has been doubly UTF-8 encoded. Thus, the proper fixing routine would be to reverse such mappings and see if the result makes sense.
For example, if you take A with grave accent, À, U+00C0, and UTF-8 encode it, you get the bytes C3 A0. If these are then mistakenly understood as single-byte encodings according to windows-1252 for example, you get the characters U+00C3 U+00A0 (letter à and no-break space). If these are then UTF-8 encoded, you get C3 83 for the former and C2 80 for the latter. If these bytes in turn are interpreted according to windows-1252, you get À as on the page.
But you don’t actually have “À”, do you? You have some digital data, bytes, which display that way if interpreted according to windows-1252. But that would be a wrong interpretation.
You should first read the data as UTF-8 encoded, decode it to characters, checking that all codes are less than 100 hexadecimal (if not, there’s yet another error involved somewhere), then UTF-9 decode again.
I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.