Storing utf-8 characters in Oracle with default NLS_CHARACTERSET setting - character-encoding

So for some legacy reasons we have an Oracle Database with NLS_CHARACTERSET = WE8MSWIN1252. Now we want to store CJK characters into a VARCHAR2 field. Is this possible, and if so, what do I have to do to implement this?
PS: as the product has been released, changing the NLS_CHARACTERSET will be out of the question.
EDIT
So far we have come up with an idea like this:
For each CJK character we break it into its UTF-8 byte representations, then store the byte sequence into database. In return, we reorganize them into a CJK character. For example, the Chinese character 中 would be 0xe4, 0xb8, 0xad, so we store the 3 bytes instead.
However this approach seems not work correctly. If we store Chinese character 华 which is 0xe5, 0x8d, 0x8e in byte, it becomes 0xe5, 0xbf, 0x8e in database.
We are using Java language, have no idea if this has any to do with the result.

Not correctly, no. If the character set of the database is Windows-1252, the only characters you can properly store in a VARCHAR2 column are those that exist in the Windows-1252 character set.
If the NLS_NCHAR_CHARACTERSET is Unicode (generally AL16UTF16), you could create an NVARCHAR2 column and store CJK characters in that new column. Your application may need code changes in order to support NVARCHAR2 columns.

Related

What is the relationship between unicode/utf-8/utf-16 and my local encode GBK?

I've noted that my text file on Windows(chinese version), when port to Ubuntu, turned garbled.
After more research, I know the default encode on Windows CN version is GBK, while on Ubuntu is utf-8, and iconv can do the encode translation, for example, from GBK to utf-8:
iconv -f gbk -t utf-8 input.txt > output.txt
But I am still confused by the relationship of these encode. What are they? what is the similarity and difference between them?
First it is not about the OS, but about the program you are using to read the file.
On a bare .txt, the program has to be able to guess the encoding, which is not always possible, but might work. On an html, encoding is given as metadata, so browsers don't need to do that.
Second, do you know ASCII? Do you see how it represents symbols via numbers? If not this is the first thing you should learn now.
Next, do you see the difference between Unicode and UTF-XXX? It must be clear to you that Unicode is just a map of numbers (code points) to glyphs (symbols, including Chinese characters, ASCII characters, Egyptian characters, etc.)
UTF-XXX on the other hand says, given a string of bytes, which Unicode numbers (code points) do they represent. Therefore, UTF-8 and UTF-16 are different efficient ways to represent Unicode.
As you may imagine, unlike ASCII, both UTF and GBK must allow more than one byte per character, since there are much more than 256 of them.
In GBK all characters are encoded as 1 or 2 bytes.
Since GBK is specialized for Chinese, it uses less bytes in average than UTF-XXX to represent a given Chinese text, and more for other languages.
In UTF-8 and 16, the number of bytes per glyph is variable, so you have to look at how many bytes are used for the Chinese code points.
In Unicode, Chinese glyphs are on the following ranges. Then you have to look at how efficiently UTF-8 and UTF-16 represent those ranges.
According to Wikipedia articles on UTF-8 and UTF-16, the first and most common range for Chinese glyphs 4E00-9FFF is represented in UTF-8 as either 2 or 3 bytes, while in UTF-16 it is represented as 2 bytes. Therefore, if you are going to use lots of Chinese, UTF-16 might be more efficient. You also have to look into the other ranges to see how many bytes per character are used.
For portability, the best choice is UTF, since UTF can represent almost any possible character set, so it is more likely that viewers will have been programmed to decode it correctly. The size gain of GBK is not that large.

What 8-bit encodings use the C1 range for characters? (x80—x9F or 128—159)

Wikipedia has a listing of the x80—x9F "C1" range under Latin 1 Supplement for Unicode. This range is also reserved in the ISO-8859-1 codepage.
I'm looking at a file of strings, all of which are within the 7-bit ASCII range except for a few instances of \x96 where it looks like a dash would be, such as the middle of a street address.
I don't know if other characters in the C1 range might eventually show up in the data, so I'd like to know if there's a correct way to read the file. Are there are any 8-bit encodings which use x80 through x9F for character data instead of terminal control characters?
There is a large number (potentially an infinite number) of 8-bit encodings that assign graphic characters to some or all bytes in the range 0x80 to 0x9F. Several encodings defined by Microsoft have U+2013 EN DASH “–” at byte position 0x96, and this character could conceivably appear in a street address, especially between numbers.
On the other hand, e.g. MacRoman has the letter “ñ” at position 0x96, and it could well appear within a street name in Spanish, for example.
For a rational analysis of the situation, you should inspect the data as a whole, possibly using a filter that finds all bytes outside the Ascii range 0x00 to 0x7F, look at the contexts in which the characters appear, and try to find technical information about the origin of the data.
It's an en dash. I guess slightly different than a hyphen (0x2D).
http://www.ascii-code.com/

Which NLS_LENGTH_SEMANTICS for WE8MSWIN1252 Character Set

We have a database where the character set is set to WE8MSWIN1252 which I understand is a single byte character set.
We created a schema and its tables by running a script with the following:
ALTER SYSTEM SET NLS_LENGTH_SEMANTICS=CHAR
Could we possibly lose data since we are using VARCHAR2 columns with character semantics while the underlying character set is single byte?
If you are using a single-byte character set like Windows-1252, it is irrelevant whether you are using character or byte semantics. Each character occupies exactly one byte so it doesn't matter whether a column is declared VARCHAR2(10 CHAR) or VARCHAR2(10 BYTE). In either case, up to 10 bytes of storage for up to 10 characters will be allocated.
Since you gain no benefit from changing the NLS_LENGTH_SEMANTICS setting, you ought to keep the setting at the default (BYTE) since that is less likely to cause issues with other scripts that you might need to run (such as those from Oracle).
Excellent question. Multi-byte characters will take up the number of bytes required, which could use more storage than you expect. If you store a 4-byte character in a varchar2(4) column, you have used all 4 bytes. If you store a 4-byte character in a varchar2(4 char) column, you have only used 1 character. Many foreign languages and special characters use 2-byte character sets, so it's best to 'know your data' and make your database column definitions accordingly. Oracle does NOT recommend changing NLS_LENGTH_SEMANTICS to CHAR because it will affect every new column defined as CHAR or VARCHAR2, possibly including your catalog tables when you do an in-place upgrade. You can see why this is probably not a good idea. Other Oracle toolsets and interfaces may present issues as well.

What is binary character set?

I'm wondering what binary character set is and what is a difference from, let's say, ISO/IEC 8859-1 aka Latin-1 character set?
There's a page in the MySQL documentation about The _bin and binary Collations.
Nonbinary strings (as stored in the CHAR, VARCHAR, and TEXT data types) have a character set and collation. A given character set can have several collations, each of which defines a particular sorting and comparison order for the characters in the set. One of these is the binary collation for the character set, indicated by a _bin suffix in the collation name. For example, latin1 and utf8 have binary collations named latin1_bin and utf8_bin.
Binary strings (as stored in the BINARY, VARBINARY, and BLOB data types) have no character set or collation in the sense that nonbinary strings do. (Applied to a binary string, the CHARSET() and COLLATION() functions both return a value of binary.) Binary strings are sequences of bytes and the numeric values of those bytes determine sort order.
And so on. Maybe gives more sense? If not, I'd recommend looking further in the documentation for descriptions about these things. If it's a concept, it should be explained. Usually is :)

What's the proper technical term for "high ascii" characters?

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

Resources