What is the relationship between unicode/utf-8/utf-16 and my local encode GBK? - encode

I've noted that my text file on Windows(chinese version), when port to Ubuntu, turned garbled.
After more research, I know the default encode on Windows CN version is GBK, while on Ubuntu is utf-8, and iconv can do the encode translation, for example, from GBK to utf-8:
iconv -f gbk -t utf-8 input.txt > output.txt
But I am still confused by the relationship of these encode. What are they? what is the similarity and difference between them?

First it is not about the OS, but about the program you are using to read the file.
On a bare .txt, the program has to be able to guess the encoding, which is not always possible, but might work. On an html, encoding is given as metadata, so browsers don't need to do that.
Second, do you know ASCII? Do you see how it represents symbols via numbers? If not this is the first thing you should learn now.
Next, do you see the difference between Unicode and UTF-XXX? It must be clear to you that Unicode is just a map of numbers (code points) to glyphs (symbols, including Chinese characters, ASCII characters, Egyptian characters, etc.)
UTF-XXX on the other hand says, given a string of bytes, which Unicode numbers (code points) do they represent. Therefore, UTF-8 and UTF-16 are different efficient ways to represent Unicode.
As you may imagine, unlike ASCII, both UTF and GBK must allow more than one byte per character, since there are much more than 256 of them.
In GBK all characters are encoded as 1 or 2 bytes.
Since GBK is specialized for Chinese, it uses less bytes in average than UTF-XXX to represent a given Chinese text, and more for other languages.
In UTF-8 and 16, the number of bytes per glyph is variable, so you have to look at how many bytes are used for the Chinese code points.
In Unicode, Chinese glyphs are on the following ranges. Then you have to look at how efficiently UTF-8 and UTF-16 represent those ranges.
According to Wikipedia articles on UTF-8 and UTF-16, the first and most common range for Chinese glyphs 4E00-9FFF is represented in UTF-8 as either 2 or 3 bytes, while in UTF-16 it is represented as 2 bytes. Therefore, if you are going to use lots of Chinese, UTF-16 might be more efficient. You also have to look into the other ranges to see how many bytes per character are used.
For portability, the best choice is UTF, since UTF can represent almost any possible character set, so it is more likely that viewers will have been programmed to decode it correctly. The size gain of GBK is not that large.

Related

What ASCII character uses up the most memory?

I've been thinking about ASCII and memory lately and couldn't find a solid answer to this question.
When a script compiled, do ASCII characters use up different amounts of memory? And if so: what ASCII character uses up the most memory?
ASCII characters are a fixed width character encoding with each character represented by 7 bits. So to answer your question the different ASCII characters will all take the same amount of memory regardless of the implementation.
Because of the way in which our processor architectures are designed we typically store ASCII character in a single byte (the reason for doing so is because aligned memory access is a lot faster than having to do bitwise operations, see tripleee's comment). This means that typically any ASCII character will take up one byte of space on common computing platforms.
In contrast to this are the variable width encodings such as UTF8. For future readers who come across this page it might be worth noting that the ASCII characters 0 through to 127 are represented with the same binary as they are in UTF8. This was done to help maintain backwards compatibility. Therefore in the context of UTF8 encoding, the ASCII characters 0 through 127 will take up less space than other UTF8 characters.
Further I haven't heard of a mainstream compiler/interpreter that compresses strings stored with ASCII characters. This would impose a runtime performance hit that many would find unacceptable. Such a space optimization would therefore be left to the user to perform.
The ASCII wikipedia page has a good summary of the ASCII character set.
﷽ is probably the most space-consuming character. Im not sure about the coding, but it is a huge single-character. It is called "Basmala" and it means "In the name of Allah, the Most Gracious, the Most Merciful."
According to a Reddit user who has now deleted their account: “It's an Arabic ligature commonly used in Urdu. It was added so someone using an Urdu keyboard can type it easier.”
I love to use this in Discord raids, because imagine 2000 Basmala characters, vs 2000 regular characters. It fills their server up a LOT. Glad I could help.

Is ASCII code in matter of fact 7 bit or 8 bit?

My teacher told me ASCII is an 8-bit character coding scheme. But it is defined only for 0-127 codes which means it can be fitted into 7 bits. So can't it be argued that ASCII is actually a 7-bit code?
And what do we mean to say at all when saying ASCII is a 8-bit code at all?
ASCII was indeed originally conceived as a 7-bit code. This was done well before 8-bit bytes became ubiquitous, and even into the 1990s you could find software that assumed it could use the 8th bit of each byte of text for its own purposes ("not 8-bit clean"). Nowadays people think of it as an 8-bit coding in which bytes 0x80 through 0xFF have no defined meaning, but that's a retcon.
There are dozens of text encodings that make use of the 8th bit; they can be classified as ASCII-compatible or not, and fixed- or variable-width. ASCII-compatible means that regardless of context, single bytes with values from 0x00 through 0x7F encode the same characters that they would in ASCII. You don't want to have anything to do with a non-ASCII-compatible text encoding if you can possibly avoid it; naive programs expecting ASCII tend to misinterpret them in catastrophic, often security-breaking fashion. They are so deprecated nowadays that (for instance) HTML5 forbids their use on the public Web, with the unfortunate exception of UTF-16. I'm not going to talk about them any more.
A fixed-width encoding means what it sounds like: all characters are encoded using the same number of bytes. To be ASCII-compatible, a fixed-with encoding must encode all its characters using only one byte, so it can have no more than 256 characters. The most common such encoding nowadays is Windows-1252, an extension of ISO 8859-1.
There's only one variable-width ASCII-compatible encoding worth knowing about nowadays, but it's very important: UTF-8, which packs all of Unicode into an ASCII-compatible encoding. You really want to be using this if you can manage it.
As a final note, "ASCII" nowadays takes its practical definition from Unicode, not its original standard (ANSI X3.4-1968), because historically there were several dozen variations on the ASCII 127-character repertoire -- for instance, some of the punctuation might be replaced with accented letters to facilitate the transmission of French text. All of those variations are obsolete, and when people say "ASCII" they mean that the bytes with value 0x00 through 0x7F encode Unicode codepoints U+0000 through U+007F. This will probably only matter to you if you ever find yourself writing a technical standard.
If you're interested in the history of ASCII and the encodings that preceded it, start with the paper "The Evolution of Character Codes, 1874-1968" (samizdat copy at http://falsedoor.com/doc/ascii_evolution-of-character-codes.pdf) and then chase its references (many of which are not available online and may be hard to find even with access to a university library, I regret to say).
On Linux man ascii says:
ASCII is the American Standard Code for Information Interchange. It is a 7-bit code.
The original ASCII table is encoded on 7 bits, and therefore it has 128 characters.
Nowadays, most readers/editors use an "extended" ASCII table (from ISO 8859-1), which is encoded on 8 bits and enjoys 256 characters (including Á, Ä, Œ, é, è and other characters useful for European languages as well as mathematical glyphs and other symbols).
While UTF-8 uses the same encoding as the basic ASCII table (meaning 0x41 is A in both codes), it does not share the same encoding for the "Latin Extended-A" block. Which sometimes causes weird characters to appear in words like à la carte or piñata.
ASCII encoding is 7-bit, but in practice, characters encoded in ASCII are not stored in groups of 7 bits. Instead, one ASCII is stored in a byte, with the MSB usually set to 0 (yes, it's wasted in ASCII).
You can verify this by inputting a string in the ASCII character set in a text editor, setting the encoding to ASCII, and viewing the binary/hex:
Aside: the use of (strictly) ASCII encoding is now uncommon, in favor of UTF-8 (which does not waste the MSB mentioned above - in fact, an MSB of 1 indicates the code point is encoded with more than 1 byte).
The original ASCII code provided 128 different characters numbered 0 to 127. ASCII and 7-bit are synonymous. Since the 8-bit byte is the common storage element, ASCII leaves room for 128 additional characters which are used for foreign languages and other symbols.
But the 7-bit code was original made before the 8-bit code. ASCII stand for American Standard Code for Information Interchange.
In early Internet mail systems, it only supported 7-bit ASCII codes.
This was because it then could execute programs and multimedia files over such systems. These systems use 8 bits of the byte, but then it must then be turned into a 7-bit format using coding methods such as MIME, uucoding and BinHex. This means that the 8-bit characters has been converted to 7-bit characters, which adds extra bytes to encode them.
When we call ASCII a 7-bit code, the left-most bit is used as the sign bit, so with 7 bits we can write up to 127.
That means from -126 to 127, because the maximum values of ASCII is 0 to 255. This can be only satisfied with the argument of 7 bit if the last bit is considered as the sign bit.

How to create UTF-16 animation in Twitter?

I use a UTF-16 character picker to create ASCII art in Texbox in HTML, and UTF-16 characters are supported and visible "as is". Now I need to process such ASCII art and save into an Array as UTF-16 characters, process with Javascript as Strings to build ASCII art animations for Twitter like this:
You don't have to be sorry.
Twitter accepts UTF-16 as ASCIIart
For UTF-16 definition go to Wikipedia
http://en.wikipedia.org/wiki/UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064[1] numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.
I already did 2-bytes Unicode picker (UTF-16) and can generate UTF-16 input into Twitter.
==
re:
Removed the link as it's pointing to a Twitter account which doesn't show the mentioned content anymore (w/o scrolling). May appear like spam then. – david Nov 20 at 4:09
That way it may take much longet time to get right answer.
UTF-16 is a character encoding. Twitter only accepts UTF-8 as input. You can convert UTF-16 to UTF-8 without any data loss, so just do that and then send it to Twitter.

How does low-level character encodings work?

let's say, i have a textfile called sometext.txt
it has a line - "Sic semper tyrannis" which is (correct me if i'm wrong..)
83 105 99 32 115 101 109 112 101 114 32 116 121
114 97 110 110 105 115
(in decimal ASCII)
When i read this line from file using standard library file i/o routines, i don't perform any character encodings work.. (or do i??)
The question is:
Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?? Is it OS component?? Which one??
It's all a bunch of 1's and 0's.
An ASCII "A" is just the letter displayed when the value (01000001b, or 0x41 or 65 dec) is "encountered" (depend on context, naturally). There is no "conversion"; it's just a different view of the same thing defined by an accepted mapping.
Unicode (and other multi-byte) character sets often use different encodings; in UTF-8 (a Unicode encoding), for instance, a single Unicode character can be mapped as 1, 2, 3 or 4 bytes depending upon the character. Unicode encoding conversion often takes place in the IO libraries that come as part of a language or runtime; however, a Unicode-aware operating system also needs to understand a Unicode encoding itself (in system calls) so the line can be blurred.
UTF-8 has the nice property that all normal ASCII characters map to a single byte which makes it the most compatible Unicode encoding with traditional ASCII.
First, I recommend that you read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
When i read this line from file using
standard library file i/o routines, i
don't perform any character encodings
work.. (or do i??)
That depends heavily on which standard library you mean.
In C, when you write:
FILE* f = fopen("filename.txt", "w");
fputs("Sic semper tyrannis", f);
No encoding conversion is performed; the chars in the string are just written to the file as-is (except for line breaks). (Encoding is relevant when you're editing the source file.)
But in Python 3.x, when you write:
f = open('filename.txt', 'w', encoding='UTF-8')
f.write('Sic semper tyrannis')
The write function performs an internal conversion from the UTF-16/32 encoding of the Python str types to the UTF-8 encoding used on disk.
The question is: Which software
component actually converts 0s and 1s
into characters(i.e. contains
algorithm for converting 0s and 1s
into characters)?? Is it OS
component?? Which one??
The decoding function (like MultiByteToWideChar or bytes.decode) for the appropriate character encoding converts the bytes into Unicode code points, which are integers that uniquely identify characters. A font converts code points to glyphs, the images of the characters that appear on screen or paper.
Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?
This depends on what languge you're using. For example, Python has character encoding functions:
>>> f = open( ...., 'rb')
>>> data = f.read()
>>> data.decode('utf-8')
u'café'
Here, Python has converted a sequence of bytes into a Unicode string. The exact component is typically a library or program in userspace, but some compilers need knowledge of character encodings.
Underneath, it's all a sequence of bytes, which are 1s and 0s. However, given a sequence of bytes, which characters do these represent? ASCII is one such "character encoding", and tells us how to encode or decode A-Z, a-z, and a few more. There are many others, noteably UTF-8 (an encoding of Unicode). In the end, if you're dealing with text, you need to know what character encoding it is encoded with.
Like DrStrangeLove says, it's 1's & 0's all the way to your display screen and beyond - the 'A' character is an array of pixels whose color/brightness is defined by bits in the display driver. Turning that pixel array into an understandable character needs a bioElectroChemical video camera connected to 10^11 threshold logic gates running an adaptive, massively-parallel OS and apps that no-one understands, especially after a few beers
Not exactly sure what you're asking. The 0's and 1's from the file are blocked up into the bytes that can represent ASCII codes by the disk driver - it will only read/write blocks of eight bits. The ASCII code bytes are rendered into displayable bitmaps by the display driver using the chosen font.
Rgds,
Martin
It has nothing (well, not so much) to do's with 0s and 1s. Most character encodings work with entire bytes of 8 bits. Each of the numbers you wrote represents a single byte. In ASCII, every character is a single byte. Besides that, ASCII is a subset of ANSI and UTF-8, making it compatible with the most used character sets. ASCII contains only the first half of the byte range. Chars upto 127.
For ANSI you need some encoding. ANSI specifies the characters in the upper half of the byte range. In UTF-8, these ANSI characters don't exist. Instead, these last 128 bytes represent part of a character. A whole character is made of 2 to 4 bytes. Except those 128 ASCII characters. They are still the same old single byte characters. I think this is mainly done because if UTF-8 wouldn't be compatible with ASCII, there is no way Americans would have adopted it. ;-)
But yes, the OS does have various functions to work with character encodings. Where they are depends on the OS and platform, but if I read your question right, you're not really looking for some specific API. Your question cannot be answered that concrete. There are numerous ways to work with characters, and these is a major difference between working with the actual character data and writing them to the screen. (difference between character and font).

What's the proper technical term for "high ascii" characters?

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

Resources