How to interpret unicode encodings - character-encoding

I just finished reading the article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky.
I'd really appreciate clarification on this part of the article.
OK, so say we have a string: Hello which, in Unicode, corresponds to these five code points:
U+0048 U+0065 U+006C U+006C U+006F...That’s where encodings come in.
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes
00 48 00 65 00 6C 00 6C 00 6F
Right? Not so fast! Couldn’t it also be:
48 00 65 00 6C 00 6C 00 6F 00 ?
Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
My questions:
Why could the two zero's at the beginning of 0048 be moved to the end?
What is FE FF and FF FE, what's the difference between them and how were they used? (Yes I tried googling those terms, but I'm still confused)
Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?
Also, I'd appreciate any recommended resources to learn more about this stuff.

Summary: the 0xFEFF (byte-order mark) character is used to solve the endianness problem for some character encodings. However, most of today's character encodings are not prone to the endianness problem, and thus the byte-order mark is not really relevant for today.
Why could the two zero's at the beginning of 0048 be moved to the end?
If two bytes are used for all characters, then each character is saved in a 2-byte data structure in the memory of the computer. Bytes (groups of 8 bits) are the basic addressable units in most computer memories, and each byte has its own address. On systems that use the big-endian format, the character 0x0048 would be saved in two 1-byte memory cells in the following way:
n n+1
+----+----+
| 00 | 48 |
+----+----+
Here, n and n+1 are the addresses of the memory cells. Thus, on big-endian systems, the most significant byte is stored in the lowest memory address of the data structure.
On a little-endian system, on the other hand, the character 0x0048would be stored in the following way:
n n+1
+----+----+
| 48 | 00 |
+----+----+
Thus, on little-endian systems, the least significant byte is stored in the lowest memory address of the data structure.
So, if a big-endian system sends you the character 0x0048 (for example, over the network), it sends you the byte sequence 00 48. On the other hand, if a little-endian system sends you the character 0x0048, it sends you the byte sequence 48 00.
So, if you receive a byte sequence like 00 48, and you know that it represents a 16-bit character, you need to know whether the sender was a big-endian or little-endian system. In the first case, 00 48 would mean the character 0x0048, in the second case, 00 48 would mean the totally different character 0x4800.
This is where the FE FF sequence comes in.
What is FE FF and FF FE, what's the difference between them and how were they used?
U+FEFF is the Unicode byte-order mark (BOM), and in our example of a 2-byte encoding, this would be the 16-bit character 0xFEFF.
The convention is that all systems (big-endian and little-endian) save the character 0xFEFF as the first character of any text stream. Now, on a big-endian system, this character is represented as the byte sequence FE FF (assume memory addresses increasing from left to right), whereas on a little-endian system, it is represented as FF FE.
Now, if you read a text stream, that has been created by following this convention, you know that the first character must be 0xFEFF. So, if the first two bytes of the text stream are FE FF, you know that this text stream has been created by a big-endian system. On the other hand, if the first two bytes are FF FE, you know that the text stream has been created by a little-endian system. In either case, you can now correctly interpret all the 2-byte characters of the stream.
Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?
Placing the byte-order mark (BOM) character 0xFEFF at the beginning of each text stream is just a convention, and not all systems may follow it. So, if the BOM is missing, you have the problem of not knowing whether to interpret the 2-byte characters as big-endian or little-endian.
Also, I'd appreciate any recommended resources to learn more about this stuff.
https://en.wikipedia.org/wiki/Endianness
https://en.wikipedia.org/wiki/Unicode
https://en.wikibooks.org/wiki/Unicode/Character_reference
https://en.wikipedia.org/wiki/Byte_order_mark
https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes
Notes:
Today, the most widely used Unicode-compatible encoding is UTF-8. UTF-8 has been designed to avoid the endianness problem, thus, the entire byte-order mark 0xFEFF stuff is not relevant for UTF-8 (see here).
The byte-order mark is however relevant to the other Unicode-compatible encodings UTF-16 and UTF-32, which are prone to the endianness problem. If you browse through the list of available encodings, for example in the settings of a text editor or terminal, you see that there are big-endian and little-endian versions of UTF-16 and UTF-32, typically called UTF-16BE and UTF-16LE, or UTF-32BE and UTF-32LE, respectively. However, UTF-16 and UTF-32 are rarely used in practice.
Other popular encodings used today include the encodings from the ISO 8859 series, such as ISO 8859-1 (and the derived Windows-1252), known as Latin-1, or also the pure ASCII encoding. However, all these are single-byte encodings, that is, each character is encoded to 1 byte and saved in a 1-byte data structure. Thus, the endianness problem doesn't apply here, and the byte-order mark story is also not relevant for these cases.
All in all, the endianness problem for character encodings, that you struggled to understand, has thus mostly a historical value, and is not really relevant for today's world anymore.

This is all to do with the internal storage of data in the computer's memory - in this example (00 48), some computers will store the largest byte first and the smallest byte second (known as big-endian), and others will store the smallest byte first (little-endian). So, depending on your computer, when you read the bytes out of memory you'll get either the 00 first or the 48 first. And you need to know which way round it's going to be to make sure you interpret the bytes correctly. For a more in-depth introduction to the topic, see Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness)
These days, most compilers and interpreters will take care of this low-level stuff for you, so you will rarely (if ever) need to worry about it.

Related

x86 32bit Assembly Parser | logical problem

I'm currently working on an Obfuscator for assembled x86 assembly (working with the raw bytes).
To do that I first need to build a simple parser, to "understand" the bytes.
I'm using a database that I create for myself mostly with the website: https://defuse.ca/online-x86-assembler.htm
Now my question:
Some bytes can be interpreted in two ways, for example (intel syntax):
1. f3 00 00 repz add BYTE PTR [eax],al
2. f3 repz
My idea way to loop through the bytes and work with every instruction as single,
but when I reach byte '0xf3' I have 2 ways of interpreting it.
I know there are working x86 disassemblers out there, how do I know what case this is?
Prefixes, including repz prefix, are not meaningful without subsequent instruction. The subsequent instruction may incorporate the prefix (repz nop is pause), change its meaning (repz is xrelease if used before some interlocked instruction), or the prefix may be just invalid.
The decoding is always unambiguous, otherwise the CPU could not execute instructions. It may be ambiguous only if you don't know exact byte offset where to begin decoding (as x86 uses variable instruction length).

Is ASCII code in matter of fact 7 bit or 8 bit?

My teacher told me ASCII is an 8-bit character coding scheme. But it is defined only for 0-127 codes which means it can be fitted into 7 bits. So can't it be argued that ASCII is actually a 7-bit code?
And what do we mean to say at all when saying ASCII is a 8-bit code at all?
ASCII was indeed originally conceived as a 7-bit code. This was done well before 8-bit bytes became ubiquitous, and even into the 1990s you could find software that assumed it could use the 8th bit of each byte of text for its own purposes ("not 8-bit clean"). Nowadays people think of it as an 8-bit coding in which bytes 0x80 through 0xFF have no defined meaning, but that's a retcon.
There are dozens of text encodings that make use of the 8th bit; they can be classified as ASCII-compatible or not, and fixed- or variable-width. ASCII-compatible means that regardless of context, single bytes with values from 0x00 through 0x7F encode the same characters that they would in ASCII. You don't want to have anything to do with a non-ASCII-compatible text encoding if you can possibly avoid it; naive programs expecting ASCII tend to misinterpret them in catastrophic, often security-breaking fashion. They are so deprecated nowadays that (for instance) HTML5 forbids their use on the public Web, with the unfortunate exception of UTF-16. I'm not going to talk about them any more.
A fixed-width encoding means what it sounds like: all characters are encoded using the same number of bytes. To be ASCII-compatible, a fixed-with encoding must encode all its characters using only one byte, so it can have no more than 256 characters. The most common such encoding nowadays is Windows-1252, an extension of ISO 8859-1.
There's only one variable-width ASCII-compatible encoding worth knowing about nowadays, but it's very important: UTF-8, which packs all of Unicode into an ASCII-compatible encoding. You really want to be using this if you can manage it.
As a final note, "ASCII" nowadays takes its practical definition from Unicode, not its original standard (ANSI X3.4-1968), because historically there were several dozen variations on the ASCII 127-character repertoire -- for instance, some of the punctuation might be replaced with accented letters to facilitate the transmission of French text. All of those variations are obsolete, and when people say "ASCII" they mean that the bytes with value 0x00 through 0x7F encode Unicode codepoints U+0000 through U+007F. This will probably only matter to you if you ever find yourself writing a technical standard.
If you're interested in the history of ASCII and the encodings that preceded it, start with the paper "The Evolution of Character Codes, 1874-1968" (samizdat copy at http://falsedoor.com/doc/ascii_evolution-of-character-codes.pdf) and then chase its references (many of which are not available online and may be hard to find even with access to a university library, I regret to say).
On Linux man ascii says:
ASCII is the American Standard Code for Information Interchange. It is a 7-bit code.
The original ASCII table is encoded on 7 bits, and therefore it has 128 characters.
Nowadays, most readers/editors use an "extended" ASCII table (from ISO 8859-1), which is encoded on 8 bits and enjoys 256 characters (including Á, Ä, Œ, é, è and other characters useful for European languages as well as mathematical glyphs and other symbols).
While UTF-8 uses the same encoding as the basic ASCII table (meaning 0x41 is A in both codes), it does not share the same encoding for the "Latin Extended-A" block. Which sometimes causes weird characters to appear in words like à la carte or piñata.
ASCII encoding is 7-bit, but in practice, characters encoded in ASCII are not stored in groups of 7 bits. Instead, one ASCII is stored in a byte, with the MSB usually set to 0 (yes, it's wasted in ASCII).
You can verify this by inputting a string in the ASCII character set in a text editor, setting the encoding to ASCII, and viewing the binary/hex:
Aside: the use of (strictly) ASCII encoding is now uncommon, in favor of UTF-8 (which does not waste the MSB mentioned above - in fact, an MSB of 1 indicates the code point is encoded with more than 1 byte).
The original ASCII code provided 128 different characters numbered 0 to 127. ASCII and 7-bit are synonymous. Since the 8-bit byte is the common storage element, ASCII leaves room for 128 additional characters which are used for foreign languages and other symbols.
But the 7-bit code was original made before the 8-bit code. ASCII stand for American Standard Code for Information Interchange.
In early Internet mail systems, it only supported 7-bit ASCII codes.
This was because it then could execute programs and multimedia files over such systems. These systems use 8 bits of the byte, but then it must then be turned into a 7-bit format using coding methods such as MIME, uucoding and BinHex. This means that the 8-bit characters has been converted to 7-bit characters, which adds extra bytes to encode them.
When we call ASCII a 7-bit code, the left-most bit is used as the sign bit, so with 7 bits we can write up to 127.
That means from -126 to 127, because the maximum values of ASCII is 0 to 255. This can be only satisfied with the argument of 7 bit if the last bit is considered as the sign bit.

Discover the character encoding from byte

I have a string where I know that the degree symbol (°) is represented by the byte 63 (3F).
Each character is represented by a single byte.
How can I find the character encoding used ?
Almost all 8-bit encodings in modern times coincide with ASCII in the ASCII range, so byte 3F hexadecimal is the question mark “?”. As Sebtm’s comment suggests, this might result from character-level data error. E.g., some software that is limited to ASCII could turn all other bytes to “?” – not a good practice, but possible.
If it were a non-ASCII byte, you could use the page http://www.eki.ee/letter/chardata.cgi?search=degree+sign to make a guess.

Translating memory contents into a string via ASCII encoding

I have to translate some memory contents into a string, using ASCII encoding. For example:
0x6A636162
But I am not sure how to break that up, to be translated into ASCII. I think it has something to do with how many bits are in a char/digit, but I am not sure how to go about doing so (and of course, I would like to know more of the reasoning behind it, not just "how to do it").
ASCII uses 7 bits to encode a character (http://en.wikipedia.org/wiki/ASCII). However, it's common to encode characters using 8 bits instead (note that technically this isn't ASCII). Thus, you'd need to split your data into 8-bit chunks and match that to an ASCII table.
If you're using a specific programming language, it may have a way to translate an ASCII code to a character. For instance, Ruby has the .chr method, Python has the chr() built-in function, and in C you can printf("%c", number).
Note that each nibble (4 bits) can be represented as one hexadecimal digit, so for the sample string you show, each 8-bit "chunk" would be:
0x6A
0x63
0x61
0X62
the string reads "jcab" :)

How does low-level character encodings work?

let's say, i have a textfile called sometext.txt
it has a line - "Sic semper tyrannis" which is (correct me if i'm wrong..)
83 105 99 32 115 101 109 112 101 114 32 116 121
114 97 110 110 105 115
(in decimal ASCII)
When i read this line from file using standard library file i/o routines, i don't perform any character encodings work.. (or do i??)
The question is:
Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?? Is it OS component?? Which one??
It's all a bunch of 1's and 0's.
An ASCII "A" is just the letter displayed when the value (01000001b, or 0x41 or 65 dec) is "encountered" (depend on context, naturally). There is no "conversion"; it's just a different view of the same thing defined by an accepted mapping.
Unicode (and other multi-byte) character sets often use different encodings; in UTF-8 (a Unicode encoding), for instance, a single Unicode character can be mapped as 1, 2, 3 or 4 bytes depending upon the character. Unicode encoding conversion often takes place in the IO libraries that come as part of a language or runtime; however, a Unicode-aware operating system also needs to understand a Unicode encoding itself (in system calls) so the line can be blurred.
UTF-8 has the nice property that all normal ASCII characters map to a single byte which makes it the most compatible Unicode encoding with traditional ASCII.
First, I recommend that you read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
When i read this line from file using
standard library file i/o routines, i
don't perform any character encodings
work.. (or do i??)
That depends heavily on which standard library you mean.
In C, when you write:
FILE* f = fopen("filename.txt", "w");
fputs("Sic semper tyrannis", f);
No encoding conversion is performed; the chars in the string are just written to the file as-is (except for line breaks). (Encoding is relevant when you're editing the source file.)
But in Python 3.x, when you write:
f = open('filename.txt', 'w', encoding='UTF-8')
f.write('Sic semper tyrannis')
The write function performs an internal conversion from the UTF-16/32 encoding of the Python str types to the UTF-8 encoding used on disk.
The question is: Which software
component actually converts 0s and 1s
into characters(i.e. contains
algorithm for converting 0s and 1s
into characters)?? Is it OS
component?? Which one??
The decoding function (like MultiByteToWideChar or bytes.decode) for the appropriate character encoding converts the bytes into Unicode code points, which are integers that uniquely identify characters. A font converts code points to glyphs, the images of the characters that appear on screen or paper.
Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?
This depends on what languge you're using. For example, Python has character encoding functions:
>>> f = open( ...., 'rb')
>>> data = f.read()
>>> data.decode('utf-8')
u'café'
Here, Python has converted a sequence of bytes into a Unicode string. The exact component is typically a library or program in userspace, but some compilers need knowledge of character encodings.
Underneath, it's all a sequence of bytes, which are 1s and 0s. However, given a sequence of bytes, which characters do these represent? ASCII is one such "character encoding", and tells us how to encode or decode A-Z, a-z, and a few more. There are many others, noteably UTF-8 (an encoding of Unicode). In the end, if you're dealing with text, you need to know what character encoding it is encoded with.
Like DrStrangeLove says, it's 1's & 0's all the way to your display screen and beyond - the 'A' character is an array of pixels whose color/brightness is defined by bits in the display driver. Turning that pixel array into an understandable character needs a bioElectroChemical video camera connected to 10^11 threshold logic gates running an adaptive, massively-parallel OS and apps that no-one understands, especially after a few beers
Not exactly sure what you're asking. The 0's and 1's from the file are blocked up into the bytes that can represent ASCII codes by the disk driver - it will only read/write blocks of eight bits. The ASCII code bytes are rendered into displayable bitmaps by the display driver using the chosen font.
Rgds,
Martin
It has nothing (well, not so much) to do's with 0s and 1s. Most character encodings work with entire bytes of 8 bits. Each of the numbers you wrote represents a single byte. In ASCII, every character is a single byte. Besides that, ASCII is a subset of ANSI and UTF-8, making it compatible with the most used character sets. ASCII contains only the first half of the byte range. Chars upto 127.
For ANSI you need some encoding. ANSI specifies the characters in the upper half of the byte range. In UTF-8, these ANSI characters don't exist. Instead, these last 128 bytes represent part of a character. A whole character is made of 2 to 4 bytes. Except those 128 ASCII characters. They are still the same old single byte characters. I think this is mainly done because if UTF-8 wouldn't be compatible with ASCII, there is no way Americans would have adopted it. ;-)
But yes, the OS does have various functions to work with character encodings. Where they are depends on the OS and platform, but if I read your question right, you're not really looking for some specific API. Your question cannot be answered that concrete. There are numerous ways to work with characters, and these is a major difference between working with the actual character data and writing them to the screen. (difference between character and font).

Resources