What is a code point and code space? - character-encoding

I was reading the Wikipedia article on code points, but not sure if I understand correctly.
For example, the character encoding scheme ASCII comprises 128 code
points in the range 0hex to 7Fhex
So is 0hex a code point?
Also could not find anything on code space.
PS. If it's a duplicate please post a link in the comments and I'll remove the question.

A code point is a numerical code that refers to a single element/character in a specific coded character set, that sentence means that ASCII has 128 possible symbols (only a part of those will be printable characters) and each one of those has a related numerical code by which it can be identified/addressed, the code point.
For an alternative wording, check out this Joel's post and this summary by Oracle that also introduces the concept of code unit :)
To give you a real world example of what code points are, consider the unicode character snowman ☃, its code point (with unicode syntax U+<code point in hex>) is U+2603.

The concepts are slightly more abstract than the traditional, pre-Unicode concepts.
Traditionally, a "code space" was more or less synonymous with "character range". A 7-bit encoding would have a code space from 0 through 127, an 8-bit encoding 0 through 255, a 16-bit encoding 0 through 65535. Unicode has a code space from 0 through 0x10FFFF, though parts of the code space are unpopulated.
Traditionally, a "code point" was more or less synonymous with "character code". Unicode abstracts away from the single "character code" mapping to emphasize that there is a more-complex relationship between a set of glyphs and a set of character codes, and that some code points (such as joining modifiers) do not encode individual glyphs as such. Superficially, U+0020 is still the same character as ASCII SPACE 0x20, but Unicode has a much richer set of well-defined attributes and relationships.
Unicode had to coin new terms for these concepts so as not to overload the traditional terms with extended meanings. A "code space" is a unique, well-defined concept, which is not exactly the same thing as an (implicitly contiguous, possibly fully populated) character range. A "code point" is a unique, well-defined concept, which is not exactly the same thing as a "character code" (which isn't even entirely well-defined in the first place; it has multiple ambiguous interpretations).

Related

Code which actually uses the "\a", "\f", and "\v" escape sequences?

If it weren't for the obligatory listings of escape sequences in textbooks, I'd be unaware of the ASCII characters denoted by the escape sequences "\a", "\f", and "\v" (denoting "bell", "form feed", and "vertical tab".) Or at least as unaware as I am about the esoteric ASCII control characters like "File Separator", "Data Link Escape" and "End of Medium" that fill up much of the first 32 slots in the ASCII character table.
Are there programs and systems which actually make heaving use of the more uncommon escape sequences given by the C language and its descendants? For exmaple, if the new crop of programming languages were to drop "\v" notation, would anyone care? Besides the obvious use in printing, is "\f" used? Beyond making a terminal beep, what about "\a"? Are these routinely supported?
You are conflating the escape sequence, which is used by the compiler, with the character, which is used the program (and perhaps passed on to other programs).
Your question, though relevant, is a bit late. Java has already dropped '\v'. Funny, I never noticed. But, not C# or JavaScript. Never noticed that, either.

Are code pages and code charts the same thing?

Based on what I have gathered so far from reading information available online:
character set is a bunch of characters that we want to use (like an interface)
character encoding is a method of encoding some character set (like an implementation)
What is the relationship between code charts and code pages and how do they fit into the overall context? I am not sure if these two terms are synonyms or if they are referring to distinct concepts.
Do code charts/code pages define character sets through large tables and also provide a method of encoding, making them a part of character encoding? Or, do they only define character sets and leave encoding implementation to another aspect? Additionally, is a locale simply a type of code chart/code page or is it a separate concept altogether?
In the majority of cases, character sets and character encodings are one and the same. For example, ISO-8859-1 defines the character set for Western Europe AND the encoding using an 8bit scheme.
See the specification for ISO-8859-1: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf, which includes the encoding implementation.
Unicode on the other hand separates encoding from the character definition, albeit within a bunch of related documents. In Unicode, just about all current and a good deal of historic characters, symbols and modifiers are mapped to a 32 bit "code point". Encodings of UTF-32, UTF-16 and UTF-8 are then documented separately, to define how the Unicode Code Point is encoded.

What ASCII character uses up the most memory?

I've been thinking about ASCII and memory lately and couldn't find a solid answer to this question.
When a script compiled, do ASCII characters use up different amounts of memory? And if so: what ASCII character uses up the most memory?
ASCII characters are a fixed width character encoding with each character represented by 7 bits. So to answer your question the different ASCII characters will all take the same amount of memory regardless of the implementation.
Because of the way in which our processor architectures are designed we typically store ASCII character in a single byte (the reason for doing so is because aligned memory access is a lot faster than having to do bitwise operations, see tripleee's comment). This means that typically any ASCII character will take up one byte of space on common computing platforms.
In contrast to this are the variable width encodings such as UTF8. For future readers who come across this page it might be worth noting that the ASCII characters 0 through to 127 are represented with the same binary as they are in UTF8. This was done to help maintain backwards compatibility. Therefore in the context of UTF8 encoding, the ASCII characters 0 through 127 will take up less space than other UTF8 characters.
Further I haven't heard of a mainstream compiler/interpreter that compresses strings stored with ASCII characters. This would impose a runtime performance hit that many would find unacceptable. Such a space optimization would therefore be left to the user to perform.
The ASCII wikipedia page has a good summary of the ASCII character set.
﷽ is probably the most space-consuming character. Im not sure about the coding, but it is a huge single-character. It is called "Basmala" and it means "In the name of Allah, the Most Gracious, the Most Merciful."
According to a Reddit user who has now deleted their account: “It's an Arabic ligature commonly used in Urdu. It was added so someone using an Urdu keyboard can type it easier.”
I love to use this in Discord raids, because imagine 2000 Basmala characters, vs 2000 regular characters. It fills their server up a LOT. Glad I could help.

What strategies are there for escaping character entities?

We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be "ASCII", UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to "flatten" these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).
I appreciate that there is no absolute answer - any escaping can clash in some cases - but I am looking for something allong the lines of XML's <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I'm wondering if there is a generally adopted approach rather than inventing our own.
A typical example is the "degrees" symbol:
HTML Entity (decimal) °
HTML Entity (hex) °
HTML Entity (named) °
How to type in Microsoft Windows Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex) 0xC2 0xB0 (c2b0)
UTF-8 (binary) 11000010:10110000
UTF-16 (hex) 0x00B0 (00b0)
UTF-16 (decimal) 176
UTF-32 (hex) 0x000000B0 (00b0)
UTF-32 (decimal) 176
C/C++/Java source code "\u00B0"
Python source code u"\u00B0"
We are also likely to encounter TeX
$10\,^{\circ}{\rm C}$
or
\degree
so backslashes, curlies and dollars are a poor idea.
We could for example use markup like:
__deg__
__#176__
and this will probably work but I'd appreciate advice from those who have similar problems.
update I accept #MichaelB's insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I'll revisit this. Note that my original question is not well worded - read his answer and the link in it.
Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.
Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.
Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.
Example, something like
the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee
your marker is a uuid, and the entity is &deg encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.
Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).
If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:
>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>>
If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:
>>> uuid.uuid1(node=0x1234567890)
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>>
You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).

What's the proper technical term for "high ascii" characters?

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

Resources