Why do some characters have Escape character before them? - character-encoding

Some special characters like | , ~ , ^ , { , } and many others have Escape characters before them. Have a look at the screenshot or visit this link : http://messente.com/documentation/sms-length-calculator, to check it yourself.
I want to know as I don't understand why these characters have Escape characters before them or how/why these characters are different from other (special) characters.

See here for information about the GSM 03.38 encoding.
“Why” questions are always difficult to answer precisely, but my guess is that the goal is to be able to encode the characters deemed most common with 7 bits, while other, less frequent characters will require 14 bits.
There are only 1120 bits per SMS, so saving space is desirable. With the above encoding, you can get more than 140 characters encoded for a “normal” text message.

Related

How to best "compress" a 32 character hexadecimal UUID with URL-safe characters?

I was hoping to shorten the UUIDs for database entries in order to make URL-safe and shareable as permalinks. While it's possible for us to have a separate cached table of "pointers" for these with fewer characters, I was wondering if there was a better way?
The best way I can think of off the top of my head is to base64 encode them instead of hexadecimal encode them. That shortens them from 32 characters to around 22. But I wanted to try to get this below 14 characters if possible. :/
I'm going to take a stab at some quick math here, so please correct me if I'm wrong. A UUID is at it's most basic level a 128-bit value (ref). That means there are 2^128 possibilities.
According to RFC 3986,
Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
So among friends let's say that's about 66 unreserved ASCII characters we can use in the URL (26+26+10+4).
Solving the equation 2^128 - 66^x = 0, x is about 21.18, meaning that, like you said with your base64 idea, a minimum of 22 unreserved ASCII characters are needed to URL-encode the UUID (at this time) and fewer characters cannot be used 100% of the time.
Having said that, on the surface (visually, in browsers) it might be possible to use unicode characters to represent larger portions of hexibits (e.g. example.com/uuid/☂☎♞ʤ☯...), but the underlying URL would be much longer than a 32-hexibit UUID as allowed URL characters are limited by the RFC. However, this would be admittedly crazy and in need of some neat algorithms to encode the UUID nicely.

What ASCII character uses up the most memory?

I've been thinking about ASCII and memory lately and couldn't find a solid answer to this question.
When a script compiled, do ASCII characters use up different amounts of memory? And if so: what ASCII character uses up the most memory?
ASCII characters are a fixed width character encoding with each character represented by 7 bits. So to answer your question the different ASCII characters will all take the same amount of memory regardless of the implementation.
Because of the way in which our processor architectures are designed we typically store ASCII character in a single byte (the reason for doing so is because aligned memory access is a lot faster than having to do bitwise operations, see tripleee's comment). This means that typically any ASCII character will take up one byte of space on common computing platforms.
In contrast to this are the variable width encodings such as UTF8. For future readers who come across this page it might be worth noting that the ASCII characters 0 through to 127 are represented with the same binary as they are in UTF8. This was done to help maintain backwards compatibility. Therefore in the context of UTF8 encoding, the ASCII characters 0 through 127 will take up less space than other UTF8 characters.
Further I haven't heard of a mainstream compiler/interpreter that compresses strings stored with ASCII characters. This would impose a runtime performance hit that many would find unacceptable. Such a space optimization would therefore be left to the user to perform.
The ASCII wikipedia page has a good summary of the ASCII character set.
﷽ is probably the most space-consuming character. Im not sure about the coding, but it is a huge single-character. It is called "Basmala" and it means "In the name of Allah, the Most Gracious, the Most Merciful."
According to a Reddit user who has now deleted their account: “It's an Arabic ligature commonly used in Urdu. It was added so someone using an Urdu keyboard can type it easier.”
I love to use this in Discord raids, because imagine 2000 Basmala characters, vs 2000 regular characters. It fills their server up a LOT. Glad I could help.

What is the relationship between unicode/utf-8/utf-16 and my local encode GBK?

I've noted that my text file on Windows(chinese version), when port to Ubuntu, turned garbled.
After more research, I know the default encode on Windows CN version is GBK, while on Ubuntu is utf-8, and iconv can do the encode translation, for example, from GBK to utf-8:
iconv -f gbk -t utf-8 input.txt > output.txt
But I am still confused by the relationship of these encode. What are they? what is the similarity and difference between them?
First it is not about the OS, but about the program you are using to read the file.
On a bare .txt, the program has to be able to guess the encoding, which is not always possible, but might work. On an html, encoding is given as metadata, so browsers don't need to do that.
Second, do you know ASCII? Do you see how it represents symbols via numbers? If not this is the first thing you should learn now.
Next, do you see the difference between Unicode and UTF-XXX? It must be clear to you that Unicode is just a map of numbers (code points) to glyphs (symbols, including Chinese characters, ASCII characters, Egyptian characters, etc.)
UTF-XXX on the other hand says, given a string of bytes, which Unicode numbers (code points) do they represent. Therefore, UTF-8 and UTF-16 are different efficient ways to represent Unicode.
As you may imagine, unlike ASCII, both UTF and GBK must allow more than one byte per character, since there are much more than 256 of them.
In GBK all characters are encoded as 1 or 2 bytes.
Since GBK is specialized for Chinese, it uses less bytes in average than UTF-XXX to represent a given Chinese text, and more for other languages.
In UTF-8 and 16, the number of bytes per glyph is variable, so you have to look at how many bytes are used for the Chinese code points.
In Unicode, Chinese glyphs are on the following ranges. Then you have to look at how efficiently UTF-8 and UTF-16 represent those ranges.
According to Wikipedia articles on UTF-8 and UTF-16, the first and most common range for Chinese glyphs 4E00-9FFF is represented in UTF-8 as either 2 or 3 bytes, while in UTF-16 it is represented as 2 bytes. Therefore, if you are going to use lots of Chinese, UTF-16 might be more efficient. You also have to look into the other ranges to see how many bytes per character are used.
For portability, the best choice is UTF, since UTF can represent almost any possible character set, so it is more likely that viewers will have been programmed to decode it correctly. The size gain of GBK is not that large.

Weird encoding replacing

I have a file. I don't know how it was processed. It's probably a double encoding. I've found this link about double encoding that solved almost my problem:
http://www.spamusers.com/encoding.htm
It has all the double encodings substitutions to do like:
À àÁ
 Â
Unfortnately I still others weird characters like:
ú
ç
ö
Do you have an idea on how to clean these weird characters? For the ones I know I've just made a bash script and I've just replaced them. But I don't know how to recognize the others. I'm running on linux so if you have some magic commands I would like that.
The “double encodings substitutions” page that you link to seems to contain mappings meant to fix character data that has been doubly UTF-8 encoded. Thus, the proper fixing routine would be to reverse such mappings and see if the result makes sense.
For example, if you take A with grave accent, À, U+00C0, and UTF-8 encode it, you get the bytes C3 A0. If these are then mistakenly understood as single-byte encodings according to windows-1252 for example, you get the characters U+00C3 U+00A0 (letter à and no-break space). If these are then UTF-8 encoded, you get C3 83 for the former and C2 80 for the latter. If these bytes in turn are interpreted according to windows-1252, you get À as on the page.
But you don’t actually have “À”, do you? You have some digital data, bytes, which display that way if interpreted according to windows-1252. But that would be a wrong interpretation.
You should first read the data as UTF-8 encoded, decode it to characters, checking that all codes are less than 100 hexadecimal (if not, there’s yet another error involved somewhere), then UTF-9 decode again.

What's the proper technical term for "high ascii" characters?

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

Resources