HuffmanCode fixed bits length per character - huffman-code

How do you determine how many bits per character are required for a fixed length code in a string using huffman? i had an idea that you count the number of different characters in a string than you present that number in binary so that will be the fixed length but it doesn't work. For example in the string "letty lotto likes lots of lolly"...there are 10 different characters excluding the quotes(since 10 = 0101(4bits), i thought it meant all the characters can be represented using 4 bits), now the frequency of f is 1 and is encoded as 11111(5 bits)not 4.

Let's say you have a string with 50 "A"s, 35 "B"s and 15 "C"s.
With a fixed-length encoding, you could represent each character in that string using 2 bits. There are 100 total characters, so when using this method, the compressed string would be 200 bits long.
Alternatively, you could use a variable-length encoding scheme. If you allow the characters to have a variable number of bits, you could represent "A" with 1 bit ("0"), "B" with 2 bits ("10") and "C" with 2 bits ("11"). With this method, the compressed string is 150 bits long, because the most common pieces of information in the string take fewer bits to represent.
Huffman coding specifically refers to a method of building a variable-length encoding scheme, using the number of occurrences of each character to do so.
The fixed-length algorithm you're describing is entirely separate from Huffman coding. If your goal is to compress text using a fixed-length code, then your method of figuring out how many bits to represent each character with will work.

Related

How many Bits represent ONE character and How many Bits represent One Byte in ASCII?

I know its simple but I still don't know it. Some people are saying that three are 7 Bits that represent a character while some are saying 8. So can anyone just tell me which one is right? If it is 8 Bits/Character then How many Bits represent a Byte? and If it's 7 then How many bits represent a Character and how many Bits represent ONE byte?
US-ASCII is indeed 7 bits per character. The highest code has value 127, which represents the DEL control character. Any character set that has codes with higher values is not US-ASCII (but may be an extension of it, such as Unicode).
Most microprocessors work with bytes (=smallest addressable unit of storage) of eight bits. If you want to use US-ASCII with these microprocessors, you have two options:
Use 7 bytes (of 8 bits each) to store 8 characters (of 7 bits each), even though that makes programs very complicated.
Use 1 byte (of 8 bits) to store 1 character (of 7 bits), even though you'll waste space.
The need for simple programs outweighs the need for efficient memory use in this case. That's why you usually use one 8-bit unit (an octet, for short) to store a character, even though each character is encoded in only 7-bit units. You just set the extra bit to zero (or, as was done in some cases, use the extra bit for error detection).
I know this is an old question, but for the sake of future readers; you can determine how many bytes are in a given string (or string value) via the following (C# .NET):
Encoding.ASCII.GetByteCount("SomeString");
Remember to use the proper encoding when you are attempting to count the number of bytes since it is different with each encoding:
An ASCII character in 8-bit ASCII encoding is 8 bits (1 byte), though it can fit in 7 bits.
An ISO-8895-1 character in ISO-8859-1 encoding is 8 bits (1 byte).
A Unicode character in UTF-8 encoding is between 8 bits (1 byte) and 32 bits (4 bytes).
A Unicode character in UTF-16 encoding is between 16 (2 bytes) and 32 bits (4 bytes), though most of the common characters take 16 bits. This is the encoding used by Windows internally.
A Unicode character in UTF-32 encoding is always 32 bits (4 bytes).
An ASCII character in UTF-8 is 8 bits (1 byte), and in UTF-16 - 16 bits.
The additional (non-ASCII) characters in ISO-8895-1 (0xA0-0xFF) would take 16 bits in UTF-8 and UTF-16.

Length() vs Sizeof() on Unicode strings

Quoting the Delphi XE8 help:
For single-byte and multibyte strings, Length returns the number of bytes used by the string. Example for UTF-8:
Writeln(Length(Utf8String('1¢'))); // displays 3
For Unicode (WideString) strings, Length returns the number of bytes divided by two.
This arises important questions:
Why the difference in handling is there at all?
Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?
Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.
Length returns the number of elements when considering the string as an array.
For strings with 8 bit element types (ANSI, UTF-8) then Length gives you the number of bytes since the number of bytes is the same as the number of elements.
For strings with 16 bit elements (UTF-16) then Length is half the number of bytes because each element is 2 bytes wide.
Your string '1¢' has two code points, but the second code point requires two bytes to encode it in UTF-8. Hence Length(Utf8String('1¢')) evaluates to three.
You mention SizeOf in the question title. Passing a string variable to SizeOf will always return the size of a pointer, since a string variable is, under the hood, just a pointer.
To your specific questions:
Why the difference in handling is there at all?
There is only a difference if you think of Length as relating to bytes. But that's the wrong way to think about it Length always returns an element count, and when viewed that way, there behaviour is uniform across all string types, and indeed across all array types.
Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?
It does always return the element count. It just so happens that when the element size is a single byte, then the element count and the byte count happen to be the same. In fact the documentation that you refer to also contains the following just above the excerpt that you provided: Returns the number of characters in a string or of elements in an array. That is the key text. The excerpt that you included is meant as an illustration of the implications of this italicised text.
Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.
UTF-16 character elements are always 16 bits wide. However, some Unicode code points require two character elements to encode. These pairs of character elements are known as surrogate pairs.
You are hoping, I think, that Length will return the number of code points in a string. But it doesn't. It returns the number of character elements. And for variable length encodings, the number of code points is not necessarily the same as the number of character elements. If your string was encoded as UTF-32 then the number of code points would be the same as the number of character elements since UTF-32 is a constant sized encoding.
A quick way to count the code points is to scan through the string checking for surrogate pairs. When you encounter a surrogate pair, count one code point. Otherwise, when you encounter a character element that is not part of a surrogate pair, count one code point. In pseudo-code:
N := 0;
for C in S do
if C.IsSurrogate then
inc(N)
else
inc(N, 2);
CodePointCount := N div 2;
Another point to make is that the code point count is not the same as the visible character count. Some code points are combining characters and are combined with their neighbouring code points to form a single visible character or glyph.
Finally, if all you are hoping to do is find the byte size of the string payload, use this expression:
Length(S) * SizeOf(S[1])
This expression works for all types of string.
Be very careful about the function System.SysUtils.ByteLength. On the face of it this seems to be just what you want. However, that function returns the byte length of a UTF-16 encoded string. So if you pass it an AnsiString, say, then the value returned by ByteLength is twice the number of bytes of the AnsiString.

Why Lua's string can contain characters with any numeric value?

I read something aboue string there:
http://www.lua.org/pil/2.4.html
Lua is eight-bit clean and so strings may contain characters with any numeric value, including embedded zeros.
What is that eight-bit clean means?
Why it can contain characters with any numeric value ? (different with basic c strings)
There are two common ways to store strings:
Characters and Terminator
Length and Characters
When you use #1, you need to "sacrifice" one character to serve as the terminator; when you use #2, you do not have such limitation.
C uses the first method of storing strings. It uses character zero to serve as the terminator; the other 255 characters can be used to represent characters of the string.
Lua uses the second method of storing strings. All 256 possible character values, including zeros, can be used in Lua strings. For example, you can construct a three-character string from characters 'A', 0, 'B', and Lua will treat it as a three character string. You can construct the same string in C, but its string-processing libraries will treat it as a single-character string: strlen would return 1, puts will write character A and stop, and so on.
The Lua string type is a counted sequence of bytes. A byte can hold any value between 0 and 255.
The string type is used for character strings. You are right, few character set encodings allow any byte value or sequence of byte values. Code page 437 is one that does; It maps 256 characters to 256 values, one byte per character. Windows-1252 does not; It maps 251 characters to 251 values, one byte per character. UTF-8 maps 1,112,064 characters to sequences of one to four bytes, where some values of bytes are not used and some sequences of values are not used.
The Lua string library does have functions that treats bytes as characters. Their behavior is influenced by the implementation's libraries, which typically uses the C runtime along with its locale features.
There are specialized libraries for Lua to explicitly handle various character set encodings.

Discover the character encoding from byte

I have a string where I know that the degree symbol (°) is represented by the byte 63 (3F).
Each character is represented by a single byte.
How can I find the character encoding used ?
Almost all 8-bit encodings in modern times coincide with ASCII in the ASCII range, so byte 3F hexadecimal is the question mark “?”. As Sebtm’s comment suggests, this might result from character-level data error. E.g., some software that is limited to ASCII could turn all other bytes to “?” – not a good practice, but possible.
If it were a non-ASCII byte, you could use the page http://www.eki.ee/letter/chardata.cgi?search=degree+sign to make a guess.

How does low-level character encodings work?

let's say, i have a textfile called sometext.txt
it has a line - "Sic semper tyrannis" which is (correct me if i'm wrong..)
83 105 99 32 115 101 109 112 101 114 32 116 121
114 97 110 110 105 115
(in decimal ASCII)
When i read this line from file using standard library file i/o routines, i don't perform any character encodings work.. (or do i??)
The question is:
Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?? Is it OS component?? Which one??
It's all a bunch of 1's and 0's.
An ASCII "A" is just the letter displayed when the value (01000001b, or 0x41 or 65 dec) is "encountered" (depend on context, naturally). There is no "conversion"; it's just a different view of the same thing defined by an accepted mapping.
Unicode (and other multi-byte) character sets often use different encodings; in UTF-8 (a Unicode encoding), for instance, a single Unicode character can be mapped as 1, 2, 3 or 4 bytes depending upon the character. Unicode encoding conversion often takes place in the IO libraries that come as part of a language or runtime; however, a Unicode-aware operating system also needs to understand a Unicode encoding itself (in system calls) so the line can be blurred.
UTF-8 has the nice property that all normal ASCII characters map to a single byte which makes it the most compatible Unicode encoding with traditional ASCII.
First, I recommend that you read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
When i read this line from file using
standard library file i/o routines, i
don't perform any character encodings
work.. (or do i??)
That depends heavily on which standard library you mean.
In C, when you write:
FILE* f = fopen("filename.txt", "w");
fputs("Sic semper tyrannis", f);
No encoding conversion is performed; the chars in the string are just written to the file as-is (except for line breaks). (Encoding is relevant when you're editing the source file.)
But in Python 3.x, when you write:
f = open('filename.txt', 'w', encoding='UTF-8')
f.write('Sic semper tyrannis')
The write function performs an internal conversion from the UTF-16/32 encoding of the Python str types to the UTF-8 encoding used on disk.
The question is: Which software
component actually converts 0s and 1s
into characters(i.e. contains
algorithm for converting 0s and 1s
into characters)?? Is it OS
component?? Which one??
The decoding function (like MultiByteToWideChar or bytes.decode) for the appropriate character encoding converts the bytes into Unicode code points, which are integers that uniquely identify characters. A font converts code points to glyphs, the images of the characters that appear on screen or paper.
Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?
This depends on what languge you're using. For example, Python has character encoding functions:
>>> f = open( ...., 'rb')
>>> data = f.read()
>>> data.decode('utf-8')
u'café'
Here, Python has converted a sequence of bytes into a Unicode string. The exact component is typically a library or program in userspace, but some compilers need knowledge of character encodings.
Underneath, it's all a sequence of bytes, which are 1s and 0s. However, given a sequence of bytes, which characters do these represent? ASCII is one such "character encoding", and tells us how to encode or decode A-Z, a-z, and a few more. There are many others, noteably UTF-8 (an encoding of Unicode). In the end, if you're dealing with text, you need to know what character encoding it is encoded with.
Like DrStrangeLove says, it's 1's & 0's all the way to your display screen and beyond - the 'A' character is an array of pixels whose color/brightness is defined by bits in the display driver. Turning that pixel array into an understandable character needs a bioElectroChemical video camera connected to 10^11 threshold logic gates running an adaptive, massively-parallel OS and apps that no-one understands, especially after a few beers
Not exactly sure what you're asking. The 0's and 1's from the file are blocked up into the bytes that can represent ASCII codes by the disk driver - it will only read/write blocks of eight bits. The ASCII code bytes are rendered into displayable bitmaps by the display driver using the chosen font.
Rgds,
Martin
It has nothing (well, not so much) to do's with 0s and 1s. Most character encodings work with entire bytes of 8 bits. Each of the numbers you wrote represents a single byte. In ASCII, every character is a single byte. Besides that, ASCII is a subset of ANSI and UTF-8, making it compatible with the most used character sets. ASCII contains only the first half of the byte range. Chars upto 127.
For ANSI you need some encoding. ANSI specifies the characters in the upper half of the byte range. In UTF-8, these ANSI characters don't exist. Instead, these last 128 bytes represent part of a character. A whole character is made of 2 to 4 bytes. Except those 128 ASCII characters. They are still the same old single byte characters. I think this is mainly done because if UTF-8 wouldn't be compatible with ASCII, there is no way Americans would have adopted it. ;-)
But yes, the OS does have various functions to work with character encodings. Where they are depends on the OS and platform, but if I read your question right, you're not really looking for some specific API. Your question cannot be answered that concrete. There are numerous ways to work with characters, and these is a major difference between working with the actual character data and writing them to the screen. (difference between character and font).

Resources