Huffman Code with equal symbol frequencies - huffman-code

Starting with these frequencies:
A:7 F:6 H:1 M:2 N:4 U:5
at a later step I have 5 6 7 7, where one of the 7's is the "A". Which 7 branch I pick to be a 0 or a 1 is arbitrary.
So how do I get uniquely decodable code word?

You need to send the code to the receiver, not the frequencies. You can arbitrarily assign 0's and 1's to all of the branches, and then send the codes for each symbol before the coded symbols themselves. There are many possible Huffman codes from the same set of frequencies.
More commonly only the code lengths in bits for each symbol are sent. In this case those are A:2 F:2 H:4 M:4 N:3 U:2. Then a canonical code is used on both ends that depends only on the lengths. In this case, starting with 0's, the canonical code would be:
A: 00
F: 01
U: 10
N: 110
H: 1110
M: 1111
where codes of equal length are assigned to the symbols in lexicographical order. Note that the Huffman tree that was built is not needed. All that is needed is the number of bits for each symbol.

Related

bin2dec for 16 bit signed binary values (in google sheets)

In google sheets, I'm trying to convert a 16-bit signed binary number to its decimal equivalent, but the built in function that does that only takes up to 10 bits. Other solutions to the problem that I've seen don't preserve the signedness.
So far I've tried:
bin2dec on the leftmost 8 bits * 2^8 + bin2dec on the rightmost 8 bits
hex2dec on the result of bin2dec on the leftmost 8 bits concatenated with bin2dec on the rightmost 8 bits
I've also seen a suggestion that multiplies each bit by its power of 2, eliminating bin2dec altogether.
Any suggestions?
You will need to use a custom function
function binary2decimal(bin) {
return parseInt(bin, 2);
}
Let's assume that your binary number is in cell A2.
First, set the formatting as follows: Format > Number > Plain text.
Then place the following formula in, say, B2:
=ArrayFormula(SUM(SPLIT(REGEXREPLACE(SUBSTITUTE(A2&"","-",""),"(\d)","$1|"),"|")*(2^SEQUENCE(1,LEN(SUBSTITUTE(A2&"","-","")),LEN(SUBSTITUTE(A2&"","-",""))-1,-1))*IF(LEFT(A2)="-",-1,1)))
This formula will process any length binary number, positive or negative, from 1 bit to 16 bits (and, in fact, to a length of 45 or 46 bits).
What this formula does is SPLIT the binary number (without the negative sign if it exists) into its separate bits, one per column; multiply each of those by 2 raised to the power of each element of an equal-sized degressive SEQUENCE that runs from a high of the LEN (i.e., number) of bits down to zero; and finally apply the negative sign conditionally IF one exists.
If you need to process a range where every value is a positive or negative binary number with exactly 16 bits, you can do so. Suppose that your 16-bit binary numbers are in the range A2:A. First, be sure to select all of Column A and set the formatting to "Plain text" as described above. Then place the following array formula into, say, B2 (being sure that B2:B is empty first):
=ArrayFormula(MMULT(SPLIT(REGEXREPLACE(SUBSTITUTE(FILTER(A2:A,A2:A<>"")&"","-",""),"(\d)","$1|"),"|")*(2^SEQUENCE(1,16,15,-1)),SEQUENCE(16,1,1,0))*IF(LEFT(FILTER(A2:A,A2:A<>""))="-",-1,1))

double rounding

I this is code that rounds 62 to 61 and shows it in the output. Why it decides to round and how to get 62 in the output?
var d: double;
i: integer;
begin
d:=0.62;
i:= trunc(d*100);
Showmessage( inttostr(i) );
end;
This boils down to the fact that 0.62 is not exactly representable in a binary floating point data type. The closest representable double value to 0.62 is:
0.61999 99999 99999 99555 91079 01499 37383 83054 73327 63671 875
When you multiply this value by 100, the resulting value is slightly less that 62. What happens next depends on how the intermediate value d*100 is treated. In your program, under the 32 bit Windows compiler with default settings, the intermediate value is held in an 80 bit extended register. And the closest 80 bit extended precision value is:
61.99999 99999 99999 55591 07901 49937 38383 05473 32763 67187 5
Since the value is less than 62, Trunc returns 61 since Trunc rounds towards zero.
If you stored d*100 in a double value, then you'd see a different result.
d := 0.62;
d := d*100;
i := Trunc(d);
Writeln(i);
This program outputs 62 rather than 61. That's because although d*100 to extended 80 bit precision is less than 62, the closest double precision value to that 80 bit value is in fact 62.
Similarly, if you compile your original program with the 64 bit compiler, then arithmetic is performed in the SSE unit which has no 80 bit registers. And so there is no 80 bit intermediate value and your program outputs 62.
Or, going back to the 32 bit compiler, you can arrange that intermediate values are stored to 64 bit precision on the FPU and also achieve an output of 62. Call Set8087CW($1232) to achieve that.
As you can see, binary floating point arithmetic can sometimes be surprising.
If you use Round rather than Trunc then the value returned will be the closest integer, rather than rounding towards zero as Trunc does.
But perhaps a better solution would be to use a decimal data type rather than a binary data type. If you do that then you can represent 0.62 exactly and thereby avoid all such problems. Delphi's built in decimal real valued data type is Currency.
Use round instead of trunc.
round will round towards the nearest integer, and 62.00 is very close to 62, so there is no problem. trunc will round to the nearest integer towards zero, and 62.00 is very close to 61.9999999 so numeric 'fuzz' might very well cause the issue you describe.

What does the ampersand do in a print statement? [duplicate]

I was reading through some code examples and came across a & on Oracle's website on their Bitwise and Bit Shift Operators page. In my opinion it didn't do too well of a job explaining the bitwise &. I understand that it does a operation directly to the bit, but I am just not sure what kind of operation, and I am wondering what that operation is. Here is a sample program I got off of Oracle's website: http://docs.oracle.com/javase/tutorial/displayCode.html?code=http://docs.oracle.com/javase/tutorial/java/nutsandbolts/examples/BitDemo.java
An integer is represented as a sequence of bits in memory. For interaction with humans, the computer has to display it as decimal digits, but all the calculations are carried out as binary. 123 in decimal is stored as 1111011 in memory.
The & operator is a bitwise "And". The result is the bits that are turned on in both numbers. 1001 & 1100 = 1000, since only the first bit is turned on in both.
The | operator is a bitwise "Or". The result is the bits that are turned on in either of the numbers. 1001 | 1100 = 1101, since only the second bit from the right is zero in both.
There are also the ^ and ~ operators, that are bitwise "Xor" and bitwise "Not", respectively. Finally there are the <<, >> and >>> shift operators.
Under the hood, 123 is stored as either 01111011 00000000 00000000 00000000 or 00000000 00000000 00000000 01111011 depending on the system. Using the bitwise operators, which representation is used does not matter, since both representations are treated as the logical number 00000000000000000000000001111011. Stripping away leading zeros leaves 1111011.
It's a binary AND operator. It performs an AND operation that is a part of Boolean Logic which is commonly used on binary numbers in computing.
For example:
0 & 0 = 0
0 & 1 = 0
1 & 0 = 0
1 & 1 = 1
You can also perform this on multiple-bit numbers:
01 & 00 = 00
11 & 00 = 00
11 & 01 = 01
1111 & 0101 = 0101
11111111 & 01101101 = 01101101
...
If you look at two numbers represented in binary, a bitwise & creates a third number that has a 1 in each place that both numbers have a 1. (Everywhere else there are zeros).
Example:
0b10011011 &
0b10100010 =
0b10000010
Note that ones only appear in a place when both arguments have a one in that place.
Bitwise ands are useful when each bit of a number stores a specific piece of information.
You can also use them to delete/extract certain sections of numbers by using masks.
If you expand the two variables according to their hex code, these are:
bitmask : 0000 0000 0000 1111
val: 0010 0010 0010 0010
Now, a simple bitwise AND operation results in the number 0000 0000 0000 0010, which in decimal units is 2. I'm assuming you know about the fundamental Boolean operations and number systems, though.
Its a logical operation on the input values. To understand convert the values into the binary form and where bot bits in position n have a 1 the result has a 1. At the end convert back.
For example with those example values:
0x2222 = 10001000100010
0x000F = 00000000001111
result = 00000000000010 => 0x0002 or just 2
Knowing how Bitwise AND works is not enough. Important part of learning is how we can apply what we have learned. Here is a use case for applying Bitwise AND.
Example:
Adding any even number in binary with 1's binary will result in zeros. Because all the even number has it's last bit(reading left to right) 0 and the only bit 1 has is 1 at the end.
If you were to ask write a function which takes an argument as a number and returns true for even number without using addition, multiplication, division, subtraction, modulo and you cannot convert number to string.
This function is a perfect use case for using Bitwise AND. As I have explained earlier. You ask show me the code? Here is the java code.
/**
* <p> Helper function </p>
* #param number
* #return 0 for even otherwise 1
*/
private int isEven(int number){
return (number & 1);
}
is doing the logical and digit by digit so for example 4 & 1 became
10 & 01 = 1x0,0x1 = 00 = 0
n & 1 is used for checking even numbers since if a number is even the oeration it will aways be 0
import.java.io.*;
import.java.util.*;
public class Test {
public static void main(String[] args) {
int rmv,rmv1;
//this R.M.VIVEK complete bitwise program for java
Scanner vivek=new Scanner();
System.out.println("ENTER THE X value");
rmv = vivek.nextInt();
System.out.println("ENTER THE y value");
rmv1 = vivek.nextInt();
System.out.println("AND table based\t(&)rmv=%d,vivek=%d=%d\n",rmv,rmv1,rmv&rmv1);//11=1,10=0
System.out.println("OR table based\t(&)rmv=%d,vivek=%d=%d\n",rmv,rmv1,rmv|rmv1);//10=1,00=0
System.out.println("xOR table based\t(&)rmv=%d,vivek=%d=%d\n",rmv,rmv1,rmv^rmv1);
System.out.println("LEFT SWITH based to %d>>4=%d\n",rmv<<4);
System.out.println("RIGTH SWITH based to %d>>2=%d\n",rmv>>2);
for(int v=1;v<=10;v++)
System.out.println("LIFT SWITH based to (-NAGATIVE VALUE) -1<<%d=%p\n",i,-1<<1+i);
}
}

COBOL Packed data type: Type = P5

This may be a very basic question for COBOL experts. But I till date had nothing to do with COBOL. We are processing some files based on character position. The files are being sent to us from mainframe machines and we have a layout file for that that says somethings like this.
POSITION : LENGTH : TYPE : DESCRIPTION
----------:--------:------:-------------------------------
61-70 : 10 : P5 : FIELD-1 9(13)V(05)
71-80 : 10 : P5 : Field-2 9(13)V(05)
81-81 : 1 : A/N : FLAG
82-84 : 3 : N : NUMBER OF DAYS 9(3)
I understand that the type A/N means it is alpha-numeric. N means numeric and P means Packed data type. What i dont understand is what P5 means. What is the significance of 5 that comes next to P?
What is the significance of 5 that comes next to P?
I'm not sure. Five 16 bit words, maybe.
Your packed fields are 10 bytes and holding 19 characters (18 digits plus the sign). The decimal point is implied.
If the sign byte (the rightmost byte) is anything other than hexadecimal F, update your question.
If you could update your question with five hexadecimal strings representing five of the numbers, that would be great.
Right now, I'm guessing that it's an ordinary packed decimal field.
P - packed decimal (i.e. Cobol Comp-3) a 18 digit packed decimal would occupy 10 bytes which agrees with the lengths give
5 - number of digits after the decimal point (at a guess).
Field definition in cobol is probably
03 FIELD-1 pic s9(13)V(05) comp-3.
in packed decimal, the sign is held in the last nyble (4 bits) and each nyble (4 bits) holds one decimal digit.
i.e.
121 is represented as x'121c'
while
-121 is represented as x'121d'
If you are using java and can get the cobol copybook, there are packages that can read the file using the cobol copybook.
I would bet it means 5 decimal places.

variations in huffman encoding codewords

I'm trying to solve some huffman coding problems, but I always get different values for the codewords (values not lengths).
for example, if the codeword of character 'c' was 100, in my solution it is 101.
Here is an example:
Character Frequency codeword my solution
A 22 00 10
B 12 100 010
C 24 01 11
D 6 1010 0110
E 27 11 00
F 9 1011 0111
Both solutions have the same length for codewords, and there is no codeword that is prefix of another codeword.
Does this make my solution valid ? or it has to be only 2 solutions, the optimal one and flipping the bits of the optimal one ?
There are 96 possible ways to assign the 0's and 1's to that set of lengths, and all would be perfectly valid, optimal, prefix codes. You have shown two of them.
There exist conventions to define "canonical" Huffman codes which resolve the ambiguity. The value of defining canonical codes is in the transmission of the code from the compressor to the decompressor. As long as both sides know and agree on how to unambiguously assign the 0's and 1's, then only the code length for each symbol needs to be transmitted -- not the codes themselves.
The deflate format starts with zero for the shortest code, and increments up. Within each code length, the codes are ordered by the symbol values, i.e. sorting by symbol. So for your code that canonical Huffman code would be:
A - 00
C - 01
E - 10
B - 110
D - 1110
F - 1111
So there the two bit codes are assigned in the symbol order A, C, E, and similarly, the four bit codes are assigned in the order D, F. Shorter codes are assigned before longer codes.
There is a different and interesting ambiguity that arises in finding the code lengths. Depending on the order of combination of equal frequency nodes, i.e. when you have a choice of more than two lowest frequency nodes, you can actually end up with different sets of code lengths that are exactly equally optimal. Even though the code lengths are different, when you multiply the lengths by the frequencies and add them up, you get exactly the same number of bits for the two different codes.
There again, the different codes are all optimal and equally valid. There are ways to resolve that ambiguity as well at the time the nodes to combine are chosen, where the benefit can be minimizing the depth of the tree. That can reduce the table size for table-driven Huffman decoding.
For example, consider the frequencies A: 2, B: 2, C: 1, D: 1. You first combine C and D to get 2. Then you have A, B, and C+D all with frequency 2. Now you can choose to combine either A and B, or C+D with A or B. This gives two different sets of bit lengths. If you combine A and B, you get lengths: A-2, B-2, C-2, and D-2. If you combine C+D with B, you get A-1, B-2, C-3, D-3. Both are optimal codes, since 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12, so both codes use 12 bits to represent those symbols that many times.
The problem is, that there is no problem.
You huffman tree is valid, it also gives the exactly same results after encoding and decoding. Just think if you would build a huffman tree by hand, there are always more ways to combine items with equal (or least difference) value. E.g. if you have A B C (everyone frequency 1), you can at first combine A and B, and the result with C, or at first B and C, and the result with a.
You see, there are more correct ways.
Edit: Even with only one possible way to combine the items by frequency, you can get different results because you can assign 1 for the left or for the right branch, so you would get different (correct) results.

Resources