variations in huffman encoding codewords - huffman-code

I'm trying to solve some huffman coding problems, but I always get different values for the codewords (values not lengths).
for example, if the codeword of character 'c' was 100, in my solution it is 101.
Here is an example:
Character Frequency codeword my solution
A 22 00 10
B 12 100 010
C 24 01 11
D 6 1010 0110
E 27 11 00
F 9 1011 0111
Both solutions have the same length for codewords, and there is no codeword that is prefix of another codeword.
Does this make my solution valid ? or it has to be only 2 solutions, the optimal one and flipping the bits of the optimal one ?

There are 96 possible ways to assign the 0's and 1's to that set of lengths, and all would be perfectly valid, optimal, prefix codes. You have shown two of them.
There exist conventions to define "canonical" Huffman codes which resolve the ambiguity. The value of defining canonical codes is in the transmission of the code from the compressor to the decompressor. As long as both sides know and agree on how to unambiguously assign the 0's and 1's, then only the code length for each symbol needs to be transmitted -- not the codes themselves.
The deflate format starts with zero for the shortest code, and increments up. Within each code length, the codes are ordered by the symbol values, i.e. sorting by symbol. So for your code that canonical Huffman code would be:
A - 00
C - 01
E - 10
B - 110
D - 1110
F - 1111
So there the two bit codes are assigned in the symbol order A, C, E, and similarly, the four bit codes are assigned in the order D, F. Shorter codes are assigned before longer codes.
There is a different and interesting ambiguity that arises in finding the code lengths. Depending on the order of combination of equal frequency nodes, i.e. when you have a choice of more than two lowest frequency nodes, you can actually end up with different sets of code lengths that are exactly equally optimal. Even though the code lengths are different, when you multiply the lengths by the frequencies and add them up, you get exactly the same number of bits for the two different codes.
There again, the different codes are all optimal and equally valid. There are ways to resolve that ambiguity as well at the time the nodes to combine are chosen, where the benefit can be minimizing the depth of the tree. That can reduce the table size for table-driven Huffman decoding.
For example, consider the frequencies A: 2, B: 2, C: 1, D: 1. You first combine C and D to get 2. Then you have A, B, and C+D all with frequency 2. Now you can choose to combine either A and B, or C+D with A or B. This gives two different sets of bit lengths. If you combine A and B, you get lengths: A-2, B-2, C-2, and D-2. If you combine C+D with B, you get A-1, B-2, C-3, D-3. Both are optimal codes, since 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12, so both codes use 12 bits to represent those symbols that many times.

The problem is, that there is no problem.
You huffman tree is valid, it also gives the exactly same results after encoding and decoding. Just think if you would build a huffman tree by hand, there are always more ways to combine items with equal (or least difference) value. E.g. if you have A B C (everyone frequency 1), you can at first combine A and B, and the result with C, or at first B and C, and the result with a.
You see, there are more correct ways.
Edit: Even with only one possible way to combine the items by frequency, you can get different results because you can assign 1 for the left or for the right branch, so you would get different (correct) results.

Related

bin2dec for 16 bit signed binary values (in google sheets)

In google sheets, I'm trying to convert a 16-bit signed binary number to its decimal equivalent, but the built in function that does that only takes up to 10 bits. Other solutions to the problem that I've seen don't preserve the signedness.
So far I've tried:
bin2dec on the leftmost 8 bits * 2^8 + bin2dec on the rightmost 8 bits
hex2dec on the result of bin2dec on the leftmost 8 bits concatenated with bin2dec on the rightmost 8 bits
I've also seen a suggestion that multiplies each bit by its power of 2, eliminating bin2dec altogether.
Any suggestions?
You will need to use a custom function
function binary2decimal(bin) {
return parseInt(bin, 2);
}
Let's assume that your binary number is in cell A2.
First, set the formatting as follows: Format > Number > Plain text.
Then place the following formula in, say, B2:
=ArrayFormula(SUM(SPLIT(REGEXREPLACE(SUBSTITUTE(A2&"","-",""),"(\d)","$1|"),"|")*(2^SEQUENCE(1,LEN(SUBSTITUTE(A2&"","-","")),LEN(SUBSTITUTE(A2&"","-",""))-1,-1))*IF(LEFT(A2)="-",-1,1)))
This formula will process any length binary number, positive or negative, from 1 bit to 16 bits (and, in fact, to a length of 45 or 46 bits).
What this formula does is SPLIT the binary number (without the negative sign if it exists) into its separate bits, one per column; multiply each of those by 2 raised to the power of each element of an equal-sized degressive SEQUENCE that runs from a high of the LEN (i.e., number) of bits down to zero; and finally apply the negative sign conditionally IF one exists.
If you need to process a range where every value is a positive or negative binary number with exactly 16 bits, you can do so. Suppose that your 16-bit binary numbers are in the range A2:A. First, be sure to select all of Column A and set the formatting to "Plain text" as described above. Then place the following array formula into, say, B2 (being sure that B2:B is empty first):
=ArrayFormula(MMULT(SPLIT(REGEXREPLACE(SUBSTITUTE(FILTER(A2:A,A2:A<>"")&"","-",""),"(\d)","$1|"),"|")*(2^SEQUENCE(1,16,15,-1)),SEQUENCE(16,1,1,0))*IF(LEFT(FILTER(A2:A,A2:A<>""))="-",-1,1))

Prime factorization of integers with Maxima

I want to use Maxima to get the prime factorization of a random positive integer, e.g. 12=2^2*3^1.
What I have tried so far:
a:random(20);
aa:abs(a);
fa:ifactors(aa);
ka:length(fa);
ta:1;
pfza: for i:1 while i<=ka do ta:ta*(fa[i][1])^(fa[i][2]);
ta;
This will be implemented in STACK for Moodle as part of a online exercise for students, so the exact implementation will be a little bit different from this, but I broke it down to these 7 lines.
I generate a random number a, make sure that it is a positive integer by using aa=|a|+1 and want to use the ifactors command to get the prime factors of aa. ka tells me the number of pairwise distinct prime factors which I then use for the while loop in pfza. If I let this piece of code run, it returns everything fine, execpt for simplifying ta, that is I don't get ta as a product of primes with some exponents but rather just ta=aa.
I then tried to turn off the simplifier, manually simplifying everything else that I need:
simp:false$
a:random(20);
aa:ev(abs(a),simp);
fa:ifactors(aa);
ka:ev(length(fa),simp);
ta:1;
pfza: for i:1 while i<=ka do ta:ta*(fa[i][1])^(fa[i][2]);
ta;
This however does not compile; I assume the problem is somewhere in the line for pfza, but I don't know why.
Any input on how to fix this? Or another method of getting the factorizing in a non-simplified form?
(1) The for-loop fails because adding 1 to i requires 1 + 1 to be simplified to 2, but simplification is disabled. Here's a way to make the loop work without requiring arithmetic.
(%i10) for f in fa do ta:ta*(f[1]^f[2]);
(%o10) done
(%i11) ta;
2 2 1
(%o11) ((1 2 ) 2 ) 3
Hmm, that's strange, again because of the lack of simplification. How about this:
(%i12) apply ("*", map (lambda ([f], f[1]^f[2]), fa));
2 1
(%o12) 2 3
In general I think it's better to avoid explicit indexing anyway.
(2) But maybe you don't need that at all. factor returns an unsimplified expression of the kind you are trying to construct.
(%i13) simp:true;
(%o13) true
(%i14) factor(12);
2
(%o14) 2 3
I think it's conceptually inconsistent for factor to return an unsimplified, but anyway it seems to work here.

Huffman Code with equal symbol frequencies

Starting with these frequencies:
A:7 F:6 H:1 M:2 N:4 U:5
at a later step I have 5 6 7 7, where one of the 7's is the "A". Which 7 branch I pick to be a 0 or a 1 is arbitrary.
So how do I get uniquely decodable code word?
You need to send the code to the receiver, not the frequencies. You can arbitrarily assign 0's and 1's to all of the branches, and then send the codes for each symbol before the coded symbols themselves. There are many possible Huffman codes from the same set of frequencies.
More commonly only the code lengths in bits for each symbol are sent. In this case those are A:2 F:2 H:4 M:4 N:3 U:2. Then a canonical code is used on both ends that depends only on the lengths. In this case, starting with 0's, the canonical code would be:
A: 00
F: 01
U: 10
N: 110
H: 1110
M: 1111
where codes of equal length are assigned to the symbols in lexicographical order. Note that the Huffman tree that was built is not needed. All that is needed is the number of bits for each symbol.

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Constrained Sequence to Index Mapping

I'm puzzling over how to map a set of sequences to consecutive integers.
All the sequences follow this rule:
A_0 = 1
A_n >= 1
A_n <= max(A_0 .. A_n-1) + 1
I'm looking for a solution that will be able to, given such a sequence, compute a integer for doing a lookup into a table and given an index into the table, generate the sequence.
Example: for length 3, there are 5 the valid sequences. A fast function for doing the following map (preferably in both direction) would be a good solution
1,1,1 0
1,1,2 1
1,2,1 2
1,2,2 3
1,2,3 4
The point of the exercise is to get a packed table with a 1-1 mapping between valid sequences and cells.
The size of the set in bounded only by the number of unique sequences possible.
I don't know now what the length of the sequence will be but it will be a small, <12, constant known in advance.
I'll get to this sooner or later, but though I'd throw it out for the community to have "fun" with in the meantime.
these are different valid sequences
1,1,2,3,2,1,4
1,1,2,3,1,2,4
1,2,3,4,5,6,7
1,1,1,1,2,3,2
these are not
1,2,2,4
2,
1,1,2,3,5
Related to this
There is a natural sequence indexing, but no so easy to calculate.
Let look for A_n for n>0, since A_0 = 1.
Indexing is done in 2 steps.
Part 1:
Group sequences by places where A_n = max(A_0 .. A_n-1) + 1. Call these places steps.
On steps are consecutive numbers (2,3,4,5,...).
On non-step places we can put numbers from 1 to number of steps with index less than k.
Each group can be represent as binary string where 1 is step and 0 non-step. E.g. 001001010 means group with 112aa3b4c, a<=2, b<=3, c<=4. Because, groups are indexed with binary number there is natural indexing of groups. From 0 to 2^length - 1. Lets call value of group binary representation group order.
Part 2:
Index sequences inside a group. Since groups define step positions, only numbers on non-step positions are variable, and they are variable in defined ranges. With that it is easy to index sequence of given group inside that group, with lexicographical order of variable places.
It is easy to calculate number of sequences in one group. It is number of form 1^i_1 * 2^i_2 * 3^i_3 * ....
Combining:
This gives a 2 part key: <Steps, Group> this then needs to be mapped to the integers. To do that we have to find how many sequences are in groups that have order less than some value. For that, lets first find how many sequences are in groups of given length. That can be computed passing through all groups and summing number of sequences or similar with recurrence. Let T(l, n) be number of sequences of length l (A_0 is omitted ) where maximal value of first element can be n+1. Than holds:
T(l,n) = n*T(l-1,n) + T(l-1,n+1)
T(1,n) = n
Because l + n <= sequence length + 1 there are ~sequence_length^2/2 T(l,n) values, which can be easily calculated.
Next is to calculate number of sequences in groups of order less or equal than given value. That can be done with summing of T(l,n) values. E.g. number of sequences in groups with order <= 1001010 binary, is equal to
T(7,1) + # for 1000000
2^2 * T(4,2) + # for 001000
2^2 * 3 * T(2,3) # for 010
Optimizations:
This will give a mapping but the direct implementation for combining the key parts is >O(1) at best. On the other hand, the Steps portion of the key is small and by computing the range of Groups for each Steps value, a lookup table can reduce this to O(1).
I'm not 100% sure about upper formula, but it should be something like it.
With these remarks and recurrence it is possible to make functions sequence -> index and index -> sequence. But not so trivial :-)
I think hash with out sorting should be the thing.
As A0 always start with 0, may be I think we can think of the sequence as an number with base 12 and use its base 10 as the key for look up. ( Still not sure about this).
This is a python function which can do the job for you assuming you got these values stored in a file and you pass the lines to the function
def valid_lines(lines):
for line in lines:
line = line.split(",")
if line[0] == 1 and line[-1] and line[-1] <= max(line)+1:
yield line
lines = (line for line in open('/tmp/numbers.txt'))
for valid_line in valid_lines(lines):
print valid_line
Given the sequence, I would sort it, then use the hash of the sorted sequence as the index of the table.

Resources