Decode Huffman encoding in HTTP/2 - huffman-code

I'm trying to understand how to do Huffman decoding of an HTTP/2 header. Most of the docs I see talk about having a binary tree for the frequency table, but for HTTP/2 it's just a static lookup table. I've got the encoding part working fine, but the decoding is confusing me as I don't know how to tell how many bits I'm supposed to take each time.

A Huffman code is a prefix-free code. This means that no encoded symbol is a prefix of any other encoded symbol.
If there is a symbol represented by the bit string 00111, then there can't be one represented by 001110 or 001111 or 0011100110011 - nothing else will start with 00111. If you've read the bit string 00111, you don't need a marker to tell you that you're at the end of a symbol. If you can produce an output symbol from the bits you've read so far, you must produce that output symbol and start reading the next one.
When you've read 0011, you won't be able to output anything because 0011 is not a symbol. It can't be, because it's a prefix of 00111.
Huffman codes always assign some meaning to every possible bit string (except one that ends prematurely). The fact that 0011 has no meaning by itself means that there must be at least 2 symbol codes starting with 0011. At least one will start with 00110 and at least one will start with 00111.
To decode an input bit stream, you start from a state representing no input, then read a bit, and move to the state representing the bits you've read so far. For example, if you're in state 00 and you read a 1, you move to state 001. When you reach a state that corresponds to a symbol, you output that symbol and move back to the initial state before reading the next bit.
(Note that detecting the end of stream is outside the scope of Huffman coding. The containing protocol must tell you how to know when you're at the end of the bit stream.)
Since every state has exactly 2 possible successor states (corresponding to a 0 bit and a 1 bit), the state transitions form a binary tree. At every non-leaf node you're reading a bit to decide whether to go to the left child or the right child, and at every leaf node you've finished decoding a symbol, you output that symbol, and then go back to the root.
You can build the tree from a list of symbols and their encodings, and then traverse the tree to decode the input. Writing the code to build the tree will probably be the thing that gives you the experience to truly understand the Huffman code.
When you have an input list like
A 00
B 010
C 011
D 100
E 1010
F 1011
G 1100
H 1101
I 1110
J 1111
your tree structure should satisfy
root->symbol is null
root->left->symbol is null
root->left->left->symbol = A
root->left->right->left->symbol = B
root->left->right->right->symbol = C
...
(In this pseudocode, every node has 3 attributes, but in a real language, you will probably find a more efficient representation. Every node needs to have either a symbol or a pair of pointers/references to left and right child nodes.)
Since you have a static code list, you only need to build the tree once and you can reuse it for the lifetime of the program.

Related

Store NFA into data structure

I am provided with a NFA and I need to use a data structure (I can not use recursive descent parser) for storing it. Once the NFA is stored in a data structure I am given a string to check if the string is valid according to the NFA given or not.
Can someone please suggest a data structure for storing a NFA? Also if there are any opensource c language examples that would help a lot.
An NFA is just a set of triples State x Input -> State. It's usually convenient to represent a State with a small integer in a consecutive range starting at 0 (or some other defined starting point). Input symbols can also be mapped onto small integers, either directly (ascii code if all the transitions are ascii characters) or by keeping an inventory while you read the machine. Making a list of triples is highly inefficient and making a hash table is overkill; a plausible intermediate is a two-dimensional array. Remember that the machine is Nondeterministic, so a given [state, input symbol] pair might map to a set of next states.
You can determinize the NFA into a DFA using the Subset Construction. That simplifies the data structure but it can also blow up exponentially in size.

Checksum calculation in visual basic

I want to communicate with a medical analyzer and send some messages to it. But the analyzer requests a control character or checksum at the end of the message to validate it.
You'll excuse my limited knowledge of English, but according to the manual, here is how to calculate that checksum:
The control character is the exclusion logic sum (exclusive OR), in character units, from the start of the text to the end of the text and consists of a 1 byte binary. Moreover, since this control character becomes the value of 00~7F by hexadecimal, please process not to mistake for the control code used in transmission.
So please can you tell me how to get this control character based on those informations. I did not understand well what is written because of my limited English.
I'm using visual basic for programming
Thanks
The manual isn't in very good English either. My understanding:
The message you will send to the device can be represented as a byte array. You want to XOR together every single byte, which will leave you with a one byte checksum, to be added to the end of the message. (byte 1 XOR byte 2 XOR byte 3 XOR ....)
Everything after "Moreover" appears to say "If you are reading from the device, the final character is a checksum, not a command, so do not treat it as a command, even if it looks like one.". You are writing to the device, so you can ignore that part.
To XOR (bear in mind I don't know VB):
have a checksum variable, set to 0. Loop over each byte of the message, and XOR the checksum variable with that byte. Store the result back in the checksum.

Huffman code for a single character?

Lets say I have a massive string of just a single character say x. I need to use huffman encoding.
A huffman encoding is a fully binary tree. So how does one create a huffman code for just a single character when we dont need two leaves at all ?
jbr's answer is fine; this is just a longer version of it.
Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length.
In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.
If you don't have a length encoded separately, then you must have a symbol to represent End Of Sequence so the decoder knows when to stop reading. Then your Huffman tree will have 2 nodes and you won't run into this special case.
If you only have one symbol, then you only need 1 bit per symbol. So you really don't have to do anything except count the number of bits and translate each into your symbol.
You simply could add an edge case in your code.
For example:
check if there is only one character in your hash table, which returns only the root of the tree without any leafs. In this case, you could add a code for this root node in your encoding function, like 0.
In the encoding function, you should refer to this edge case too.

Checking input grammar and deciding a result

Say I have a string "abacabacabadcdcdcd" and I want to apply a simple set of rules:
abaca->a
dcd->d
From left to right s.t. the string ends up being "abad". This output will be used to make a decision. After the rules are applied, if the output string does not match preset strings such as "abad", the original string would be discarded. ex. Every string should distill down to "abad", kick if it doesn't.
I have this hard-coded right now as regex, but there are many instances of these small rule sets. I am looking for something that will take a set of simple rules and compile (or just a function?) into something I can feed the string to and retrieve a result. The rule sets are independent of each other.
The input is tightly controlled, and the rules in use will be simple. Speed is the most important aspect.
I've looked at Bison and ANTLR, but I don't think I need anything nearly that powerful...
What am I looking for?
Edit: Should mention that the strings are made up of a couple letters. Usually 5, i.e. "abcde". There are no spaces, etc. Just letters.
If it is going to go fast, you can start out with a map, that contains your rules as key value pairs of strings. You can then compile this map to a sort of state machine, a tree with char keys, where the associated value is either a replacement string, or another tree.
You then go char by char through your string. Look up the current char in the tree. If you find another tree, look up the next character in that tree, etc.
At some point, either:
the lookup will fail, and then you know that the string you've seen so far is not the prefix of any rule. You can skip the current character and continue with the next.
or you get a replacement string. In that case, you can replace the characters between the current char and the last one you looked up inclusive by the replacement string.
The only difficulty is if the replacement can itself be part of a pattern to replace. Example:
ab -> e
cd -> b
The input:
acd -> ab (by rule 2)
ab -> e (by rule 1) ????
Now the question is if you want to reconsider ab to give e?
If this is so, you must start over from the beginning after each replacement. In addition, it will be hard to tell whether the replacement ever ends, except if all the rules you have are such that the right hand side is shorter than the left hand side. For, in that case, a finite string will get reduced in a finite amount of time.
But if we don't need to reconsider, the algorithm above will go straight through the string.

Is The Effectiveness Of Huffman Coding Limited?

My problem is that I have a 100,000+ different elements and as I understand it Huffman works by assigning the most common element a 0 code, and the next 10, the next 110, 1110, 11110 and so on. My question is, if the code for the nth element is n-bits long then surely once I have passed the 32nd term it is more space efficient to just sent 32-bit data types as they are, such as ints for example? Have I missed something in the methodology?
Many thanks for any help you can offer. My current implementation works by doing
code = (code << 1) + 2;
to generate each new code (which seems to be correct!), but the only way I could encode over 100,000 elements would be to have an int[] in a makeshift new data type, where to access the value we would read from the int array as one continuous long symbol... that's not as space efficient as just transporting a 32-bit int? Or is it more a case of Huffmans use being with its prefix codes, and being able to determine each unique value in a continuous bit stream unambiguously?
Thanks
Your understanding is a bit off - take a look at http://en.wikipedia.org/wiki/Huffman_coding. And you have to pack the encoded bits into machine words in order to get compression - Huffman encoded data can best be thought of as a bit-stream.
You seem to understand the principle of prefix codes.
Could you tell us a little more about these 100,000+ different elements you mention?
The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.
What you describe is one particular kind of prefix code: unary coding.
Another popular variant of the unary coding system assigns elements in order of frequency to the fixed codes
"1", "01", "001", "0001", "00001", "000001", etc.
Some compression programs use another popular prefix code: Elias gamma coding.
The Elias gamma coding assigns elements in order of frequency to the fixed set of codewords
1
010
011
00100
00101
00110
00111
0001000
0001001
0001010
0001011
0001100
0001101
0001110
0001111
000010000
000010001
000010010
...
The 32nd Elias gamma codeword is about 10 bits long, about half as long as the 32nd unary codeword.
The 100,000th Elias gamma codeword will be around 32 bits long.
If you look carefully, you can see that each Elias gamma codeword can be split into 2 parts -- the first part is more or less the unary code you are familiar with. That unary code tells the decoder how many more bits follow afterward in the rest of that particular Elias gamma codeword.
There are many other kinds of prefix codes.
Many people (confusingly) refer to all prefix codes as "Huffman codes".
When compressing some particular data file, some prefix codes do better at compression than others.
How do you decide which one to use?
Which prefix code is the best for some particular data file?
The Huffman algorithm -- if you neglect the overhead of the Huffman frequency table -- chooses exactly the best prefix code for each data file.
There is no singular "the" Huffman code that can be pre-generated without regard to the actual symbol frequencies.
The prefix code choosen by the Huffman algorithm is usually different for different files.
The Huffman algorithm doesn't compress very well when we really do have 100,000+ unique elements --
the overhead of the Huffman frequency table becomes so large that we often can find some other "suboptimal" prefix code that actually gives better net compression.
Or perhaps some entirely different data compression algorithm might work even better in your application.
The "Huffword" implementation seems to work with around 32,000 or so unique elements,
but the overwhelming majority of Huffman code implementations I've seen work with around 257 unique elements (the 256 possible byte values, and the end-of-text indicator).
You might consider somehow storing your data on a disk in some raw "uncompressed" format.
(With 100,000+ unique elements, you will inevitably end up storing many of those elements in 3 or more bytes).
Those 257-value implementations of Huffman compression will be able to compress that file;
they re-interpret the bytes of that file as 256 different symbols.
My question is, if the code for the nth element is n-bits long then
surely once I have passed the 32nd term it is more space efficient to
just sent 32-bit data types as they are, such as ints for example?
Have I missed something in the methodology?
One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences. If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less. By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- that frees up the compressor to use less than 8 bits to store the more-frequent symbols.
related:
Maximum number of different numbers, Huffman Compression

Resources