Decoding Huffman file from canonical form

Decoding Huffman file from canonical form - huffman-code

I am writing a Huffman file where I am storing the code lengths of the canonical codes in the header of the file. And during decoding, I am able to regenerate the canonical codes and store them into a std::map<std:uint8_it, std::vector<bool>>. The actual data is read into a single std::vector<bool>. Before anyone suggests me to use std::bitset, let me clarify that Huffman codes have variable bit length, and hence, I am using std::vector<bool>. So, given that I have my symbols and their corresponding canonical codes, how do I decode my file? I don't know where to go from here. Can someone explain to me how I would decode this file since I couldn't find anything proper related to it on searching.

You do not need to create the codes or the tree in order to decode canonical codes. All you need is the list of symbols in order and the count of symbols in each code length. By "in order", I mean sorted by code length from shortest to longest, and within each code length, sorted by the symbol value.
Since the canonical codes within a code length are sequential binary integers, you can simply do integer comparisons to see if the bits you have fall within that code range, and if it is, an integer subtraction to determine which symbol it is.
Below is code from puff.c (with minor changes) to show explicitly how this is done. bits(s, 1) returns the next bit from the stream. (This assumes that there is always a next bit.) h->count[len] is the number of symbols that are coded by length len codes, where len is in 0..MAXBITS. If you add up h->count[1], h->count[2], ..., h->count[MAXBITS], that is the total number of symbols coded, and is the length of the h->symbol[] array. The first h->count[1] symbols in h->symbol[] have length 1. The next h->count[2] symbols in h->symbol[] have length 2. And so on.
The values in the h->count[] array, if correct, are constrained to not oversubscribe the possible number of codes that can be coded in len bits. It can be further constrained to represent a complete code, i.e. there is no bit sequence that remains undefined, in which case decode() cannot return an error (-1). For a code to be complete and not oversubscribed, the sum of h->count[len] << (MAXBITS - len) over all len must equal 1 << MAXBITS.
Simple example: if we are coding e with one bit, t with two bits, and a and o with three bits, then h->count[] is {0, 1, 1, 2} (the first value, h->count[0] is not used), and h->symbol[] is {'e','t','a','o'}. Then the code to e is 0, the code for t is 10, a is 110, and o is 111.
#define MAXBITS 15 /* maximum bits in a code */
struct huffman {
short *count; /* number of symbols of each length */
short *symbol; /* canonically ordered symbols */
};
int decode(struct state *s, const struct huffman *h)
{
int len; /* current number of bits in code */
int code; /* len bits being decoded */
int first; /* first code of length len */
int count; /* number of codes of length len */
int index; /* index of first code of length len in symbol table */
code = first = index = 0;
for (len = 1; len <= MAXBITS; len++) {
code |= bits(s, 1); /* get next bit */
count = h->count[len];
if (code - count < first) /* if length len, return symbol */
return h->symbol[index + (code - first)];
index += count; /* else update for next length */
first += count;
first <<= 1;
code <<= 1;
}
return -1; /* ran out of codes */
}

Your map contains the relevant information, but it maps symbols to codes.
Yet, the data you are trying to decode comprises codes.
Thus your map cant be used to get the symbols corresponding to the codes read in an efficient way since the lookup method expects a symbol. Searching for codes and retrieving the corresponding symbol would be a linear search.
Instead you should reconstruct the Huffman tree you constructed for the compression step.
The frequency values of the inner nodes are irrelevant here, but you will need the leaf nodes at the correct positions.
You can create the tree on the fly as you read your file header. Create an empty tree initially. For each symbol to code mapping you read, create the corresponding nodes in the tree.
E.g. if the symbol 'D' has been mapped to the code 101, then make sure there is a right child node at the root, which has a left child node, which has a right child node, which contains the symbol 'D', creating the nodes if they were missing.
Using that tree you can then decode the stream as follows (pseudo-code, assuming taking a right child corresponds to adding a 1 to the code):
// use a node variable to remember the position in the tree while reading bits
node n = tree.root
while(stream not fully read) {
read next bit into boolean b
if (b == true) {
n = n.rightChild
} else {
n = n.leftChild
}
// check whether we are in a leaf node now
if (n.leftChild == null && n.rightChild == null) {
// n is a leaf node, thus we have read a complete code
// add the corresponding symbol to the decoded output
decoded.add(n.getSymbol())
// reset the search
n = tree.root
}
}
Note that inverting your map to get the lookup into the correct direction will still result in suboptimal performance (compared to binary tree traversal) since it can't exploit the restriction to a smaller search space as the traversal does.

Related

Time Complexity Difference between Two Parsing Implementations Using Global Variables and Return Values

I'm trying to solve the following problem:
A string containing only lower-case letters can be encoded into NUM[encoded string] format. For example, aaa can be encoded into 3[a]. Given an encoded string, find its original string according to the following grammar.
S -> {E}
E -> NUM[S] | STR # NUM[S] means encoded, while STR means not.
NUM -> 1 | 2 | ... | 9
STR -> {LETTER}
LETTER -> a | b | ... | z
Note: in the above grammar {} represents "concatenate 0 or more times".
For example, given the encoded string 3[a2[c]], the result (original string) is accaccacc.
I think this can be parsed by recursive descent parsing, and there are two ways to implement it:
Method I: Let the parsing method to return the result string directly.
Method II: Use a global variable, and each parsing method can just append characters to it.
I'm wondering if the two methods share the same time complexity. Suppose the result string is of length t. Then for method II, I think its time complexity should be O(t) because we read and write every character in the result string exactly once. For method I, however, my intuition was that it could be slower because the same substring can be copied multiple times, depending on the depth of recursions. But I'm not able to figure out the exact time complexity to justify my intuition. Can anyone give a hint?

My first suggestion is that your parser should produce an abstract syntax tree rather than directly interpret the string, no matter whether you choose to write a recursive descent parser, a state-based parser or use a parser generator. This greatly enhances maintainability and allows you perform validation, analyses, and transformations much more easily.
Method I
If I understand you correctly, in Method I you have functions for each grammar construct that return an immutable string, which are then recursively repeated and concatenated. For example, for the top-level concatenation rule
S ::= E*
you would have an interpretation function that looks like this:
string interpretS(NodeS sNode) {
string result = "";
for (int i = 0; i < sNode.Expressions.Length; i++) {
result = result + interpretE(sNode.Expressions[i]);
}
return result;
}
... and similarly for the other rules. It is easy to see that the time complexity of Method I is O(n²) where n is the length of the output. (NB: It makes sense to measure the time complexity in terms of the output rather than the input, since the output length is exponential in the length of the input, and so any interpretation method must have time complexity at least exponential in the input, which is not very interesting.) For example, interpreting the input abcdef requires concatenating a and b, then concatenating the result with c, then concatenating that result with d etc., resulting in 1+2+3+4+5 steps. (See here for a more detailed discussion why repeated string concatenation with immutable strings has quadratic complexity.)
Method II
I interpret your description of Method II like this: instead of returning individual strings which have to be combined, you keep a reference to a mutable structure representing a string that supports appending. This could be a data structure like StringBuilder in Java or .NET, or just a dynamic-length list of characters. The important bit is that appending a string of length b to a string of length a can be done in O(b) (rather than O(a+b)).
Note that for this to work, you don't need a global variable! A cleaner solution would just pass the reference to the resulting structure through (this pattern is called accumulator parameter). So now we would have functions like these:
void interpretS2(NodeS sNode, StringBuilder accumulator) {
for (int i = 0; i < sNode.Expressions.Length; i++) {
interpretE2(sNode.Expressions[i], accumulator);
}
}
void interpretE2(NodeE eNode, StringBuilder accumulator) {
if (eNode is NodeNum numNode) {
for (int i = 0; i < numNode.Repetitions; i++) {
interpretS2(numNode.Expression, accumulator);
}
}
else if (eNode is NodeStr strNode) {
for (int i = 0; i < strNode.Letters.Length; i++) {
interpretLetter2(strNode.Letters[i], accumulator);
}
}
}
void interpretLetter2(NodeLetter letterNode, StringBuilder accumulator) {
accumulator.Append(letterNode.Letter);
}
...
As you stated correctly, here the time complexity is O(n), since at each step exactly one character of the output is appended to the accumulator, and no strings are ever copied (only at the very end, when the mutable structure is converted into the output string).
So, at least for this grammar, Method II is clearly preferable.
Update based on comment
Of course, my interpretation of Method I above is exceedingly naive. A more realistic implementation of the interpretS function would internally use a StringBuilder to concatenate the results from the subexpressions, resulting in linear complexity for the example given above, abcdef.
However, this wouldn't change the worst case complexity O(n²): consider the example
1[a1[b[1[c[1[d[1[e[1[f]]]]]]]]]]
Even the less naive version of Method I would first append f to e (1 step), then append ef to d (+ 2 steps), then append def to c (+ 3 steps) and so on, amounting to 1+2+3+4+5 steps in total.
The fundamental reason for the quadratic time complexity of Method I is that the results from the subexpressions are copied to create the new subresult to be returned.

Time Complexity is estimated by counting the number of elementary operations performed by an algorithm, supposing that each elementary operation takes a fixed amount of time to perform, see here. Of interest is, however, only how fast this number of operations increases, when the size of the input data set increases.
In your case, the size of the input data means the length of the string to be parsed.
I assume by your 1st method you mean that when a NUM is encountered, its argument is processed by the parser completely NUM times. In your example, when „3“ is read from the input string, „a2[c]“ is processed completely 3 times. Processing here means to transverse the syntax tree up to a leave, and append the leave value, here the „c“ to the output string.
I also assume by your 2nd method you mean that when a NUM is encountered, its argument is only evaluated once and all intermediate results are stored and re-used. In your example, when „3“ is read from the input string and stored, „a“ is read from the input string and stored, „2[c]“ is processed, i.e. „2“ is read from the input string and stored, and finally „c“ is processed. „c“ is due to the stored „2“ combined to „cc“, and due to the stored „a“ combined to „acc“. This is due to the stored „3“ combined then to „accaccacc“, and „accaccacc“ is output.
The question now is, what is the elementary operation that is relevant to the time complexity? My feeling is that in the 1st case, the stack operations during transversal of the syntax tree are important, while in the 2nd case, string copying operations are important.
Strictly speaking, one can thus not compare the time complexities of both algorithms.
If you are, however, interested in run times instead of time complexities, my guess is that the stack operations take more time than string copying, and that then method 2 is preferable.

Writing UInt16List via IOSink.Add, what's the result?

Trying to write audio samples to a file.
I have List of 16-bit ints
UInt16List _samples = new UInt16List(0);
I add elements to this list as samples come in.
Then I can write to an IOSink like so:
IOSink _ios = ...
List<int> _toWrite;
_toWrite.addAll(_samples);
_ios.add(_toWrite);
or
_ios.add(_samples);
just works, no issues with types despite the signature of add taking List<int> and not UInt16List.
As I read, in Dart the 'int' type is 64 bit.
Are both writes above identical? Do they produce packed 16-bit ints in this file?

A Uint16List is-a List<int>. It's a list of integers which truncates writes to 16-bits, and always reads out 16-bit integers, but it is a list of integers.
If you copy those integers to a plain growable List<int>, it will contain the same integer values.
So, doing ios.add(_sample) will do the same as ios.add(_toWrite), and most likely neither does what you want.
The IOSink's add method expects a list of bytes. So, it will take a list of integers and assume that they are bytes. That means that it will only use the low 8 bits of each integer, which will likely sound awful if you try to play that back as a 16-bit audio sample.
If you want to store all 16 bits, you need to figure out how to store each 16-bit value in two bytes. The easy choice is to just assume that the platform byte order is fine, and do ios.add(_samples.buffer.asUint8List(_samples.offsetInBytes, _samples.lengthInBytes)). This will make a view of the 16-bit data as twice as many bytes, then write those bytes.
The endianness of those bytes (is the high byte first or last) depends on the platform, so if you want to be safe, you can convert the bytes to a fixed byte order first:
if (Endian.host == Endian.little) {
ios.add(
_samples.buffer.asUint8List(_samples.offsetInBytes, _samples.lengthInBytes);
} else {
var byteData = ByteData(_samples.length * 2);
for (int i = 0; i < _samples.length; i++) {
byteData.setUint16(i * 2, _samples[i], Endian.little);
}
var littleEndianData = byteData.buffer.asUint8List(0, _samples.length * 2);
ios.add(littleEndianData);
}

Dafny: types with contraints

I am trying some things in Dafny. I want to code a simple datastructure that holds an uncompressed image in memory:
datatype image' = image(width: int, height: int, data: array<byte>)
newtype byte = b: int | 0 <= b <= 255
Actually using it:
method Main() {
var dat := [1,2,3];
var im := image(1, 3, dat);
}
datatype image' = image(width: int, height: int, data: array<byte>)
newtype byte = b: int | 0 <= b <= 255
leads Dafny to complain:
stdin.dfy(3,24): Error: incorrect type of datatype constructor argument (found seq, expected array)
1 resolution/type errors detected in stdin.dfy
I might also want to demand that the byte array is not null, and the size of the byte array is equal to width * height * 3 (to store three bytes representing the RGB value of that pixel).
What way should I enforce this? I looked into newtype, which lets you put some constraints on variables with a certain type, but this works only for numeric types.

Dafny supports both immutable sequences (which are like mathematical sequences of elements) and mutable arrays (which are, like in C and Java, pointers to elements). The error you're getting is telling you that you're calling the image constructor with a seq<byte> value where an array<byte> value is expected.
You can fix the problem by replacing your definition of dat with:
var dat := new byte[3];
dat[0], dat[1], dat[2] := 1, 2, 3;
However, the more typical thing, if you're using a datatype (which is immutable), would be to use a sequence. So, you probably want to instead change your definition of image to:
datatype image = image(width: int, height: int, data: seq<byte>)
Btw, note that Dafny allows you to name a type and one of its constructors the same, so there's no reason to name one of them with a prime (unless you want to, of course).
Another matter of style is to use a half-open interval in your definition of byte:
newtype byte = b: int | 0 <= b < 256
Since half-open intervals are prevalent in computer science, Dafny's syntax favors them. For example, for a sequence s, the expression s[52..57] denotes a subsequence of s of length 5 (that is, 57 minus 52) starting in s at index 52. One more thing, you can also leave out the type int of b if you want, since Dafny will infer it:
newtype byte = b | 0 <= b < 256
You also asked about the possibility of adding a type constraint, so that the sequence in your datatype will always be of length 3. As you discovered, you cannot do this with a newtype, because newtype (at least for now) only works with numeric types. You can (almost) use a subset type, however. This would be done as follows:
type triple = s: seq<byte> | |s| == 3
(In this example, the first vertical bar is like the one in the newtype declaration and says "such that", whereas the next two denote the length operator on sequences.) The trouble with this declaration is that types must be nonempty and Dafny isn't convinced that there are any values that satisfy the constraint of triple. Well, Dafny is not trying very hard. The plan is to add a witness clause to the type (and newtype) declaration, so that a programmer can show Dafny a value that belongs to the triple type. However, this support is waiting for some implementation changes that will allow customized initial values, so you cannot use this constraint at this time.
Not that you want it here, but Dafny would let you give a weaker constraint that admits the empty sequence:
type triple = s: seq<byte> | |s| <= 3
So, instead, if you want to talk about that an image value has a data component of length 3, then introduce a predicate:
predicate GoodImage(img: image)
{
|img.data| == 3
}
and use this predicate in specifications like pre- and postconditions.
Program safely,
Rustan

In Lua Torch, the product of two zero matrices has nan entries

I have encountered a strange behavior of the torch.mm function in Lua/Torch. Here is a simple program that demonstrates the problem.
iteration = 0;
a = torch.Tensor(2, 2);
b = torch.Tensor(2, 2);
prod = torch.Tensor(2,2);
a:zero();
b:zero();
repeat
prod = torch.mm(a,b);
ent = prod[{2,1}];
iteration = iteration + 1;
until ent ~= ent
print ("error at iteration " .. iteration);
print (prod);
The program consists of one loop, in which the program multiplies two zero 2x2 matrices and tests if entry ent of the product matrix is equal to nan. It seems that the program should run forever since the product should always be equal to 0, and hence ent should be 0. However, the program prints:
error at iteration 548
0.000000 0.000000
nan nan
[torch.DoubleTensor of size 2x2]
Why is this happening?
Update:
The problem disappears if I replace prod = torch.mm(a,b) with torch.mm(prod,a,b), which suggests that something is wrong with the memory allocation.
My version of Torch was compiled without BLAS & LAPACK libraries. After I recompiled torch with OpenBLAS, the problem disappeared. However, I am still interested in its cause.

The part of code that auto-generates the Lua wrapper for torch.mm can be found here.
When you write prod = torch.mm(a,b) within your loop it corresponds to the following C code behind the scenes (generated by this wrapper thanks to cwrap):
/* this is the tensor that will hold the results */
arg1 = THDoubleTensor_new();
THDoubleTensor_resize2d(arg1, arg5->size[0], arg6->size[1]);
arg3 = arg1;
/* .... */
luaT_pushudata(L, arg1, "torch.DoubleTensor");
/* effective matrix multiplication operation that will fill arg1 */
THDoubleTensor_addmm(arg1,arg2,arg3,arg4,arg5,arg6);
So:
a new result tensor is created and resized with the proper dimensions,
but this new tensor is NOT initialized, i.e. there is no calloc or explicit fill here so it points to junk memory and could contain NaN-s,
this tensor is pushed on the stack so as to be available on the Lua side as the return value.
The last point means that this returned tensor is different from the initial prod one (i.e. within the loop, prod shadows the initial value).
On the other hand calling torch.mm(prod,a,b) does use your initial prod tensor to store the results (behind the scenes there is no need to create a dedicated tensor in that case). Since in your code snippet you do not initialize / fill it with given values it could also contain junk.
In both cases the core operation is a gemm multiplication like C = beta * C + alpha * A * B, with beta=0 and alpha=1. The naive implementation looks like that:
real *a_ = a;
for(i = 0; i < m; i++)
{
real *b_ = b;
for(j = 0; j < n; j++)
{
real sum = 0;
for(l = 0; l < k; l++)
sum += a_[l*lda]*b_[l];
b_ += ldb;
/*
* WARNING: beta*c[j*ldc+i] could give NaN even if beta=0
* if the other operand c[j*ldc+i] is NaN!
*/
c[j*ldc+i] = beta*c[j*ldc+i]+alpha*sum;
}
a_++;
}
Comments are mine.
So:
with torch.mm(a,b): at each iteration, a new result tensor is created without being initialized (it could contain NaN-s). So every iteration presents a risk of returning NaN-s (see above warning),
with torch.mm(prod,a,b): there is the same risk since you do not initialized the prod tensor. BUT: this risk only exists at the first iteration of the repeat / until loop since right after prod is filled with 0-s and re-used for the subsequent iterations.
So this is why you do not observe a problem here (it is less frequent).
In case 1: this should be improved at the Torch level, i.e. make sure the wrapper initializes the output (e.g. with THDoubleTensor_fill(arg1, 0);).
In case 2: you should initialize prod initially and use the torch.mm(prod,a,b) construct to avoid any NaN problem.
--
EDIT: this problem is now fixed (see this pull request).

Why is "no code allowed to be all ones" in libjpeg's Huffman decoding?

I'm trying to satisfy myself that METEOSAT images I'm getting from their FTP server are actually valid images. My doubt arises because all the tools I've used so far complain about "Bogus Huffman table definition" - yet when I simply comment out that error message, the image appears quite plausible (a greyscale segment of the Earth's disc).
From https://github.com/libjpeg-turbo/libjpeg-turbo/blob/jpeg-8d/jdhuff.c#L379:
while (huffsize[p]) {
while (((int) huffsize[p]) == si) {
huffcode[p++] = code;
code++;
}
/* code is now 1 more than the last code used for codelength si; but
* it must still fit in si bits, since no code is allowed to be all ones.
*/
if (((INT32) code) >= (((INT32) 1) << si))
ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
code <<= 1;
si++;
}
If I simply comment out the check, or add a check for huffsize[p] to be nonzero (as in the containing loop's controlling expression), then djpeg manages to convert the image to a BMP which I can view with few problems.
Why does the comment claim that all-ones codes are not allowed?

It claims that because they are not allowed. That doesn't mean that there can't be images out there that don't comply with the standard.
The reason they are not allowed is this (from the standard):
Making entropy-coded segments an integer number of bytes is performed
as follows: for Huffman coding, 1-bits are used, if necessary, to pad
the end of the compressed data to complete the final byte of a
segment.
If the all 1's code was allowed, then you could end up with an ambiguity in the last byte of compressed data where the padded 1's could be another coded symbol.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart