Convention for Huffman Coding - huffman-code

Is there a convention for generating a Huffman encoding for a certain alphabet? It seems like the resultant encoding depends both on whether you assign '0' to the left child or the right child as well as how you determine which symbol will go to the left tree.
Wikipedia says that:
As a common convention, bit '0' represents following the left child
and bit '1' represents following the right child.
So that is an answer to the first half of the variance. However, I couldn't find any convention for the second half. I would assume something like making the node with lower probability go on the left, but several example Huffman trees online don't do this.
For example:
So is there a convention for the assignment of nodes to left and right, or is it up to the implementation?
I apologize if this is a duplicate, but I wasn't able to find an answer.

Yes, in fact there is. Not so much a convention for interoperability, but rather for encoding efficiency. It's called Canonical Huffman, where the codes are assigned in numerical order from the shortest codes to the longest codes, and within a single code length, they are assigned in a lexicographical order on the symbols. This permits transmitting only the length of the code for each symbol, as opposed to the entire tree structure.
Generally what is done is to use the Huffman algorithm tree only to determine the number of bits for each symbol. The tree is then discarded. Bit values are never assigned to the branches. The codes are then built directly from the lengths, using the ordering above.

Check it out
class Nodo{
constructor(v=null,f=null,l=null,r=null){
this.f=f
this.v=v
this.l=l
this.r=r
}
}
function EnCrypt(text){
let lista=[]
for(let i=0;i<text.length;i++){//Create the list with the appearances
!lista.find(e => e.v===text[i]) && lista.push(new Nodo(text[i],(text.match(new RegExp(text[i],"g")) || []).length))
}
lista=lista.sort((a,b)=>a.f-b.f)//Order from smallest to largest
//----------------------------------------------------
function createNew(){//Create the tree
let nodos=lista.splice(0,2)
lista.push(new Nodo(null,nodos[0].f+nodos[1].f,nodos[0],nodos[1]))
if(lista.length==1)return lista
createNew()
}createNew()
///-----------------------------------------------
let Arbol=lista[0]
function Codigo(nodo,c=""){//recursively traverse the tree
if(!nodo.l && !nodo.r)return [nodo.v,c]
return Codigo(nodo.l,c+"0")+";"+Codigo(nodo.r,c+"1")
}
//-----------------------------------------------
const codigo=(Codigo(Arbol)).split(";")
let finish=""
text.split("").map(t=>{
codigo.map(e => {
if(e.split(",")[0]==t)finish+=e.split(",")[1]
})
})
return {
cod:finish,
dicc:codigo
}
}
function DeCrypt(key,res=""){
let {cod,dicc}=key
let temp=""
for(let i=0;i<=cod.length;i++){
temp+=cod.substr(i,1)
dicc.map((d)=>{
d=d.split(",")
if(temp==d[1]){
res+=d[0]
temp=""
cod=cod.substr(i)
i=0
}
})
}
return res
}
function Huffman(){
const text=document.querySelector("#newValue").value
const comp= EnCrypt(text)
document.querySelector(".res").innerHTML=JSON.stringify(comp,null, 4)
}
function HuffmanDecode(){
const text=JSON.parse(document.querySelector("#huffman").value)
const comp= DeCrypt(text)
document.querySelector(".res2").innerHTML=comp
}
<h1></h1>
<input type="text"
placeholder="set value (min 2 chars)" id="newValue">
<button onclick="Huffman()">Go</button>
<div class="res" style="white-space: pre-wrap;"></div>
<input type="text" placeholder="paste the result from above" id="huffman"><button onclick="HuffmanDecode()">decode</button>
<div class="res2" style="white-space: pre-wrap;"></div>

Related

What is the difference between makelist() and create_list() in MAXIMA?

I have seen that there are two similar functions to create lists in maxima: create_list() and makelist(). In both cases, the arguments can be
(<an expression>, <a variable>, <the initial value>, <the final value>, < the step>) or
(<an expression>, <a variable>, <a list of values for the variable>).
What is the difference between these two functions? I have tried a couple of examples and they seem to work in the same way:
makelist(i^i,i,1,3); -> [1,4,27]
create_list(i^i,i,1,3); -> [1,4,27]
makelist(i^i,i,[1,2,3]); -> [1,4,27]
create_list(i^i,i,[1,2,3]); -> [1,4,27]
If you wish, you can create your own function, with its own syntax, in maxima.
For example, there is no operator ".." but this makes it happen.
infix("..",80,80,expr,expr,expr);
You can then define the semantics ...
here I just call a function named range.
(a..b):= range(a,b)
This doesn't provide for all the embroidery that you might like.
I think a superior technique for syntax and semantics is to enhance the "for" loop as in this example:
for i:1 thru 5 do collect i;
which returns [1,2,3,4,5]
All the varied mechanisms for "for", including step size, limit, iterating through sets, etc. can then be included in computing a list explicitly comprising a range. The code for this is about 7 lines of lisp, inserted into the source code for "parse-$do".
I also allow
for i in [a,b] summing f(i) ; which returns f(b)+f(a).
This enhancement is redundant for the (few) people who are comfortable with map, cons, lambda, apply, append ... in Maxima.
The code, which can be read in to any maxima, is here.
https://people.eecs.berkeley.edu/~fateman/lisp/doparsesum.lisp

Time Complexity Difference between Two Parsing Implementations Using Global Variables and Return Values

I'm trying to solve the following problem:
A string containing only lower-case letters can be encoded into NUM[encoded string] format. For example, aaa can be encoded into 3[a]. Given an encoded string, find its original string according to the following grammar.
S -> {E}
E -> NUM[S] | STR # NUM[S] means encoded, while STR means not.
NUM -> 1 | 2 | ... | 9
STR -> {LETTER}
LETTER -> a | b | ... | z
Note: in the above grammar {} represents "concatenate 0 or more times".
For example, given the encoded string 3[a2[c]], the result (original string) is accaccacc.
I think this can be parsed by recursive descent parsing, and there are two ways to implement it:
Method I: Let the parsing method to return the result string directly.
Method II: Use a global variable, and each parsing method can just append characters to it.
I'm wondering if the two methods share the same time complexity. Suppose the result string is of length t. Then for method II, I think its time complexity should be O(t) because we read and write every character in the result string exactly once. For method I, however, my intuition was that it could be slower because the same substring can be copied multiple times, depending on the depth of recursions. But I'm not able to figure out the exact time complexity to justify my intuition. Can anyone give a hint?
My first suggestion is that your parser should produce an abstract syntax tree rather than directly interpret the string, no matter whether you choose to write a recursive descent parser, a state-based parser or use a parser generator. This greatly enhances maintainability and allows you perform validation, analyses, and transformations much more easily.
Method I
If I understand you correctly, in Method I you have functions for each grammar construct that return an immutable string, which are then recursively repeated and concatenated. For example, for the top-level concatenation rule
S ::= E*
you would have an interpretation function that looks like this:
string interpretS(NodeS sNode) {
string result = "";
for (int i = 0; i < sNode.Expressions.Length; i++) {
result = result + interpretE(sNode.Expressions[i]);
}
return result;
}
... and similarly for the other rules. It is easy to see that the time complexity of Method I is O(n²) where n is the length of the output. (NB: It makes sense to measure the time complexity in terms of the output rather than the input, since the output length is exponential in the length of the input, and so any interpretation method must have time complexity at least exponential in the input, which is not very interesting.) For example, interpreting the input abcdef requires concatenating a and b, then concatenating the result with c, then concatenating that result with d etc., resulting in 1+2+3+4+5 steps. (See here for a more detailed discussion why repeated string concatenation with immutable strings has quadratic complexity.)
Method II
I interpret your description of Method II like this: instead of returning individual strings which have to be combined, you keep a reference to a mutable structure representing a string that supports appending. This could be a data structure like StringBuilder in Java or .NET, or just a dynamic-length list of characters. The important bit is that appending a string of length b to a string of length a can be done in O(b) (rather than O(a+b)).
Note that for this to work, you don't need a global variable! A cleaner solution would just pass the reference to the resulting structure through (this pattern is called accumulator parameter). So now we would have functions like these:
void interpretS2(NodeS sNode, StringBuilder accumulator) {
for (int i = 0; i < sNode.Expressions.Length; i++) {
interpretE2(sNode.Expressions[i], accumulator);
}
}
void interpretE2(NodeE eNode, StringBuilder accumulator) {
if (eNode is NodeNum numNode) {
for (int i = 0; i < numNode.Repetitions; i++) {
interpretS2(numNode.Expression, accumulator);
}
}
else if (eNode is NodeStr strNode) {
for (int i = 0; i < strNode.Letters.Length; i++) {
interpretLetter2(strNode.Letters[i], accumulator);
}
}
}
void interpretLetter2(NodeLetter letterNode, StringBuilder accumulator) {
accumulator.Append(letterNode.Letter);
}
...
As you stated correctly, here the time complexity is O(n), since at each step exactly one character of the output is appended to the accumulator, and no strings are ever copied (only at the very end, when the mutable structure is converted into the output string).
So, at least for this grammar, Method II is clearly preferable.
Update based on comment
Of course, my interpretation of Method I above is exceedingly naive. A more realistic implementation of the interpretS function would internally use a StringBuilder to concatenate the results from the subexpressions, resulting in linear complexity for the example given above, abcdef.
However, this wouldn't change the worst case complexity O(n²): consider the example
1[a1[b[1[c[1[d[1[e[1[f]]]]]]]]]]
Even the less naive version of Method I would first append f to e (1 step), then append ef to d (+ 2 steps), then append def to c (+ 3 steps) and so on, amounting to 1+2+3+4+5 steps in total.
The fundamental reason for the quadratic time complexity of Method I is that the results from the subexpressions are copied to create the new subresult to be returned.
Time Complexity is estimated by counting the number of elementary operations performed by an algorithm, supposing that each elementary operation takes a fixed amount of time to perform, see here. Of interest is, however, only how fast this number of operations increases, when the size of the input data set increases.
In your case, the size of the input data means the length of the string to be parsed.
I assume by your 1st method you mean that when a NUM is encountered, its argument is processed by the parser completely NUM times. In your example, when „3“ is read from the input string, „a2[c]“ is processed completely 3 times. Processing here means to transverse the syntax tree up to a leave, and append the leave value, here the „c“ to the output string.
I also assume by your 2nd method you mean that when a NUM is encountered, its argument is only evaluated once and all intermediate results are stored and re-used. In your example, when „3“ is read from the input string and stored, „a“ is read from the input string and stored, „2[c]“ is processed, i.e. „2“ is read from the input string and stored, and finally „c“ is processed. „c“ is due to the stored „2“ combined to „cc“, and due to the stored „a“ combined to „acc“. This is due to the stored „3“ combined then to „accaccacc“, and „accaccacc“ is output.
The question now is, what is the elementary operation that is relevant to the time complexity? My feeling is that in the 1st case, the stack operations during transversal of the syntax tree are important, while in the 2nd case, string copying operations are important.
Strictly speaking, one can thus not compare the time complexities of both algorithms.
If you are, however, interested in run times instead of time complexities, my guess is that the stack operations take more time than string copying, and that then method 2 is preferable.

customising the parse return value, retaining unnamed terminals

Consider the grammar:
TOP ⩴ 'x' Y 'z'
Y ⩴ 'y'
Here's how to get the exact value ["TOP","x",["Y","y"],"z"] with various parsers (not written manually, but generated from the grammar):
xyz__Parse-Eyapp.eyp
%strict
%tree
%%
start:
TOP { shift; use JSON::MaybeXS qw(encode_json); print encode_json $_[0] };
TOP:
'x' Y 'z' { shift; ['TOP', (scalar #_) ? #_ : undef] };
Y:
'y' { shift; ['Y', (scalar #_) ? #_ : undef] };
%%
xyz__Regexp-Grammars.pl
use 5.028;
use strictures;
use Regexp::Grammars;
use JSON::MaybeXS qw(encode_json);
print encode_json $/{TOP} if (do { local $/; readline; }) =~ qr{
<nocontext:>
<TOP>
<rule: TOP>
<[anon=(x)]> <[anon=Y]> <[anon=(z)]>
<MATCH=(?{['TOP', $MATCH{anon} ? $MATCH{anon}->#* : undef]})>
<rule: Y>
<[anon=(y)]>
<MATCH=(?{['Y', $MATCH{anon} ? $MATCH{anon}->#* : undef]})>
}msx;
Code elided for the next two parsers. With Pegex, the functionality is achieved by inheriting from Pegex::Receiver. With Marpa-R2, the customisation of the return value is quite limited, but nested arrays are possible out of the box with a configuration option.
I have demonstrated that the desired customisation is possible, although it's not always easy or straight-forward. These pieces of code attached to the rules are run as the tree is assembled.
The parse method returns nothing but nested Match objects that are unwieldy. They do not retain the unnamed terminals! (Just to make sure what I'm talking about: these are the two pieces of data at the RHS of the TOP rule whose values are 'x' and 'z'.) Apparently only data springing forth from named declarators are added to the tree.
Assigning to the match variable (analog to how it works in Regexp-Grammars) seems to have no effect. Since the terminals do no make it into the match variable, actions don't help, either.
In summary, here's the grammar and ordinary parse value:
grammar {rule TOP { x <Y> z }; rule Y { y };}.parse('x y z')
How do you get the value ["TOP","x",["Y","y"],"z"] from it? You are not allowed to change the shape of rules because that would potentially spoil user attached semantics, otherwise anything else is fair game. I still think the key to the solution is the match variable, but I can't see how.
Not a full answer, but the Match.chunks method gives you a few of the input string tokenized into captured and non-captured parts.
It does, however, does not give you the ability to distinguish between non-capturing literals in the regex and implicitly matched whitespace.
You could circumvent that by adding positional captures, and use Match.caps
my $m = grammar {rule TOP { (x) <Y> (z) }; rule Y { (y) }}.parse('x y z');
sub transform(Pair $p) {
given $p.key {
when Int { $p.value.Str }
when Str { ($p.key, $p.value.caps.map(&transform)).flat }
}
}
say $m.caps.map(&transform);
This produces
(x (Y y) z)
so pretty much what you wanted, except that the top-level TOP is missing (which you'll likely only get in there if you hard-code it).
Note that this doesn't cover all edge cases; for example when a capture is quantified, $p.value is an Array, not a match object, so you'll need another level of .map in there, but the general idea should be clear.

OpenCV partition() underlying algorithm

Does anybody know what algorithm is used here?
I want to implement this function to do detection's windows grouping.
Thank you.
If you look at the OpenCV source code for the partition function, you will see the following comments:
// This function splits the input sequence or set into one or more equivalence classes and
// returns the vector of labels - 0-based class indexes for each element.
// predicate(a,b) returns true if the two sequence elements certainly belong to the same class.
//
// The algorithm is described in "Introduction to Algorithms"
// by Cormen, Leiserson and Rivest, the chapter "Data structures for disjoint sets"
template<typename _Tp, class _EqPredicate> int partition( const vector<_Tp>& _vec, vector<int>& labels, _EqPredicate predicate=_EqPredicate())
{
// ... etc.
}
This gives you both the source code, and the reference for the algorithm.
So, that's Chapter 21 in this book.

Linked-list representation of disjoint sets - omission in Intro to Algorithms text?

Having had success with my last CLRS question, here's another:
In Introduction to Algorithms, Second Edition, p. 501-502, a linked-list representation of disjoint sets is described, wherein each list member the following three fields are maintained:
set member
pointer to next object
pointer back to first object (the set representative).
Although linked lists could be implemented by using only a single "Link" object type, the textbook shows an auxiliary "Linked List" object that contains a pointer to the "head" link and the "tail" link. Having a pointer to the "tail" facilitates the Union(x, y) operation, so that one need not traverse all of the links in a larger set x in order to start appending the links of the smaller set y to it.
However, to obtain a reference to the tail link, it would seem that each link object needs to maintain a fourth field: a reference to the Linked List auxiliary object itself. In that case, why not drop the Linked List object entirely and use that fourth field to point directly to the tail?
Would you consider this an omission in the text?
I just opened the text and the textbook description seems fine to me.
From what I understand the data-structure is something like:
struct Set {
LinkedListObject * head;
LinkedListObject * tail;
};
struct LinkedListObject {
Value set_member;
Set *representative;
LinkedListObject * next;
};
The textbook does not talk of any "auxillary" linked list structure in the book I have (second edition). Can you post the relevant paragraph?
Doing a Union would be something like:
// No error checks.
Set * Union(Set *x, Set *y) {
x->tail->next = y->head;
x->tail = y->tail;
LinkedListObject *tmp = y->head;
while (tmp) {
tmp->representative = x;
tmp = tmp->next;
}
return x;
}
why not drop the Linked List object entirely and use that fourth field to point directly to the tail?
An insight can be taken from path compression. There all the elements are supposed to point to head of list. If it doesn't happen then the find-set operation does that (by changing p[x] and returning that). You talk similarly of tail. So if such function is implemented only then can we use that.

Resources