What is the most efficient(*) way of building a canonical huffman tree? - huffman-code

Assume A is an array where A[0] holds the frequency of 0-th letter of the alphabet.
What is the most efficient(*) way of calculating code lengths? Not sure, but I guess efficiency can be in terms of memory usage or steps required.
All I'm interested is the array L where L[0] contains code lengths (number of bits) of 0-th letter of the alphabet, where code comes from canonical huffman tree built out of A frequency array.

If frequencies form a monotonic sequence, ie. A[0]<=A[1]<=...<=A[n-1] or A[0]>=A[1]>=...>=A[n-1], then you can generate an optimal code lengths in O(n) time and O(1) additional space. This algorithm requires only 2 simple passes over the array and it's very fast. A full description is given in [1].
If your frequencies aren't sorted, first you need to sort them and then apply the above algorithm. In this case time complexity is O(n log n) and an auxiliary array of n integers is needed to store sorted order - space complexity O(n).
[1]:
In-Place Calculation of Minimum-Redundancy Codes by Alistair Moffat and Jyrki Katajainen, available online: http://www.diku.dk/~jyrki/Paper/WADS95.pdf

Related

Clustering "access-time" data sequences

I have many sequences of data looking like this:
s1 = t11, t12, ..., t1m_1
s2 = t21, t22, ..., t2m_2
...
si = ti1, ti2, ..., tim_i
si means the i-th sequence, tij means the i-th sequence be accessed at time tj
each sequence has different length of data (m_1 may not equal to m_2),
and each sequence's data means that the sequence si was accessed time at ti1, ti2, ..., tim_i.
My goal is to cluster the similar access-time sequences.
I'm not sure whether I can translate this problem to a time-series problem.
For my understanding the time-series data like that each sequence's data means the value at that time like stock data, but my sequence's value means which time the sequence be accessed.
If it can translate to time-series problem, but there is another problem. The problem is that the sequence's access time is very discrete (may be accessed at 1s, 1000s, 2000s), so if I translate to time-series format, its space would be very large, I think this can't run cluster with some algorithm like (DTW), its time complexity may too large.
As you pointed out, DTW would be quite slow, since comparing the first two series takes k * m_1 * m_2 operations.
To avoid this, and to more easily compare your sequences, you might somehow hammer them into the same format (thereby also losing information).
Here are some ideas:
Differentiate to obtain times-between-accesses, and build histograms with fixed bins across all data.
Count the number of accesses during each minute every week (and divide by number of times that minute-of-week appears in each series). Adapt to timescales of interest.
Count "number of accesses up until now". So, instead of having data points only when an access was made ("sparse"), you'd get a data point for every timestamp ("dense") showing accesses for every minute up to the current one.
#3 would be similar to an "integral image" in computer vision. After this, new summarization techniques open up, like moving averages, or even direct comparison (if the recordings happen in parallel).
In order to pick a more useful representation, you need to think about what is meaningful in your application.
After you get a uniform-length representation, you can use cheaper similarity measures. A typical one is cosine similarity (but be sure to normalize first).

LCS (Longest Common Subsequence) - get best K solutions

The LCS problem gets two strings and returns their longest common subsequence.
For example:
LCS on the strings: elephant and eat is 3, as the whole string eat is a subsequence in elephant - indices 0,6,7 or 2,6,7
Another example:
LCS on the strings: elephant and olives is 2, as their longest common subsequence is le
The question is, whether there is an algorithm that does not only returns the most optimal solution, but that can return the K best solutions?
There is an algorithm to return all the optimal solutions (I think this is what you asked).
As in Wikipedia:
Using the dynamic programming algorithm for two strings, the table is constructed, then backtracked from the end to the beginning recursively, with the added computation that if either of (i, j-1) or (i-1, j) could be the point preceding the current one, then both paths are explored. This leads to exponential computation in the worst case.
There can be an exponential number of these optimal sequences in the worst case!

Why is modulus operator slow?

Paraphrasing from in "Programming Pearls" book (about c language on older machines, since book is from the late 90's):
Integer arithmetic operations (+, -, *) can take around 10 nano seconds whereas the % operator takes up to 100 nano seconds.
Why there is that much difference?
How does a modulus operator work internally?
Is it same as division (/) in terms of time?
The modulus/modulo operation is usually understood as the integer equivalent of the remainder operation - a side effect or counterpart to division.
Except for some degenerate cases (where the divisor is a power of the operating base - i.e. a power of 2 for most number formats) this is just as expensive as integer division!
So the question is really, why is integer division so expensive?
I don't have the time or expertise to analyze this mathematically, so I'm going to appeal to grade school maths:
Consider the number of lines of working out in the notebook (not including the inputs) required for:
Equality: (Boolean operations) essentially none - in computer "big O" terms this is known a O(1)
addition: two, working left to right, one line for the output and one line for the carry. This is an O(N) operation
long multiplication: n*(n+1) + 2: two lines for each of the digit products (one for total, one for carry) plus a final total and carry. So O(N^2) but with a fixed N (32 or 64), and it can be pipelined in silicon to less than that
long division: unknown, depends upon the argument size - it's a recursive descent and some instances descend faster than others (1,000,000 / 500,000 requires less lines than 1,000 / 7). Also each step is essentially a series of multiplications to isolate the closest factors. (Although multiple algorithms exist). Feels like an O(N^3) with variable N
So in simple terms, this should give you a feel for why division and hence modulo is slower: computers still have to do long division in the same stepwise fashion tha you did in grade school.
If this makes no sense to you; you may have been brought up on school math a little more modern than mine (30+ years ago).
The Order/Big O notation used above as O(something) expresses the complexity of a computation in terms of the size of its inputs, and expresses a fact about its execution time. http://en.m.wikipedia.org/wiki/Big_O_notation
O(1) executes in constant (but possibly large) time. O(N) takes as much time as the size of its data-so if the data is 32 bits it takes 32 times the O(1) time of the step to calculate one of its N steps, and O(N^2) takes N times N (N squared) the time of its N steps (or possibly N times MN for some constant M). Etc.
In the above working I have used O(N) rather than O(N^2) for addition since the 32 or 64 bits of the first input are calculated in parallel by the CPU. In a hypothetical 1 bit machine a 32 bit addition operation would be O(32^2) and change. The same order reduction applies to the other operations too.

why a good choice of mod is "a prime not too close to an exact of 2"

To generate a hash function, Map a key k into one of m slots by taking the remainder of k divided by m. That is, the hash function is
h(k) = k mod m.
I have read at several places that a good choice of m will be
A prime - I understand that we want to remove common factors, hence a prime number is chosen
not too close to an exact power of 2 - why is that?
From Introduction to algorithms :
When using the division method we avoid certain values of m. For
example m should not be power of 2. Since if m=2^p then h(k) is p
lowest-order bits of k. Unless it is known that all low-order p-bit
patterns are equally likely,
it is better to make a hash function
depend on all bits of the key.
As you se from the below image if i chose 2^3 which mean p=3 and m=8. The hashed keys are only dependent to lowest 3(p) bits which is bad because when you hash you want to include as much data as possible for a good distribution.

Why is scikit-learn's random forest using so much memory?

I'm using scikit's Random Forest implementation:
sklearn.ensemble.RandomForestClassifier(n_estimators=100,
max_features="auto",
max_depth=10)
After calling rf.fit(...), the process's memory usage increases by 80MB, or 0.8MB per tree (I also tried many other settings with similar results. I used top and psutil to monitor the memory usage)
A binary tree of depth 10 should have, at most, 2^11-1 = 2047 elements, which can all be stored in one dense array, allowing the programmer to find parents and children of any given element easily.
Each element needs an index of the feature used in the split and the cut-off, or 6-16 bytes, depending on how economical the programmer is. This translates into 0.01-0.03MB per tree in my case.
Why is scikit's implementation using 20-60x as much memory to store a tree of a random forest?
Each decision (non-leaf) node stores the left and right branch integer indices (2 x 8 bytes), the index of the feature used to split (8 bytes), the float value of the threshold for the decision feature (8 bytes), the decrease in impurity (8 bytes). Furthermore leaf nodes store the constant target value predicted by the leaf.
You can have a look at the Cython class definition in the source code for the details.

Resources