How to fix huffman tree - huffman-code

I am using Huffman Encoding to compress radar data. The data arrives at a rate of 30 fps. Each frame is divided into 9x64 data chunks and this chunk is compressed at one time.
I do not want to transfer the huffman tree along with the compressed data for decoding. Is there any way the tree can be fixed?
Thank you!

You can simply take a large amount of your data, generate a Huffman code for that, and ... that's it. Just use that code on both sides.
If you want to get fancier, you can see if your data clusters statistically, and generate a handful of Huffman codes, one for each cluster. Then just send a few bits at the front of the data to select the Huffman code to use.

Related

How to K-Means clustering of pdf raw data

I want to cluster pdf documents based on their structure, not only the text content.
The main problem with the text only approach is, that it will loose the information if a document has a pdf form structure or was it just a plain doc or does it contain pictures?
For our further processing these information are most important.
My main goal is now to be able to classify a document regarding mainly its structure not only the text content.
The documents to classify are stored in a SQL database as byte[] (varbinary), so my idea is now to use the this raw data for classification, without prior text conversion.
Because if I look at the hex output of these data, I can see repeating structures which seems to be similar to the different doc classes I want to separate.
You can see some similar byte patterns as first impression in my attached screenshot.
So my idea is now to train a K-Means model with e.g. a hex output string.
In the next step I would try to find the best number of clusters with the elbow method, which should be around 350 - 500.
The size of the pdf data varies between 20 kByte and 5 MB, mostly around 150 kBytes. To train the model I have +30.k documents.
When I research that, the results are sparse. I only find this article, which make me unsure about the best way to solve my task.
https://www.ibm.com/support/pages/clustering-binary-data-k-means-should-be-avoided
My questions are:
Is K-Means the best algorithm for my goal?
What method do you would recommend?
How to normalize or transform the data for the best results?
Like Ian in the comments said, to use raw data seems a bad idea.
With further research I found the best solution to first read the structure of the PDF file e.g. with an approach like this:
https://github.com/Uzi-Granot/PdfFileAnaylyzer
I normalized and clustered the data with this information, which gives me good results.

At what rate should I sample to make a dependent data stream independent?

I am an undergraduate student who is volunteering in a computer vision research project. As a part of the project, I wish to make a dependent data stream (a stream in which the value of each data sample depends on the previous data sample seen), independent. For this, I need to determine a scalar at which intervals I must sample the stream so that no 2 consecutive data samples are dependent.
For instance, maybe at a jump factor of 10, that is, sampling after every 10 data points in the stream, the resultant reduced data stream is independent.
My question is how can we determine this scalar jump factor for effective sampling such that the new data stream has independent data points?
From my research, I have been unable to find any statistical test that could be helpful.
Thanks in advance.

Huffman vs adaptive huffman

I know that adaptive huffman has better performance than huffman algorhitm, but I can't figure out why.
In Huffman, when you build a tree and encode your text, you must send frequencies for each letter in text with encoded text. So when decoding, you build a tree, like you did when you were encoding, and then decode the message.
But in adaptive huffman, when you build a tree and encode text, I guess you must send message with built huffman tree? I may be wrong, but it seems that it's easier to send table containing letter frequencies, than whole tree.
Where am I wrong?
No, you don't send the code. An adaptive Huffman code is adjusted incrementally using the data already received. That process is replicated on the receiving end.

running weka over a large arff dataset file

I am having an arff file that contains 700 entries, each of 42000+ features for a NLP related project. Right now the format is in dense format, but the entries can be reduced substantially, if sparse representation is used.
I am running on a core 2 duo machine with 2 GB RAM, and I am getting memory out of range eception, in spite of increasing the limit till 1536 MB.
Will it be of any advantage if I convert the arff file to a sparse representation or shall I need to run my code on a much more powerful machine?
Depending on the internal data structure of the algorithm and how the data can be processed (incrementally or all in memory) it will need more memory or not. So the memory you will need depends on the algorithm.
So sparse representation is easier for you because it is compact, but, as fas as I know, the algorithm will need the same amount of memory to create the model from the same dataset. The input should format be transparent to the algorithm.

Chunked HDF5 DataSet and slabsize

We are evaluating the performance of HDF5 regarding chunked datasets.
Especially we try to figure out if it is possible to read across different contiguous chunks and how the performance is influenced by doing so?
E.g. we have a dataset with chunk size of 10, a dataset with 100 values and want to read values 23 to 48. Will there be a great loss of performance?
Many thanks!
I don't know how to specifically answer your question, but I suggest you to use a chunk size of 1024 (or any higher power of two). I don't know the internals of HDF5, but from my knowledge of filesystems, and from a rough benchmark we did, 1024 was just right.

Resources