How to write a hashmap to a file in a memory efficient format?

How to write a hashmap to a file in a memory efficient format? - memory

I am writing a Huffman Coding/Decoding algorithm and I am running into the problem that the storing the Huffman tree is taking up way to much room. Currently, I am converting the tree into a hashMap as such -> hashMap<Character(s),Huffman Code> and then storing that hash map. The issue is that, while the string is compressed great, adding the Huffman Tree data stored in the hash map is adding so much overhead that it's actually ending up bigger than the original. Currently I am just naively writing [data, value] pairs to the file, but I imagine there must be some sort of trickier way to do that. Any ideas?

You do not need the tree in order to encode. All you need is the bit lengths for each symbol and a way to order the symbols. See Canonical Huffman Code.
In fact, all you need is the symbols that are coded ordered by bit length, and within bit length sorted by symbol, and then the number of codes of each length. With just those two things you can encode.

Related

How is a frequency table stored in Huffman coding?

So I'm looking into Huffman coding, and it's a pretty simple algorithm to understand, except I was curious about one thing. Given that "a Huffman tree that omits unused symbols produces the most optimal code lengths", I was curious whether the frequency table of a Huffman tree counts towards the total length of the encoded message? I suppose this question in itself boils down to how the frequency table is stored. Is it part of the encoded message, or is it saved as a separate file?

Yes, unless the two sides agree on a pre-determined code book, the frequency table (or equivalent information sufficient to construct the decoding tree on the receiving end) must be included in the message.
Google Canonical Huffman code for a clever way to cut down on the size of this information.

How to create, update and read a radix tree that won't fit into memory?

I’m interested in using a radix tree (or Patricia trie) to store a hash/dict/array of strings -> values. However, I'm finding that I have too many strings to fit into memory.
I found an article by Algolia about how they solved this problem with their search index and they talk about doing what I’m trying to do: flushing a radix tree to disk as each branch is constructed and only reading back the branches you need.
However, they don’t mention how they do this. The only way I can think of storing a radix tree is either as a full (serialized) object or as a hash/array as a simple Key/Value store.
For example, using a key/value store
SET smile: [...values...]
SET smiled: [...values...]
SET smiles: [...values...]
SET smiling: [...values...]
Then doing a prefix scan to pull out keys/values that MATCH smil*. However, this kind of loses the space-saving benefits of a radix tree plus it would require reconstructing at least part of the radix tree on load.

Why not hash the strings in a pre-processing step, store the mapping out-of-core and build the trie on the hashes instead? This should significantly reduce the memory load since only the hashes are left to consider.

Running the huffman encoding algorithm on an empty file

This may be a stupid question, but compressing an empty file doesn't make any sense right?. The Huffman encoding algorithm on an empty file wouldn't work because it relies on the fact there have to be at least 2 nodes in the priority queue. If we run the algorithm on an empty file, the only node we would get is the one corresponding to EOF.

Ya, that's right, it doesn't make much sense to run the Huffman encoding on it. Depending on the details of the implementation, it may not crash.
But why would you try to compress an empty file?

You need to somehow encode at the start of the compressed data what symbols correspond to what Huffman codes. It is in that representation that the number of symbols would be indicated. If there is only one symbol, which has to be EOF per your description, then the Huffman coding is implied to be zero bits. If there is only one symbol, then you need zero bits to represent it.

Huffman code for a single character?

Lets say I have a massive string of just a single character say x. I need to use huffman encoding.
A huffman encoding is a fully binary tree. So how does one create a huffman code for just a single character when we dont need two leaves at all ?

jbr's answer is fine; this is just a longer version of it.
Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length.
In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.
If you don't have a length encoded separately, then you must have a symbol to represent End Of Sequence so the decoder knows when to stop reading. Then your Huffman tree will have 2 nodes and you won't run into this special case.

If you only have one symbol, then you only need 1 bit per symbol. So you really don't have to do anything except count the number of bits and translate each into your symbol.

You simply could add an edge case in your code.
For example:
check if there is only one character in your hash table, which returns only the root of the tree without any leafs. In this case, you could add a code for this root node in your encoding function, like 0.
In the encoding function, you should refer to this edge case too.

Delphi TStringList wrapper to implement on-the-fly compression

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA
Ross

I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).

You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.

I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and
keep separated growning dictionary
to reference the words and save list of indexes internally
You implement something related to
zlib stream available in Delphi, but operating by the block that
for example can contains 10-100
strings. In this case you still have
to recompress/compress the complete
block, but the "price" you pay is lower.

I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart