what the data structure for this? - prefix-tree

i was given a set(no duplication then) of binary strings with arbitrary lenght and number, and need to find out if there is any string is the prefix of other string. for small set and string with small length, it's easy, just build a binary tree by read in each string, whenever i find a prefix match, i m done.but with a lots of strings with long length, this method won't be efficient. just wondering what would be the right data structure and algorithm for this one. huffman tree? tries(radix tree)? or anything? thanks.

I would go with a trie. Using a trie, insert all the strings, such that the last node of each string is marked with a flag, then for each string, walk along its path and check if any node on the page has its flag set. If yes, then the string ending at that node is a prefix of the string you're analyzing.
Assuming n = number of strings and k = average length, inserting and analyzing both take O(kn) in total.
A prefix tree (a trie with nodes longer than a single character) might be more efficient, but not as easy to implement.

Related

How to write a hashmap to a file in a memory efficient format?

I am writing a Huffman Coding/Decoding algorithm and I am running into the problem that the storing the Huffman tree is taking up way to much room. Currently, I am converting the tree into a hashMap as such -> hashMap<Character(s),Huffman Code> and then storing that hash map. The issue is that, while the string is compressed great, adding the Huffman Tree data stored in the hash map is adding so much overhead that it's actually ending up bigger than the original. Currently I am just naively writing [data, value] pairs to the file, but I imagine there must be some sort of trickier way to do that. Any ideas?
You do not need the tree in order to encode. All you need is the bit lengths for each symbol and a way to order the symbols. See Canonical Huffman Code.
In fact, all you need is the symbols that are coded ordered by bit length, and within bit length sorted by symbol, and then the number of codes of each length. With just those two things you can encode.

Z3: Fixed-size array of strings

I would like to know how I can use type String and declare an array like
String status[3] = {"init", "phase1", "phase2"}.
I am trying to write an algorithm which has N processes and each process can be in the initial phase, phase 1 or phase 2.
Z3 doesn't have string data-types. For the scenario you hint at, it seems unnecessary to have to represent the names of processors by strings. You might get away by simply creating separate variables for each process.

which is faster to find a random string: random line order or sorted?

We want to find a random string, e.g.: "ASDF555". We have a very BIG file with unique lines containing this string. Which one is faster (in time, with an easy grep command) to find the mentioned string? If the "BIG file" is:
sorted
or random?
Of course, the ASDF555 could be anything!
We are thinking of that it's faster to have the lines in random order, since the string could be random too. But we cannot prove this idea..
grep does not "know" your file is sorted, so it needs to go over it line by line - so the fact it's sorted is inconsequential. To rephrase - the fact that a file is sorted cannot harm your search speed - you can also go over a file line by line until you find the desired string.
However, if the file is indeed sorted, you may implement a better searching algorithm (e.g., binary searching) instead of using grep.

Huffman code for a single character?

Lets say I have a massive string of just a single character say x. I need to use huffman encoding.
A huffman encoding is a fully binary tree. So how does one create a huffman code for just a single character when we dont need two leaves at all ?
jbr's answer is fine; this is just a longer version of it.
Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length.
In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.
If you don't have a length encoded separately, then you must have a symbol to represent End Of Sequence so the decoder knows when to stop reading. Then your Huffman tree will have 2 nodes and you won't run into this special case.
If you only have one symbol, then you only need 1 bit per symbol. So you really don't have to do anything except count the number of bits and translate each into your symbol.
You simply could add an edge case in your code.
For example:
check if there is only one character in your hash table, which returns only the root of the tree without any leafs. In this case, you could add a code for this root node in your encoding function, like 0.
In the encoding function, you should refer to this edge case too.

Checking input grammar and deciding a result

Say I have a string "abacabacabadcdcdcd" and I want to apply a simple set of rules:
abaca->a
dcd->d
From left to right s.t. the string ends up being "abad". This output will be used to make a decision. After the rules are applied, if the output string does not match preset strings such as "abad", the original string would be discarded. ex. Every string should distill down to "abad", kick if it doesn't.
I have this hard-coded right now as regex, but there are many instances of these small rule sets. I am looking for something that will take a set of simple rules and compile (or just a function?) into something I can feed the string to and retrieve a result. The rule sets are independent of each other.
The input is tightly controlled, and the rules in use will be simple. Speed is the most important aspect.
I've looked at Bison and ANTLR, but I don't think I need anything nearly that powerful...
What am I looking for?
Edit: Should mention that the strings are made up of a couple letters. Usually 5, i.e. "abcde". There are no spaces, etc. Just letters.
If it is going to go fast, you can start out with a map, that contains your rules as key value pairs of strings. You can then compile this map to a sort of state machine, a tree with char keys, where the associated value is either a replacement string, or another tree.
You then go char by char through your string. Look up the current char in the tree. If you find another tree, look up the next character in that tree, etc.
At some point, either:
the lookup will fail, and then you know that the string you've seen so far is not the prefix of any rule. You can skip the current character and continue with the next.
or you get a replacement string. In that case, you can replace the characters between the current char and the last one you looked up inclusive by the replacement string.
The only difficulty is if the replacement can itself be part of a pattern to replace. Example:
ab -> e
cd -> b
The input:
acd -> ab (by rule 2)
ab -> e (by rule 1) ????
Now the question is if you want to reconsider ab to give e?
If this is so, you must start over from the beginning after each replacement. In addition, it will be hard to tell whether the replacement ever ends, except if all the rules you have are such that the right hand side is shorter than the left hand side. For, in that case, a finite string will get reduced in a finite amount of time.
But if we don't need to reconsider, the algorithm above will go straight through the string.

Resources