LCS (Longest Common Subsequence) - get best K solutions - lcs

The LCS problem gets two strings and returns their longest common subsequence.
For example:
LCS on the strings: elephant and eat is 3, as the whole string eat is a subsequence in elephant - indices 0,6,7 or 2,6,7
Another example:
LCS on the strings: elephant and olives is 2, as their longest common subsequence is le
The question is, whether there is an algorithm that does not only returns the most optimal solution, but that can return the K best solutions?

There is an algorithm to return all the optimal solutions (I think this is what you asked).
As in Wikipedia:
Using the dynamic programming algorithm for two strings, the table is constructed, then backtracked from the end to the beginning recursively, with the added computation that if either of (i, j-1) or (i-1, j) could be the point preceding the current one, then both paths are explored. This leads to exponential computation in the worst case.
There can be an exponential number of these optimal sequences in the worst case!

Related

Sum of all the bits in a Bit Vector of Z3

Given a bit vector in Z3, I am wondering how can I sum up each individual bit of this vector?
E.g.,
a = BitVecVal(3, 2)
sum_all_bit(a) = 2
Is there any pre-implemented APIs/functions that support this? Thank you!
It isn't part of the bit-vector operations.
You can create an expression as follows:
def sub(b):
n = b.size()
bits = [ Extract(i, i, b) for i in range(n) ]
bvs = [ Concat(BitVecVal(0, n - 1), b) for b in bits ]
nb = reduce(lambda a, b: a + b, bvs)
return nb
print sub(BitVecVal(4,7))
Of course, log(n) bits for the result will suffice if you prefer.
The page:
https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
has various algorithms for counting the bits; which can be translated to Z3/Python with relative ease, I suppose.
My favorite is: https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
which has the nice property that it loops as many times as there are set bits in the input. (But you shouldn't extrapolate from that to any meaningful complexity metric, as you do arithmetic in each loop, which might be costly. The same is true for all these algorithms.)
Having said that, if your input is fully symbolic, you can't really beat the simple iterative algorithm, as you can't short-cut the iteration count. Above methods might work faster if the input has concrete bits.
So you're computing the Hamming Weight of a bit vector. Based on a previous question I had, one of the developers had this answer. Based on that original answer, this is how I do it today:
def HW(bvec):
return Sum([ ZeroExt(int(ceil(log2(bvec.size()))), Extract(i,i,bvec)) for i in range(bvec.size())])

How to concatenate word vectors to form sentence vector

I have learned in some essays (Tomas Mikolov...) that a better way of forming the vector for a sentence is to concatenate the word-vector.
but due to my clumsy in mathematics, I am still not sure about the details.
for example,
supposing that the dimension of word vector is m; and that a sentence has n words.
what will be the correct result of concatenating operation?
is it a row vector of 1 x m*n ? or a matrix of m x n ?
There are at least three common ways to combine embedding vectors; (a) summing, (b) summing & averaging or (c) concatenating. So in your case, with concatenating, that would give you a 1 x m*a vector, where a is the number of sentences. In the other cases, the vector length stays the same. See gensim.models.doc2vec.Doc2Vec, dm_concat and dm_mean - it allows you to use any of those three options [1,2].
[1] http://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.LabeledLineSentence
[2] https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py

Show that L and Images cannot both be finite

im trying to complete these exercises for my automata theory class. The book i have explains this stuff really badly. Im kinda lost on how to start this as im not sure what I should be looking at.
Let L be any language on a non-empty alphabet. Show that L and The Complement of L cannot both be finite.
i know the complement of L ( ill use L# for the compliment of L) L#= E^*-L but i dont know were to go from their.
Let a be a letter of your alphabet. Assume for sake of contradiction both L and its complement L# are finite. Then, their union, L+L#, is finite. But L+L# contains all words a^n for natural n, i.e. infinitely many, a contradiction.
This is as much about infinite sets as it is about automata and languages: you cannot split an infinite set into a finite number of finite sets.

why a good choice of mod is "a prime not too close to an exact of 2"

To generate a hash function, Map a key k into one of m slots by taking the remainder of k divided by m. That is, the hash function is
h(k) = k mod m.
I have read at several places that a good choice of m will be
A prime - I understand that we want to remove common factors, hence a prime number is chosen
not too close to an exact power of 2 - why is that?
From Introduction to algorithms :
When using the division method we avoid certain values of m. For
example m should not be power of 2. Since if m=2^p then h(k) is p
lowest-order bits of k. Unless it is known that all low-order p-bit
patterns are equally likely,
it is better to make a hash function
depend on all bits of the key.
As you se from the below image if i chose 2^3 which mean p=3 and m=8. The hashed keys are only dependent to lowest 3(p) bits which is bad because when you hash you want to include as much data as possible for a good distribution.

What is the most efficient(*) way of building a canonical huffman tree?

Assume A is an array where A[0] holds the frequency of 0-th letter of the alphabet.
What is the most efficient(*) way of calculating code lengths? Not sure, but I guess efficiency can be in terms of memory usage or steps required.
All I'm interested is the array L where L[0] contains code lengths (number of bits) of 0-th letter of the alphabet, where code comes from canonical huffman tree built out of A frequency array.
If frequencies form a monotonic sequence, ie. A[0]<=A[1]<=...<=A[n-1] or A[0]>=A[1]>=...>=A[n-1], then you can generate an optimal code lengths in O(n) time and O(1) additional space. This algorithm requires only 2 simple passes over the array and it's very fast. A full description is given in [1].
If your frequencies aren't sorted, first you need to sort them and then apply the above algorithm. In this case time complexity is O(n log n) and an auxiliary array of n integers is needed to store sorted order - space complexity O(n).
[1]:
In-Place Calculation of Minimum-Redundancy Codes by Alistair Moffat and Jyrki Katajainen, available online: http://www.diku.dk/~jyrki/Paper/WADS95.pdf

Resources