Trying to understand the VITERBI algorithm a bit better

Trying to understand the VITERBI algorithm a bit better - machine-learning

I'm currently trying to implement the viterbi algorithm in python, more specifically the version presented in an online course.
As it stands, the algorithm is presented that way:
given a sentence with K tokens, we have to generate K tags .
We assume that tag K-1 = tag K-2 = '*', then for k going from 0 to K,
we set the tag for the token as follows :
tag(WORD_k) = argmax(p(k-1, tag_k-2, tag_k-1) * e( word_k, tag_k) * q(tag_k, tag_k-1, tag_k-1))
From my understanding this is straightforward because the p parameters are already calculated on each step (we go from 1 forward, and we already know p0), and max for the e and q params can be calculated by one iteration through the tags (since we can't come up with 2 different tags, we basically have to find the tag T for which the q * e product is maximal, and return that). This saves a lot of time, since we are almost at linear time in terms in big O notation, instead of exponential complexity, which we would get if we iterated through all possible word/tag combinations.
Am I getting the core of the algorithm correctly or am I missing something out?
Thanks in advance

since we can't come up with 2 different tags, we basically have to
find the tag T for which the q * e product is maximal, and return that
Yeah, sounds about right. q is the trigram (transition) probability and e is named the emission probability. As you said is unchanged between different paths in each stage, so the max is only dependent on the other two.
Each tag sequence should start with two asterisks at positions -2 and -1. So the first assumption is correct:
If we assume to be the maximum probability that the last two tags at position k are u and v, based on what we just said about the beginning asterisks, the base case would be
.
You had two errors in the general case though. The emission probability is a conditional. Also in the trigram, is repeated two times and the formula given is incorrect:

Related

How to improve binary search based optimization in Z3py

I am trying to optimize with Z3py an instance of Set Covering Problem (SCP41) based on minimize.
The results are the following:
Using
(1) I know that Z3 supports optimization (https://rise4fun.com/Z3/tutorial/optimization). Many times I get to the optimum in SCP41 and others instances, a few do not.
(2) I understand that if I use the Z3py API without the optimization module I would have to do the typical sequential search described in (Minimum and maximum values of integer variable) by #Leonardo de Moura. It never gives me results.
My approach
(3) I have tried to improve the sequential search approach by implementing a binary search similar to how it explains #Philippe in (Does Z3 have support for optimization problems), when I run my algorithm it waits and I do not get any result.
I understand that the binary search should be faster and work in this case? I also know that the instance SCP41 is something big and that many restrictions are generated and it becomes extremely combinatorial, this is my full code (Code large instance) and this is my binary search it:
def min(F, Z, M, LB, UB, C):
i = 0
s = Solver()
s.add(F[0])
s.add(F[1])
s.add(F[2])
s.add(F[3])
s.add(F[4])
s.add(F[5])
r = s.check()
if r == sat:
UB = s.model()[Z]
while int(str(LB)) <= int(str(UB)):
C = int(( int(str(LB)) + int(str(UB)) / 2))
s.push()
s.add( Z > LB, Z <= C)
r = s.check()
if r==sat:
UB = Z
return s.model()
elif r==unsat:
LB = C
s.pop()
i = i + 1
if (i > M):
raise Z3Exception("maximum not found, maximum number of iterations was reached")
return unsat
And, this is another instance (Code short instance) that I used in initial tests and it worked well in any case.
What is incorrect binary search or some concept of Z3 not applied correctly?
regards,
Alex

I don't think your problem is to do with minimization itself. If you put a print r after r = s.check() in your program, you see that z3 simply struggles to return a result. So your loop doesn't even execute once.
It's impossible to read through your program since it's really large! But I see a ton of things of the form:
Or(X250 == 0, X500 == 1)
This suggests your variables X250 X500 etc. (and there's a ton of them) are actually booleans, not integers. If that is indeed true, you should absolutely stick to booleans. Solving integer constraints is significantly harder than solving pure boolean constraints, and when you use integers to model booleans like this, the underlying solver simply explores the search space that's just unreachable.
If this is indeed the case, i.e., if you're using Int values to model booleans, I'd strongly recommend modelling your problem to get rid of the Int values and just use booleans. If you come up with a "small" instance of the problem, we can help with modeling.
If you truly do need Int values (which might very well be the case), then I'd say your problem is simply too difficult for an SMT solver to deal with efficiently. You might be better off using some other system that is tuned for such optimization problems.

How do I speedup adding two big vectors of tuples?

Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.

What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)

how to set a pattern in a variable using Z3Py

I'm pretty new in Z3, but a thing that my problem could be resolved with it.
I have two variables A and B and two pattern like this:
pattern_1: 1010x11x
pattern_2: x0x01111
where 1 and 0 are the bits zero and one, and x (dont care) cold be the bit 0 or 1.
I would like to use Z3Py for check if A with the pattern_1 and B with the pattern_2 can be true at the same time.
In this case if A = 10101111 and B = 10101111 than A and B cold be true ate the same time.
Can anyone help me with this?? It is possible resolve this with Z3Py

revised answer after clarification
Here's one way you could represent those constraints. There is an operation called Extract that can be applied to bit-vector terms. It is defined as follows:
def Extract(high, low, a):
"""Create a Z3 bit-vector extraction expression."""
where high is the high bit to be extracted, low is the low bit to be extracted, and a is the bitvector. This function represents the bits of a between high and low, inclusive.
Using the Extract function you can constrain each bit of whatever term you want to check so that it matches the pattern. For example, if the seventh bit of D must be a 1, then you can write s.add(Extract(7, 7, D) == 1). Repeat this for each bit in a pattern that isn't an x.

How to effectively use the Levenshtein algorithm for text auto-completion

I'm using the Levenshtein distance algorithm to filter through some text in order to determine the best matching result for the purpose of text field auto-completion (and top 5 best results).
Currently, I have an array of strings, and apply the algorithm to each one in an attempt to determine how close of a match it is to the text which was typed by the user. The problem is that I'm not too sure how to interpret the values outputted by the algorithm to effectively rank the results as expected.
For example: (Text typed = "nvmb")
Result: "game" ; levenshtein distance = 3 (best match)
Result: "number the stars" ; levenshtein distance = 13 (second best match)
This technically makes sense; the second result needs many more 'edits', because of it's length. The problem is that the second result is logically and visually a much closer match than the first one. It's almost as if I should ignore any characters longer than the length of the typed text.
Any ideas on how I could achieve this?

Levenshtein distance itself is good for correcting query, not for auto-completion.
I can propose alternative solution:
First, store your strings in prefix tree instead of array, so you will have no need to analyze all of them.
Second, given user input enumerate strings with fixed distance from it and suggest completions for any.
Your example: Text typed = "nvmb"
Distance is 0, no completions
Enumerate strings with distance 1
Only "numb" will have some completions
Another example:Text typed="gamb"
For distance 0 you have only one completion, "gambling", make it first suggestion, and continue to get 4 more
For distance 1 you will get "game" and some completions for it
Of course, this approach sometimes gives more than 5 results, but you can order them by another criterion, not depending on current query.
I think it is more efficient because typically you can limit distance with at maximum two, i.e. check order of 1000*n prefixes, where n is length of input, most times less than number of stored strings.

The Levenshtein distance corresponds to the number of single-character insertions, deletions and substitutions in an optimal global pairwise alignment of two sequences if the gap and mismatch costs are all 1.
The Needleman-Wunsch DP algorithm will find such an alignment, in addition to its score (it's essentially the same DP algorithm as the one used to calculate the Levenshtein distance, but with the option to weight gaps, and mismatches between any given pair of characters, arbitrarily). But there are more general models of alignment that allow reduced penalties for gaps at the start or the end (and reduced penalties for contiguous blocks of gaps, which may also be useful here, although it doesn't directly answer the question). At one extreme, you have local alignment, which is where you pay no penalty at all for gaps at the ends -- this is computed by the Smith-Waterman DP algorithm. I think what you want here is in-between: You want to penalise gaps at the start of both the query and test strings, and gaps in the test string at the end, but not gaps in the query string at the end. That way, trailing mismatches cost nothing, and the costs will look like:
Query: nvmb
Costs: 0100000000000000 = 1 in total
Against: number the stars
Query: nvmb
Costs: 1101 = 3 in total
Against: game
Query: number the stars
Costs: 0100111111111111 = 13 in total
Against: nvmb
Query: ber star
Costs: 1110001111100000 = 8 in total
Against: number the stars
Query: some numbor
Costs: 111110000100000000000 = 6 in total
Against: number the stars
(In fact you might want to give trailing mismatches a small nonzero penalty, so that an exact match is always preferred to a prefix-only match.)
The Algorithm
Suppose the query string A has length n, and the string B that you are testing against has length m. Let d[i][j] be the DP table value at (i, j) -- that is, the cost of an optimal alignment of the length-i prefix of A with the length-j prefix of B. If you go with a zero penalty for trailing mismatches, you only need to modify the NW algorithm in a very simple way: instead of calculating and returning the DP table value d[n][m], you just need to calculate the table as before, and find the minimum of any d[n][j], for 0 <= j <= m. This corresponds to the best match of the query string against any prefix of the test string.

Solving a simple TSP using Z3

I am new to SMT solvers. I would like to know that how could I encode a simple TSP problem having 4/6 nodes? I am confused how to set my constraints using the Z3pay APA. Any kind of hint or help would be highly appreciated.

You can phrase your TSP problem as an ILP problem. The question is now how to encode the TSP as an ILP. There are two well known answers: Miller–Tucker–Zemlin and Dantzig–Fulkerson–Johnson.
The basic idea is as follows: Say we have n cities. Let us denote by d_ij the distance between cities i and j and let us denote by x_{ij} the boolean value (0 or 1) whether the TSP contains the edge from i to j. Then finding the smallest tour means
minimize sum_{i,j} x_{ij} d_{ij}
such that the x_{ij} describe a cycle. With those two conditions we get one or more cycles:
sum_{j} x_{ij} = 1 for all i exactly one outgoing edge per city
sum_{i} x_{ij} = 1 for all j exactly one ingoing edge per city
Now we have to exclude the case that the solutions comprises multiple cycles. We can add this exponential number of Dantzig–Fulkerson–Johnson conditions:
sum_{i in S} sum_{j in S} x_{ij} < |S| for all proper subsets S of {1, ..., n}
Note that if our solution contains two cycles then for S being the vertex set of one of the cycles then the x_{ij}-sum will be |S|. On the other hand, if there is only one cycle then the x_{ij}-sum will never reach |S|, e.g., if you remove one vertex from {1, ..., n} then the number of edges remaining is n-2, but |S| = n-1.
Of course, an exponential number of constraints is not what we want, so we look for a more clever way to exclude the subcycle cases. And here is where Miller–Tucker–Zemlin jumps in.
A different approach would be to simply ignore the subcycle problem, compute a solution and check whether the solution comprises subcycles. If it does, exclude the solution by adding it as a lazy constraint and repeat until you get a single-cycle solution. The keyword here is lazy constraint.

There is a nice sample that may be useful for you:
http://z3.codeplex.com/SourceControl/changeset/view/1235b3ea24d9#examples/python/hamiltonian/hamiltonian.py

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Trying to understand the VITERBI algorithm a bit better - machine-learning

Related

How to improve binary search based optimization in Z3py

How do I speedup adding two big vectors of tuples?

how to set a pattern in a variable using Z3Py

How to effectively use the Levenshtein algorithm for text auto-completion

Solving a simple TSP using Z3

Categories

Resources