How to solve basic HMM problems with hmmlearn - machine-learning

There are three fundamental HMM problems:
Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation
sequence O, determine the likelihood P(O|λ).
Problem 2 (Decoding): Given an observation sequence O and an HMM λ =
(A,B), discover the best hidden state sequence Q.
Problem 3 (Learning): Given an observation sequence O and the set of
states in the HMM, learn the HMM parameters A and B.
I'm interested in Problems ## 1 and 3. In general, first problem can be solved with Forward Algorithm and third problem can be solved with Baum-Welch Algorithm. Am I right that I should use fit(X, lengths) and score(X, lengths) methods from hmmlearn for solving first and third problems respectively? (The documentation does not say that score uses Forward Algorithm.)
And I have some more questions about score method. Why score calculates log probability? And why if I pass several sequences to score it returns sum of log probabilities instead of probability for each sequence?
My original task was the following: I have 1 million short sentences of equal size (10 words). I want to train HMM model with that data and for test data (10-words sentences again) predict the probability of each sentence in the model. And based on that probability I will decide is that usual or unusual phrase.
And maybe there is better libraries for python to solve that problems?

If you are fitting the model on a single sequence, you should use score(X) and fit(X) to solve the first and third problem, respectively (since length = None is the default value, you do not need to pass it explicitly). When working with multiple sequences, you should pass the list of their lengths as the lengths parameter, see the documentation.
The score method calculates log probability for numerical stability. Multiplying a lot of numbers can result in numerical overflow or underflow - i.e. a number may get too big to store in memory or too small to distinguish from zero. The solution is to add their logarithms instead.
The score method returns the sum of logprobabilities of all sequences, well, because that is how it is implemented. The feature request for the feature you want has been submitted a month ago, though, so maybe it will appear soon. https://github.com/hmmlearn/hmmlearn/issues/272 Or you can simply score each sequence separately.
The hmmlearn library is a good python library to use for hidden markov models. I have tried using a different library, ghmm, and I was getting weird numerical underflows. It turned out that their implementation of Baum-Welch algorithm for Gaussian HMM was numerically unstable. They used LU decomposition instead of Cholesky decomposition for computing the inverse of the covariance matrix, which sometimes leads to the covariance matrix ceasing to be positive semidefinite. The hmmlearn library uses the Cholesky decomposition. I switched to hmmlearn and my program started working fine, so I recommend it.

Related

What is used to train a self-attention mechanism?

I've been trying to understand self-attention, but everything I found doesn't explain the concept on a high level very well.
Let's say we use self-attention in a NLP task, so our input is a sentence.
Then self-attention can be used to measure how "important" each word in the sentence is for every other word.
The problem is that I do not understand how that "importance" is measured. Important for what?
What exactly is the goal vector the weights in the self-attention algorithm are trained against?
Connecting language with underlying meaning is called grounding. A sentence like “The ball is on the table” results into an image which can be reproduced with multimodal learning. Multimodal means, that different kind of words are available for example events, action words, subjects and so on. A self-attention mechanism works with mapping input vector to output vectors and between them is a neural network. The output vector of the neural network is referencing to the grounded situation.
Let us make a short example. We need a pixel image which is 300x200, we need a sentence in natural language and we need a parser. The parser works in both directions. He can convert text to image, that means the sentence “The ball is on the table” gets converted into the 300x200 image. But it is also possible to parse a given image and extract the natural sentence back. Self-attention learning is a bootstrapping technique to learn and use the grounded relationship. That means to verify existing language models, to learn new one and to predict future system states.
This question is old now but I came across it so I figured I should update others as my own understanding has increased.
Attention simply refers to some operation that takes the output and combines it with some other information. Typically this just happens by taking the dot product of the output with some other vector so it can "attend" to it in some way.
Self-attention combines the output with other parts of the input (hence self part). Again the combination usually occurs via the dot-product between the vectors.
Finally how is attention (or self-attention) trained?
Let's call Z our output, W our weight matrix and X our input (we'll use # as matrix multiplication symbol).
Z = X^T # W^T # X
In NLP we will compare Z to whatever we want the resulting output to be. In machine translation it is the sentence in the other language for example. We can compare the two with average cross entropy loss over each word predicted. Finally we can update W with back propagation.
How do we see what is important? We can look at the magnitudes of Z to see after the attention what words were most "attended" to.
This is a slightly simplified example as it only has one weight matrix and typically the inputs are embedded but I think it still highlights some of the necessary details concerning attention.
Here is a useful resource with visualizations for more information about attention.
Here is another resource with visualizations for more about attention in transformers specifically self-attention.

Is my method to detect overfitting in matrix factorization correct?

I am using matrix factorization as a recommender system algorithm based on the user click behavior records. I try two matrix factorization method:
The first one is the basic SVD whose prediction is just the product of user factor vector u and item factor i: r = u * i
The second one I used is the SVD with bias component.
r = u * i + b_u + b_i
where b_u and b_i represents the bias of preference of users and items.
One of the models I use has a very low performance, and the other one is reasonable. I really do not understand why the latter one performs worse, and I doubt that it is overfitting.
I googled methods to detect overfitting, and found the learning curve is a good way. However, the x-axis is the size of the training set and y-axis is the accuracy. This make me quite confused. How can I change the size of the training set? Pick out some of the records out of the data set?
Another problem is, I tried to plot the iteration-loss curve (The loss is the ). And it seems the curve is normal:
But I am not sure whether this method is correct because the metrics I use are precision and recall. Shall I plot the iteration-precision curve??? Or this one already tells that my model is correct?
Can anybody please tell me whether I am going in the right direction? Thank you so much. :)
I will answer in reverse:
So you are trying two different models, one that uses straight matrix factorization r = u * i and the other which enters the biases, r = u * i + b_u + b_i.
You mentioned you are doing Matrix Factorization for a recommender system which looks at user's clicks. So my question here is: Is this an Implicit ratings case? or Explicit one? I believe is an Implicit ratings problem if it is about clicks.
This is the first important thing you need to be very aware of, whether your problem is about Explicit or Implicit ratings. Because there are some differences about the way they are used and implemented.
If you check here:
http://yifanhu.net/PUB/cf.pdf
Implicit ratings are treated in a way that the number of times someone clicked or bought a given item for example is used to infer a confidence level. If you check the error function you can see that the confidence levels are used almost as a weight factor. So the whole idea is that in this scenario the biases have no meaning.
In the case of Explicit Ratings, where one has ratings as a score for example from 1-5, one can calculate those biases for users and products (averages of these bounded scores) and introduce them in the ratings formula. They make sense int his scenario.
The whole point is, depending whether you are in one scenario or the other you can use the biases or not.
On the other hand, your question is about over fitting, for that you can plot training errors with test errors, depending on the size of your data you can have a holdout test data, if the errors differ a lot then you are over fitting.
Another thing is that matrix factorization models usually include regularization terms, see the article posted here, to avoid over fitting.
So I think in your case you are having a different problem the one I mentioned before.

Machine learning multi-classification: Why use 'one-hot' encoding instead of a number

I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.
I have successfully tried to train models that output the y tensor like this:
y = [0,0,1,0]
But I can't understand the principal behind it...
Why not just train the same model to output classes such as y = 3 or y = 4
This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.
What am I missing?
Ideally, you could train you model to classify input instances and producing a single output. Something like
y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:
How do I interpret the output y=1.5?
Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?
In fact, what are you doing is treating a multi-class classification problem like a regression problem.
This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).
To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.
The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.
This, every input=dog will have 1, 0, 0 as output and so on.
In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )
The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.
Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats
Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.
A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.
Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.
Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.

Is there hope for using Lasagne's Adam implementation for Probabilistic Matrix Factorization?

I am implementing Probabilistic Matrix Factorization models in theano and would like to make use of Adam gradient descent rules.
My goal is to have a code that is as uncluttered as possible, which means that I do not want to keep explicitly track of the 'm' and 'v' quantities from Adam's algorithm.
It would seem that this is possible, especially after seeing how Lasagne's Adam is implemented: it hides the 'm' and 'v' quantities inside the update rules of theano.function.
This works when the negative log-likelihood is formed with each term that handles a different quantity. But in Probabilistic Matrix Factorization each term contains a dot product of one latent user vector and one latent item vector. This way, if I make an instance of Lasagne's Adam on each term, I will have multiple 'm' and 'v' quantities for the same latent vector, which is not how Adam is supposed to work.
I also posted on Lasagne's group, actually twice, where there is some more detail and some examples.
I thought about two possible implementations:
each existing rating (which means each existing term in the global NLL objective function) has its own Adam updated via a dedicated call to a theano.function. Unfortunately this lead to a non correct use of Adam, as the same latent vector will be associated to different 'm' and 'v' quantities used by the Adam algorithm, which is not how Adam is supposed to work.
Calling Adam on the entire objective NLL, which will make the update mechanism like simple Gradient Descent instead of SGD, with all the disadvantages that are known (high computation time, staying in local minima etc.).
My questions are:
maybe is there something that I did not understand correctly on how Lasagne's Adam works?
Would the option number 2 be actually like SGD, in the sense that every update to a latent vector is going to make an influence to another update (in the same Adam call) that make use of that updated vector?
Do you have any other suggestions on how to implement it?
Any idea on how to solve this issue and avoid manually keeping replicated vectors and matrices for the 'v' and 'm' quantities?
It looks like in the paper they are suggesting that you optimise the entire function at once using gradient descent:
We can then perform gradient descent in Y , V , and W to minimize the objective function given by Eq. 10.
So I would say your option 2 is the correct way to implement what they have done.
There aren't many complexities or non-linearities in there (beside the sigmoid) so you probably aren't that likely to run into typical optimisation problems associated with neural networks that necessitate the need for something like Adam. So as long as it all fits in memory I guess this approach will work.
If it doesn't fit in memory, perhaps there's some way you could devise a minibatch version of the loss to optimise over. Would also be interested to see if you could add one or more neural networks to replace some of these conditional probabilities. But that's getting a bit off-topic...

Hidden Markov Model: Is it possible that the accuracy decreases as the number of states increases?

I constructed a couple of Hidden Markov Models using the Baum-Welch algorithm for an increasing number of states. I noticed that after 8 states, the validation score goes down for more than 8 states. So I wondered whether it's possible that the accuracy of an Hidden Markov Model can decrease with an increasing number of states, due to some kind of overfitting?
Thanks in advance!
For the sake of clarity, I propose here a very simplified illustration of the phenomenon.
Say you train your HMM with the data sequence (A-B-A-B).
Say you use a 2-state HMM.
Naturally, state 1 will optimize itself to represent A and state 2 will represent B (or the opposite).
Then, you have a new sequence (A-B-A-B). You want to know the likelihood this sequence has with respect to your HMM.
A Viterbi algorithm will find that the most probable state sequence is (1-2-1-2), and the Baum-Welch algorithm will give this sequence a high likelihood as the sequence of states and the "values" of the new sequence (if working with continuous data) clearly match your training sequence.
Say now that you train a 3-state HMM with the same training sequence (A-B-A-B). The initial clustering of your data will most probably either assign the 2 first states of the HMM for the representation of symbol A, and the last one to symbol B (or the opposite once again).
So now, the query sequence (A-B-A-B) can be represented as the state sequence (1-3-1-3) or (2-3-2-3) or (1-3-2-3) or (2-3-1-3) !
This means that for this 3-state HMM, two identical sequences (A-B-A-B) can have low similarity for the HMM. That is the reason why for any HMM and any data set, beyond a certain number of states, performance will decrease.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, the Minimum Message Length Criterion, or if you just want to get a blur idea, a k-means clustering combined with the percentage of variance explained. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model.
Hope it helps! :)

Resources