Hidden Markov Model with new, unseen observations - machine-learning

I am trying to use a hidden markov model, but I have the problem that my observations are some triplets of continuous values (temperature, humidity, sth else). This means that I do not know the exact number of my possible observations, as they are not discrete. This creates the problem that I can not define the size of my emission matrix. Considering discrete values is not an option because using the necessary step at each variable, I get some millions of possible observation combinations. So, can this problem be solved with HMM? Essentialy, can the size of the emission matrix change every time that I get a new observation?

I guess you have misunderstood the concept, there is no emission matrix, only transition probability matrix. and it is constant. Concerning your problem with 3 unknown continuous rv. is easier comparing to speech recognition, for example with 39 MFCC continuous rv. but in speech there is the assumption that 39 rv (yours only 3) distributes normal independent, not identical. So if you insist on HMM, then do not change the emission matrix. you're problem still can be solved instead.

One approach is to give the new unseen observation an equal probability of been emitted by all the states, or assign them a probability according a PDF if you happen to know it. This at least will solve your immediate problem. Later on, when the state is observed (I assume you are trying to predict states), you may want to reassign the real probabilities to the new observation.
A second approach (the one I like better) is to cluster your observations employing a clustering method. This way, your observations would be the clusters not the real time data. Once you capture your data you assign it to the corresponding cluster and give the HMM the cluster number as an observation. No more "unseen" observations to worry about.
Or you may have to resort to a Continuous Hidden Markov model instead of a discrete one. But this one comes with a lot of caveats.

Related

Classifying pattern in time series

I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.
I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.
I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.

How specific should a Support Vector Machine Model be?

The whole point of using an SVM is that the algorithm will be able to decide whether an input is true or false etc etc.
I am trying to use an SVM for predictive maintenance to predict how likely a system is to overheat.
For my example, the range is 0-102°C and if the temperature reaches 80°C or above it's classed as a failure.
My inputs are arrays of 30 doubles(the last 30 readings).
I am making some sample inputs to train the SVM and I was wondering if it is good practice to pass in very specific data to train it - eg passing in arrays 80°C, 81°C ... 102°C so that the model will automatically associate these values with failure. You could do an array of 30 x 79°C as well and set that to pass.
This seems like a complete way of doing it, although if you input arrays like that - would it not be the same as hardcoding a switch statement to trigger when the temperature reads 80->102°C.
Would it be a good idea to pass in these "hardcoded" style arrays or should I stick to more random inputs?
If there is a finite set of possibilities I would really recommend using Naïve Bayes, as that method would fit this problem perfectly. However if you are forced to use an SVM, I would say that would be rather difficult. For starters the main idea with an SVM is to use it for classification, and the amount of scenarios does not really matter. The input is however seldom discrete, so I guess there usually are infinite scenarios. However, an SVM implemented normally would only give you a classification, unless you have 100 classes one for 1% another one for 2%, this wouldn't really solve problem.
The conclusion is that this could work, but it would not be considered "best practice". You can imagine your 30 dimensional vector space divided into 100 small sub spaces, and each datapoint, a 30x1 vector is a point in that vectorspace so that the probability is decided by which of the 100 subsets its in. However, having a 100 classes and not very clean or insufficient data, will lead to very bad, hard performing models.
Cheers :)

Neural Network - Should I Remove All Derived / Calculated Variables?

I'm using a neural network to control the movement of a character in a game. I've currently got a huge amount of dimensions and in the interest of trimming them to improve storage and code manageability, I'm considering removing all derived variables i.e. any variable which can be calculated from data already sent into to the network.
An example of this would be the relationship between a) position, b) velocity, and c) acceleration along a path. Currently, I send the last 50 data points of all three to the NN to help it decide its next movement. However, I wonder if system control / error could be minimized just as easily by sending only position. Theoretically the neural network should be able to derive the velocity and acceleration at a point in time entirely on it's own given the position history.
Generally, is dimension reduction in this capacity recommended? Why or why not?
I know the oft recommendation in this scenario is just to test it and see what happens, but in this case there are so many variables here that it would take days to test, so I was hoping to hear anyone's experience given this type of situation and what they surmise the general rule to be.
Bonus question--would this assessment / decision be different for a neural network (intent on mapping functions to data) as opposed to a random forest (seems to use more of a nearest neighbor approach).
Thanks!!
Implement PCA to reduce the number of features. They reduced features will have unusual units like [positionvelocityacceleration]. However, if you do PCA correctly you can retain a feature set that has 99% variance of the original set.
Then use the new feature set in your NN.
Reducing dimensions is recommended to speed-up algorithms because, as you observed, there is a lot of similarity between your features.

HMM application in speech recognition

this is my first time posting a question here so if the approach is not so standard i apologize, i understand there are lots of questions out there on this and i have read tons of thesis, questions, aritcles and tutorials yet i seem to have a problem and it's always best to ask. i am creating a speech recognition application, using phoneme level processing(not isolated word) continuous HMMs based on gaussian mixture models, involving baum welch, forward-backward, and viterbi algorithms,
i have implemented a very good feature extraction and pre-processing method (MFCC), feature vectors consist of the mfcc, delta and acceleration coefficients as well and it's working pretty well on it's part however when it comes to HMMs , i seem to either have a 'Major Misunderstading' about how HMMs are supposed to help recognize speech or i am missing a little point here...i have try harded a lot that at this point i can't really tell what's wrong and right.
first off, i recorded around 50 words, each 6 utterances, and run them through a correct compatibility and conversion program that i wrote myself and the extracted the features so that they can be used for baum-welch.
i want you to please tell me where am i making a mistake in this procedure, also i will mention a few doubts i have on it so that you can help me understand this whole subject better.
here are the steps in my application concerning anything related to the training :
steps for initial parameters of HMM model :
1 - assign all observations from each training sample of each model to their corresponding discrete state(or in other words, which feature vector belongs to which alphabetic state).
2 - use k-means to find the initial continuous emission parameters, clustering is done over all observations of each state, here the cluster size is 6 (equal to number of mixtures for probability density function), parameters would be sample means, sample covariances and some mixture weights for each cluster.
3 - create initial state-initial and transition probability matrices for each model and training sample individually(left right structure is used in this case), 0 for previous states and 1 for up to 1 next state for transitions and 1 for initial and 0 for others in state initials.
4 - calculate gaussian mixture model based probability density function for each state -> it's corresponding cluster -> assigned to all the vectors in all the training samples for each model
5 - calculate initial emission parameters using the pdf and mixture weights for clusters.
6 - now calculate the gamma variables using initial paramters(transitions, emissions, initials) in forward-backward and initial PDFs, using the continuous formula for gamma..(gamma = probability of being in a certain state at a certain time for any of the mixtures)
7 - estimate new state initials
8 - estimate new state tranisitons
9 - estimate new sample means
10 - estimate new sample covariances
11 - estimate new pdfs
12 - estimate new emissions using new pdfs
repeat the steps from 6 to 12 using new estimated values on each iteration, use viterbi to get an overlook on how the estimating is going and when the probability is not changing anymore, stop and save.
now my issues :
first i don't know if the entire procedure i have followed is correct or not, or is there a better method to approach this...for all i know is that the convergence is pretty fast, for up to 4-5 iterations and it's already not changing anymore, however considering that if i am right then :
it's not possible for me to sit down and pre assign each feature vector to it's state in the beginning at step 1...and i don't think it's a standard procedure either...again i don't even know if i have to do it necessarily, from all my studies it was the best method i could find to get a rapid convergence.
second, say this whole baum welch has done a great job in re estimating and finding local maximums, what's raising my doubt about my baum welch implementation is that how are they later going to help me recognize speech? i assume the estimated parameters are used in viterbi for finding the optimal state for every spoken utterance...if so then emission parameters are not known cause if you look closely you will see that final emission parameters in my algorithm will be assigning each alphabetic state of each model to all the observed signals in each model, other than that...no emission parameters can be found if the signal is not exactly match to the ones used in re-estimation, and it won't obviously work, any attempt to try and match out the signals and find emissions will make the whole HMM lose it's purpose...
again i might have a wrong idea about almost everything here, i would appreciate if you help me understand what i am doing wrong here...if ANYTHING is wrong, notify me please...thank you.
You're attempting to determine the most likely set of phonemes that would have generated the sounds that you're observing - you're not attempting to work out emission parameters, you're working out the most likely set of inputs that would have produced them.
Also, your input corpus is quite small - it's unsurprising that it would converge so quickly. If you're doing this while involved with a university, see if they have access to one of the larger speech corpuses commonly used to train this kind of algorithm on.

Determinig the number of hidden states in a Hidden Markov Model

I am learning about Hidden Markov Models for classifying motion in a sequence of t image frames.
Assume I have m dimensions of feature from each frame. Then I cluster it into a symbol (for observable symbol). And I create k different HMM model for k class.
Then, how do I determine the number of hidden states for each model to optimise prediction ?
Btw, is my approach correct? If I misunderstood how to use it, please correct me:)
Thanks :)
"is my approach already correct?"
Your current approach is correct. I have done the same thing some weeks ago and had asked the same questions. I have built a gesture recognition tool.
You say you have k classes you want to recognize, so yes, you will train k HMM. For each HMM you run Forward algorithm and receive P(HMM|observation) for each hidden markov model (alternatively Viterbi decoding is also possible). Then you take the one with the highest probability.
It's also correct to see the m-dimensional feature vector as a single observation symbol. Depending on what your vector looks like, you might want to use a continuous hidden markov model or a discrete hidden markov model. Working with discrete ones is often easier and easier to train with little training data. So in case your feature vector space is continuous, you might want to consider discretization to make all values discrete (e.g. through uniform classes).
The question about discreteness is: How many classes of observations will you have?
"how to determine the number of hidden state for each model to get optimal prediction?"
However, I cannot fully answer your actual question about the number of hidden states. From what I have been taught in other areas, it seems like it's a lot of benchmarking and testing. E.g. in speech recongition we use 3 HMM states for each phonem (human sound), because sounds sound different at the beginning, in the middle and at the end. And then each different phonem gets one triple. But that was of course engineering.
In my own application I have thought like this: I wanted to define gestures and associate them with directions. Like open_firefox = [UP, RIGHT]. So I decided to use four hidden states for all four directions.
I guess finding out the best number of states is a lot about engineering and trying out different things.

Resources