Determinig the number of hidden states in a Hidden Markov Model - machine-learning

I am learning about Hidden Markov Models for classifying motion in a sequence of t image frames.
Assume I have m dimensions of feature from each frame. Then I cluster it into a symbol (for observable symbol). And I create k different HMM model for k class.
Then, how do I determine the number of hidden states for each model to optimise prediction ?
Btw, is my approach correct? If I misunderstood how to use it, please correct me:)
Thanks :)

"is my approach already correct?"
Your current approach is correct. I have done the same thing some weeks ago and had asked the same questions. I have built a gesture recognition tool.
You say you have k classes you want to recognize, so yes, you will train k HMM. For each HMM you run Forward algorithm and receive P(HMM|observation) for each hidden markov model (alternatively Viterbi decoding is also possible). Then you take the one with the highest probability.
It's also correct to see the m-dimensional feature vector as a single observation symbol. Depending on what your vector looks like, you might want to use a continuous hidden markov model or a discrete hidden markov model. Working with discrete ones is often easier and easier to train with little training data. So in case your feature vector space is continuous, you might want to consider discretization to make all values discrete (e.g. through uniform classes).
The question about discreteness is: How many classes of observations will you have?
"how to determine the number of hidden state for each model to get optimal prediction?"
However, I cannot fully answer your actual question about the number of hidden states. From what I have been taught in other areas, it seems like it's a lot of benchmarking and testing. E.g. in speech recongition we use 3 HMM states for each phonem (human sound), because sounds sound different at the beginning, in the middle and at the end. And then each different phonem gets one triple. But that was of course engineering.
In my own application I have thought like this: I wanted to define gestures and associate them with directions. Like open_firefox = [UP, RIGHT]. So I decided to use four hidden states for all four directions.
I guess finding out the best number of states is a lot about engineering and trying out different things.

Related

Hidden Markov Model with new, unseen observations

I am trying to use a hidden markov model, but I have the problem that my observations are some triplets of continuous values (temperature, humidity, sth else). This means that I do not know the exact number of my possible observations, as they are not discrete. This creates the problem that I can not define the size of my emission matrix. Considering discrete values is not an option because using the necessary step at each variable, I get some millions of possible observation combinations. So, can this problem be solved with HMM? Essentialy, can the size of the emission matrix change every time that I get a new observation?
I guess you have misunderstood the concept, there is no emission matrix, only transition probability matrix. and it is constant. Concerning your problem with 3 unknown continuous rv. is easier comparing to speech recognition, for example with 39 MFCC continuous rv. but in speech there is the assumption that 39 rv (yours only 3) distributes normal independent, not identical. So if you insist on HMM, then do not change the emission matrix. you're problem still can be solved instead.
One approach is to give the new unseen observation an equal probability of been emitted by all the states, or assign them a probability according a PDF if you happen to know it. This at least will solve your immediate problem. Later on, when the state is observed (I assume you are trying to predict states), you may want to reassign the real probabilities to the new observation.
A second approach (the one I like better) is to cluster your observations employing a clustering method. This way, your observations would be the clusters not the real time data. Once you capture your data you assign it to the corresponding cluster and give the HMM the cluster number as an observation. No more "unseen" observations to worry about.
Or you may have to resort to a Continuous Hidden Markov model instead of a discrete one. But this one comes with a lot of caveats.

Supervised learning linear regression

I am confused about how linear regression works in supervised learning. Now I want to generate a evaluation function for a board game using linear regression, so I need both the input data and output data. Input data is my board condition, and I need the corresponding value for this condition, right? But how can I get this expected value? Do I need to write an evaluation function first by myself? But I thought I need to generate an evluation function by using linear regression, so I'm a little confused about this.
It's supervised-learning after all, meaning: you will need input and output.
Now the question is: how to obtain these? And this is not trivial!
Candidates are:
historical-data (e.g. online-play history)
some form or self-play / reinforcement-learning (more complex)
But then a new problem arises: which output is available and what kind of input will you use.
If there would be some a-priori implemented AI, you could just take the scores of this one. But with historical-data for example you only got -1,0,1 (A wins, draw, B wins) which makes learning harder (and this touches the Credit Assignment problem: there might be one play which made someone lose; it's hard to understand which of 30 moves lead to the result of 1). This is also related to the input. Take chess for example and take a random position from some online game: there is the possibility that this position is unique over 10 million games (or at least not happening often) which conflicts with the expected performance of your approach. I assumed here, that the input is the full board-position. This changes for other inputs, e.g. chess-material, where the input is just a histogram of pieces (3 of these, 2 of these). Now there are much less unique inputs and learning will be easier.
Long story short: it's a complex task with a lot of different approaches and most of this is somewhat bound by your exact task! A linear evaluation-function is not super-uncommon in reinforcement-learning approaches. You might want to read some literature on these (this function is a core-component: e.g. table-lookup vs. linear-regression vs. neural-network to approximate the value- or policy-function).
I might add, that your task indicates the self-learning approach to AI, which is very hard and it's a topic which somewhat gained additional (there was success before: see Backgammon AI) popularity in the last years. But all of these approaches are highly complex and a good understanding of RL and the mathematical-basics like Markov-Decision-Processes are important then.
For more classic hand-made evaluation-function based AIs, a lot of people used an additional regressor for tuning / weighting already implemented components. Some overview at chessprogramming wiki. (the chess-material example from above might be a good one: assumption is: more pieces better than less; but it's hard to give them values)

How specific should a Support Vector Machine Model be?

The whole point of using an SVM is that the algorithm will be able to decide whether an input is true or false etc etc.
I am trying to use an SVM for predictive maintenance to predict how likely a system is to overheat.
For my example, the range is 0-102°C and if the temperature reaches 80°C or above it's classed as a failure.
My inputs are arrays of 30 doubles(the last 30 readings).
I am making some sample inputs to train the SVM and I was wondering if it is good practice to pass in very specific data to train it - eg passing in arrays 80°C, 81°C ... 102°C so that the model will automatically associate these values with failure. You could do an array of 30 x 79°C as well and set that to pass.
This seems like a complete way of doing it, although if you input arrays like that - would it not be the same as hardcoding a switch statement to trigger when the temperature reads 80->102°C.
Would it be a good idea to pass in these "hardcoded" style arrays or should I stick to more random inputs?
If there is a finite set of possibilities I would really recommend using Naïve Bayes, as that method would fit this problem perfectly. However if you are forced to use an SVM, I would say that would be rather difficult. For starters the main idea with an SVM is to use it for classification, and the amount of scenarios does not really matter. The input is however seldom discrete, so I guess there usually are infinite scenarios. However, an SVM implemented normally would only give you a classification, unless you have 100 classes one for 1% another one for 2%, this wouldn't really solve problem.
The conclusion is that this could work, but it would not be considered "best practice". You can imagine your 30 dimensional vector space divided into 100 small sub spaces, and each datapoint, a 30x1 vector is a point in that vectorspace so that the probability is decided by which of the 100 subsets its in. However, having a 100 classes and not very clean or insufficient data, will lead to very bad, hard performing models.
Cheers :)

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Machine learning how to use the Facebook interest of a users to give a decision

I'm trying to figure out a way I could represent a Facebook user as a vector. I decided to go with stacking the different attributes/parameters of the user into one big vector (i.e. age is a vector of size 100, where 100 is the maximum age you can have, if you are lets say 50, the first 50 values of the vector would be 1 just like a thermometer). I just can't figure out a way to represent the Facebook interests as a vector too, they are a collection of words and the space that represents all the words is huge, I can't go for a model like a bag of words or something similar. Does anyone know how I should proceed? I'm still new to this, any reference would be highly appreciated.
In the case of a desire to down vote this question just let me know what is wrong about it so that I could improve the wording and context.
Thanks
The "right" approach depends on what your learning algorithm is and what the decision problem is.
It would often be better, though, to represent age as a single numeric feature rather than 100 indicator features. That way learning algorithms don't have to learn the relationship between those hundred features (it's baked-in), and the problem has 99 fewer dimensions, which'll make everything better.
To model the interests, you might want to start with an extremely high-dimensional bag of words model and then use one of various options to reduce the dimensionality:
a general dimensionality-reduction technique like PCA or smarter nonlinear ones, including Kernel PCA or various nonlinear approaches: see wikipedia's overview of dimensionality reduction and of specifically nonlinear techniques
pass it through a topic model and use the learned topic weights as your features; examples include LSA, LDA, HDP and many more

Resources