Support Vector Machine using previous predictions - machine-learning

My svm gets as input a list of values and predicts a label for each entry. The list is processed from front to back.
Now my question is whether there is the possibility to consider previous predictions. Because each label may be awarded only once.
If a label has been awarded, it may not be awarded a second time.

That sounds to be a sequence model. You can use CRF for that.

Related

How to build separate classifiers for each label in the dataset?

I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).

Hidden Markov Model with new, unseen observations

I am trying to use a hidden markov model, but I have the problem that my observations are some triplets of continuous values (temperature, humidity, sth else). This means that I do not know the exact number of my possible observations, as they are not discrete. This creates the problem that I can not define the size of my emission matrix. Considering discrete values is not an option because using the necessary step at each variable, I get some millions of possible observation combinations. So, can this problem be solved with HMM? Essentialy, can the size of the emission matrix change every time that I get a new observation?
I guess you have misunderstood the concept, there is no emission matrix, only transition probability matrix. and it is constant. Concerning your problem with 3 unknown continuous rv. is easier comparing to speech recognition, for example with 39 MFCC continuous rv. but in speech there is the assumption that 39 rv (yours only 3) distributes normal independent, not identical. So if you insist on HMM, then do not change the emission matrix. you're problem still can be solved instead.
One approach is to give the new unseen observation an equal probability of been emitted by all the states, or assign them a probability according a PDF if you happen to know it. This at least will solve your immediate problem. Later on, when the state is observed (I assume you are trying to predict states), you may want to reassign the real probabilities to the new observation.
A second approach (the one I like better) is to cluster your observations employing a clustering method. This way, your observations would be the clusters not the real time data. Once you capture your data you assign it to the corresponding cluster and give the HMM the cluster number as an observation. No more "unseen" observations to worry about.
Or you may have to resort to a Continuous Hidden Markov model instead of a discrete one. But this one comes with a lot of caveats.

Last time step's state vs all time steps' state of RNN/LSTM/GRU

Based on my understanding so far, after training a RNN/LSTM model for sequence classification task I can do prediction in following two ways,
Take the last state and make prediction using a softmax layer
Take all time step's states, make prediction at each time step and take the maximum after summing predictions
In general, is there any reason to choose one over another? Or this is application dependent? Also if I decide to use second strategy should I use different softmax layers for each time step or one softmax layer for all time steps?
I have never seen any network that implements the second approach. The most obvious reason is that all states except for the last one haven't seen the whole sequence.
Take, for example, review sentiment classification. It can start with few positive aspects, after which goes a "but" with a list of drawbacks. All RNN cells before the "but" are going to be biased and their state won't reflect the true label. Does it matter how many of them output positive class and how confident they are? The last cell output would be a better predictor anyway, so I don't see a reason to take the previous ones into account.
If the sequential of the data aspect is not important in a particular problem, then RNN doesn't seem like a good approach in general. Otherwise, you should better use the last state.
There is, however, one exception in sequence-to-sequence models with attention mechanism (see for instance this question). But it is different, because the decoder is predicting a new token on each step, so it can benefit from looking at earlier states. Besides it takes the final hidden state information as well.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Prediction Algorithm for Basketball Stats

I'm working on a project where I need to predict future stats based on past stats of basketball players. I would like to be able to predict next season's statistics based on the statistics of the past three seasons (if there are three previous seasons to choose from). Does anyone have a suggestion for a good prediction algorithm I could use? The data is continuous and there can be anywhere between 5-14 dimensions (age, minutes, points, etc.)
Thanks!
Note: I'd really like to use the program Weka to do this.
Out of the box, random forest would likely give you a strong baseline, so I would start with this.
You can also try try linear regression, which is a simple yet relative effective method, but depending on the data might require a bit more tweaking (for example transforming some of the input and/or out variables).
Gradient boosting regression is another strong predictor, but typically also needs more tweaking to work well.
All of these algorithms have Weka implementations.
There obviously isn't one correct answer, but for anyone looking to do something similar, I'll better describe my problem and the solution that I've found. I created a csv file where each row is a different season, and each column contains a different attribute. For each attribute that I would like to predict, I have the stats for the current season and then another column for the stats for the previous season. The first (rookie) season will have 0 for all 'previous season' columns. With this data set, I loaded it into Weka and used a Multilayer Perceptron with the test-option set to Cross-Validation. I set the number of folds to somewhere between 80-90% of the number of seasons available.
Finally, to predict the next season's statistics, you add one more row to the end and input the last-season values with "?" in the columns that you would like to predict. If anyone would like a deeper example, I'd be glad to provide one.
I think also if you truly want to create an accurate prediction you have to look at player movement and if a player moves to a team with a losing record, do they increase their minutes to have a larger role which would inflate stats or move to a winning team for a lesser role where they could see a decrease in stats.

Resources