One-versus-one or one-versus-all?

One-versus-one or one-versus-all? - machine-learning

I'm using the CARET package in R for multiclass classification. I have 3 classes and I use the method train for training purposes. Here is the code for it:
trained.model.rf <- train(x = dataset.train[,-c(ncol(dataset.train))], y = dataset.train[,ncol(dataset.train)],method='rf',
trControl=trainControl(method="cv",number=10), allowParallel=TRUE, tuneLength = 6)
model.rf <- trained.model.rf$finalModel
result.rf <- predict(model.rf, dataset.test, type="response")
For the dataset.train I have the three classes together.
How can I tell if this is an one-versus-one or one-versus-all aproach?

Edit:
After a second read i realized you might just be asking what Caret is doing and not which one you should pick. Sadly i can't answer that and i have to add that Caret's documentation is awful (they could learn something from scikit-learn)!
If there is no specific reason, i would not care that much in your case (small number of classes + random-forest; using SVMs or having many classes though it would be interesting to see what's used)
/Edit
There is not much difference in regards to performance with well-working underlying classifiers reference.
One-vs-All is usually the default in most libraries i tried.
But there is a possible trade-off when thinking of the underlying classifiers and data-sets:
Let's call the number of classes N. The samples of your data-set is called M.
One vs. All
Will train N classifiers on the whole data-set
Consequences:
It's doing a linear-size of classification-learnings which scales well with the number of classes
That's probably the reason it's often default as it's also well-working with 100 classes or more
It's learning on the whole data-set, which can be a problem if the underlying classifier is complexity-wise bounded by sample-size
Popular example: SVMs are complexity-wise between O(m^2)-O(m^3) in (depending on kernel & kernel-cache; ignoring SGD-based approaches)
SVMs can therefore be troublesome to learn on huge data-sets (compare with OvO below)
One vs. One
Will train N over 2 classifiers on some partial data-set
Consequences:
It's doing an exponential-size of classification-learning (in regards to the number of classes) which scales very bad with the number of classes
If your data-set is balanced, it's working on M/N*2 samples (only the samples of the two selected pairs are used)
This can help compared to OvA if the classifier-complexity is dominated by sample-size (like mentioned above)
In your case you got a small set of classes. If your library is supporting both approaches, i would use OvO first. But this is, like explained, dependent on your classifier and class-statistics.
While the paper referenced above says OvA should not be worse than OvO, i can imagine the latter can provide more safety if your setup is kind of imperfect (badly performing classifier, ...).

Related

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.

Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Multiclass classification growing number of classes

I am building an intent recognition system using multiclass classfication with SVM.
Currently I only have a few numbers of classes limited by the training data. However, in the future, I may get data with new classes. I can, of course, put all the data together and re-train the model, which is timing consuming and in-efficient.
My current idea is to do the one-against-one classification at the beginning, and when a new class comes in, I can just train it against all the existing classes, and get n new classifiers. I am wondering if there are some other better methods to do that. Thanks!

The most efficient approach would be to focus on one-class classifiers, then you just need to add one new model to the ensemble. Just to compare:
Let us assume that we have K classes and you get 1 new plus P new points from it, your whole dataset consists of N points (for simpliciy - equaly distributed among classes) and your training algorithm complexity is f(N) and if your classifier supports incremental learning then its complexity if g(P, N)
OVO (one vs one) - in order to get the exact results you need to train K new classifiers, each with about N/K datapoints thus leading to O(K f(P+N/K)), there is no place to use incremental training
OVA (one vs all) - in order to get the exact results you retrain all classifiers, if done in batch fassion you need O(K f(N+P)), worse than the above. However if you can train in incremental fashion you just need O(K g(P, N)) which might be better (depending on the classifier).
One-class ensemble - it might seem a bit weird, but for example Naive Bayes can be seen as such approach, you have generative model which models each class conditional distribution, thus your model for each class is actually independent on the remaining ones. Thus the complexity is O(f(P))
This list is obviously not exhaustive but should give you general idea in what to analyze.

Is that make sense to construct a learning Model using only one feature?

in order to improve the accuracy of an adaboost classifier (for image classification), I am using genetic programming to derive new statistical Measures. Every Time when a new feature is generated, i evaluate its fitness by training an adaboost Classifier and by testing its performances. But i want to know if that procedure is correct; I mean the use of a single feature to train a learning model.

You can build a model on one feature. I assume, that by "one feature" you mean simply one number in R (otherwise, it would be completely "traditional" usage). However this means, that you are building a classifier in one-dimensional space, and as such - many classifiers will be redundant (as it is really a simple problem). What is more important - checking whether you can correctly classify objects using one particular dimensions does not mean that it is a good/bad feature once you use combination of them. In particular it may be the case that:
Many features may "discover" the same phenomena in data, and so - each of them separatly can yield good results, but once combined - they won't be any better then each of them (as they simply capture same information)
Features may be useless until used in combination. Some phenomena can be described only in multi-dimensional space, and if you are analyzing only one-dimensional data - you won't ever discover their true value, as a simple example consider four points (0,0),(0,1),(1,0),(1,1) such that (0,0),(1,1) are elements of one class, and rest of another. If you look separatly on each dimension - then the best possible accuracy is 0.5 (as you always have points of two different classes in exactly same points - 0 and 1). Once combined - you can easily separate them, as it is a xor problem.
To sum up - it is ok to build a classifier in one dimensional space, but:
Such problem can be solved without "heavy machinery".
Results should not be used as a base of feature selection (or to be more strict - this can be very deceptive).

Optimal Feature-to-Instance Ratio in Back Propagation Neural Network

I'm trying to perform leave-one-out cross validation for modelling a particular problem using Back Propagation Neural Network. I have 8 features in my training data and 20 instances. I'm trying to make the NN learn a function in building a prediction model. Now, the problem is that the error rate is quite high in the prediction. My guess is that the number of instances in the training is less when compared to the number of features under consideration. Is this conclusion correct. Is there any optimal feature to instance ratio ?

(This topic is often phrased in the ML literature as acceptable size or shape of the data set, given that a data set is often described as an m x n matrix in which m is the number of rows (data points) and n is the number of columns (features); obvious m >> n is preferred.)
In an event, I am not aware of a general rule for an acceptable range of features-to-observations; there are probably a couple of reasons for this:
such a ratio would depend strongly on the quality of the data
(signal-to-noise ratio); and
the number of features is just one element of model complexity (e.g., interaction among the features); and model complexity is the strongest determinant of the number of data instances (data points).
So there are two sets of approaches to this problem--which, because they are opposing, both can be applied to the same model:
reduce the number of features; or
use a statistical technique to leverage the data that you do have
A couple of suggestions, one for each of the two paths above:
Eliminate "non-important" features--i.e, those features that don't contribute to the variability in your response variable. Principal Component Analysis (PCA) is fast and reliable way to do this, though there are a number of other techniques which are generally subsumed under the rubric "dimension reduction."
Use Bootstrap methods instead of cross-validation. The difference in methodology seems slight but the (often substantial) improvement in reducing prediction error is well documented for multi-layer perceptrons (neural networks) (see e.g., Efron, B. and Tibshirani, R.J., The bootstrap method: Improvements on cross-validation, J. of the American Statistical Association, 92, 548-560., 1997). If you are not familiar with Bootstrap methods for splitting training and testing data, the general technique is similar to cross-validation except that instead of taking subsets of the entire data set you take subsamples. Section 7.11 of Elements is a good introduction to Bootstrap methods.
The best single source on this general topic that i have found is Chapter 7 Model Assessment and Selection from the excellent treatise Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. This book is available free to download from the book's homepage.

One versus rest classifier

I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!

Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .

Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart