Alternatives for SVMs - machine-learning

In the disadvantages of SVM they say
If the number of features is much greater than the number of samples,
the method is likely to give poor performances.
What is a good alternative in such case ?

You can keep an eye on the libsvm guide for beginners, in the section C.1 it gives you the answer and an example of exactly what you have asked:
C.1 Number of instances << number of features Many microarray data in
bioinformatics are of this type. We consider the Leukemia data from
the LIBSVM data sets (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
datasets). The training and testing sets have 38 and 34 instances,
respectively. The number of features is 7,129, much larger than the
number of instances. We merge the two les and compare the cross
validation accuracy of using the RBF and the linear kernels:
RBF kernel with parameter selection
$ cat leu leu.t > leu.combined
$ python grid.py leu.combined
...
8.0 3.0517578125e-05 97.2222
(Best C=8.0, = 0:000030518 with ve-fold cross-validation rate=97.2222%)
Linear kernel with parameter selection
$ python grid.py -log2c -1,2,1 -log2g 1,1,1 -t 0 leu.combined
...
0.5 2.0 98.6111
(Best C=0.5 with ve-fold cross-validation rate=98.61111%)
Though grid.py was designed for the RBF kernel, the
above way checks various C using the linear kernel (-log2g 1,1,1 sets
a dummy ).
The cross-validation accuracy of using the linear kernel
is comparable to that of using the RBF kernel. Apparently, when the
number of features is very large, one may not need to map the data.
In addition to LIBSVM, the LIBLINEAR software mentioned below is also
effective for data in this case.
So as you can see you can use SVM with linear kernels and obtain good results.

Related

Suggested unsupervised feature selection / extraction method for 2 class classification?

I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.

In SVM, can a support vector not be a training sample?

For all SVM versions, like c-svm, v-svm, soft margin svm etc., can a support vector not be a training sample?
No, it can't. A support vector is always a sample from the training set.
This is a good thing, because it means SVMs are oblivious to the internal structure of their samples and their support vectors. Only the kernel function, which is separate from the SVM proper, has to know about the structure of samples. While most kernels operate on vectors of numbers, there exist kernels that operate on strings, trees, graphs, you name it.
(Note that linear support vector machines can be trained without taking support vectors into account. I.e., when you train a linear model under hinge loss with appropriate regularization using an algorithm such as SGD, you get a model that is equivalent to an SVM with a linear kernel, but where the support vectors are implicit.)

Is there a way to use SVM to build a model that contains a pre-defined percentage of the training data?

Say, I wish to use LIBSVM to build a model that contains 70% of the training data. Is that possible?
There are no techniques that allow you to specify exactly how many support vectors the model will contain in advance. A possible exception are fixed-size formulations of least-squares SVM which allow you to specify the size of the kernel in advance (and for LS-SVM every training instance is a SV).
Note that a fraction of 70% support vectors is very high for typical SVM in most cases, so I can't see an immediate reason why you would want this.
You can, however, specify a minimum fraction of support vectors using formulations like the nu-SVM by Schölkopf et al:
B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector
algorithms. Neural Computation, 12:1207-1245, 2000.
In this formulation you will have at least the fraction nu support vectors (with 0 < nu <= 1). nu-SVMs are implemented in LIBSVM, for example (use -s 1 for nu-SVC or -s 4 for nu-SVR). For more information, you may refer to this pdf (page 4) or google nu-SVM to find various papers about it.
If you look in the tools folder of libsvm you will find a python script called subset.py. You can use it to randomly select a subset of your data for training.

unigrams & bigrams (tf-idf) less accurate than just unigrams (ff-idf)?

This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.
It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate "new" but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can't remember which chapter and I don't have my copy handy).
In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel's feature space, and {1,2,3}-grams of that of the cubic kernel.
Exactly what is happening depends on your training set; it might simply be too small.
As larsmans said, adding more variables / features makes it easier for the model to overfit hence lose in test accuracy. In the master branch of scikit-learn there is now a min_df parameter to cut-off any feature with less than that number of occurrences. Hence min_df==2 to min_df==5 might help you get rid of spurious bi-grams.
Alternatively you can use L1 or L1 + L2 penalized linear regression (or classification) using either the following classes:
sklearn.linear_model.Lasso (regression)
sklearn.linear_model.ElasticNet (regression)
sklearn.linear_model.SGDRegressor (regression) with penalty == 'elastic_net' or 'l1'
sklearn.linear_model.SGDClassifier (classification) with penalty == 'elastic_net' or 'l1'
This will make it possible to ignore spurious features and lead to a sparse model with many zero weights for noisy features. Grid Searching the regularization parameters will be very important though.
You can also try univariate feature selection such as done the text classification example of scikit-learn (check the SelectKBest and chi2 utilities.

What is the OpenCV svm type parameter

The opencv SVM implementation takes a parameter labeled as "SVM type" which must be used in the CVSVMParams structure used in training the SVM. All the explanation I can find is:
// SVM type
enum { C_SVC=100, NU_SVC=101, ONE_CLASS=102, EPS_SVR=103, NU_SVR=104 };
Anyone know what these different values represent?
They are different formulations of SVM. At the heart of SVM is an mathematical optimization problem. This problem can be stated in different ways.
C-SVM uses C as the tradeoff parameter between the size of margin and the number of training points which are misclassified. C is just a number, the useful range depends on the dataset and it can range from very small (like 10-5) to very large (like 10^5), depending on your data.
nu-SVM uses nu instead of C. nu is roughly a percentage of training points which will end up as support vectors. The more support vectors, the wider your margin is, the more training points which will be misclassified. nu ranges from 0.1 to 0.8 - at 0.1 roughly 10% of training points will be support vectors, at 0.8, more like 80%. I say roughly because its just correlated that way - its not exact.
epsilon-SVR and nu-SVR use SVM for regression. Instead of doing binary classification by finding a maximum margin hyperplane, instead the concept is used to find a hypertube which best fits the data in order to use it to predict future models. They differ in the way they are parameterized (like nu-SVM and C-SVM differ).
One-Class SVM is novelty detection. Rather than binary classification, or predicting a value, instead you give the SVM a training set and it attempts to train a model to wrap around that set so that a future instance can be classified as part of the class or outside the class (novel or outlier).
In general:
Classification SVM Type 1 (also known as C-SVM classification)
Classification SVM Type 2 (also known as nu-SVM classification)
Regression SVM Type 1 (also known as epsilon-SVM regression)
Regression SVM Type 2 (also known as nu-SVM regression)
Details can be found on page SVM

Resources