training time and overfitting with gamma and C in libsvm - machine-learning

I am now using libsvm for support vector machine classifier with Gaussian kernal. In its website, it provides a python script grid.py to select the best C and gamma.
I just wonder how training time and overfitting/underfitting change with gamma and C?
Is it correct that:
suppose C changes from 0 to +infinity, the trained model will go from underfitting to overfitting, and the training time increases?
suppose gamma changes from almost 0 to +infinity, the trained model will go from underfitting to overfitting, and the training time increases?
In grid.py, the default searching order is for C from small to big BUT gamma from big to small. Is it for the purpose of training time from small to big and trained model from underfitting to overfitting? So we can perhaps save time in selecting the values of C and gamma?
Thanks and regards!

Good question for which I don't have a sure answer, because I myself would like to know. But in response to the question:
So we can perhaps save time in selecting the values of C and gamma?
... I find that, with libsvm, there is definitely a "right" value for C and gamma that is highly problem dependent. So regardless of the order in which gamma is searched, many candidate values for gamma must be tested. Ultimately, I don't know any shortcut around this time-consuming (depending upon your problem) but necessary parameter search.

Related

Techniques to improve the accuracy of SVM classifier

I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?
For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier from sklearn.ensemble.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.
some thoughts that have come to my mind when reading your question and the arguments you putting forward with this author claiming to have achieved acc=99.51%.
My first thought was OVERFITTING. I can be wrong, because it might depend on the dataset - But the first thought will be overfitting. Now my questions;
1- Has the author in his article stated whether the dataset was split into training and testing set?
2- Is this acc = 99.51% achieved with the training set or the testing one?
With the training set you can hit this acc = 99.51% when your model is overfitting.
Generally, in this case the performance of the SVM classifier on unknown dataset is poor.

How to choose C and gamma AFTER grid search using libSVM (RBF kernel) for best possible generalisation?

I am aware of the abundance of questions asking about choosing the 'best' C and gamma values for SVM (RBF kernel). The standard answer is a grid search, however, my questions starts after the results of the grid search. Let me explain:
I have a data set of 10 subjects on which I perform leave-one-subject-out-xfold-validation meaning I perform a grid search on each left-out subject. In order to not optimise on this training data I do not want to choose the best C and gamma parameter by building the mean accuracy over all 10 models and search for the maximum. Considering one model within the xfold, I could perform another xfold only on the training data wihtin this model (not involving the left out validation subject). But you can imagine the computational effort and I do not have enough time atm for this.
Since the grid search for each of the 10 models resulted in a wide range of good C and gamma parameters (difference between accuracy of only 2-4%, see Figure 1) I thought about a different way.
I defined a region within the grid, which only contains the accuracies that have a difference of 2% to the maximum accuracy of this grid. All other accuracy values with a difference higher than 2% are set to zero (see Figure 2). I do this for every model and build the intersect between the regions of every model. This results in a much smaller region of C and gamma values that would produce accuracies within 2% of the max. accuracy for each model. However, the range is still rather big. So I thought about choosing the C-gamma pair with the lowest C as this would mean that I am the furthest away from overfitting and closer to a good generalisation. Can I argue like that?
How would I generally choose a C and gamma within this region of C-gamma pairs, which all proofed to be reliable adjustments for my classifier in all 10 models?
Should I focus on minimising the C parameter? Or should I focus on minimising the C AND the gamma paramater?
I found a related answer here (Are high values for c or gamma problematic when using an RBF kernel SVM?) that says a combination of high C AND high gamma would mean overfitting. I understood that the value of gamma changes the width of the gaussian curve around data points, but I still cant get my head around what it practically means within a data set.
The post brought me to another idea. Could I use the number of SVs related to the number of data points as a criterium to choose between all the C-gamma pairs? A low (number of SVs/number of data points) would mean a better generalisation? I am willing to loose accuracy as it shouldnt effect the outcome I am interested in, if I get in return a better generalisation (at least from a theoretical point of view).
Since linear kernel is a special case of rbf kernel. There is a method using linear SVM to tune C first. And bilinear tuning C-G pair later to save time.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.880&rep=rep1&type=pdf

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

Issues in Convergence of Sequential minimal optimization for SVM

I have been working on Support Vector Machine for about 2 months now. I have coded SVM myself and for the optimization problem of SVM, I have used Sequential Minimal Optimization(SMO) by Dr. John Platt.
Right now I am in the phase where I am going to grid search to find optimal C value for my dataset. ( Please find details of my project application and dataset details here SVM Classification - minimum number of input sets for each class)
I have successfully checked my custom implemented SVM`s accuracy for C values ranging from 2^0 to 2^6. But now I am having some issues regarding the convergence of the SMO for C> 128.
Like I have tried to find the alpha values for C=128 and it is taking long time before it actually converges and successfully gives alpha values.
Time taken for the SMO to converge is about 5 hours for C=100. This huge I think ( because SMO is supposed to be fast. ) though I`m getting good accuracy?
I am screwed right not because I can not test the accuracy for higher values of C.
I am actually displaying number of alphas changed in every pass of SMO and getting 10, 13, 8... alphas changing continuously. The KKT conditions assures convergence so what is so weird happening here?
Please note that my implementation is working fine for C<=100 with good accuracy though the execution time is long.
Please give me inputs on this issue.
Thank You and Cheers.
For most SVM implementations, training time can increase dramatically with larger values of C. To get a sense of how training time in a reasonably good implementation of SMO scales with C, take a look at the log-scale line for libSVM in the graph below.
SVM training time vs. C - From Sentelle et al.'s A Fast Revised Simplex Method for SVM Training.
alt text http://dmcer.net/StackOverflowImages/svm_scaling.png
You probably have two easy ways and one not so easy way to make things faster.
Let's start with the easy stuff. First, you could try loosening your convergence criteria. A strict criteria like epsilon = 0.001 will take much longer to train, while typically resulting in a model that is no better than a looser criteria like epsilon = 0.01. Second, you should try to profile your code to see if there are any obvious bottlenecks.
The not so easy fix, would be to switch to a different optimization algorithm (e.g., SVM-RSQP from Sentelle et al.'s paper above). But, if you have a working implementation of SMO, you should probably only really do that as a last resort.
If you want complete convergence, especially if C is large, it takes a very long time.You can consider defining a large stop criterion, and give the maximum number of iterations, the default in Libsvm is 1000000, if the number of iterations is more, the time will multiply, but the loss is not worth the cost, but the result may not fully meet the KKT condition, some support vectors are in the band, non-support vectors are out of the band, but the error is small and acceptable.In my opinion, it is recommended to use other quadratic programming algorithms instead of SMO algorithm if the accuracy is higher

Resources