Techniques to improve the accuracy of SVM classifier - machine-learning

I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?

For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier from sklearn.ensemble.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.

some thoughts that have come to my mind when reading your question and the arguments you putting forward with this author claiming to have achieved acc=99.51%.
My first thought was OVERFITTING. I can be wrong, because it might depend on the dataset - But the first thought will be overfitting. Now my questions;
1- Has the author in his article stated whether the dataset was split into training and testing set?
2- Is this acc = 99.51% achieved with the training set or the testing one?
With the training set you can hit this acc = 99.51% when your model is overfitting.
Generally, in this case the performance of the SVM classifier on unknown dataset is poor.

Related

Why does different batch-sizes give different accuracy in Keras?

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

Using Convolution Neural network as Binary classifiers

Given any image I want my classifier to tell if it is Sunflower or not. How can I go about creating the second class ? Keeping the set of all possible images - {Sunflower} in the second class is an overkill. Is there any research in this direction ? Currently my classifier uses a neural network in the final layer. I have based it upon the following tutorial :
https://github.com/torch/tutorials/tree/master/2_supervised
I am taking images with 254x254 as the input.
Would SVM help in the final layer ? Also I am open to using any other classifier/features that might help me in this.
The standard approach in ML is that:
1) Build model
2) Try to train on some data with positive\negative examples (start with 50\50 of pos\neg in training set)
3) Validate it on test set (again, try 50\50 of pos\neg examples in test set)
If results not fine:
a) Try different model?
b) Get more data
For case #b, when deciding which additional data you need the rule of thumb which works for me nicely would be:
1) If classifier gives lots of false positive (tells that this is a sunflower when it is actually not a sunflower at all) - get more negative examples
2) If classifier gives lots of false negative (tells that this is not a sunflower when it is actually a sunflower) - get more positive examples
Generally, start with some reasonable amount of data, check the results, if results on train set or test set are bad - get more data. Stop getting more data when you get the optimal results.
And another thing you need to consider, is if your results with current data and current classifier are not good you need to understand if the problem is high bias (well, bad results on train set and test set) or if it is a high variance problem (nice results on train set but bad results on test set). If you have high bias problem - more data or more powerful classifier will definitely help. If you have a high variance problem - more powerful classifier is not needed and you need to thing about the generalization - introduce regularization, remove couple of layers from your ANN maybe. Also possible way of fighting high variance is geting much, MUCH more data.
So to sum up, you need to use iterative approach and try to increase the amount of data step by step, until you get good results. There is no magic stick classifier and there is no simple answer on how much data you should use.
It is a good idea to use CNN as the feature extractor, peel off the original fully connected layer that was used for classification and add a new classifier. This is also known as the transfer learning technique that has being widely used in the Deep Learning research community. For your problem, using the one-class SVM as the added classifier is a good choice.
Specifically,
a good CNN feature extractor can be trained on a large dataset, e.g. ImageNet,
the one-class SVM can then be trained using your 'sunflower' dataset.
The essential part of solving your problem is the implementation of the one-class SVM, which is also known as anomaly detection or novelty detection. You may refer http://scikit-learn.org/stable/modules/outlier_detection.html for some insights about the method.

How to continue to train SVM based on the previous model

We all know that the objective function of SVM is iteratively trained. In order to continue training, at least we can store all the variables used in the iterations if we want to continue on the same training dataset.
While, if we want to train on a slightly different dataset, what should we do to make full use of the previously trained model? Or does this kind of thought make sense? I think it is quite reasonable if we train a K-means model. But I am not sure if it still makes sense for the SVM problem.
There are some literature on this topic:
alpha-seeding, in which the training data is divided into chunks. After you train a SVM on the ith chunk, you take those and use them to train your SVM with the (i+1)th chunk.
Incremental SVM serves as an online learning in which you update a classifier with new examples rather than retrain the entire data set.
SVM heavy package with online SVM training as well.
What you are describing is what an online learning algorithm does and unfortunately the classic definition for SVM is done in a batch fashion.
However, there are several solvers for SVM that produces quasy optimal hypothesis to the underneath optimization problem in an online learning way. In particular my favourite is Pegasos-SVM which can find a good near optimal solution in linear time:
http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
In general this doesn't make sense. SVM training is an optimization process with regard to every training set vector. Each training vector has an associated coefficient, which as a result is either 0 (irrelevant) or > 0 (support vector). Adding another training vector imposes another, different, optimization problem.
The only way to reuse information from previous training I can think of is to choose support vectors from the previous training and add them to the new training set. I'm not sure, but this probably will negatively affect generalization - VC dimension of an SVM is related to the number of support vectors, so adding previous support vectors to the new dataset is likely to increase the support vector count.
Apparently, there are more possibilities, as noted in lennon310's answer.

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

SGD model "overconfidence"

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!
What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.
From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.

Resources