Currently I am doing English alphabet classification using SVM classifier in opencv.
I have following doubts in doing above thing
How length of feature vector depends on the classification ?
(What will happen if feature length increases (my current feature length is 125))
Is time taken for prediction depend on number of data used for training ?
Why we need normalization of feature vector (will this improve accuracy of prediction and time required for the prediction of the class) ?
How to determine best method for normalizing feature vector ?
1) Length of features does not matter per se, what matters is predictive quality of features
2) No, it does not depend on number of samples, but it depends on number of features (prediction is generally very fast)
3) Normalization is required if features are in very different ranges of values
4) There are basically standarization (mean, stdev) and scaling (xmax -> +1, xmean -> -1 or 0) - you could do both and see which one is better
when talking about classification the data consists of feature vectors with a number of features. in image processing there is also features which are mapped to classification feature vectors. so your "feature length" is actually the number of features or feature vector size.
1) the number of features matter. in principle more features allow better classification but also lead to overtraining. to avoid the latter you can add more samples (more feature vectors).
2) yes, as the prediction time depends on the number of support vectors and the size of the support vectors. but as prediction is very fast this is not an issue unless you have some real time requirements.
3) while SVM as a maximum margin classifier is quite robust against different feature value ranges a feature with a bigger value range would have more weight than one with a smaller range. this especially applies to penalty calculation if classes are not completely separable.
4) as SVM is quite robust against different value ranges (compared to cluster oriented algorithms) this is not the biggest issue. typically absolute min/max are scaled to -1/+1. if you know the expected range of your data you could scale that range and measurement errors in your data would not influence the scaling. a fixed range is also preferable when adding trraining data in an iterative process.
Related
I have a classification problem and my current feature vector does not seem to hold enough information.
My training set has 10k entries and I am using a SVM as classifier (scikit-learn).
What is the maximum reasonable feature vector size (how many dimension)?
(Training and evaluation using Labtop CPU)
100? 1k? 10k? 100k? 1M?
The thing is not how many features should it be for a certain number of cases (i.e. entries) but rather the opposite:
It’s not who has the best algorithm that wins. It’s who has the most data. (Banko and Brill, 2001)
Banko and Brill in 2001 made a comparison among 4 different algorithms, they kept increasing the Training Set Size to millions and came up with the above-quoted conclusion.
Moreover, Prof. Andrew Ng clearly covered this topic, and I’m quoting here:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
So as a rule of thumb, your data cases must be greater than the number of features in your dataset taking into account that all features should be informative as much as possible (i.e. the features are not highly collinear (i.e. redundant)).
I read once in more than one place and somewhere in Scikit-Learn Documentation, that the number of inputs (i.e. samples) must be at least the square size of the number of features (i.e. n_samples > n_features ** 2 ).
Nevertheless, for SVM in particular, the number of features n v.s number of entries m is an important factor to specify the type of kernel to use initially, as a second rule of thumb for SVM in particular (also according to Prof. Andrew Ng):
If thr number of features is much greater than number of entries (i.e. n is up to 10K and m is up to 1K) --> use SVM without a kernel (i.e. "linear kernel") or use Logistic Regression.
If the number of features is small and if the number of entries is intermediate (i.e. n is up to 1K and m is up to 10K) --> use SVM with Gaussian kernel.
If the number of feature is small and if the number of entries is much larger (i.e. n is up to 1K and m > 50K) --> Create/add more features, then use SVM without a kernel or use Logistic Regression.
I use logistic regression. We know that it is a supervised method and needs calculated feature values both in training and test data. There are six features. Although the functions produce these features’ values are different and their maximum value can be 1, there are four features (both in training and test data) that have very low values. e.g. they range between 0 and 0.1 and are never 1, even more than 0.1!!!. Thus these features’ values are very close to each other. Other features are distributed normally (they range between 0 and 0.9). So the difference between these two kinds of features is high, I think this causes trouble in learning process for logistic regression. Am I right?! Does it need any transforming/normalizing these features? Any help would be highly appreciated.
In short: you should normalize your features prior to training. Typically - so each is either in some range (like [0,1]) or is whitened (mean 0 and std 1).
Why is it important? In order to make "small" features important LR will need very high weights in this dimension. However, you will probably use regularized LR (typically L2 regularized) - in such case it will be very hard to assign high values to these vectors, as regularization penalty will force model to rather choose equally distributed weights instead - thus use normalization. However - if you fit LR without any regularization, then there is no point in scaling (up to numerical errors) as LR does not depend on the choice of scaling (the solution should be exactly the same)
When using SVMlight or LIBSVM in order to classify phrases as positive or negative (Sentiment Analysis), is there a way to determine which are the most influential words that affected the algorithms decision? For example, finding that the word "good" helped determine a phrase as positive, etc.
If you use the linear kernel then yes - simply compute the weights vector:
w = SUM_i y_i alpha_i sv_i
Where:
sv - support vector
alpha - coefficient found with SVMlight
y - corresponding class (+1 or -1)
(in some implementations alpha's are already multiplied by y_i and so they are positive/negative)
Once you have w, which is of dimensions 1 x d where d is your data dimension (number of words in the bag of words/tfidf representation) simply select the dimensions with high absolute value (no matter positive or negative) in order to find the most important features (words).
If you use some kernel (like RBF) then the answer is no, there is no direct method of taking out the most important features, as the classification process is performed in completely different way.
As #lejlot mentioned, with linear kernel in SVM, one of the feature ranking strategies is based on the absolute values of weights in the model. Another simple and effective strategy is based on F-score. It considers each feature separately and therefore cannot reveal mutual information between features. You can also determine how important a feature is by removing that feature and observe the classification performance.
You can see this article for more details on feature ranking.
With other kernels in SVM, the feature ranking is not that straighforward, yet still feasible. You can construct an orthogonal set of basis vectors in the kernel space, and calculate the weights by kernel relief. Then the implicit feature ranking can be done based on the absolute value of weights. Finally the data is projected into the learned subspace.
If I have 200 features, and if each feature can have a value ranging from 0 to infinity, should I scale the feature values to be in the range [0-1] before I go ahead and train a LibSVM on top of it?
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
Thanks
Abhishek S
You should store the ranges of you feature-values used for training. Then when you extract a feature-value from an unknown instance, use the particular range for scaling.
Use the formula (here for the range [-1.0 , 1.0]):
double scaled_val = -1.0 + (1.0 - -1.0) * (extracted_val - vmin)/(vmax-vmin);
The Guide provided at libsvm website explains the scaling well:
"2.2 Scaling
Scaling before applying SVM is very important. Part 2 of Sarle's Neural Networks
FAQ Sarle (1997) explains the importance of this and most of considerations also apply
to SVM. The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel,
large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1; +1] or [0; 1].
Of course we have to use the same method to scale both training and testing
data."
If you've got infinite feature values, you're not going to be able to use LIBSVM anyway.
More practically, scaling is generally useful so the kernel doesn't have to deal with large numbers, so I would say go for it and scale. It's not a requirement, though.
And as Anony-Mousse implied in the comments, please try running experiments with and without scaling so you can see the difference.
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
You don't need to scale again. You already did that in the pre-training step (i.e. data processing).
I was recently playing around with the well known movie review dataset used in binary sentiment analysis. It consists of 1,000 positive and 1,000 negative reviews. While exploring various feature-encodings with unigram features, I noticed that all previous research publications normalize the vectors by their Euclidean norm in order to scale them to unit-length.
In my experiments using Liblinear, however, I found that such length-normalization decreases the classification accuracy significantly. I studied the vectors, and I think this is the reason: the dimension of the vector space is, say, 10,000. As a result, the Euclidean norm of the vectors is very high compared to the individual projections. Therefore, after normalization, all the vectors get very small numbers on each axis (i.e., the projection on an axis).
This surprised me, because all publications in this field claim that they perform cosine normalization, whereas I found that NOT normalizing yields better classification.
Thus my question: is there any specific disadvantage if we don't perform cosine normalization for SVM feature vectors? (Basically, I am seeking a mathematical explanation for this need for normalization).
After perusing the manual of LibSVM, I realize why the normalization was yielding much lower accuracy when compared to not normalizing. They recommend scaling the data to a [0,1] or [-1,1] interval. This is something I had not done. Scaling up will resolve the issue of having too many data points very close to zero, while retaining the advantages of length-normalization.