Is scaling of feature values in LibSVM necessary?

Is scaling of feature values in LibSVM necessary? - machine-learning

If I have 200 features, and if each feature can have a value ranging from 0 to infinity, should I scale the feature values to be in the range [0-1] before I go ahead and train a LibSVM on top of it?
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
Thanks
Abhishek S

You should store the ranges of you feature-values used for training. Then when you extract a feature-value from an unknown instance, use the particular range for scaling.
Use the formula (here for the range [-1.0 , 1.0]):
double scaled_val = -1.0 + (1.0 - -1.0) * (extracted_val - vmin)/(vmax-vmin);
The Guide provided at libsvm website explains the scaling well:
"2.2 Scaling
Scaling before applying SVM is very important. Part 2 of Sarle's Neural Networks
FAQ Sarle (1997) explains the importance of this and most of considerations also apply
to SVM. The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel,
large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1; +1] or [0; 1].
Of course we have to use the same method to scale both training and testing
data."

If you've got infinite feature values, you're not going to be able to use LIBSVM anyway.
More practically, scaling is generally useful so the kernel doesn't have to deal with large numbers, so I would say go for it and scale. It's not a requirement, though.
And as Anony-Mousse implied in the comments, please try running experiments with and without scaling so you can see the difference.
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
You don't need to scale again. You already did that in the pre-training step (i.e. data processing).

Related

Feature Scaling is Linear model

How to know that feature scaling is require in Linear Regression, multilinear regression, polynomial regression? Because some where I am getting a point that feature scaling is not required because coefficient is there and somewhere I am getting that feature scaling is required so what's the actual answer.

Both the statements are correct but incomplete.
If you are using simple linear model such as y = w1 * x1 + w2 * x2 then feature scaling is not required. As the coefficient w1 and w2 will be learned or adapted accordingly.
But if you modify the above expression with the regularization term or defining a constraints over variables, then the coefficient will be biased toward the feature with larger magnitude without feature scaling.
In conclusion: Feature scaling is important when we modify the expression for simple linear model. Also it is a good practice to normalize the features before applying any algorithm.

Suppose we have two features of weight and price, as in the below table. The “Weight” cannot have a meaningful comparison with the “Price.” So the assumption algorithm makes that since “Weight” > “Price,” thus “Weight,” is more important than “Price.” link
Feature scaling is required when the data columns have large variation in their ranges. Getting the min, max and mean of the data in each column is great way
Plotting the data is a next. This identifies the range of the different dimensions of the data easily.

Scaling of Continuous variables in logistic regression

I am performing logistic regression and had a doubt
I have categorical (0,1)as well as continuous variables in my data set..
Now do I need to scale my continuous variables between 0 and 1?
Coz few of my continuous variables have values up to 10k
Does it make sense to keep such continous values along with categorical variables while performing the logistic regression?

Theoretically it is not neccesary. But your resulting system will probably have very small coefficients for the inputs with large range. This can be a problem if you want to use numbers with reduced accuracy (for example 16 bit) for your model.
I am not sure why you are asking if you should use the continuous values in your model. If there is any possibility that they are correlated with the result, keep them. Only if you are sure they are uncorrelated, ignore them.

For simple linear/logistic regression (without regularization): no need to scale variables.
For linear/logistic regression with regularization: you need to perform scaling.
For linear/logistic regression without regularization you need to scale features only if you'd like to interpret/compare weights after fitting. Otherwise features with higher values will possibly have smaller weights than other ones.

You can scale by variance and by location. There are many options. My advice is to consider scaling if your variables vary a lot between and within. You can try the following;
All the stuff below here represents a vector, so by X, I mean
. Thus, all I write below are either vectors or matrices.
Scaling by range,
, where R is the range of the variables, basically max(X)-min(X).
Scaling by location (centering), and variance (scaling),
, where xbar and s are the sample mean and sample variance of X, respectively.
The latter one provides centering as well, so make sure that you select the proper formula for your data. There is no rule of thumb here, but intuiton and inference is a key point. You can also try different combinations of scale and location measures.

How Feature length depend on prediction in SVM classifier

Currently I am doing English alphabet classification using SVM classifier in opencv.
I have following doubts in doing above thing
How length of feature vector depends on the classification ?
(What will happen if feature length increases (my current feature length is 125))
Is time taken for prediction depend on number of data used for training ?
Why we need normalization of feature vector (will this improve accuracy of prediction and time required for the prediction of the class) ?
How to determine best method for normalizing feature vector ?

1) Length of features does not matter per se, what matters is predictive quality of features
2) No, it does not depend on number of samples, but it depends on number of features (prediction is generally very fast)
3) Normalization is required if features are in very different ranges of values
4) There are basically standarization (mean, stdev) and scaling (xmax -> +1, xmean -> -1 or 0) - you could do both and see which one is better

when talking about classification the data consists of feature vectors with a number of features. in image processing there is also features which are mapped to classification feature vectors. so your "feature length" is actually the number of features or feature vector size.
1) the number of features matter. in principle more features allow better classification but also lead to overtraining. to avoid the latter you can add more samples (more feature vectors).
2) yes, as the prediction time depends on the number of support vectors and the size of the support vectors. but as prediction is very fast this is not an issue unless you have some real time requirements.
3) while SVM as a maximum margin classifier is quite robust against different feature value ranges a feature with a bigger value range would have more weight than one with a smaller range. this especially applies to penalty calculation if classes are not completely separable.
4) as SVM is quite robust against different value ranges (compared to cluster oriented algorithms) this is not the biggest issue. typically absolute min/max are scaled to -1/+1. if you know the expected range of your data you could scale that range and measurement errors in your data would not influence the scaling. a fixed range is also preferable when adding trraining data in an iterative process.

Text Classification - how to find the features that most affected the decision

When using SVMlight or LIBSVM in order to classify phrases as positive or negative (Sentiment Analysis), is there a way to determine which are the most influential words that affected the algorithms decision? For example, finding that the word "good" helped determine a phrase as positive, etc.

If you use the linear kernel then yes - simply compute the weights vector:
w = SUM_i y_i alpha_i sv_i
Where:
sv - support vector
alpha - coefficient found with SVMlight
y - corresponding class (+1 or -1)
(in some implementations alpha's are already multiplied by y_i and so they are positive/negative)
Once you have w, which is of dimensions 1 x d where d is your data dimension (number of words in the bag of words/tfidf representation) simply select the dimensions with high absolute value (no matter positive or negative) in order to find the most important features (words).
If you use some kernel (like RBF) then the answer is no, there is no direct method of taking out the most important features, as the classification process is performed in completely different way.

As #lejlot mentioned, with linear kernel in SVM, one of the feature ranking strategies is based on the absolute values of weights in the model. Another simple and effective strategy is based on F-score. It considers each feature separately and therefore cannot reveal mutual information between features. You can also determine how important a feature is by removing that feature and observe the classification performance.
You can see this article for more details on feature ranking.
With other kernels in SVM, the feature ranking is not that straighforward, yet still feasible. You can construct an orthogonal set of basis vectors in the kernel space, and calculate the weights by kernel relief. Then the implicit feature ranking can be done based on the absolute value of weights. Finally the data is projected into the learned subspace.

How to pre-process dataset for maximum effectiveness with LibSVM Weka implementation

So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!

Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart