Scaling of Continuous variables in logistic regression - machine-learning

I am performing logistic regression and had a doubt
I have categorical (0,1)as well as continuous variables in my data set..
Now do I need to scale my continuous variables between 0 and 1?
Coz few of my continuous variables have values up to 10k
Does it make sense to keep such continous values along with categorical variables while performing the logistic regression?

Theoretically it is not neccesary. But your resulting system will probably have very small coefficients for the inputs with large range. This can be a problem if you want to use numbers with reduced accuracy (for example 16 bit) for your model.
I am not sure why you are asking if you should use the continuous values in your model. If there is any possibility that they are correlated with the result, keep them. Only if you are sure they are uncorrelated, ignore them.

For simple linear/logistic regression (without regularization): no need to scale variables.
For linear/logistic regression with regularization: you need to perform scaling.
For linear/logistic regression without regularization you need to scale features only if you'd like to interpret/compare weights after fitting. Otherwise features with higher values will possibly have smaller weights than other ones.

You can scale by variance and by location. There are many options. My advice is to consider scaling if your variables vary a lot between and within. You can try the following;
All the stuff below here represents a vector, so by X, I mean
. Thus, all I write below are either vectors or matrices.
Scaling by range,
, where R is the range of the variables, basically max(X)-min(X).
Scaling by location (centering), and variance (scaling),
, where xbar and s are the sample mean and sample variance of X, respectively.
The latter one provides centering as well, so make sure that you select the proper formula for your data. There is no rule of thumb here, but intuiton and inference is a key point. You can also try different combinations of scale and location measures.

Related

Weighted features in machine learning

I am a beginner in machine learning. So any help or suggestion would be of great help.
I have read that putting weights on features and Predicting is a very bad idea. But what if few features needs to be weighted.
In a classification problem let's say it's a common norm that age is most dependent, how do I give weights to this feature. I was thinking to normalize it but with a variance of 1.5 or 2 (other features with variance 1), I believe that this feature will have more weight. Is this fundamentally wrong ? If wrong any other method.
Does it effect differently for classification and regression problems ?
If we are talking specifically about random forests (as you tagged) then you can use the Weighted Subspace Random Forest algorithm (in R wsrf package). The algorithm determines a weight for each variable and then uses these during the model building.
The informativeness of a variable with respect to the class is
measured by an information gain ratio. The measure is used as the
probability of that variable being selected for inclusion in the
variable subspace when splitting a specific node during the tree
building process. Therefore, variables with higher values by the
measure are more likely to be chosen as candidates during variable
selection and a stronger tree can be built.
Generally if a feature has more Importance compared to other features and the model is Dense enough, with enough training sample, your model will automatically give it more Importance by optimizing weight matrices to account for that because we have partial derivatives in back propagation which calculate change by each connection, so it learns to give more importance to that feature on itself. If you don't normalize it, but scale it to a higher scale, you might have overstated it's important.
In practice a neural network works best if the inputs are centered and white. That means that their covariance is diagonal and the mean is the zero vector. This improves optimization of the neural net, since the hidden activation functions don't saturate that fast and thus do not give you near zero gradients early on in learning.
If you do scale just one feature up by a small value, it may or may not have desired effects, but the higher probability is of saturated gradients, so we avoid it.

How to select features for clustering?

I had time-series data, which I have aggregated into 3 weeks and transposed to features.
Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on.
Some of features are discreet, some - continuous.
I am thinking of applying K-Means or DBSCAN.
How should I approach the feature selection in such situation?
Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?
Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.
Formalize your problem, don't just hack some code.
K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).
For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.
Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:
maxdistance(x,y) <= eps
<=>
forall_i |x_i-y_i| <= eps

How do I use principal component analysis in supervised machine learning classification problems?

I have been working through the concepts of principal component analysis in R.
I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.
The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.
More specifically, here's my real question:
I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?
I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)
Please step in and set me straight if you have the time and wherewithal. Thanks in advance.
Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.
The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:
Because of orthonormality, we can invert W simply by transposing it and write:
Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.
Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.
The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:
We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:
The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.
After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.
This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:
In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.

Most appropriate normalization / transformation method for skewed features?

I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.
I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).
Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?
Any advice on the matter would be greatly appreciated!
Thanks!
As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.
As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.
If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.
If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.
This is an interesting question. Normalization is meant to keep features' values in one scale to facilitate the optimization process.
I would suggest the following:
1- Check if you need to normalize your data. If, for example, the means of the variables or features are within same scale of values, you may progress with no normalization. MSVMpack uses some normalization check condition for their SVM implementation. If, however, you need to do so, you are still advised to run the models on the data without Normalization.
2- If you know the actual maximum or minimum values of a feature, use them to normalize the feature. I think this kind of normalization would preserve the skewedness in values.
3- Try decimal value normalization with other features if applicable.
Finally, you are still advised to apply different normalization techniques and compare the MSE for evey technique including z-score which may harm the skewedness of your data.
I hope that I have answered your question and gave some support.

Is scaling of feature values in LibSVM necessary?

If I have 200 features, and if each feature can have a value ranging from 0 to infinity, should I scale the feature values to be in the range [0-1] before I go ahead and train a LibSVM on top of it?
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
Thanks
Abhishek S
You should store the ranges of you feature-values used for training. Then when you extract a feature-value from an unknown instance, use the particular range for scaling.
Use the formula (here for the range [-1.0 , 1.0]):
double scaled_val = -1.0 + (1.0 - -1.0) * (extracted_val - vmin)/(vmax-vmin);
The Guide provided at libsvm website explains the scaling well:
"2.2 Scaling
Scaling before applying SVM is very important. Part 2 of Sarle's Neural Networks
FAQ Sarle (1997) explains the importance of this and most of considerations also apply
to SVM. The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel,
large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1; +1] or [0; 1].
Of course we have to use the same method to scale both training and testing
data."
If you've got infinite feature values, you're not going to be able to use LIBSVM anyway.
More practically, scaling is generally useful so the kernel doesn't have to deal with large numbers, so I would say go for it and scale. It's not a requirement, though.
And as Anony-Mousse implied in the comments, please try running experiments with and without scaling so you can see the difference.
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
You don't need to scale again. You already did that in the pre-training step (i.e. data processing).

Resources