What is the feature selection priciple for sequential backward selection? - machine-learning

For example when we do feature selection for feature A, B, C by sequential backward selection algorithm, the performance of {A, B, C} is 0.7, the performance of {A, B} is 0.9, the performance of {A} is 0.6, which feature subset will be chosen?

A sequential (or stepwise) backward (or forward) feature selection algorithm is usually greedy. Features will be added (respectively removed) until it doesn't improve significantly (resp. preserves) the model score.
Note that the score is not the raw performance but some other metric (AIC, BIC, adjusted R2 for example in the case of regression).
In your example, if the stopping criteria is that you want to increase performance for at lest 0.1, then the set of features {A,B} would be chosen, because at the next step the performance is down 0.3.
This wikipedia page on stepwise regression gives details in the case of regression.

Related

Does L1 or L2 regularization give the most sparse weights for the same loss function and optimizer?

If I consider a dataset, which regularization technique (L1 regularization or L2 regularization) will output the highest sparse weights for the same loss function and same optimizer?
By definition, L1 regularization (lasso) forces some weights to zero, thus leading to sparser solutions; according to the Wikipedia entry on regularization:
It can be shown that the L1 norm induces sparsity
See also the L1 and L2 Regularization Methods post at Towards Data Science:
The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
For more details, see the following threads # Cross Validated:
Sparsity in Lasso and advantage over ridge
Why does the Lasso provide Variable Selection?

Feature Selection or PCA?

I'm having the below Azure Machine Learning question:
You need to identify which columns are more predictive by using a
statistical method. Which module should you use?
A. Filter Based Feature Selection
B. Principal Component Analysis
I choose is A but the answer is B. Can someone explain why it is B
PCA is the optimal approximation of a random vector (in N-d space) by linear combination of M (M < N) vectors. Notice that we obtain these vectors by calculating M eigenvectors with largest eigen values. Thus these vectors (features) can (and usually are) a combination of original features.
Filter Based Feature Selection is choosing the best features as they are (not combining them in any way) based on various scores and criteria.
so as you can see, PCA results in better features since it creates better set of features while FBFS merely finds the best subset.
hope that helps ;)

Which machine learning algorithm to use for high dimensional matching?

Let say, I can define a person by 1000 different way, so i have 1,000 features for a given person.
PROBLEM: How can I run machine learning algorithm to determine the best possible match, or closest/most similar person, given the 1,000 features?
I have attempted Kmeans but this appears to be more for 2 features, rather than high dimensions.
You basically after some kind of K Nearest Neighbors Algorithm.
Since your data has high dimension you should explore the following:
Dimensionality Reduction - You may have 1000 features but probably some of them are better than others. So it would be a wise move to apply some kind of Dimensionality Reduction. Easiest and teh first point o start with would be Principal Component Analysis (PCA) which preserves ~90% of the data (Namely use enough Eigen Vectors which match 90% o the energy with their matching Eigen Values). I would assume you'll see a significant reduction from this.
Accelerated K Nearest Neighbors - There are many methods out there to accelerate the search of K-NN in high dimensional case. The K D Tree Algorithm would be a good start for that.
Distance metrics
You can try to apply a distance metric (e.g. cosine similarity) directly.
Supervised
If you know how similar the people are, you can try the following:
Neural networks, Approach #1
Input: 2x the person feature vector (hence 2000 features)
Output: 1 float (similarity of the two people)
Scalability: Linear with the number of people
See neuralnetworksanddeeplearning.com for a nice introduction and Keras for a simple framework
Neural networks, Approach #2
A more advanced approach is called metric learning.
Input: the person feature vector (hence 2000 features)
Output: k floats (you choose k, but it should be lower than 1000)
For training, you have to give the network first on person, store the result, then the second person, store the result, apply a distance metric of your choice (e.g. Euclidean distance) of the two results and then backpropagate the error.

Perform Chi-2 feature selection on TF and TF*IDF vectors

I'm experimenting with Chi-2 feature selection for some text classification tasks.
I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom.
Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2,
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
It seems to me that we we can also perform Chi-2 feature selection on DF (word counts) vector presentation. My 1st question is: how does sklearn discretize the integer-valued feature into categorical?
My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html
It seems to me that we can also perform Chi-2 feature selection on a TF*IDF vector representation. How sklearn perform Chi-2 feature selection on real-valued features?
Thank you in advance for your kind advise!
The χ² features selection code builds a contingency table from its inputs X (feature values) and y (class labels). Each entry i, j corresponds to some feature i and some class j, and holds the sum of the i'th feature's values across all samples belonging to the class j. It then computes the χ² test statistic against expected frequencies arising from the empirical distribution over classes (just their relative frequencies in y) and a uniform distribution over feature values.
This works when the feature values are frequencies (of terms, for example) because the sum will be the total frequency of a feature (term) in that class. There's no discretization going on.
It also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.

importance of PCA or SVD in machine learning

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand.
I am trying to think (since long time) but I am not able to guess why is it so.
In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization.
But how does something like SVD helps.
So lets say i have a huge matrix of user rating movies..and then in this matrix, I implement some version of recommendation system (say collaborative filtering):
1) Without SVD
2) With SVD
how does it helps
SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.
The way it works is simple. You perform SVD over your training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called k-truncated SVD.
SVD doesn't help you with sparsity though, only helps you when features are redundant. Two features can be both sparse and informative (relevant) for a prediction task, so you can't remove either one.
Using SVD, you go from n features to k features, where each one will be a linear combination of the original n. It's a dimensionality reduction step, just like feature selection is. When redundant features are present, though, a feature selection algorithm may lead to better classification performance than SVD depending on your data set (for example, maximum entropy feature selection). Weka comes with a bunch of them.
See: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition
https://stats.stackexchange.com/questions/33142/what-happens-when-you-apply-svd-to-a-collaborative-filtering-problem-what-is-th
The Singular Value Decomposition is often used to approximate a matrix X by a low rank matrix X_lr:
Compute the SVD X = U D V^T.
Form the matrix D' by keeping the k largest singular values and setting the others to zero.
Form the matrix X_lr by X_lr = U D' V^T.
The matrix X_lr is then the best approximation of rank k of the matrix X, for the Frobenius norm (the equivalent of the l2-norm for matrices). It is computationally efficient to use this representation, because if your matrix X is n by n and k << n, you can store its low rank approximation with only (2n + 1)k coefficients (by storing U, D' and V).
This was often used in matrix completion problems (such as collaborative filtering) because the true matrix of user ratings is assumed to be low rank (or well approximated by a low rank matrix). So, you wish to recover the true matrix by computing the best low rank approximation of your data matrix. However, there are now better ways to recover low rank matrices from noisy and missing observations, namely nuclear norm minimization. See for example the paper The power of convex relaxation: Near-optimal matrix completion by E. Candes and T. Tao.
(Note: the algorithms derived from this technique also store the SVD of the estimated matrix, but it is computed differently).
PCA or SVD, when used for dimensionality reduction, reduce the number of inputs. This, besides saving computational cost of learning and/or predicting, can sometimes produce more robust models that are not optimal in statistical sense, but have better performance in noisy conditions.
Mathematically, simpler models have less variance, i.e. they are less prone to overfitting. Underfitting, of-course, can be a problem too. This is known as bias-variance dilemma. Or, as said in plain words by Einstein: Things should be made as simple as possible, but not simpler.

Resources