What is the motivation for cross validation on feature selection preprocessing? - machine-learning

I saw several articles and examples of feature selection (wrapper & embedded methods) where they split the samples data into train and test sets.
I understand why we need to use cross-validate (split the data into train &test set) for building and testing the score of the models (actual prediction of the propose algorithm).
But I can't understand what is the motivation to do so for feature selection ?
There are no true results of which features we need to choose, so how can it improve the process of feature selection ?
What is the benefit ?

Most feature selection methods, such as wrapper models, require comparing model's performance under the use different feature combinations.
Cross-validation provides a more robust means of comparing the performance when different features subsets are used, and, therefore, a more robust feature selection process. For example, if K-folds cross-validation is used, comparison will be based on the average of the errors from different folds of data, and, therefore, a subset is selected that would result in the smallest generalization error.
Also, optimal hyper-parameters are not necessarily the same for different feature combination. Cross-validation helps with tuning and, therefore, a more fair comparison.
This is also an informative resource on this topic.

Related

Usage of nested cross validation for different regressors

I'm working on an assignment in which I have to compare two regressors (random forest and svr) that I implement with scikit-learn. I want to evaluate both regressors and I was googling around a lot and came across the nested cross validation where you use the inner loop to tune the hyperparameter and the outer loop to validate on the k folds of the training set. I would like to use the inner loop to tune both my regressors and the outer loop to validate both so I will have the same test and train folds for both rergessors.
Is this a proper way to compare two ml algorithms with each other? Are there better ways to compare two algorithms with each other? Especialy regressors?
I found some entries in blogs but I could not find any scientific paper stating this is a good technique to compare two algorithms with each other, which would be important to me. If there are some links to current papers I would be glad if you could post them, too.
Thanks for the help in advance!
EDIT
I have a very low amount of data (apprx. 200 samples) with a high amount of features (apprx. 250, after using feature selection, otherwise about 4500) so I decided to use cross validation.My dependent variable is a continous value from 0 to 1.
The problem is a recommender problem so it makes no sense to test for accuracy in this case. As this is only an assignment I can only measure the ml algorithms with statistical methods rather than asking users for their opinion or measure the purchases done by them.
I think it depends on what you want to compare. If you just want to compare different models with regards to prediction power (classifier and regressor alike), nested cross validation is usually good in order to not report overly optimistic metrics: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html while allowing you to find the best set of hyperparameters.
However, sometimes it seems like is just overkilling: https://arxiv.org/abs/1809.09446
Also, depending on how the ml algorithms behave, what datasets are you talking about, their characteristics, etc etc, maybe your "comparison" might need to take into consideration a lot of other things rather than just prediction power. Maybe if you give some more details we will be able to help more.

Why is dropping predictors based on low correlation between predictors and the dependent variable before performing cross validation incorrect?

Say I have predictors X1, X2, ..., Xn and a dependent variable Y.
I check correlation between the predictors and Y and drop predictors that have a low correlation with Y. Now I use cross validation between Y and the remaining predictors to train a logistic regression model.
What is wrong with this method?
There are many possible problems with doing so, which would end up in a very lengthy answer - I'll just point out two that I think are most important and of which you can use the "buzzwords" to look up anythink still unclear:
Dropping features based on their feature-to-target-correlation essentially is a form of feature filtering. It is important to understand that feature filtering does not necessarily improve predictive performance. Think e.g. of 2 features in AND or OR configuration to the target variable, and just together will allow for correct prediction of the target variable. Correlation of those features with the target will be low, but dropping them might very well decrease your predictive performance. Besides feature filters there are feature wrappers, with which you essentially use a subset of features with a model and evaluate the predictive performance of the model. So, in contrast to feature filters, which just look at the features and target, feature wrappers look at the actual model performance. BTW: if you end up using feature filters based on feature correlation, you might still not only want to discard features with low feature-target correlation, but also features with high inter-feature correlation (because such features just don't contain much new information at all).
If you want to tune your feature selection (e.g. the amount of information/variance you want to preserve in your data, the nr. of features you want to keep, the amount of correlation you allow, etc) and you do this outside of your cross validation and resampling approach, you will likely end up with overly optimistic error estimates of your final model. This is because by not including those in the CV process, you will end up selecting one "best" configuration that was not correctly (=independently) estimated, hence might just by chance have been good. So, if you want to correctly estimate the error, you should consider including your feature selection in the CV process too.

Why KNN implementation in weka runs faster?

1) As we know KNN perform no computation in training phase instead defer all computations for classification because of which we call it lazy learner. It should take more time in classification than training however i found this assumption almost opposite about weka. In which KNN take more time in training than testing.
Why and how KNN in weka perform much faster in classification whereas in general it should perform slower ?
Does it also result in computations mistake ?
2) When we say feature weighting in Knn may improve performance for high dimensional data, what do we mean by saying it ? Do we mean that feature selection and selecting feature with high InformationGain ?
Answer to question 1
My guess is that the Weka implementation uses some kind of data structure to efficiently perform (approximate) nearest-neighbor queries.
Using such data structures, queries can be performed much more efficiently than performing them in the naive way.
Examples of such data structures are the KD tree and the SR Tree.
In the training phase the data structure has to be created, so it will take more time than classification.
Answer to question 2
(I'm not sure if you refer to predictive performance or performance as in speed-up. Since both are relevant, I will address them both in my answer.)
Using higher weights for the most relevant features and lower weights for the less relevant features may improve the predictive performance.
Another way to improve predictive performance is to perform feature selection. Using Mutual Information or some other kind of univariate association (like Pearson correlation for continuous variables) is the simplest and easiest way to perform feature selection. Note that reducing the number of variables can offer a significant speed-up in terms of computational time.
Of course, you can do both, that is, perform feature selection first and then use weights on the remaining features. You could for example use the mutual information to weight the remaining features. In case of text-classification, you could also use TF-IDF to weight your features.

Optimal Feature-to-Instance Ratio in Back Propagation Neural Network

I'm trying to perform leave-one-out cross validation for modelling a particular problem using Back Propagation Neural Network. I have 8 features in my training data and 20 instances. I'm trying to make the NN learn a function in building a prediction model. Now, the problem is that the error rate is quite high in the prediction. My guess is that the number of instances in the training is less when compared to the number of features under consideration. Is this conclusion correct. Is there any optimal feature to instance ratio ?
(This topic is often phrased in the ML literature as acceptable size or shape of the data set, given that a data set is often described as an m x n matrix in which m is the number of rows (data points) and n is the number of columns (features); obvious m >> n is preferred.)
In an event, I am not aware of a general rule for an acceptable range of features-to-observations; there are probably a couple of reasons for this:
such a ratio would depend strongly on the quality of the data
(signal-to-noise ratio); and
the number of features is just one element of model complexity (e.g., interaction among the features); and model complexity is the strongest determinant of the number of data instances (data points).
So there are two sets of approaches to this problem--which, because they are opposing, both can be applied to the same model:
reduce the number of features; or
use a statistical technique to leverage the data that you do have
A couple of suggestions, one for each of the two paths above:
Eliminate "non-important" features--i.e, those features that don't contribute to the variability in your response variable. Principal Component Analysis (PCA) is fast and reliable way to do this, though there are a number of other techniques which are generally subsumed under the rubric "dimension reduction."
Use Bootstrap methods instead of cross-validation. The difference in methodology seems slight but the (often substantial) improvement in reducing prediction error is well documented for multi-layer perceptrons (neural networks) (see e.g., Efron, B. and Tibshirani, R.J., The bootstrap method: Improvements on cross-validation, J. of the American Statistical Association, 92, 548-560., 1997). If you are not familiar with Bootstrap methods for splitting training and testing data, the general technique is similar to cross-validation except that instead of taking subsets of the entire data set you take subsamples. Section 7.11 of Elements is a good introduction to Bootstrap methods.
The best single source on this general topic that i have found is Chapter 7 Model Assessment and Selection from the excellent treatise Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. This book is available free to download from the book's homepage.

Cross validation with Training/Testing sets

Is it possible to do the evaluation with cross validation and using training/testing sets? I understand cross validation vs holdout evaluation, but I am confused about if we combine them together.
Both cross-validation and holdout evaluation are widely used for estimating the accuracy (or some other measure of performance) of a model. Typically, if you have the luxury of a large amount of data available, you might use holdout evaluation, but if you are a bit more restricted, you might use cross-validation.
But they can also be used for other purposes - in particular, model selection and optimization - and one might commonly want to do these things as well as estimating the model's accuracy.
For example, you might wish to carry out feature selection on your model (choose among several versions of the model, each if which has been built with a different subset of variables), and then evaluate the final chosen model. For the final evaluation, you might reserve a test set for holdout validation; but in order to choose the best subset of variables, you might compare the accuracies of the models built on each subset, as estimated by a cross-validation on the training set.
Other aspects of models could also be optimized using this mixed approach such as, for example, a complexity parameter from a neural network or the ridge parameter from ridge regression.
Hope that helps!

Resources