When do you use gridsearchcv vs. k-fold in sklearn? - machine-learning

When would you use gridsearchcv vs. k-fold? Does gridsearchcv automatically perform k-fold via the CV parameter?
example of gridsearch implementation:
GridSearchCV(svc_gc, param_grid=parameter_grid, cv=10)

Yes, GridSearchCV performs k-fold cross-validation, specified by the cv parameter.
If the cv parameter is an integer, it represents the number of folds for k-fold cross-validation.
You might want to have a look at the reference as well: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
I hope this helps :)

Yes, GridSearchCV does perform a K-Fold cross validation, where the number of folds is specified by its cv parameter. If it is not specified, it applied a 5-fold cross validation by default.
Essentially they serve different purposes. Or better said, GridSearchCV can be seen of an extension of applying just a K-Fold, which is the way to go in the case of wanting to perform a hyper-parameter search over a predefined grid of parameters.

Related

Difference between RidgeCV() and GridSearchCV()

RidgeCV() also searches on a set of hyperparameters, and given a similar kernel in GridSearchCV() along with similar parameters would there be any difference in the results of the two?
RidgeCV implements cross validation for ridge regression specifically, while with GridSearchCV you can optimize parameters for any estimator, including ridge regression.

learning_curve() with holdout set

Is there any way to plot the learning curves of training and validation data, when validation data does already exist as a holdout set?
sklearn learning_curve() does not offer such a paramater, as far as I can tell.
You can set the parameter cv as a PredefinedSplit (or just a length-1 iterator of train/test indices). See the description of cv in the documentation.

No score method for MeanShift estimator - scikit-learn

I was trying to use GridSearch to iterate over different values of bandwidth for MeanShift algorithm and it shows this error; does any of you know how can I fix this? Thanks a lot!
# Using GridSearch for Algorithm Tuning
from sklearn.model_selection import GridSearchCV
meanshift=MeanShift()
C = range(48, 69) # For MeanShift bandwidth
param_grid = {"bandwidth": range(48, 69)}
mean_grid = GridSearchCV(estimator=meanshift, param_grid=param_grid, scoring=None)
mean_grid.fit(X)
And this is the error I get:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1,
n_jobs=1, seeds=None) does not.
You can't use GridSearch with an unsupervised method well.
The concept of grid search is to choose those parameters that have the best score when predicting on held out data. But since most clustering algorithms cannot predict on unseen data, this does not work.
It's not that straightforward to choose "optimal" parameters in unsupervised learning. That is why there isn't an easy automation like gridsearch available.
It's because MeanShift algoritm does not contain score function. In this case you have to specify scoring in GridSearchCV. Here is a complete list.
From the documentation of GridSearchCV:
Parameters:
estimator : estimator object.
This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.

How to use over-sampled data in cross validation?

I have a imbalanced dataset. I am using SMOTE (Synthetic Minority Oversampling Technique)to perform oversampling. When performing the binary classification, I use 10-fold cross validation on this oversampled dataset.
However, I recently came accross this paper; Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models that mentions that it is incorrect to use the oversampled dataset during cross-validation as it leads to overoptimistic performance estimates.
I want to verify the correct approach/procedure of using the over-sampled data in cross validation?
To avoid overoptimistic performance estimates from cross-validation in Weka when using a supervised filter, use FilteredClassifier (in the meta category) and configure it with the filter (e.g. SMOTE) and classifier (e.g. Naive Bayes) that you want to use.
For each cross-validation fold Weka will use only that fold's training data to parameterise the filter.
When you do this with SMOTE you won't see a difference in the number of instances in the Weka results window, but what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the -P setting, which controls how many additional minority-class instances are generated as a percentage of the number in the dataset) and you should see the performance changing, showing you that the filter is actually doing something.
The use of FilteredClassifier is illustrated in this video and these slides from the More Data Mining with Weka online course. In this example the filtering operation is supervised discretisation, not SMOTE, but the same principle applies to any supervised filter.
If you have further questions about the SMOTE technique I suggest asking them on Cross Validated and/or the Weka mailing list.
The correct approach would be first splitting the data into multiple folds and then applying sampling just to the training data and let the validation data be as is. The image below states the correct approach of how the dataset should be resampled in a K-fold fashion.
If you want to achieve this in python, there is a library for that:
Link to the library: https://pypi.org/project/k-fold-imblearn/

Scikit-learn: scoring in GridSearchCV

It seems that GridSearchCV of scikit-learn collects the scores of its (inner) cross-validation folds and then averages across the scores of all folds. I was wondering about the rationale behind this. At first glance, it would seem more flexible to instead collect the predictions of its cross-validation folds and then apply the chosen scoring metric to the predictions of all folds.
The reason I stumbled upon this is that I use GridSearchCV on an imbalanced data set with cv=LeaveOneOut() and scoring='balanced_accuracy' (scikit-learn v0.20.dev0). It doesn't make sense to apply a scoring metric such as balanced accuracy (or recall) to each left-out sample. Rather, I would want to collect all predictions first and then apply my scoring metric once to all predictions. Or does this involve an error in reasoning?
Update: I solved it by creating a custom grid search class based on GridSearchCV with the difference that predictions are first collected from all inner folds and the scoring metric is applied once.
GridSearchCVuses the scoring to decide what internal hyperparameters to set in the model.
If you want to estimate the performance of the "optimal" hyperparameters, you need to do an additional step of cross validation.
See http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
EDIT to get closer to answering the actual question:
For me it seems reasonable to collect predictions for each fold and then score them all, if you want to use LeaveOneOut and balanced_accuracy. I guess you need to make your own grid searcher to do that. You could use model_selection.ParameterGrid and model_selection.KFold for that.

Resources