How to compare baseline and GridSearchCV results fair? - machine-learning

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?

No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

Related

sklearn cross valid / cross predict

I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks
The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)

Implementation of Cross-validation

I am confused since many individuals have their own approach to apply the cross-validation. For instance, some apply it on the whole dataset and some apply it on the training set.
My question is whether the below code is appropriate to implement cross-validation and make predictions from such model while having Cross-validation being applied?
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold
model= GradientBoostingClassifier(n_estimators= 10,max_depth = 10, random_state = 0)#sepcifying the model
cv = KFold(n_splits=5, shuffle=True)
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
#X -the whole dataset
#y - the whole dataset but target attributes only
y_pred = cross_val_predict(model, X, y, cv=cv)
scores = cross_val_score(model, X, y, cv=cv)
You need to have a test set to evaluate performance on completely unseen data even for cross validation. Performance tuning should not be done on this test set to avoid data leakage.
Split data into two segments train and test. There are various CV methods such as K-Fold, Stratified K-Fold etc. Visualization and further reading material here,
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
In K-Fold CV training data is split into K sets. Then for each fold, K-1 of the fold is trained and the remaining one is used for performance evaluation.
The image and further detail about cross validation, train/validation/test split etc. can be found here.
https://scikit-learn.org/stable/modules/cross_validation.html
Visualization of K-Fold cross validation for 3 classes,

Python Time SeriesSplit

I have the following question:
I have a timer series. I have done my preprocessing, and now I have x, which contains multiple features and y, which contains my output. I have split it into the train, test: x_train, x_test, y_train, y_test
I now want to do a regression and a gridsearch.
Since I have a time series, I cant do the k-fold cross-validation. So I wanted to use the TimeSeriesSplit.
But what exactly am I splitting? I thought I would split the training set into train and test/validate to train my model, validate/select my hyperparameter and then forecast using the test. Is this correct?
And how do I choose n_splits?
I have now the following code:
pipe=Pipeline....
pipe.fit(x_train, y_train)
tss=TimeSeriesSplit(n_splits=5)
for train_index, test_index in tss(train):
print('train:', train_index, 'test:', test_index
clf=GridSearchCV(pipe, param_grid, cv=tss)
clf.fit(x_train, y_train)
according to sklearn documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
"Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate."
The way to go, if you want to validate a time series model is nested cross validation some info about it is in the link bellow:
https://mlfromscratch.com/nested-cross-validation-python-code/

Difference between cross_val_score and another way of calculating accuracy

I tried to calculate the accuracy and was puzzled by the fact that cross_val_score gives a rather low result, than by comparing the predicted results with the correct.
First way of counting, that gives
[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]
kf = KFold(shuffle=True, n_splits=5)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(np.sum(y_pred == y_test) / len(y_test))
Second way gives array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ]):
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')
What's my mistake?
cross_val_score will use a StratifiedKFold cv iterator when not specified otherwise. A StratifiedKFold will keep the ratio of classes balanced the same way in train and test split. For more explanation, see my other answer here:-
https://stackoverflow.com/a/48314533/3374996
On the other hand, in your first approach you are using KFold which will not keep the balance of classes. In addition you are doing shuffling of data in that.
So in each fold, there is data difference in your two approaches and hence different results.
The low score in cross_val_score is probably because of the fact that you are providing the complete data to it, instead of breaking it into test and training set. This generally leads to leakage of information which results in your model giving incorrect predictions. See this post for more explanation.
References
Learn the right way to validate models

How to predict on a single data sample when preprocssing is needed

When I read scikit learn example, a typical machine learning flow is prepocessing --> learning --> predicting. As the code snippet shown below:
steps = [('scalar', StandardScalar()),
('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Here, both training and testing dataset are scaled before fitting into the classifier. But in my task, I am going to predict on a single data sample. After training my model, I will get data from a streaming line. So each time, a single new data is received, I need to use the classifier to predict on it, and preceed my task with the predicted value.
So with only one example available each time, how to preprocess it before predicting? Scaling on this single example seems make no sense. How should I deal with such issue?
just as you train your classifier and use the generated model to predict the individual records, preprocessing step generates a preprocessing model as well. Let's say your input is Xi and you fitted the preprocessing and classifier models(scaler and clf respectively) already:
Xi_new=scaler.transform(Xi)
print(clf.predict(Xi_new))

Resources