Can we apply two time cross validation on same dataset? - machine-learning

First, we split the dataset using stratify parameter
train_test_split(np.array(X), y, train_size=TRAIN_SIZE, stratify=y, random_state=42)
and then apply KFlod Cross Validation
kfold = KFold(n_splits=num_folds, shuffle=True)
fold_no = 1
for train, test in kfold.split(inputs, targets):
Does this code have any drawbacks?

Related

Where the categorical encoding should be done in a k fold - cv procedure?

I want to apply a cross validation method in my machine learning models. I these models, I want a Feature Selection and a GridSearch to be applied as well. Imagine that I want to estimate the performance of K-Nearest-Neighbor Classifier by applying a feature selection technique based on an F-score (ANOVA) that chooses the 10 most relevant features. The code would be as follows:
# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)
# Data standardization
scaler = StandardScaler()
# Variable to contain error measures and counter for the splits
error_knn = []
split = 0
for train_index, test_index in rkf.split(X, y):
# Print a dot for each train / test partition
sys.stdout.write('.')
sys.stdout.flush()
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Standardize the data
scaler.fit(X_train, y_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
# Evaluate the performance in a 5-fold cross-validation
skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)
# n_jobs = -1 to use all processors
gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, \
scoring=make_scorer(accuracy_score))
result = gridcv.fit(X_train, y_train)
###- Results -###
# Mean accuracy and standard deviation
accuracies = gridcv.cv_results_['mean_test_score']
std_accuracies = gridcv.cv_results_['std_test_score']
# Best value for the number of neighbors
# Define KNeighbors Classifier with that best value
# Method fit(X,y) to fit each model according to training data
best_Nneighbors = N_neighbors[np.argmax(accuracies)]
knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
knn.fit(X_train, y_train)
# Error for the prediction
error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))
split += 1
However, my columns are categorical (except binary label) and I need to do a categorical encoding. I can not remove this columns because they are essential.
Where would you perform this encoding and how the problems of categorical encoding of unseen labels in each fold would be solved?
Categorical encoding should be performed as the first step, precisely to avoid the problem you mentioned regarding unseen labels in each fold.
Additionally, your current implementation suffers from data leakage.
You're performing feature scaling on the full X_train dataset before performing your inner cross-validation.
This can be solved by including StandardScaler on the pipeline used for your GridSearchCV:
...
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
[ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
)
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
...
Another couple of tips:
GridSearchCV has a best_estimator_ attribute that can be used to extract the estimator with the best set of hyperparameters found.
When using GridSearchCV with refit=True (the default), you can use the object directly to perform predictions, e.g. gridcv.predict(X_test).
EDIT: Perhaps I was too general when it came to when to perform categorical enconding. Your approach should depend on your problem/dataset.
If you know beforehand how many categorical features exist and you want to train your inner CV classifiers with this knowledge, you should perform categorical enconding as the first step.
If at training time you do not know how many categorical features you are going to see or you want to train your CV classifiers without knowledge of the full range of categorical features, you should perform categorical enconding at each fold.
When using the former your classifiers will all be trained on the same feature space while that's not guaranteed for the latter.
If using the latter, the above pipeline can be extended to incorporate categorical encoding:
pipeline = Pipeline(
[
('enc', OneHotEncoder()),
('scaler', StandardScaler(with_mean=False)),
('knn', KNeighborsClassifier()),
],
)
I suggest you read the Encoding categorical features section of scikit-learn's User Guide carefully.

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

Model validation with separate dataset

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here
The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

How to Build a Decision tree Regressor model

I am learning ML and was doing a simple handsOn as below:
//
Split boston.data into two sets names x_train and x_test. Also, split boston.target into two sets y_train and y_test.
Build a Decision tree Regressor model from x_train set, with default parameters.
//
I did following code for this:
from sklearn import datasets, model_selection, tree
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = model_selection.train_test_split(boston.data,boston.target, random_state=30)
dt = tree.DecisionTreeRegressor()
dt_reg = dt.fit(x_train)
When I am doing above, it's giving:
TypeError: fit() missing 1 required positional argument: 'y'
Can I fit a model for one training dataset?
What should I give here as 'y'?
As the error states, the fit() method takes 2 parameters for a regression problem, the predictors and the outcome:
dt_reg = dt.fit(x_train, y_train)
Supervised learning models such as the regression tree you are using require a set of observations composed of features (each row of X_train can be understood as a vector containing features for one observation) and a target outcome (each element in the vector y_train)

How to apply cross validation on data?

I want to evaluate a ML model using the average cross validation score.
I am splitting the data in a train and test set.
But I don't know if I have to use the train or test data to evaluate the model using the cross validation score.
Here is a part of my code:
train, test = train_test_split(basic_df, test_size=0.3, random_state=42)
# Separate the labels from the features and convert features & labels to numpy arrays
x_train=train.drop('successful',axis=1)
y_train=train['successful']
x_test=test.drop('successful',axis=1)
y_test=test['successful']
model = RandomForestClassifier()
model_random = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)
model_random.fit(x_train, y_train)
print('Accuracy score: ', model_random.score(x_test,y_test))
print('Average Cross-Val-Score: ', np.mean(cross_val_score(model_random, x_train, y_train, cv=5))) # 5-Fold Cross validation
Y_predicted = model_random.predict(x_test.values)
print('f1_score (macro): ', f1_score(y_test, Y_pred, average='macro') )
The main question is on the following code line:
print('Average Cross-Val-Score: ', np.mean(cross_val_score(model_random, x_train, y_train, cv=5))) # 5-Fold Cross validation
Is it right or should I use the test set there like this:
print('Average Cross-Val-Score: ', np.mean(cross_val_score(model_random, x_test, y_test, cv=5))) # 5-Fold Cross validation
You don't have to fit again to know the performance of your model's performance on training data. you can get using the following command
import pandas as pd
pd.DataFrame(model_random.cv_results_)
look at the mean_test_score column. Remember this is the performance on test fold of cross validation. This will give you an idea of how well the model performed, for a particular hyper parameter combination chosen by RandomizedSearchCV. Best hyper parameter combination and corresponding model can be extracted using
model_random.best_params_
model_random.best_estimator_
Coming to your actual test data, usually people don't use cross validation there.
Just do a prediction there, like how you in this part. In the background, it uses the model_random.best_estimator_ to do prediction.
Y_predicted = model_random.predict(x_test.values)
print('f1_score (macro): ', f1_score(y_test, Y_pred, average='macro') )
Look at this documentation for more explanation.

Resources