Source of randomness of GridsearchCV - machine-learning

I was using GridSearhCV to get the best parameter for the number of neurons in my neural network model.
I am trying to see why GridSearchCV produces different results (mean_test_score) each time even though I tried different methods to address this issue.
I think I am using CPU (Not GPU), judging from the result of timestamps.
I have set shuffle=False in KFold
cv=KFold(n_splits=3, shuffle=False)
The neural network model(KerasRegressor) I am using is in a separate cell which was already run before I run GridsearchCV, and when I just ran the GridSearchCV code block, still it gives different result each run (without rerunning the KerasRegressor code block). So any source of randomness in KerasRegressor should not affect the randomness after that (and also I am using tf.random.set_seed(seed) before defining the model)
I tried including the following code block to set the global seed
import numpy as np
import tensorflow as tf
from numpy.random import seed
seed(1)
tf.random.set_seed(seed)
I included random_state as a key in param_grid dictionary but still I get different results.
param_grid = {
'hidden_size1': [64],
'hidden_size2': [64],
'hidden_size3': [64],
'random_state': [2]
}
The code I am using is
def make_model(optimizer="adam", hidden_size1=32, hidden_size2=32, hidden_size3=32, random_state=2):
model = Sequential()
model.add(Dense(hidden_size1, activation=activation,input_shape=(X_train.shape[1],)))
model.add(Dense(hidden_size2, activation=activation))
model.add(Dense(hidden_size3, activation=activation))
model.add(Dense(y_train.shape[1], activation='linear'))
model.compile(loss='mse',optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), metrics=['accuracy'])
return model
# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)
clf = KerasRegressor(make_model,verbose=1)
param_grid = {
'hidden_size1': [64],
'hidden_size2': [64],
'hidden_size3': [64],
'random_state': [2]
}
and on the separate code block, I have
cv=KFold(n_splits=3, shuffle=False)
grid = GridSearchCV(clf, param_grid=param_grid, cv=cv, return_train_score=True)
grid.fit(X_train, y_train)
and I am keep running this cell but everytime I get different result.
My question is if there is another source of randomness in GridSearchCV even when I disable the shuffle in KFold (Setting random state=integer is contradictory with Shuffle=False, so I just disabled Shuffle).

Like the commentor said, you are running GridSearch on a neural network with different weights every time, because they are randomly initiliazed. The solution to this would be to fill in initialization weights yourself.

Related

Accuracy score in K-nearest Neighbour Classifier not matching with GridSearchCV

I'm learning Machine Learning and I'm facing a mismatch I can't explain.
I have a grid to compute the best model, according to the accuracy returned by GridSearchCV.
model=sklearn.neighbors.KNeighborsClassifier()
n_neighbors=[3, 4, 5, 6, 7, 8, 9]
weights=['uniform','distance']
algorithm=['auto','ball_tree','kd_tree','brute']
leaf_size=[20,30,40,50]
p=[1]
param_grid = dict(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, leaf_size=leaf_size, p=p)
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=param_grid, cv = 5, n_jobs=1)
SGDgrid = grid.fit(data1, targetd_simp['VALUES'])
print("SGD Classifier: ")
print("Best: ")
print(SGDgrid.best_score_)
value=SGDgrid.best_score_
print("params:")
print(SGDgrid.best_params_)
print("Best estimator:")
print(SGDgrid.best_estimator_)
y_pred_train=SGDgrid.best_estimator_.predict(data1)
print(sklearn.metrics.confusion_matrix(targetd_simp['VALUES'],y_pred_train))
print(sklearn.metrics.accuracy_score(targetd_simp['VALUES'],y_pred_train))
The results I get are the following:
SGD Classifier:
Best:
0.38694539229180525
params:
{'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
Best estimator:
KNeighborsClassifier(leaf_size=20, n_neighbors=8, p=1, weights='distance')
[[4962 0 0]
[ 0 4802 0]
[ 0 0 4853]]
1.0
Probably this model is highly overfitted. I still to check it, but it's not the matter of question here.
So, basically, if I understand correctly, GridSearchCV is finding a best accuracy score of 0.3869 (quite poor) for one of the chunks in the cross validation, but the final confusion matrix is perfect, as well as the accuracy of this final matrix. It doesn't make much sense for me... How such a in theory, bad model is performing so well?
I also added scoring = 'accuracy' in GridSearchCV to be sure that the returned value is actually accuracy, and it returns exactly the same value.
What am I missing here?
The behavior you are describing is rather normal and to be expected. You should know that GridSearchCV has a parameter refit which is by default set to true. It triggers the following:
Refit an estimator using the best found parameters on the whole dataset.
This means that the estimator returned by best_estimator_ has been refit on your whole dataset (data1 in your case). It is therefore data that the estimator has already seen during training and, expectedly, performs especially well on it. You can easily reproduce this with the following example:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': [3, 4, 5]})
search.fit(X_train, y_train)
print(search.best_score_)
>>> 0.8533333333333333
print(accuracy_score(y_train, search.predict(X_train)))
>>> 0.9066666666666666
While this is not as impressive as in your case, it is still a clear result. During cross-validation, the model is validated against one fold that was not used for training the model, and thus, against data the model has not seen before. In the second case, however, the model already saw all data during training and it is to be expected that the model will perform better on them.
To get a better feeling of the true model performance, you should use a holdout set with data the model has not seen before:
print(accuracy_score(y_test, search.predict(X_test)))
>>> 0.76
As you can see, the model performs considerably worse on this data and shows us that the former metrics were all a bit too optimistic. The model did in fact not generalize that well.
In conclusion, your result is not surprising and has an easy explanation. The high discrepancy in scores is impressive but still follows the same logic and is actually just a clear indicator of overfitting.

Why is Scikit-learn RFECV returning very different features for the training dataset?

I have been experimenting with RFECV on the Boston dataset.
My understanding, thus far, is that to prevent data-leakage, it is important to perform activities such as this, only on the training data and not the whole dataset.
I performed RFECV on just the training data, and it indicated that 13 of the 14 features are optimal. However, I then ran the same process on the whole dataset, and this time around, it indicated that only 6 of the features are optimal - which seems more likely.
To illustrate:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
### CONSTANTS
TARGET_COLUMN = 'Price'
TEST_SIZE = 0.1
RANDOM_STATE = 0
### LOAD THE DATA AND ASSIGN TO X and y
data_dict = load_boston()
data = data_dict.data
features = list(data_dict.feature_names)
target = data_dict.target
df = pd.DataFrame(data=data, columns=features)
df[TARGET_COLUMN] = target
X = df[features]
y = df[TARGET_COLUMN]
### PERFORM TRAIN TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE,
random_state=RANDOM_STATE)
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
X_input = X_train
y_input = y_train
## All the data
# X_input = X
# y_input = y
### IMPLEMENT RFECV AND GET RESULTS
rfecv = RFECV(estimator=LinearRegression(), step=1, scoring='neg_mean_squared_error')
rfecv.fit(X_input, y_input)
rfecv.transform(X_input)
print(f'Optimal number of features: {rfecv.n_features_}')
imp_feats = X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1)
print('Important features:', list(imp_feats.columns))
Running the above will result in:
Optimal number of features: 13
Important features: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
Now, if change the code so that RFECV fits all the data:
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
# X_input = X_train # NOW COMMENTED OUT
# y_input = y_train # NOW COMMENTED OUT
## All the data
X_input = X # NOW UN-COMMENTED
y_input = y # NOW UN-COMMENTED
and run it, I get the following result:
Optimal number of features: 6
Important features: ['CHAS', 'NOX', 'RM', 'DIS', 'PTRATIO', 'LSTAT']
I don't understand why the results are so markedly different (and seemingly more accurate) for the whole dataset as opposed to just the training set.
I have tried making the training set close to the size of the whole data, by making the test_size extremely small (via my TEST_SIZE constant), but I still get this seemingly unlikely difference.
What am I missing?
It certainly seems like unexpected behavior, and especially when, as you say, you can reduce the test size to 10% or even 5% and find a similar disparity, which seems very counter-intuitive. The key to understanding what's going on here is to realize that for this particular dataset the values in each column are not randomly distributed across the rows (for example, try running X['CRIM'].plot()). The train_test_split function you're using to split the data has a parameter shuffle which defaults to True. So if you look at the X_train dataset you'll see that the index is jumbled up, whereas in X it is sequential. This means that when the cross-validation is performed under the hood by the RFECV class, it is getting a biased subset of data in each split of X, but a more representative/random subset of data in each split of X_train. If you pass shuffle=False to train_test_split you'll see that the two results are much closer (or alternatively, and probably better, try shuffling the index of X).

Fitting method in scickit'slearn Pipeline class

Let's say i have this python code:
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.decomposition import PCA
selector = VarianceThreshold()
scaler = StandardScaler()
ros = RandomOverSampler()
pca = PCA()
clf = neighbors.KNeighborsClassifier(n_jobs=-1)
pipe = Pipeline(steps=[('selector', selector), ('scaler', scaler), ('sampler', ros), ('pca', pca), ('kNN', clf)])
pipe.fit(X_train,y_train)
preds = pipe.predict(X_test)
What this does is import 4 transformers and an estimator from scickit learn.Then it fit's these in the data and lastly it predicts.If i understand correctly fit method applies the 4 transformers to the data and the predict method makes the final estimation(in our case using kNN).My question is this:For scaler as well as pca the alterations that are done in the train data must be also applied in the test data.But in fit's parameters we don't give test and as a result the test data won't be altered.How does that make sense?Is there something i am missing?
The model only learns the parameters from the training data and assumes that the test data will have similar patterns and transforms it accordingly. You cannot have a test data that is completely different from training data and expect good predictions, hence the same PCA and scaler models are also used on test dataset. If you fit a scaler on a smaller test dataset, the results might come out completely different than what the model is originally trained on.

Build a Random Forest regressor with Cross Validation from scratch

I know this is a very classical question which might be answered many times in this forum, however I could not find any clear answer which explains this clearly from scratch.
Firstly, imgine that my dataset called my_data has 4 variables such as
my_data = variable1, variable2, variable3, target_variable
So, let's come to my problem. I'll explain all my steps and ask your help for where I've been stuck:
# STEP1 : split my_data into [predictors] and [targets]
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
# STEP2 : import the required libraries
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
#STEP3 : define a simple Random Forest model attirbutes
model = RandomForestClassifier(n_estimators=100)
#STEP4 : Simple K-Fold cross validation. 3 folds.
cv = cross_validation.KFold(len(my_data), n_folds=3, random_state=30)
# STEP 5
At this step, I want to fit my model based on the training dataset, and then
use that model on test dataset and predict test targets. I also want to calculate the required statistics such as MSE, r2 etc. for understanding the performance of my model.
I'd appreciate if someone helps me woth some basic codelines for Step5.
First off, you are using the deprecated package cross-validation of scikit library. New package is named model_selection. So I am using that in this answer.
Second, you are importing RandomForestRegressor, but defining RandomForestClassifier in the code. I am taking RandomForestRegressor here, because the metrics you want (MSE, R2 etc) are only defined for regression problems, not classification.
There are multiple ways to do what you want. I assume that since you are trying to use the KFold cross-validation here, you want to use the left-out data of each fold as test fold. To accomplish this, we can do:
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
model = RandomForestRegressor(n_estimators=100)
cv = model_selection.KFold(n_splits=3)
for train_index, test_index in kf.split(predictors):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = predictors[train_index], predictors[test_index]
y_train, y_test = targets[train_index], targets[test_index]
# For training, fit() is used
model.fit(X_train, y_train)
# Default metric is R2 for regression, which can be accessed by score()
model.score(X_test, y_test)
# For other metrics, we need the predictions of the model
y_pred = model.predict(X_test)
metrics.mean_squared_error(y_test, y_pred)
metrics.r2_score(y_test, y_pred)
For all this, documentation is your best friend. And scikit-learn documentation are one of the best I have ever seen. Following links may help you know more about them:
http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
http://scikit-learn.org/stable/user_guide.html
Also in the for loop it should be:
model = RandomForestRegressor(n_estimators=100)
for train_index, test_index in cv.split(X):

SciKit Learn feature selection and cross validation using RFECV

I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.
This is the code I have so far:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)
# Create the RFECV object
nb = MultinomialNB(alpha=0.5)
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')
rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)
print("Optimal number of features : %d" % rfecv.n_features_)
# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)
# test clf on test data
X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)
y_predicted = clf.predict(X_test_rfecv)
#print the mean accuracy on the given test data and labels
print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))
Three questions:
1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.
2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?
3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.
To answer your questions:
There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:
I doubt the code behaves exactly like you think it does ;).
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.
You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.
Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).
I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
Another example from the docs that does just that (and more).
Additional remarks:
You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).
from sklearn.feature_selection import chi2_sparse
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline(steps=[
("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
("selector", SelectPercentile(score_func=chi2, percentile=70)),
('NB', MultinomialNB(alpha=0.5))
])
Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.
Hope this helps, happy learning ;).

Resources