What X,y I should fit into GridSearchCV? That I have used to train model or to test model?
As I found out that using wrong one, resulted in bad accuracy score for the model ;C
Always perform GridSearchCV on your train set, thus X_train and y_train.
Related
I have a problem with xgboost predictions.
I have trained a xgboost model for my regression problem in python but when max_depth parameter is given different than default value, some of predictions changes if it is predicted again with the same model.
So far I tried changing basic parameters like learning rate, reg_lamda and so on but only max_depth causes randomness in predictions for same data.
I'm taking a course in stat learning / ML, currently doing a project including a classification task, and I have some newbie questions regarding the random_state parameter. The accuracy of my model is heavily changing depending on the random_state. I'm currently working with logistic regression (from sklearn.linear_model.LogisticRegression()). I try to tune the hyperparameter by using the GridSearchCV method.
The problem:
I get different prediction accuracy, depending on which random_state I'm using.
What I have tried:
I have tried to set the random_state parameter as a global state (using np.random.seed(randomState) and setting randomState as an integer in the top of the script). Further, I split the data using the
train_test_split(X, y, test_size=0.2, random_state=randomState)
with the same (global) integer randomState. Further, I want to preform GridSearchCV to tune the hyperparameters. Thus, I specify a param_grid and preform a GridSearchCV on this. From this, I find the best estimator and choose this as my model. Then, I use my model for prediction and print a classification report of the results. I take the average out of 10 runs by changing the randomState.
Example: I do this procedure with randomState=1 and find the best model from GridSearchCV: model_1. I get the accuracy 84%. If Im changing to randomState = 2,...,10 and still use model_1, the average accuracy becomes 80.5%.
I do this procedure with randomState=42 and find the best model from GridSearchCV: model_42. I get the accuracy 77%. If Im changing to randomState = 41, 40, 39,..,32 and still use model_42, the average accuracy becomes 78.7%.
I'm very confused why the accuracy varies so much depending on random_state.
Tuning random_state gives you different accuracies. Random state is like randomly splitting the dataset into train and test rather than splitting the dataset according to ascending values of index. This results in splitting of data-points into train and test and if there is any point in test data which is not there in train data, then this may lead to poor accuracies. The best way to deal this problem is by using Cross-validation Split. In this approach which randomly split the data into train and test then perform machine learning modelling, and this step is repeated for n times where n is number of splits (mostly n = 5). Then we take the mean of all accuracies and will consider this accuracy to be the final result. Instead of changing the value random_state every-time you can perform Cross-validation Split
You can find references to this in the below link:
https://machinelearningmastery.com/k-fold-cross-validation/#:~:text=Cross%2Dvalidation%20is%20a%20resampling,k%2Dfold%20cross%2Dvalidation
I'm using random forest to do binary classification, test size 0.3, 5-fold cv, for both train and test, precision and recall are over 99%, am I over-fitting?
If you have done a 70-30 train test split and a 5 fold CV only on the train set. after that got a 99% precision and recall on the TEST set you have covered all the steps.
what you may validate is the proportion of data distribution in your test and train split.
do a mean on your y_train and y_test. validate that you get comparable numbers.
Check that samples from train and test data sets are different, and possibly try to run on some new real-world samples.
"Train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in." The above is most of the blogs mentioned about which I don't understand that. I think the disadvantages is not overfitting but underfitting. When we split the data , assume State A and B become the training dataset and try to predict the State C which is completely different than the training data that will lead to underfitting. Can someone fill me in why most of the blogs state 'test-split' lead to overfitting.
It would be more correct to talk about selection bias, which your question describes.
Selection bias can not really tie to overfitting, but to fitting a biased set, therefore the model will be unable to generalize/predict correctly.
In other words, whether "fitting" or "overfitting" applies to a biased train set, that is still wrong.
The semantic strain on the "over" prefix is just that. It implies bias.
Imagine you have no selection bias. In that case, when you overfit even a healthy set, by definition of overfitting, you will still make the model biased towards your train set.
Here, your starting training set is already biased. So any fitting, even "correct fitting", will be biased, just like it happens in overfitting.
In fact train/test split does have some randomness. See below with sci-kit learn train_test_split
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
Here, in order to have some initial intuition, you may change the random_state value to some random integer and train the model multiple times to see if you could get a comparable test accuracies in each run. If the dataset is small (in order of 100s) the test accuracies may differ significantly. But when you have a larger dataset (in order of 10000s) the test accuracies become more or less similar as the train set would include at least some examples from all samples.
Of course, cross validation is performed to minimize the effect of overfitting and to make the results more generalized. But with too large datasets, it would be really expensive to do cross validation.
The "train_test_split" function will not necessarily be biased if you do it only once on a data set. What I mean is that by selecting a value for "random_state" feature of the function, you can make different groups of train and test data sets.
Imagine you have a data set, and after applying the train_test_split and training your model, you get low accuracy score on your test data.
If you alter the random_state value and retrain your model, you will get a different accuracy score on your data set.
Consequently, you can essentially be tempted to find the best value for random_state feature to train your model in a way that will have best accuracy. Well, guess what?, you have just introduced bias to your model. So you have found a train set which could train your model in such way that would work the best on the test set.
However, when we use something such as KFold cross Validation, we break down the data set into five or ten (depending on size) groups of train and test data set. Every time we train the model, we can see a different score. The average of all the scores will probably be something more realistic for the model, when trained on the whole data set. It would look like something like this:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kfold = KFold(5, True, 1)
R_2 = []
for train_index, test_index in kfold.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
Model = LinearRegression().fit(X_train, y_train)
r2 = metrics.r2_score(y_test, Model.predict(X_test))
R_2.append(r2)
R_2mean = np.mean(R_2)
I am using pre-trained GoogLeNet and then fine tuned it on my dataset for classifying 11 classes. Validation dataset seems to give the "loss3/top1" 86.5%. But when I am evaluating the performance on my evaluation dataset it gives me 77% accuracy. Whatever changes I did it train_val.prototxt, I did the same changes in deploy.prototxt. Is the difference between the validation and evaluation accuracy is normal or I did something wrong?
Any suggestions?
In order to you get the fair estimation of your trained model on the validation dataset you need to set the test_itr and test_batch_size in a meaningful manner.
So, test_itr should be set to:
Val_data / test_batch_Size
Where, Val_data is the size of your validation dataset and test_batch_Size is validation batch size value set in batch_size for the Validation phase.