Early stopping in XGBoost classification and RandomizedSearchCV

Early stopping in XGBoost classification and RandomizedSearchCV - machine-learning

When I try to combine RandomSearch with the early stopping method to reduce the overfitting, I get this error:
py:372: FitFailedWarning:
300 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
The code I am trying is like this:
params_dist = {'min_child_weight': [0.1, 1, 5, 10, 50],
'colsample_bytree': np.arange(0.5, 1.0, 0.1),
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': np.arange(0.5, 1.0, 0.1),
'max_depth': range(3, 21, 3),
'learning_rate': [0.0001,0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1],
'n_estimators': [50, 100, 250, 500, 750, 1000],
'reg_alpha': [0.0001, 0.001, 0.1, 1],
'reg_lambda': [0.0001, 0.001, 0.1, 1]}
model_with_earlyStopping = xgb.XGBClassifier(objective='binary:logistic',
eval_metric="error",
early_stopping_rounds=13,
seed=42)
random_search = model_selection.RandomizedSearchCV(model_with_earlyStopping,
param_distributions=params_dist,
scoring='roc_auc',
n_jobs=-1,
verbose=0,
cv=3,
random_state=1001,
n_iter=100)
The code worked fine without using early stopping. However, I am looking for a way to combine these 2 methods together.
Can anyone help me fix it?

Related

RandomizedSearchCV with xgboost classifier is taking too long

I am trying to run this code on my dataset. However, its taking too long (the code has been running since yesterday and my dataset is not too large)
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
xgb_classifier = XGBClassifier(learning_rate=0.02, n_estimators=600,
objective='binary:logistic',silent=True, nthread=1)
folds = 3
param_comb = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(xgb_classifier, param_distributions=params,
n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(trainx,trainy),
verbose=3, random_state=1001 )
random_search.fit(trainx, trainy)
I am new in the field. any ideas?

Guidance needed - GridSearchCV returns parameters that decrease the accuracy of the XGBoost model

I am playing around with the XGBoostClassifier and tuning this with GridSearchCV. I first created the variable xgbc:
xgbc = xgb.XGBClassifier()
I did'nt use any parameters as I wanted to see the default model performance. This gave me accuracy_score = 85.65%, recall_score = 77.91% and roc_auc_score = 84.21%, using the following lines of code:
print("Accuracy: ", accuracy_score(y_test, xgbc.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc.predict(X_test)))
Next, I used GridSearchCV to try to tune the parameters, like this:
Setting up the parameter dictionary:
xgbc_params = {'max_depth': [5, 6, 7], #6
'learning_rate': [0.25, 0.300000012, 0.35], #0.300000012
'gamma':[0, 0.001, 0.1], #0
'reg_lambda': [0.8, 0.95, 1], #1
'scale_pos_weight': [0, 1, 2], #1
'n_estimators': [95, 100, 105]} #100
(The numbers after the # are the default values, which gave me the above scores.)
And now run the GridSearchCV like this:
xgbc_grid = GridSearchCV(xgbc, param_grid=xgbc_params, scoring = make_scorer(accuracy_score), cv = 10, n_jobs = -1)
Next, fit this to the training data:
xgbc_grid.fit(X_train, y_train, verbose = 1, early_stopping_rounds = 10, eval_metric = 'aucpr', eval_set = [(X_test, y_test)])
Finally, run the metrics again:
print("Best Reg estimators: ", xgbc_grid.best_params_)
print("Accuracy: ", accuracy_score(y_test, xgbc_grid.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc_grid.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc_grid.predict(X_test)))
Now, the scores change: accuracy_score = 0.8340807174887892, recall_score = 0.7325581395348837 and roc_auc_score = 0.8420896282464777. Also, here is the best_params_ result:
Best Reg estimators: {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Here is my problem:
The parameter values that GridSearchCV returns through xgbc_grid.best_params_ are not the most optimal for accuracy, as the accuracy score decreases. Can you please help me figure out why this is happenning?
In the parameter dictionary above, I have provided the default values. If I set the parameters to only these single values, then I get the 85% accuracy, like, 'max_depth': [6]. However, as soon as I add other values, like 'max_depth': [5, 6, 7], then GridSearchCV gives the parameters that are not the highest on accuracy score. Full details below:
Base Reg estimators (acc = 85%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Best Reg estimators (acc = 83%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 6, 'n_estimators': 100, 'reg_lambda': 1, 'scale_pos_weight': 1}

How to impute missing values for multiple columns using a regressor?

This is an example of a larger dataset I have.
Imagine I have a dataframe with different columns and every column present missing values (NaN) in some part.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
df = pd.DataFrame({'a':[0.3, 0.2, 0.5, 0.1, 0.4, 0.5, np.nan, np.nan, np.nan, 0.6, 0.3, 0.5],
'b':[4, 3, 5, np.nan, np.nan, np.nan, 5, 6, 5, 8, 7, 4],
'c':[20, 25, 35, 30, 10, 18, 16, 22, 26, np.nan, np.nan, np.nan]})
I would like to predict these missing values using RandomForestRegressor, for example, with the other columns as features. In other words, when I see a sample with NaN, I want to use the value on the other two columns as features to predict this missing value.
I usually can do this for an unique feature, but I would like an automated way to do this for every column.
Thank you.

You can use the IterativeImputer from sklearn and provide the RandomForestRegressor for it in the estimator parameter:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df = pd.DataFrame({'a':[0.3, 0.2, 0.5, 0.1, 0.4, 0.5, np.nan, np.nan, np.nan, 0.6, 0.3, 0.5],
'b':[4, 3, 5, np.nan, np.nan, np.nan, 5, 6, 5, 8, 7, 4],
'c':[20, 25, 35, 30, 10, 18, 16, 22, 26, np.nan, np.nan, np.nan]})
imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0)
imp_mean.fit(df)
display(pd.DataFrame(imp_mean.transform(df)))
This will then return the following dataframe, in which the nan values are imputed accordingly:
0 1 2
0 0.300 4.00 20.00
1 0.200 3.00 25.00
2 0.500 5.00 35.00
3 0.100 3.69 30.00
4 0.400 5.53 10.00
5 0.500 5.78 18.00
6 0.389 5.00 16.00
7 0.455 6.00 22.00
8 0.463 5.00 26.00
9 0.600 8.00 21.02
10 0.300 7.00 16.92
11 0.500 4.00 29.98

If I add a new feature xgboost parameters will have to change?

If I got the best accuracy for some table information from this below feature
if I add a new feature which parameters should I change to get the new result?
or at least which feature?
xgb = XGBClassifier(
bagging_fraction= 0.8,
boosting= 'gbdt',
colsample_bytree= 0.7,
feature_fraction= 0.9,
learning_rate= 0.05,
max_bin= 32,
max_depth= 10,
min_child_weight= 11,
missing= -999,
n_estimators= 400,
nthread= 4,
num_leaves= 80,
objective= 'multiclass',
predictor= 'gpu_predictor',
seed= 1337,
silent= 1,
subsample= 0.8,
tree_method= 'gpu_hist',
verbose= True
)

GridSearchCV on XGBoost model gives error

I made a XGBoost classifier in python. I tried to do GridSearch to find optimal parameters like this
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
When running the Search I get an error like this
[Errno 28] No space left on device
I used a slightly big size dataset. Where,
X.shape = (38932, 1002)
Y.shape= (38932,)
What is the issue.? How to solve this.?
Is this because the dataset is too huge for my machine.? If so what can I do to preform GridSearch on this dataset.?

The error indicates that the shared memory is running out, It's likely that increasing the number of kfolds and/or adjusting the number of threads used i.e. n_jobs will resolve this issue
.Here is a working example using xgboost
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
clf = xgb.XGBClassifier()
parameters = {
'n_estimators': [100, 250, 500],
'max_depth': [6, 9, 12],
'subsample': [0.9, 1.0],
'colsample_bytree': [0.9, 1.0],
}
bsn = datasets.load_iris()
X, Y = bsn.data, bsn.target
grid = GridSearchCV(clf,
parameters, n_jobs=4,
scoring="neg_log_loss",
cv=3)
grid.fit(X, Y)
print("Best: %f using %s" % (grid.best_score_, grid.best_params_))
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
The outputs is
Best: -0.121569 using {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 500, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 500, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 500, 'subsample': 1.0}
-0.132745 (0.080433) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.9}
-0.127030 (0.077692) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.146143 (0.077623) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 250, 'subsample': 0.9}
-0.140400 (0.074645) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 250, 'subsample': 1.0}
-0.153624 (0.077594) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 500, 'subsample': 0.9}
-0.143833 (0.073645) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 500, 'subsample': 1.0}
-0.132745 (0.080433) with: {'colsample_bytree': 1.0, 'max_depth': 9, ...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Early stopping in XGBoost classification and RandomizedSearchCV - machine-learning

Related

RandomizedSearchCV with xgboost classifier is taking too long

Guidance needed - GridSearchCV returns parameters that decrease the accuracy of the XGBoost model

How to impute missing values for multiple columns using a regressor?

If I add a new feature xgboost parameters will have to change?

GridSearchCV on XGBoost model gives error

Categories

Resources