GridSearchCV on XGBoost model gives error - machine-learning

I made a XGBoost classifier in python. I tried to do GridSearch to find optimal parameters like this
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
When running the Search I get an error like this
[Errno 28] No space left on device
I used a slightly big size dataset. Where,
X.shape = (38932, 1002)
Y.shape= (38932,)
What is the issue.? How to solve this.?
Is this because the dataset is too huge for my machine.? If so what can I do to preform GridSearch on this dataset.?

The error indicates that the shared memory is running out, It's likely that increasing the number of kfolds and/or adjusting the number of threads used i.e. n_jobs will resolve this issue
.Here is a working example using xgboost
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
clf = xgb.XGBClassifier()
parameters = {
'n_estimators': [100, 250, 500],
'max_depth': [6, 9, 12],
'subsample': [0.9, 1.0],
'colsample_bytree': [0.9, 1.0],
}
bsn = datasets.load_iris()
X, Y = bsn.data, bsn.target
grid = GridSearchCV(clf,
parameters, n_jobs=4,
scoring="neg_log_loss",
cv=3)
grid.fit(X, Y)
print("Best: %f using %s" % (grid.best_score_, grid.best_params_))
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
The outputs is
Best: -0.121569 using {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 500, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 500, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 500, 'subsample': 1.0}
-0.132745 (0.080433) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.9}
-0.127030 (0.077692) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.146143 (0.077623) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 250, 'subsample': 0.9}
-0.140400 (0.074645) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 250, 'subsample': 1.0}
-0.153624 (0.077594) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 500, 'subsample': 0.9}
-0.143833 (0.073645) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 500, 'subsample': 1.0}
-0.132745 (0.080433) with: {'colsample_bytree': 1.0, 'max_depth': 9, ...

Related

Early stopping in XGBoost classification and RandomizedSearchCV

When I try to combine RandomSearch with the early stopping method to reduce the overfitting, I get this error:
py:372: FitFailedWarning:
300 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
The code I am trying is like this:
params_dist = {'min_child_weight': [0.1, 1, 5, 10, 50],
'colsample_bytree': np.arange(0.5, 1.0, 0.1),
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': np.arange(0.5, 1.0, 0.1),
'max_depth': range(3, 21, 3),
'learning_rate': [0.0001,0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1],
'n_estimators': [50, 100, 250, 500, 750, 1000],
'reg_alpha': [0.0001, 0.001, 0.1, 1],
'reg_lambda': [0.0001, 0.001, 0.1, 1]}
model_with_earlyStopping = xgb.XGBClassifier(objective='binary:logistic',
eval_metric="error",
early_stopping_rounds=13,
seed=42)
random_search = model_selection.RandomizedSearchCV(model_with_earlyStopping,
param_distributions=params_dist,
scoring='roc_auc',
n_jobs=-1,
verbose=0,
cv=3,
random_state=1001,
n_iter=100)
The code worked fine without using early stopping. However, I am looking for a way to combine these 2 methods together.
Can anyone help me fix it?

RandomizedSearchCV with xgboost classifier is taking too long

I am trying to run this code on my dataset. However, its taking too long (the code has been running since yesterday and my dataset is not too large)
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
xgb_classifier = XGBClassifier(learning_rate=0.02, n_estimators=600,
objective='binary:logistic',silent=True, nthread=1)
folds = 3
param_comb = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(xgb_classifier, param_distributions=params,
n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(trainx,trainy),
verbose=3, random_state=1001 )
random_search.fit(trainx, trainy)
I am new in the field. any ideas?

Guidance needed - GridSearchCV returns parameters that decrease the accuracy of the XGBoost model

I am playing around with the XGBoostClassifier and tuning this with GridSearchCV. I first created the variable xgbc:
xgbc = xgb.XGBClassifier()
I did'nt use any parameters as I wanted to see the default model performance. This gave me accuracy_score = 85.65%, recall_score = 77.91% and roc_auc_score = 84.21%, using the following lines of code:
print("Accuracy: ", accuracy_score(y_test, xgbc.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc.predict(X_test)))
Next, I used GridSearchCV to try to tune the parameters, like this:
Setting up the parameter dictionary:
xgbc_params = {'max_depth': [5, 6, 7], #6
'learning_rate': [0.25, 0.300000012, 0.35], #0.300000012
'gamma':[0, 0.001, 0.1], #0
'reg_lambda': [0.8, 0.95, 1], #1
'scale_pos_weight': [0, 1, 2], #1
'n_estimators': [95, 100, 105]} #100
(The numbers after the # are the default values, which gave me the above scores.)
And now run the GridSearchCV like this:
xgbc_grid = GridSearchCV(xgbc, param_grid=xgbc_params, scoring = make_scorer(accuracy_score), cv = 10, n_jobs = -1)
Next, fit this to the training data:
xgbc_grid.fit(X_train, y_train, verbose = 1, early_stopping_rounds = 10, eval_metric = 'aucpr', eval_set = [(X_test, y_test)])
Finally, run the metrics again:
print("Best Reg estimators: ", xgbc_grid.best_params_)
print("Accuracy: ", accuracy_score(y_test, xgbc_grid.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc_grid.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc_grid.predict(X_test)))
Now, the scores change: accuracy_score = 0.8340807174887892, recall_score = 0.7325581395348837 and roc_auc_score = 0.8420896282464777. Also, here is the best_params_ result:
Best Reg estimators: {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Here is my problem:
The parameter values that GridSearchCV returns through xgbc_grid.best_params_ are not the most optimal for accuracy, as the accuracy score decreases. Can you please help me figure out why this is happenning?
In the parameter dictionary above, I have provided the default values. If I set the parameters to only these single values, then I get the 85% accuracy, like, 'max_depth': [6]. However, as soon as I add other values, like 'max_depth': [5, 6, 7], then GridSearchCV gives the parameters that are not the highest on accuracy score. Full details below:
Base Reg estimators (acc = 85%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Best Reg estimators (acc = 83%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 6, 'n_estimators': 100, 'reg_lambda': 1, 'scale_pos_weight': 1}

If I add a new feature xgboost parameters will have to change?

If I got the best accuracy for some table information from this below feature
if I add a new feature which parameters should I change to get the new result?
or at least which feature?
xgb = XGBClassifier(
bagging_fraction= 0.8,
boosting= 'gbdt',
colsample_bytree= 0.7,
feature_fraction= 0.9,
learning_rate= 0.05,
max_bin= 32,
max_depth= 10,
min_child_weight= 11,
missing= -999,
n_estimators= 400,
nthread= 4,
num_leaves= 80,
objective= 'multiclass',
predictor= 'gpu_predictor',
seed= 1337,
silent= 1,
subsample= 0.8,
tree_method= 'gpu_hist',
verbose= True
)

JQuery UI nonlinear slider

I need a slider to let the user select a weight.
$( "#slider" ).slider({
value: 15,
slide: function( event, ui ) {
$('#weight').text(ui.value);
}
});
But the values should be nonlinear: That means a 'normal' behaviour for values from 10 to 50 (increasing steps of 1).
Then for example: If the values are getting bigger they should increase in steps of 10. If the user selects lower values it should be more precise: values 3 to 10 (increasing steps of 0.5), below 3 -> increasing steps 0.1.
My attempt would be to use an own array for the data:
myData = [ 0.4, 0.45, 0.5, 0.55, 0.65, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 60, 70 ];
$( "#slider" ).slider({
min: 0,
max: myData.length - 1,
step: 1,
slide: function( event, ui ) {
$('#weight').text(ui.value);
},
create: function() {
$(this).slider('values',0,0);
$(this).slider('values',1,myData.length - 1);
}
});
But this doesn't work. Is there a smarter solution?
myData = [ 0.4, 0.45, 0.5, 0.55, 0.65, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 60, 70 ];
$( "#slider" ).slider({
min: 0,
max: myData.length - 1,
step: 1,
slide: function( event, ui ) {
$('#weight).text(myData[ui.value]);
}
});

Resources