xgboost trained model make random predictions for same data - machine-learning

I have a problem with xgboost predictions.
I have trained a xgboost model for my regression problem in python but when max_depth parameter is given different than default value, some of predictions changes if it is predicted again with the same model.
So far I tried changing basic parameters like learning rate, reg_lamda and so on but only max_depth causes randomness in predictions for same data.

Related

Accuracy and prediction Classifiers

I have trained LSTM AND decision tree on my data set (type of text classification). I have used K-cross fold validation with k=10.
Decision tree accuracy 61%
LSTM accuracy 90%
Now when I predict on totally unseen data then decision tree predicts more well and good as compared to LSTM.
Why it happens? If LSTM accuracy is more then why decision tree performs more well on unseen data as compare to LSTM?
Your LSTM model may have greater accuracy than a decision tree when training, but the fact that it doesn't generalize well to unseen data, indicates that the LSTM is overfitting to the training data. Try adjusting the train-validation split and batch size to see if that improves your models.
The validation loss during training would indicate which model is better. You can also try using random forests (cluster of decision trees) which has been known to give better results than one decision tree alone

Multi-label classification Keras metrics

Which metrics is better for multi-label classification in Keras: accuracy or categorical_accuracy? Obviously the last activation function is sigmoid and as loss function is binary_crossentropy in this case.
I would not use Accuracy for classification tasks with unbalanced classes.
Especially for multi-label tasks, you probably have most of your labels to be False. That is, each data point can only have a small set of labels compared to the cardinality of all of the possibile labels.
For that reason accuracy is not a good metric, if your model predict all False (sigmoid activation output < 0.5) then you would measure a very high accuracy.
I would analyze either the AUC or recall/precision at each epoch.
Alternatively a multi-label task can be seen as a ranking task (like Recommender Systems) and you could evaluate precision#k or recall#k where k are the top predicted labels.
If your Keras back-end is TensorFlow, check out the full list of supported metrics here: https://www.tensorflow.org/api_docs/python/tf/keras/metrics.
Actually, there is no metric named accuracy in Keras. When you set metrics=['accuray'] in Keras, the correct accuracy metric will be inferred automatically based on the loss function used. As a result, since you have used binary_crossentropy as the loss function, the binary_accuracy will be chosen as the metric.
Now, you should definitely choose binary_accuracy over categorical_accuracy in a multi-label classification task since classes are independent from each other and the prediction for each class should be considered independently of the predictions for other classes.

Decision_function for XGBoost in SKLearn wrapper

I get different results for model.predict_proba(X)[:,0] compared to model.decision_function(X)for a regular Grad Boost Decision Tree classifier in SKLearn so I know that that is not the same.
I want the scores of the model. To plot ROC curves etc. How can I get the decision function for XGBoost classifier using the SKLearn wrapper? And why is predict_proba different from scores?
In general, I would not expect the sklearn.GradientBoostingClassifier and xgboost.XGBClassifier to agree, as those use very different implementations. But there are also conceptual difference between the quantities that you have tried to compare:
And why is predict_proba different from scores?
Probabilities (output of model.predict_proba(X)) are obtained from the scores (output of model.decision_function(X)) applying the loss/objective function, see here for the call to the loss function and here for the actual transformation.
I want the scores of the model. To plot ROC curves etc. How can I get the decision function for XGBoost classifier using the SKLearn wrapper?
For the ROC curve you will want to use xgbmodel.predict_proba(X)[:,1], i.e. the second column that correspond to the class 1.

Suggest using Kfold splits or validation_split kwarg in Keras Training?

In many examples, I see train/cross-validation dataset splits being performed by using a Kfold, StratifiedKfold, or other pre-built dataset splitter. Keras models have a built in validation_split kwarg that can be used for training.
model.fit(self, x, y, batch_size=32, nb_epoch=10, verbose=1, callbacks=[], validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None)
(https://keras.io/models/model/)
validation_split: float between 0 and 1: fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
I am new to the field and tools, so my intuition on what the different splitters offer you. Mainly though, I can't find any information on how Keras' validation_split works. Can someone explain it to me and when separate method is preferable? The built-in kwarg seems to me like the cleanest and easiest way to split test datasets, without having to architect your training loops much differently.
The difference between the two is quite subtle and they can be used in conjunction.
Kfold and similar functions in scikit-learn will randomly split your data into k folds. You can then train models holding out a single fold each time and testing on the fold.
validation_split takes a fraction of your data non-randomly. According to the Keras documentation it will take the fraction from the end of your data, e.g. 0.1 will hold out the final 10% of rows in the input matrix. The purpose of the validation split is to allow you to assess how the model is performing on the training set and a held out set at every epoch in the training period. If the model continues to improve on the training set but not the validation set then it is a clear sign of potential overfitting.
You could theoretically use KFold cross-validation to construct a model while also using validation_split to monitor the performance of each model. At each fold you will be generating a new validation_split from the training data.

Why does overfitting still happen after cross validation?

It is a binary photo classification problem, I extracted the features using AlexNet. The measurement is log-loss. There are totally 25000 records in the training set, 12500 "1"s and 12500 "0"s, so the data set is balanced.
I trained a XGBoost model. After tuning the parameters using cross validation, the training log-loss is 0.078, the validation log-loss is 0.09. But when I make predictions using test set, the log-loss is 2.1. It seems that over-fitting is still pretty serious.
Why is that? Do I have to further tune the parameters or try another pre-trained model?

Resources