I have trained a neural network and an XGBoost model for the same problem, now I am confused that how should I stack them. Should I just pass the output of the neural network as a parameter to the XGBoost model, or should I take the weighting of their results seperately ? Which would be better ?
This question cannot be clearly answered. I would suggest to check both possibilities and chose the one, that worked best.
Using the output of one model as input to the other model
I guess, you know, what you have to do to use the output of the NN as input to XGBoost. You should just take some time, about how you handle the test and train data (see below). Use the "probabilities" rather than the binary labels for that. Of course, you could also try it vice-versa, so that the NN gets the output of the XGBoost model as an additional input.
Using a Votingclassifier
The other possibility is to use a VotingClassifier using soft-voting. You can use VotingClassifier(voting='soft') for that (to be precise sklearn.ensemble.VotingClassifier). You could also play around with the weights here.
Difference
The big difference is, that with the first possibility the XGBoost model might learn, in what areas the NN is weak and in which it is strong, while with the VotingClassifier the outputs of both models are equally weighted for all samples and it relies on the assumption that the model output a "probability" not so close to 0 / 1 if they are not so confident about the prediciton of the specific input record. But this assumption might not be always true.
Handling of the Train/Testdata
In both cases, you need to think about, how you should handle the train/test data. The train/test data should ideally be split the same way for both models. Otherwise you might introduce some kind of data-leakage problem.
For the VotingClassifier this is no problem, because it can be used as a regular skearn model class. For the first method (output of model 1 is one feature of model 2), you should make sure, you do the train-test-split (or the cross-validation) with exactly the same records. If you don't do that, you would run the risk to validate the output of your second model on a record which was in the training set of model 1 (except for the additonal feature of course) and this clearly could cause a data-leakage problem which results in a score that appears to be better than how the model would actually perform on unseen productive data.
Related
I have a binary classification problem I'm trying to tackle in Keras. To start, I was following the usual MNIST example, using softmax as the activation function in my output layer.
However, in my problem, the 2 classes are highly unbalanced (1 appears ~10 times more often than the other). And what's even more critical, they are non-symmetrical in the way they may be mistaken.
Mistaking an A for a B is way less severe than mistaking a B for an A. Just like a caveman trying to classify animals into pets and predators: mistaking a pet for a predator is no big deal, but the other way round will be lethal.
So my question is: how would I model something like this with Keras?
thanks a lot
A non-exhaustive list of things you could do:
Generate a balanced data set using data augmentations. If the data are images, you can add image augmentations in a custom data generator that will output balanced amounts of data from each class per batch and save the results to a new data set. If the data are tabular, you can use a library like imbalanced-learn to perform over/under sampling.
As #Daniel said you can use class_weights during training (in the fit method) in a way that mistakes on important class are penalized more. See this tutorial: Classification on imbalanced data. The same idea can be implemented with a custom loss function with/without class_weights during training.
If i am doing a multi-classification problem, is there a way to essentially make a class an "unsure" class? For example if my model doesn't have a very strong prediction, it should default to this class. Like when you take a test, some tests penalize you for wrong answers, some don't. I want to do a custom loss function that doesn't penalize my model for guessing the neutral class, but does penalize if the model makes a prediction that is wrong. Is there a way to do what i am trying to do?
For classifiers using a one-hot encoded softmax output layer, the outputs can be interpreted as a probability that the input falls into each of the categories. e.g. if your model has outputs (cat, dog, frog), then an output of (0.6, 0.2, 0.2) means the input has (according to the classifier) a 60% chance of being a cat and a 20% chance for each of being a dog or frog.
In this case, when the model is uncertain it can (and will) have an output where no one class is particularly likely, e.g. (0.33, 0.33, 0.33). There's no need to add a separate 'Other' category.
Separate to this, it might be difficult to train an "unsure" category, unless you have specific input examples that you want to train the model to classify as "unsure".
I encountered the very same problem.
I tried using a neutral class, but the neural net will either put nothing in it, or everything in it depending on the reduced loss.
After some searching, it looks like we are trying to achieve "neural network uncertainty estimation". One of the ways to achieve that is to run your image 100 times in your neural net with random dropouts and see how many times it hits the same class.
This blog post explains it well : https://www.inovex.de/blog/uncertainty-quantification-deep-learning/
This video also : https://medium.com/deeplearningmadeeasy/how-to-add-uncertainty-to-your-neural-network-afb5f855e66a
I will let you know and publish here if I have some results with that.
I have built two ML models with the following roc_auc_score
Model 1
Training score - 95%
Test score - 74%
Model 2
Training score - 78%
Test score - 74%
It is high likely that model 1 is trying to overfit but test score is same in both cases. So, which of these two is a better performing one?
I assume this is a hypothetical question where all other conditions are equal. In this case, I would argue with occam's razor and declare the simpler model (probably model 2) the winner.
In practice other factors might be important too. For example have you extensively tuned hyperparameters to get to Model 2 and thus overfit to the test data?
Without any further information, I would agree that your first model does appear to be overfit. Other than that, both models conceptually have "learned" about the behavior of the underlying real world training data with a similar level of accuracy, as given by the identical test scores.
But because the first model is overfit, it means that the first model also has possibly incorporated noise from the training data. This additional information won't help the model, and might actually hurt with making new predictions.
So, I would lean towards using the second model, if I had to choose one of the two.
In general it is hard to give a concrete answer without getting insight in the use case, the problem to be sovled and the model and training strategy you have chosen.
However, perhaps a differentiation between errors might help:
Bayes Error: This the theoretically lowest possible error a classifier might reach
Human Error: Classification error exhibited by a human solving the task.
Avoidable Bias: Difference between the human/bias error and the error exhibited by your model evaluated on the training set.
Avoidable Variace: Error difference between the test error and the training error
So in your case, it seems at the first sight that model 1 is overfitting when compared to model 2 since it has a lower variance. When compared. That does not mean model 1 is better, it depends. I would advice you to:
Take a closer look at your available data: what is the distribution of the data? How does it differ from the possible upcoming data where the model is implemented?
Further implement training techniques on model 1 to see if you can reduce the test error: data augmentation (relative to the task), weights regularization, dropout, etc.
If you have already extensively performed this, then I would analyze performance/computation cost of both models (which one is faster/lighter) and as #saibot suggested, go with the simpler one (the one that consumes less ressources) (occams razer).
Remember, goal is not necessary to get your test error equal to the training error. It is actually to get your test error as close as possible to the bias error.
I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.
I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.