Which ML model is better? - machine-learning

I have built two ML models with the following roc_auc_score
Model 1
Training score - 95%
Test score - 74%
Model 2
Training score - 78%
Test score - 74%
It is high likely that model 1 is trying to overfit but test score is same in both cases. So, which of these two is a better performing one?

I assume this is a hypothetical question where all other conditions are equal. In this case, I would argue with occam's razor and declare the simpler model (probably model 2) the winner.
In practice other factors might be important too. For example have you extensively tuned hyperparameters to get to Model 2 and thus overfit to the test data?

Without any further information, I would agree that your first model does appear to be overfit. Other than that, both models conceptually have "learned" about the behavior of the underlying real world training data with a similar level of accuracy, as given by the identical test scores.
But because the first model is overfit, it means that the first model also has possibly incorporated noise from the training data. This additional information won't help the model, and might actually hurt with making new predictions.
So, I would lean towards using the second model, if I had to choose one of the two.

In general it is hard to give a concrete answer without getting insight in the use case, the problem to be sovled and the model and training strategy you have chosen.
However, perhaps a differentiation between errors might help:
Bayes Error: This the theoretically lowest possible error a classifier might reach
Human Error: Classification error exhibited by a human solving the task.
Avoidable Bias: Difference between the human/bias error and the error exhibited by your model evaluated on the training set.
Avoidable Variace: Error difference between the test error and the training error
So in your case, it seems at the first sight that model 1 is overfitting when compared to model 2 since it has a lower variance. When compared. That does not mean model 1 is better, it depends. I would advice you to:
Take a closer look at your available data: what is the distribution of the data? How does it differ from the possible upcoming data where the model is implemented?
Further implement training techniques on model 1 to see if you can reduce the test error: data augmentation (relative to the task), weights regularization, dropout, etc.
If you have already extensively performed this, then I would analyze performance/computation cost of both models (which one is faster/lighter) and as #saibot suggested, go with the simpler one (the one that consumes less ressources) (occams razer).
Remember, goal is not necessary to get your test error equal to the training error. It is actually to get your test error as close as possible to the bias error.

Related

Is there any specific data behavior that is responsible for overfitting and underfitting?

Since I'm new to data science, I just want to know that is there any specific data behavior that is responsible for overfitting and/or underfitting? Because if we are dealing with linear regression and we are supposed to get the Best fit line through gradient descent. Now, how can we get overfitting or underfitting? I know what is overfitting and underfitting but the problem is that how is it possible when you already applied gradient descent to get best fit line. I hope my question would be cleared to all, by the way.
Thanks and regards.
Less number of samples in the data can be a major reason for model over-fitting. Even if your model is simple, less variance (or variation) in the data samples can make the model learn to perform well for "only" those samples, and may not generalize well.
We can detect over fitting on a linear model by looking at the no. of features and the training error as well as the testing error.
If the model over fits:
1. Enough data is been provided for training i.e more no. of features used to train.
2. Training error is very less than the testing error.
If the model under fits:
1. Less data is been provided for training i.e less no. of features used to train.
2. Test error is very less than training error.
Using Gradient Descent is a good option.But it may lead to Over fitting and fail on real life data.
Hope this may help.

Is the validation accuracy always higher than testing accuracy in deep learning?

I met a professor who told me that generally speaking, validation accuracy is always higher than testing accuracy.
He claimed that testing dataset is used only for testing the final model. Although validation dataset is used to only tweak hyperparameters and only training data is shown to the model, the model developer could try to carefully pick the best model according to validation accuracies for numerous times of training.
However, since testing data is generally limited with the number of testing. For example, in some competitions, one evaluation for submitting the testing result per day is quite common. This way, we couldn't cherry-pick the best model which could achieve the best accuracy in both validation & testing datasets. Therefore, our best model which achieved the best results in validation data is usually not the best one in testing data. However, this speaker still believes so when the GT of testing dataset is released in some datasets.
I know that the data distribution in validation dataset and testing dataset is generally designed to be similar. However, this is not guaranteed. For example, in a general purpose object detection dataset, the "difficulty" between the same class of objects in the validation dataset and the testing dataset might be different. To be more specific, let's assume the detection target is person and we all know that small, occluded or truncated person is harder to be detected. However, it is practically difficult to control the distribution according to size, occlusion and truncation level in validation and testing dataset, accordingly.
Therefore, it is possible that the testing accuracy is higher than the validation accuracy when the GT of both datasets is available.
No. There is no strong clue indicating which one would be higher unless a bias of sampling is identified or introduced.
Consider an extreme case in which your model is highly overfitting to the training set. The relationship between $p_{train}$, $p_{val}$, and $p_{test}$ are defined as below.
$$p_{train} = p_{val} != p_{test} $$
In this case, the validation accuracy would be significantly higher than testing accuracy, and vice versa.

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

What is the right way to measure if a machine learning model has overfit?

I understand the intuitive meaning of overfitting and underfitting. Now, given a particular machine learning model that is trained upon the training data, how can you tell if the training overfitted or underfitted the data? Is there a quantitative way to measure these factors?
Can we look at the error and say if it has overfit or underfit?
I believe the easiest approach is to have two sets of data. Training data and validation data. You train the model on the training data as long as the fitness of the model on the training data is close to the fitness of the model on the validation data. When the models fitness is increasing on the training data but not on the validation data then you're overfitting.
The usual way, I think, is known as cross-validation. The idea is to split the training set into several pieces, known as folds, then pick one at a time for evaluation and train on the remaining ones.
It does not, of course, measure the actual overfitting or underfitting, but if you can vary the complexity of the model, e.g. by changing the regularization term, you can find the optimal point. This is as far as one can go with just training and testing, I think.
You don't look at the error on the training data, but on the validation data only.
A common way of testing is to try different model complexities, and see how the error changes with model complexity. Usually these have a typical curve. In the beginning, the errors quickly improve. Then there is saturation (where the model is good), then they start decreasing again, but not because of being a better model, but because of overfitting. You want to be on the low complexity end of the plateau, the simplest model that provides a reasonable generalization.
The existing answers are not strictly speaking wrong, but they are not complete. Yes, you do need a validation set, but an important issue here is that you do not simply look at the model error on the validation set and try to minimize it. It will lead to overfitting all the same, because you will effectively be fitting on a validation set that way. The right approach is not minimizing the error on your sets, but making an error independent from which training and validation sets you use. If error on validation set is significantly different (doesn't matter if it is worse, or better), then the model is overfit. Also, certainly, this should be done in a cross-validation way when you train on some random set and then validate on another random set.

Resources