Is there any specific data behavior that is responsible for overfitting and underfitting? - machine-learning

Since I'm new to data science, I just want to know that is there any specific data behavior that is responsible for overfitting and/or underfitting? Because if we are dealing with linear regression and we are supposed to get the Best fit line through gradient descent. Now, how can we get overfitting or underfitting? I know what is overfitting and underfitting but the problem is that how is it possible when you already applied gradient descent to get best fit line. I hope my question would be cleared to all, by the way.
Thanks and regards.

Less number of samples in the data can be a major reason for model over-fitting. Even if your model is simple, less variance (or variation) in the data samples can make the model learn to perform well for "only" those samples, and may not generalize well.

We can detect over fitting on a linear model by looking at the no. of features and the training error as well as the testing error.
If the model over fits:
1. Enough data is been provided for training i.e more no. of features used to train.
2. Training error is very less than the testing error.
If the model under fits:
1. Less data is been provided for training i.e less no. of features used to train.
2. Test error is very less than training error.
Using Gradient Descent is a good option.But it may lead to Over fitting and fail on real life data.
Hope this may help.

Related

Model overfits when you don't have much varied data

I am trying to understand why it is that a model overfits when you have little data to run with.
I get the typical artistic idea behind it whereby you would essentially have the model "memorize" whatever little data (or variations to be specific) you've given it.
But is there a more robust reason for this?
Couldn't you for example with a small dataset (or large one) with very little variation, just force it to not overfit by constraining the model or adding some form of regularization?
P.S I have seen an explanation detailing how not introducing the type of variance that exists within the population can definitely lead the model to generalize less and less. But is this just a quick way to rationalize it or is there, again as i mentioned above, a way to eliminate this lack of variance in the data?
yes, you can add regularization, batch normalization or even dropout to reduce overfitting. model overfit when you have to less data as compared to number of parameters in models such as weights in neural network.
Also you can fix error of model in batches rather then individual sample that way your model is less likely overfit the data.
You can also add noise to data to reduce overfitting.

Which ML model is better?

I have built two ML models with the following roc_auc_score
Model 1
Training score - 95%
Test score - 74%
Model 2
Training score - 78%
Test score - 74%
It is high likely that model 1 is trying to overfit but test score is same in both cases. So, which of these two is a better performing one?
I assume this is a hypothetical question where all other conditions are equal. In this case, I would argue with occam's razor and declare the simpler model (probably model 2) the winner.
In practice other factors might be important too. For example have you extensively tuned hyperparameters to get to Model 2 and thus overfit to the test data?
Without any further information, I would agree that your first model does appear to be overfit. Other than that, both models conceptually have "learned" about the behavior of the underlying real world training data with a similar level of accuracy, as given by the identical test scores.
But because the first model is overfit, it means that the first model also has possibly incorporated noise from the training data. This additional information won't help the model, and might actually hurt with making new predictions.
So, I would lean towards using the second model, if I had to choose one of the two.
In general it is hard to give a concrete answer without getting insight in the use case, the problem to be sovled and the model and training strategy you have chosen.
However, perhaps a differentiation between errors might help:
Bayes Error: This the theoretically lowest possible error a classifier might reach
Human Error: Classification error exhibited by a human solving the task.
Avoidable Bias: Difference between the human/bias error and the error exhibited by your model evaluated on the training set.
Avoidable Variace: Error difference between the test error and the training error
So in your case, it seems at the first sight that model 1 is overfitting when compared to model 2 since it has a lower variance. When compared. That does not mean model 1 is better, it depends. I would advice you to:
Take a closer look at your available data: what is the distribution of the data? How does it differ from the possible upcoming data where the model is implemented?
Further implement training techniques on model 1 to see if you can reduce the test error: data augmentation (relative to the task), weights regularization, dropout, etc.
If you have already extensively performed this, then I would analyze performance/computation cost of both models (which one is faster/lighter) and as #saibot suggested, go with the simpler one (the one that consumes less ressources) (occams razer).
Remember, goal is not necessary to get your test error equal to the training error. It is actually to get your test error as close as possible to the bias error.

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

How to overcome overfitting in CNN - standard methods don't work

I've been recently playing around with car data set from Stanford (http://ai.stanford.edu/~jkrause/cars/car_dataset.html).
From the very beginning I had an overfitting problem so decided to:
Add regularization (L2, dropout, batch norm, ...)
Tried different architectures (VGG16, VGG19, InceptionV3, DenseNet121, ...)
Tried trasnfer learning using models trained on ImageNet
Used data augmentation
Every step moved me a little bit forward. However I finished with 50% validation accuracy (started below 20%) compared to 99% train accuracy.
Do you have an idea what more can I do to get to around 80-90% accuracy?
Hope this can help some people!:)
Things you should try include:
Early stopping, i.e. use a portion of your data to monitor validation loss and stop training if performance does not improve for some epochs.
Check whether you have unbalanced classes, use class weighting to equally represent each class in the data.
Regularization parameter tuning: different l2 coefficients, different dropout values, different regularization constraints (e.g. l1).
Other general suggestions may be to try and replicate the state of the art models on this particular dataset, see if those perform as they should.
Also make sure to have all implementation details ironed out (e.g. convolution is being performed along width and height, and not along the channels dimension - this is a classic rookie mistake when starting out with Keras, for instance).
It would also help to have some more details on the code that you are using, but for now these suggestions will do.
50% accuracy on a 200-class problem doesn't sound so bad anyway.
Cheers
For those who encounter the same problem I managed to get 66,11% accuracy by playing with drop out, learning rate and learning decay mainly.
The best results were achieved on VGG16 architecture.
The model is on https://github.com/michalgdak/car-recognition

What is the right way to measure if a machine learning model has overfit?

I understand the intuitive meaning of overfitting and underfitting. Now, given a particular machine learning model that is trained upon the training data, how can you tell if the training overfitted or underfitted the data? Is there a quantitative way to measure these factors?
Can we look at the error and say if it has overfit or underfit?
I believe the easiest approach is to have two sets of data. Training data and validation data. You train the model on the training data as long as the fitness of the model on the training data is close to the fitness of the model on the validation data. When the models fitness is increasing on the training data but not on the validation data then you're overfitting.
The usual way, I think, is known as cross-validation. The idea is to split the training set into several pieces, known as folds, then pick one at a time for evaluation and train on the remaining ones.
It does not, of course, measure the actual overfitting or underfitting, but if you can vary the complexity of the model, e.g. by changing the regularization term, you can find the optimal point. This is as far as one can go with just training and testing, I think.
You don't look at the error on the training data, but on the validation data only.
A common way of testing is to try different model complexities, and see how the error changes with model complexity. Usually these have a typical curve. In the beginning, the errors quickly improve. Then there is saturation (where the model is good), then they start decreasing again, but not because of being a better model, but because of overfitting. You want to be on the low complexity end of the plateau, the simplest model that provides a reasonable generalization.
The existing answers are not strictly speaking wrong, but they are not complete. Yes, you do need a validation set, but an important issue here is that you do not simply look at the model error on the validation set and try to minimize it. It will lead to overfitting all the same, because you will effectively be fitting on a validation set that way. The right approach is not minimizing the error on your sets, but making an error independent from which training and validation sets you use. If error on validation set is significantly different (doesn't matter if it is worse, or better), then the model is overfit. Also, certainly, this should be done in a cross-validation way when you train on some random set and then validate on another random set.

Resources