I am trying to understand why it is that a model overfits when you have little data to run with.
I get the typical artistic idea behind it whereby you would essentially have the model "memorize" whatever little data (or variations to be specific) you've given it.
But is there a more robust reason for this?
Couldn't you for example with a small dataset (or large one) with very little variation, just force it to not overfit by constraining the model or adding some form of regularization?
P.S I have seen an explanation detailing how not introducing the type of variance that exists within the population can definitely lead the model to generalize less and less. But is this just a quick way to rationalize it or is there, again as i mentioned above, a way to eliminate this lack of variance in the data?
yes, you can add regularization, batch normalization or even dropout to reduce overfitting. model overfit when you have to less data as compared to number of parameters in models such as weights in neural network.
Also you can fix error of model in batches rather then individual sample that way your model is less likely overfit the data.
You can also add noise to data to reduce overfitting.
Related
Normalization e.g. z-scoring is a common preprocessing method in Machine Learning.
I am analyzing a dataset and use ensemble methods like Random Forests or the XGBOOST framework.
Now I compare models using
non normalized features
z-scored features
Using crossvalidation I observe in both cases that with higher max_depth parameter the training error decreases.
For the 1. case the test error also decreases and saturates at a certain MAE:
For the z-scored features however the test error is non decreasing at all.
In this question: https://datascience.stackexchange.com/questions/16225/would-you-recommend-feature-normalization-when-using-boosting-trees it was discussed that normalization is not necessary for tree based methods. But the example above shows that it has a severe effect.
So I have two questions regarding this:
Does it imply that overfitting with ensemble based methods is possible even when the test error decreases?
Should normalization like z-scoring always be common practice when working with ensemble methods?
Is it possible that normalization methods decrease the model performance?
Thanks!
It is not easy to see what is going on in the absence of any code or data.
Normalisation may or may not be helpful depending on the particular data and how the normalisation step is applied.
Tree based methods ought to be robust enough to handle the raw data.
In your cross validations is your code doing the normalisation separately for each fold?
Doing a single normalisation prior to cv may lead to significant leakage.
With very high values of depth you will have a much more complex model that will fit the training data well but will fail to generalise to new data.
I tend to prefer max depths from 2 to 5.
If I can't get a reasonable model I turn my efforts to feature engineering rather than trying to tweak the hyperparameters too much.
Since I'm new to data science, I just want to know that is there any specific data behavior that is responsible for overfitting and/or underfitting? Because if we are dealing with linear regression and we are supposed to get the Best fit line through gradient descent. Now, how can we get overfitting or underfitting? I know what is overfitting and underfitting but the problem is that how is it possible when you already applied gradient descent to get best fit line. I hope my question would be cleared to all, by the way.
Thanks and regards.
Less number of samples in the data can be a major reason for model over-fitting. Even if your model is simple, less variance (or variation) in the data samples can make the model learn to perform well for "only" those samples, and may not generalize well.
We can detect over fitting on a linear model by looking at the no. of features and the training error as well as the testing error.
If the model over fits:
1. Enough data is been provided for training i.e more no. of features used to train.
2. Training error is very less than the testing error.
If the model under fits:
1. Less data is been provided for training i.e less no. of features used to train.
2. Test error is very less than training error.
Using Gradient Descent is a good option.But it may lead to Over fitting and fail on real life data.
Hope this may help.
I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.
I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.
I understand the intuitive meaning of overfitting and underfitting. Now, given a particular machine learning model that is trained upon the training data, how can you tell if the training overfitted or underfitted the data? Is there a quantitative way to measure these factors?
Can we look at the error and say if it has overfit or underfit?
I believe the easiest approach is to have two sets of data. Training data and validation data. You train the model on the training data as long as the fitness of the model on the training data is close to the fitness of the model on the validation data. When the models fitness is increasing on the training data but not on the validation data then you're overfitting.
The usual way, I think, is known as cross-validation. The idea is to split the training set into several pieces, known as folds, then pick one at a time for evaluation and train on the remaining ones.
It does not, of course, measure the actual overfitting or underfitting, but if you can vary the complexity of the model, e.g. by changing the regularization term, you can find the optimal point. This is as far as one can go with just training and testing, I think.
You don't look at the error on the training data, but on the validation data only.
A common way of testing is to try different model complexities, and see how the error changes with model complexity. Usually these have a typical curve. In the beginning, the errors quickly improve. Then there is saturation (where the model is good), then they start decreasing again, but not because of being a better model, but because of overfitting. You want to be on the low complexity end of the plateau, the simplest model that provides a reasonable generalization.
The existing answers are not strictly speaking wrong, but they are not complete. Yes, you do need a validation set, but an important issue here is that you do not simply look at the model error on the validation set and try to minimize it. It will lead to overfitting all the same, because you will effectively be fitting on a validation set that way. The right approach is not minimizing the error on your sets, but making an error independent from which training and validation sets you use. If error on validation set is significantly different (doesn't matter if it is worse, or better), then the model is overfit. Also, certainly, this should be done in a cross-validation way when you train on some random set and then validate on another random set.