I have implemented a logistic regression model in predicting high-risk claims. My model is giving an accuracy of 97%. Does that indicate that my model is overfitted?
The accuracy in the training set is not enough for the overfitting. Are you using a validation set? If you are not using it, you should split your data set to three parts: Validation set, test set and training set. During the training, you should use training and validation set. If you have high accuarcy in training set and low accuracy in the validation set then your model is overfitting.
Related
I am new to machine learning, I have built a model that predicts if a client will subscribe in the following month or not. I got 73.4 on the training set and 72.8 on the test set. is it okay? or do I have Overfitting?
It's ok.
Overfitting happens when the accuracy in the training set in higher and the accuracy in the test set is lower (with a marginal difference).
This is what overfitting looks like.
Train accuracy: 99.4%
Test accuracy: 71.4%
You can, however, increase the accuracy using different models and feature engineering
We call it as over-fitting,If the accuracy of training data is abnormally higher (greater than 95%) and accuracy of test data is very low (less than 65%).
In your case,both training and testing accuracy are almost similar.So there is no over-fitting.
Try for more test data and check whether the accuracy is decreasing or not.You can also try to improve the model by
Trying different algorithms
Increasing the size of train data
Trying K-fold cross validation
Hyper parameter tuning
Using Regularization methods
Standardizing feature variables
I have a simple question that has made me doubt my work all of a sudden.
If I only have a training and validation set, am I allowed to monitor val_loss while training or is that adding bias to my training. I want to test my accuracy at the end of training on my validation set but suddenly I am thinking if I am monitoring that dataset while training, that'd be problematic? or no?
Short answer - yes, monitoring validation error and using it as a basis for decision about specific set up of algorithm adds bias to your algorithm. To elaborate a bit:
1) You fix hyperparameters of any ML algorithm and than train it on train set. Your resulting ML algorithm with specific set up of hyperparameters overfits to training set and you use validation set to estimate which performance you can get with these hyperparameters on unseen data
2) But you obviously want to adjust your hyperparameters to get best performance. You may be doing a gridsearch or something like it to get best hyperparameter settings for this specific algorithm using validation set. As result your hyperparameter settings overfit to validation set. Think of it as some of the information about validation set still leaks into your model through hyperparameters
3) As result you must do the following: split data set into training set, validation set and test set. Use training set for training, use validation set to make a decision about specific hyperparameters. When you are done (fully done!) with fine tuning your model you must use test set which model have never seen to get an estimation of the final performance in the combat mode.
In machine learning, an overfitted model fits training set very well but cannot generalize to new instances. I evaluated my model using cross-validation and my accuracy drops when setting the maximum number of splits of my decision tree beyond a certain number. Can this be associated with overfitting?
It is a binary photo classification problem, I extracted the features using AlexNet. The measurement is log-loss. There are totally 25000 records in the training set, 12500 "1"s and 12500 "0"s, so the data set is balanced.
I trained a XGBoost model. After tuning the parameters using cross validation, the training log-loss is 0.078, the validation log-loss is 0.09. But when I make predictions using test set, the log-loss is 2.1. It seems that over-fitting is still pretty serious.
Why is that? Do I have to further tune the parameters or try another pre-trained model?
I am using pre-trained GoogLeNet and then fine tuned it on my dataset for classifying 11 classes. Validation dataset seems to give the "loss3/top1" 86.5%. But when I am evaluating the performance on my evaluation dataset it gives me 77% accuracy. Whatever changes I did it train_val.prototxt, I did the same changes in deploy.prototxt. Is the difference between the validation and evaluation accuracy is normal or I did something wrong?
Any suggestions?
In order to you get the fair estimation of your trained model on the validation dataset you need to set the test_itr and test_batch_size in a meaningful manner.
So, test_itr should be set to:
Val_data / test_batch_Size
Where, Val_data is the size of your validation dataset and test_batch_Size is validation batch size value set in batch_size for the Validation phase.