It is a binary photo classification problem, I extracted the features using AlexNet. The measurement is log-loss. There are totally 25000 records in the training set, 12500 "1"s and 12500 "0"s, so the data set is balanced.
I trained a XGBoost model. After tuning the parameters using cross validation, the training log-loss is 0.078, the validation log-loss is 0.09. But when I make predictions using test set, the log-loss is 2.1. It seems that over-fitting is still pretty serious.
Why is that? Do I have to further tune the parameters or try another pre-trained model?
Related
Why a hyper-parameter like regularization parameter (a real number) cannot be trained over training data along with model parameters? What will go wrong?
This is generally done to prevent overfitting. Model parameters are trained using the training set. Hyper-parameter tuning is done using a validation set that is (ideally) completely independent of the training data. The final performance should be evaluated on a test set. Typical splits are 80/10/10 or 60/20/20.
If you tune your hypermeters on the training set, you will very likely vastly overfit and suffer a performance hit on the test set.
Try it out! See the difference in performance on your test set when you do hyper-parameter tuning on the training set, vs on a separate validation set.
When a MLP model is trained using a training and validation dataset, we can know the number of epochs that best fits the model. Once the training is done, and we know the best number of epochs, in order to get the best mlp model, would be fine if the model is retrained not only with the training set but the entire data set with the same number of epochs, so the model can see more data? Or this number of epochs could result in a good MLP model for the first approach but in an overfitted one for the second?
There is not a single approach to that. It depends on factors such as validation strategy (e.g. k-fold cross-validation vs validation set as the test set itself), if the model is learning on the fly or offline, if there is biased or imbalanced data on the validation set.
You may find useful the following sources:
https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-
https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-validation
https://stats.stackexchange.com/questions/402055/fitting-after-training-and-validation
https://stats.stackexchange.com/questions/361494/how-to-correctly-retrain-model-using-all-data-after-cross-validation-with-early
https://www.reddit.com/r/datascience/comments/7xqszr/should_final_model_be_retrained_on_full_dataset/
https://www.quora.com/Should-we-train-neural-networks-on-the-training-set-combined-with-the-validation-set-after-tuning
For PyTorch's tutorial on performing transfer learning for computer vision (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), we can see that there is a higher validation accuracy than training accuracy. Applying the same steps to my own dataset, I see similar results. Why is this the case? Does it have something to do with ResNet 18's architecture?
Assuming there aren't bugs in your code and the train and validation data are in the same domain, then there are a couple reasons why this may occur.
Training loss/acc is computed as the average across an entire training epoch. The network begins the epoch with one set of weights and ends the epoch with a different (hopefully better!) set of weights. During validation you're evaluating everything using only the most recent weights. This means that the comparison between validation and train accuracy is misleading since training accuracy/loss was computed with samples from potentially much worse states of your model. This is usually most noticeable at the start of training or right after the learning rate is adjusted since the network often starts the epoch in a much worse state than it ends. It's also often noticeable when the training data is relatively small (as is the case in your example).
Another difference is the data augmentations used during training that aren't used during validation. During training you randomly crop and flip the training images. While these random augmentations are useful for increasing the ability of your network to generalize they aren't performed during validation because they would diminish performance.
If you were really motivated and didn't mind spending the extra computational power you could get a more meaningful comparison by running the training data back through your network at the end of each epoch using the same data transforms used for validation.
The short answer is that train and validation data are from different distributions, and it's "easier" for model to predict target in validation data then it is for training.
The likely reason for this particular case, as indicated by this answer, is data augmentation during training. This is a way to regularize your model by increasing variability in the training data.
Other architectures can use Dropout (or its modifications), which are deliberately "hurting" training performance, reducing the potential of overfitting.
Notice, that you're using pretrained model, which already contains some information about how to solve classification problem. If your domain is not that different from the data it was trained on, you can expect good performance off-the-shelf.
I have a simple question that has made me doubt my work all of a sudden.
If I only have a training and validation set, am I allowed to monitor val_loss while training or is that adding bias to my training. I want to test my accuracy at the end of training on my validation set but suddenly I am thinking if I am monitoring that dataset while training, that'd be problematic? or no?
Short answer - yes, monitoring validation error and using it as a basis for decision about specific set up of algorithm adds bias to your algorithm. To elaborate a bit:
1) You fix hyperparameters of any ML algorithm and than train it on train set. Your resulting ML algorithm with specific set up of hyperparameters overfits to training set and you use validation set to estimate which performance you can get with these hyperparameters on unseen data
2) But you obviously want to adjust your hyperparameters to get best performance. You may be doing a gridsearch or something like it to get best hyperparameter settings for this specific algorithm using validation set. As result your hyperparameter settings overfit to validation set. Think of it as some of the information about validation set still leaks into your model through hyperparameters
3) As result you must do the following: split data set into training set, validation set and test set. Use training set for training, use validation set to make a decision about specific hyperparameters. When you are done (fully done!) with fine tuning your model you must use test set which model have never seen to get an estimation of the final performance in the combat mode.
I want to evaluate the performance different model such as SVM, RandForest, CNN etc, I only have one dataset. So I split the dataset to training set and testing set and train different model on this dataset with training data and test with testing dataset.
My question: Can I get the real performance of different model on only one dataset? For example: I found SVM model get the best result, So Should I select the SVM as my final classification model?
Its probably a better idea to cross validate your models with different test samples through cross validation to avoid biases. Also check your models against different evaluation metrics depending upon your application type. For instance use recall, accuracy and AUC for each model if its a classification problem.
Evaluation results can be pretty deceptive and require extensive validation.
You can Plot ROC curve for all the models.The model for which AUC is highest will be best model.