I'm working on relation classification with the SemEval2010 Task 8 dataset. The dataset is already split into 8'000 samples for the training and 2'717 for the testing. In order to be as fair as possible, I use only my model at the end to computing its performance (F1-Score).
In order to tune my convolutional neural networks, I keep 6'400 samples for the training and 1'600 for the validation. I train the model and after each epoch (~10' of computation) I compute the F1-Score of my predictions.
I read the paper http://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf and stop training when the last 3 performances were increasing (similar to UP in the paper). In the paper, they return the model corresponding to the best performance seen so far.
My question is : in order to be as accurate as possible, we need the whole 8'000 samples for the training. Is it correct to say we will train until the epoch which had the best performance on the validation set and then do the predictions ? Or should we save the model corresponding to the best performance and "waste" 1'600 samples ?
Related
I have a medical image dataset of ~10K 256x256 images with which I am training a deep neural classifier for disease classification. I have been working with popular CNNs like InceptionV3 and ResNets.
These models have achieved validation set accuracies in the 50-60% range and I noticed that they were overfitting. So to improve the performance, I then tried common strategies like a dropout in the dense layers, smaller learning rates, and L2 regularization. After these modifications showed no reduction in overfitting, I next moved to smaller and simpler architectures with just 2-3 convolution layers + 1 FC classification layer which I thought would mitigate the issue. However, with the simpler models, the learning curves still showed signs of overfitting. Particularly, when training for 100 epochs, the models would have similar train and validation losses for the first 20-30 epochs, but then diverge after that.
I'm not sure what other strategies I can experiment with at this point and I'm worried that trying more experiments aimlessly is inefficient. Should I just accept that the models cannot generalize to this task well?
Additionally, FYI, the dataset is imbalanced, but I have dealt with this using data augmentation and a weighted cross-entropy loss as well but no real difference.
Try to use modern classification approaches like transformers or efficientnets - their accuracy is higher. To compare different modern architectures please use paperswithcode.
Augmentations, regularizations are must-have in training process, doesn't matter if balanced or imbalanced data you have.
You can try to make over- or undersampling of your data to get better results
Try to use warmup and learning rate schedules, this improves the convergence of the model
My neural network trainign in pytorch is getting very wierd.
I am training a known dataset that came splitted into train and validation.
I'm shuffeling the data during training and do data augmentation on the fly.
I have those results:
Train accuracy start at 80% and increases
Train loss decreases and stays stable
Validation accuracy start at 30% but increases slowly
Validation loss increases
I have the following graphs to show:
How can you explain that the validation loss increases and the validation accuracy increases?
How can be such a big difference of accuracy between validation and training sets? 90% and 40%?
Update:
I balanced the data set.
It is binary classification. It now has now 1700 examples from class 1, 1200 examples from class 2. Total 600 for validation and 2300 for training.
I still see similar behavior:
**Can it be becuase I froze the weights in part of the network?
**Can it be becuase the hyperparametrs like lr?
I found the solution:
I had different data augmentation for training set and validation set. Matching them also increased the validation accuracy!
If the training set is very large in comparison to the validation set, you are more likely to overfit and learn the training data, which would make generalizing the model very difficult. I see your training accuracy is at 0.98 and your validation accuracy increases at a very slow rate, which would imply that you have overfit your training data.
Try reducing the number of samples in your training set to improve how well your model generalizes to unseen data.
Let me answer your 2nd question first. High accuracy on training data and low accuracy on val/test data indicates the model might not generalize well to infer real cases. That is what the validation process is all about. You need to finetune or even rebuild your model.
With regard to the first question, val loss might not necessarily correspond to the val accuracy. The model makes the prediction based on its model, and loss function calculates the difference between probablities of matrix and the target if you are using CrossEntropy function.
I'm very new to ML and I'm trying to build a classifier for unbalanced binary class for a real life problem. I've tried various models like Logistic regression, Random Forest, ANN, etc but every time I'm getting very high precision and recall (around 94%) for train data and very poor (around 1%) for test or validation data. I've 53 features and 97094 data points. I tried tweaking hyper-parameters but as far as I understand, with current precision and recall for test and validation data, it will also not help significantly. Can anyone please help me understand what could have gone wrong.
Thank you.
rf = RandomForestClassifier(bootstrap=True, class_weight={1:0.80,0:0.20}, criterion='entropy',
max_depth=2, max_features=4,
min_impurity_decrease=0.01, min_impurity_split=None,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=-1, oob_score=False, random_state=41, verbose=0,
warm_start=False)
rf.fit(X_train, y_train)
It is difficult to say without seeing your actual data or your code that you are using but your models are probably overfitting to your training dataset or to your majority class.
If your model overfits your training dataset, it learns to memorise your actual training dataset. It does not find any general distinctions to classify your data anymore but it adapts its classification boundaries very closely to the training data. You should consider using less complex models (e.g. limit the number of trees in Random Forest), drop some features (e.g. start using only 3 of 53 features), regularisation or data augmentation.
See here for more techniques against training data overfitting and here for an example of over- and underfitting.
If your model simply overfits to your majority class (e.g. 99% of your data has the same class), then you could try to oversample the minority class during your training.
You're likely overfitting the model due to good training performance but poor test performance, this tells me that your model cannot generalize good enough and should be simplified. Like #mrzo said - you've way too many features, so look into dimensionality reduction algorithms and apply those for your dataset prior to the application of your model. Another good place to start is to run tree classifiers' "feature importance" methods to see what actually matters in a given dataset.
Without looking at your model and dataset - it is just speculation though.
For PyTorch's tutorial on performing transfer learning for computer vision (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), we can see that there is a higher validation accuracy than training accuracy. Applying the same steps to my own dataset, I see similar results. Why is this the case? Does it have something to do with ResNet 18's architecture?
Assuming there aren't bugs in your code and the train and validation data are in the same domain, then there are a couple reasons why this may occur.
Training loss/acc is computed as the average across an entire training epoch. The network begins the epoch with one set of weights and ends the epoch with a different (hopefully better!) set of weights. During validation you're evaluating everything using only the most recent weights. This means that the comparison between validation and train accuracy is misleading since training accuracy/loss was computed with samples from potentially much worse states of your model. This is usually most noticeable at the start of training or right after the learning rate is adjusted since the network often starts the epoch in a much worse state than it ends. It's also often noticeable when the training data is relatively small (as is the case in your example).
Another difference is the data augmentations used during training that aren't used during validation. During training you randomly crop and flip the training images. While these random augmentations are useful for increasing the ability of your network to generalize they aren't performed during validation because they would diminish performance.
If you were really motivated and didn't mind spending the extra computational power you could get a more meaningful comparison by running the training data back through your network at the end of each epoch using the same data transforms used for validation.
The short answer is that train and validation data are from different distributions, and it's "easier" for model to predict target in validation data then it is for training.
The likely reason for this particular case, as indicated by this answer, is data augmentation during training. This is a way to regularize your model by increasing variability in the training data.
Other architectures can use Dropout (or its modifications), which are deliberately "hurting" training performance, reducing the potential of overfitting.
Notice, that you're using pretrained model, which already contains some information about how to solve classification problem. If your domain is not that different from the data it was trained on, you can expect good performance off-the-shelf.
I'm working on a project with colorectal cancer stage multiclass-classification using Gene Expression Data. My dataset contains 11 Biomarkers. The results from the classification are around 40%. I have tried different models for classification with KNN, SVM, neural network..., and also I have tried algorithms from ensemble machine learning. Has anyone has any idea what can I do with the dataset to improve the results?
To decide what to do next, you will need some metrics:
How well can a team of human experts classify the data?
What is the model accuracy on the training dataset?
What is the model accuracy on the testing dataset?
If the training accuracy is much worse than human experts, you should increase the complexity of the model until the training results approach or exceed human experts. You can do this by increasing the number of input features, choosing a different machine learning model, or increasing the number of layers in the NN. If the training accuracy is poor, you need to improve this first before spending time improving the testing accuracy.
If the training accuracy is good but the testing accuracy is much worse than the training accuracy, you are probably overfitting. Get or create more training data, and use regularization.