I am designing a new network architecture for semantic segmentation. The training loss reduces when training iteration increases. However, when I measure the testing accuracy. I got the below figure
From 0 to 20.000 iterations, the accuracy increase. However, after 20.000 iterations, the testing accuracy reduce. I guess it is overfitting issue.
I tried to add dropout to the network, but the graph trend is similar. Could you suggest to me the reason and how can I solve it? I think early stopping is not good solution. Thanks
Be sure to randomize your data on your training, also you can start testing with a higher learning rate (say 0.1) to get out of local minima then decrease it to a very small value to let settle down things. To do this change the step size to say 1000 iterations to reduce the size of the learning rate every 1000 iterations.
Related
I am training a deep model for MRI segmentation. The models I am using are U-Net++ and UNet3+. However, when plotting the validation and training losses of these models over time, I find that they all end with a sudden drop in loss, and a permanent plateau. Any ideas for what could be causing this plateau? or any ideas for how I could surpass it?
Here are the plots for the training and validation loss curves, and the corresponding segmentation performance (dice score) on the validation set. The drop in loss occurs at around epoch 80 and is pretty obvious in the graphs.
In regard to the things I've tried:
Perhaps a local minima is being found, which is hard to escape, so I tried resuming training at epoch 250 with the learning rate increased by a factor of 10, but the plateau stays the exact same regardless of how many epochs I keep training. I also tried resuming with a reduced LR of factor 10 and 100 and no change either.
Perhaps the model has too many parameters, i.e. the plateau is happening due to over-fitting. So I tried training models that have fewer parameters. This changed the actual loss value (Y-axis value) that the plateau ends up occurring at, but the same general shape of a sudden drop and plateau remains the same. I also tried increasing the parameters (because it was easy to do), and the same problem is observed.
Any ideas for what could be causing this plateau? or any ideas for how I could surpass it?
Due to the high number of parameters it is hard if not impossible to reason about the optimization landscape, so any speculations are really just that, speculations.
If you assume that the model got stuck somewhere, that is, that the gradient is getting very small (it's sometimes worth plotting the distribution of the entries of the gradient over time too, or at least its magnitude), it sometimes is worth artificially forcing the optimizer to adapt, by changing the environment. One popular way to do so is using weight decay. For instance using a usual weight decay for SGD or if you're using Adam, switching to AdamW. Alternatives that are based on a similar idea are warm restarts.
Finally it might very well be possible that you reached the limits of what your model can achieve. A dice score in the neighbourhood of 0.9 is already quite good in many of todays segmentation tasks.
I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.
I am training a deep residual network with 10 hidden layers with game data.
Does anyone have an idea why I don't get any overfitting here?
Training and test loss still decreasing after 100 epochs of training.
https://imgur.com/Tf3DIZL
Just a couple of advice:
for deep learning is recommended to do even 90/10 or 95/5 splitting (Andrew Ng)
this small difference between curves means that your learning_rate is not tuned; try to increase it (and, probably, number of epochs if you will implement some kind of 'smart' lr-reduce)
it is also reasonable for DNN to try to overfit with the small amount of data (10-100 rows) and an enormous number of iterations
check for data leakage in the set: weights analysis inside each layer may help you in this
I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.
I am fine-tuning a VGG16 network on 32 cpu machine using tensorflow. I used cross entropy loss with sparse. I have to classify the cloths images into 50 classes. After 2 weeks of training this is how the loss is going down, which I feel is very slow convergence. My batch size is 50. Is it normal or what do you think is going wrong here? Accuracy is also really bad. And now it crashed with bad memory allocation error.
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_allo
My last line in log file looks like this -
2016-12-13 08:56:57.162186: step 31525, loss = 232179.64 (1463843.280 sec/batch)
I also tried Tesla K80 GPU and after 20 hrs of training this is how the loss looks like. All parameters are same. Worrying part is - using GPU didn't increase the iteration rate which means each step is taking same time either in 32 cpu with 50 threds or in tesla K80.
I definitely need some practical advice here.
Another -- and drastically better -- option is to not use VGG16. If you look at Figure 5 in this paper, you'll note that VGG16 does very badly in terms of accuracy vs. FLOPs (floating point operations per second). If you need speed, Mobilenet or a reduced-size ResNet will do much better. Even inception-v2 will outperform VGG in accuracy with much lower computational cost.
This will drastically reduce your training time and memory use.