I want to train CaffeNet on the MNIST dataset in Caffe. However, I noticed that after 100 iterations the loss just slightly dropped (from 2.66364 to 2.29882).
However, when I use LeNet on MNIST, the loss goes from 2.41197 to 0.22359, after 100 iterations.
Does this happen because CaffeNet has more layers, and therefore needs more training time to converge? Or is it due to something else? I made sure the solver.prototxt of the nets were the same.
While I know 100 iterations is extremely short (as CaffeNet usually trains for ~300-400k iterations), I find it odd that LeNet is able to get a loss so small, so soon.
I am not familiar with architecture of these nets, but in general there are several possible reasons:
1) One of the nets is really much more complicated
2) One of the nets was trained with a bigger learning rate
3) Or maybe it used a training with momentum while other net didn't use it?
4) Also possible that they both use momentum during training but one of them had the bigger momentum coefficient specified
Really, there are tons of possible explanations for that.
Related
I have a medical image dataset of ~10K 256x256 images with which I am training a deep neural classifier for disease classification. I have been working with popular CNNs like InceptionV3 and ResNets.
These models have achieved validation set accuracies in the 50-60% range and I noticed that they were overfitting. So to improve the performance, I then tried common strategies like a dropout in the dense layers, smaller learning rates, and L2 regularization. After these modifications showed no reduction in overfitting, I next moved to smaller and simpler architectures with just 2-3 convolution layers + 1 FC classification layer which I thought would mitigate the issue. However, with the simpler models, the learning curves still showed signs of overfitting. Particularly, when training for 100 epochs, the models would have similar train and validation losses for the first 20-30 epochs, but then diverge after that.
I'm not sure what other strategies I can experiment with at this point and I'm worried that trying more experiments aimlessly is inefficient. Should I just accept that the models cannot generalize to this task well?
Additionally, FYI, the dataset is imbalanced, but I have dealt with this using data augmentation and a weighted cross-entropy loss as well but no real difference.
Try to use modern classification approaches like transformers or efficientnets - their accuracy is higher. To compare different modern architectures please use paperswithcode.
Augmentations, regularizations are must-have in training process, doesn't matter if balanced or imbalanced data you have.
You can try to make over- or undersampling of your data to get better results
Try to use warmup and learning rate schedules, this improves the convergence of the model
I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.
I am training a deep residual network with 10 hidden layers with game data.
Does anyone have an idea why I don't get any overfitting here?
Training and test loss still decreasing after 100 epochs of training.
https://imgur.com/Tf3DIZL
Just a couple of advice:
for deep learning is recommended to do even 90/10 or 95/5 splitting (Andrew Ng)
this small difference between curves means that your learning_rate is not tuned; try to increase it (and, probably, number of epochs if you will implement some kind of 'smart' lr-reduce)
it is also reasonable for DNN to try to overfit with the small amount of data (10-100 rows) and an enormous number of iterations
check for data leakage in the set: weights analysis inside each layer may help you in this
Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit? Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Concretely I have ~250,000 examples, each of which is a flat image 200x200px. The model is a CNN with about 5 convolution + pooling layers, followed by 2 dense layers with 1024 units each. The model classifies 12 different classes. I've been training it for about 35 hours with ~90% accuracy on training set and ~80% test set.
Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit?
Generally speaking, no. Fitting deep learning models is still an almost exclusively empirical art, and the theory behind it is still (very) poor. And although by gaining more and more experience one is more likely to tell beforehand if a model is prone to overfit, the confidence will generally be not high (extreme cases excluded), and the only reliable judge will be the experiment.
Elaborating a little further: if you take the Keras MNIST CNN example and remove the intermediate dense layer(s) (the previous version of the script used to include 2x200 dense layers instead of 1x128 now), thus keeping only conv/pooling layers and the final softmax one, you will end up with ~ 98.8% test accuracy after only 20 epochs, but I am unaware of anyone that could reliably predict this beforehand...
Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Exactly, this is the only safe way. The telltale signature of overfitting is the divergence of the learning curves (training error still decreasing, while validation or test error heading up). But even if we have diagnosed overfitting, the cause might not be always clear-cut (see a relevant question and answer of mine here).
~90% accuracy on training set and ~80% test set
Again very generally speaking and only in principle, this does not sound bad for a problem with 12 classes. You already seem to know that, if you worry for possible overfitting, it is the curves rather than the values themselves (or the training time) that you have to monitor.
On the more general topic of the poor theory behind deep learning models as related to the subject of model intepretability, you might find this answer of mine useful...
I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.