In my neural network model for digit classification
cost decrease from 7 to 1.7 and afterward it start increasing again what is possible reason ? I have used learning rate as 0.1 for 5000 iteration , 0.03 for next 5000 iteration
and 0.001 for next 5000 iteration.
i am getting only 78 % accuracy on training data.
what should i do?
If the loss decreases, reaches a specific value and starts increasing again, it is mostly due to a high learning-rate which causes model to diverge. Though the learning-rates are being lowered after 5000 iterations, the learning rate for the first 5000 iterations might be too high that the model might have drastically diverged by then, that a decrease in learning rates later on might not help it converge.
Could you try iteratively reducing the learning rates for the first 5000 iterations from, 0.03 (say) and lower and see if the model converges?
Related
When training a net does it matter if the number of samples in the epoch is not an exact multiple of the batch size? My training code doesnt seem to mind if this is the case, though my loss curve is pretty noisy at the moment (in case that is a related issue).
This would be useful to know, as if it is not an issue it saves on messing around with the dataset to make it quantized by batch size. It may also be less wasteful of captured data.
does it matter if the number of samples in the epoch is not an exact multiple of the batch size
No, it does not. Your number of samples can be say 1000, and your batch size can be 400.
You can decide the total number of iterations (where each iteration = sampling a batch and doing gradient descent) based on the overall number of epochs you want to cover. Say, you want to have roughly 5 epochs, then roughly your number of iterations >= 5 * 1000 / 400 = 13. So you will sample a random batch 13 times to get roughly 5 epochs.
In the context of Convolution Neural Networks (CNN), Batch size is the number of examples that are fed to the algorithm at a time. This is normally some small power of 2 like 32,64,128 etc. During training an optimization algorithm computes the average cost over a batch then runs backpropagation to update the weights. In a single epoch the algorithm is run with $n_{batches} = {n_{examples} \over batchsize} $ times. Generally the algorithm needs to train for several epochs to achieve convergence of weight values. Every batch is normally sampled randomly from the whole example set.
The idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.
I am using neural networks for classification
While using different training epochs/episodes, I noticed that sometimes the misclassification rate increased after a training episode eventhough the number of training has increased as well.
I expected the misclassification rate to reduce as the training episodes increased but that didn't happen at some points, for example, the error decreased from 1000 to 3000 training episodes and then it increased after 4000 episodes. So I just want to know if this is normal and if this is a sign of the network overfitting the data?
Thanks
Unless the learning rate and momentum is too high, the misclassification rate against training data should decrease with increased number of epochs. However, the misclassification rate against validation or test data might increase after a number of epochs. In this case, it is a sign of overfitting.
I'm currently training a convulotional network on the Cifar10 dataset. Let's say I have 50,000 training images and 5,000 validation images.
To start testing my model, let's say I start with just 10,000 images to get an idea of how successful the model will be.
After 40 epochs of training and a batch size of 128 - ie every epoch I'm running my Optimizer to minimize loss 10,000 / 128 ~ 78 times on a batch of 128 images (SGD).
Now, let's say I found a model that achieves 70% accuracy on the validation set. Satisfied, I move on to train on the full training set.
This time, for every epoch I run the Optimizer to minimize loss 5 * 10,000 / 128 ~ 391 times.
This makes me think that my accuracy at each epoch should be higher on than on the limited set of 10,000. Much to my dismay, the accuracy on the limited training set increases much more quickly. At the end of the 40 epochs with the full training set, my accuracy is 30%.
Thinking the data may be corrupt, I perform limited runs on training images 10-20k, 20-30k, 30-40k, and 40-50k. Surprisingly, each of these runs resulted in an accuracy ~70%, close to the accuracy for images 0-10k.
Thus arise two questions:
Why would validation accuracy go down when the data set is larger and I've confirmed that that each segment of data indeed provides decent results on its own?
For larger training set, would I need to train through more epochs even though each epoch represents a larger number of training steps (391 vs 78)?
It turns out my intuition was right, but my code was not.
Basically, I had been running my validation accuracy with training data (data used to train the model), not with validation data (data the model had not yet seen).
After correcting this error, the validation accuracy did inevitably improve with a bigger training data set, as expected.
I am designing a new network architecture for semantic segmentation. The training loss reduces when training iteration increases. However, when I measure the testing accuracy. I got the below figure
From 0 to 20.000 iterations, the accuracy increase. However, after 20.000 iterations, the testing accuracy reduce. I guess it is overfitting issue.
I tried to add dropout to the network, but the graph trend is similar. Could you suggest to me the reason and how can I solve it? I think early stopping is not good solution. Thanks
Be sure to randomize your data on your training, also you can start testing with a higher learning rate (say 0.1) to get out of local minima then decrease it to a very small value to let settle down things. To do this change the step size to say 1000 iterations to reduce the size of the learning rate every 1000 iterations.
I am fine-tuning a VGG16 network on 32 cpu machine using tensorflow. I used cross entropy loss with sparse. I have to classify the cloths images into 50 classes. After 2 weeks of training this is how the loss is going down, which I feel is very slow convergence. My batch size is 50. Is it normal or what do you think is going wrong here? Accuracy is also really bad. And now it crashed with bad memory allocation error.
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_allo
My last line in log file looks like this -
2016-12-13 08:56:57.162186: step 31525, loss = 232179.64 (1463843.280 sec/batch)
I also tried Tesla K80 GPU and after 20 hrs of training this is how the loss looks like. All parameters are same. Worrying part is - using GPU didn't increase the iteration rate which means each step is taking same time either in 32 cpu with 50 threds or in tesla K80.
I definitely need some practical advice here.
Another -- and drastically better -- option is to not use VGG16. If you look at Figure 5 in this paper, you'll note that VGG16 does very badly in terms of accuracy vs. FLOPs (floating point operations per second). If you need speed, Mobilenet or a reduced-size ResNet will do much better. Even inception-v2 will outperform VGG in accuracy with much lower computational cost.
This will drastically reduce your training time and memory use.