What suggestion will you give to improve accuracy of a deep neural network with exhibiting local minima in accuracy graph?
What you are observing are fluctuations in the accuracy of your model during training due to the fact that you are computing gradients with respect to each mini-batch. These are meant to approximate the gradient with respect to the whole training set, but they are not always accurate so sometimes you will observe your accuracy go down.
Some fluctuations can also be due to your loss function not being perfectly correlated with your accuracy metric.
The term "local minimum" is usually used to describe when a loss function has a local minimum that is different from its global minimum. I would not use it here to describe fluctuations the accuracy plot since it might cause confusion. After all you are trying to maximize accuracy.
Related
In neural nets, regularization (e.g. L2, dropout) is commonly used to reduce overfitting. For example, the plot below shows typical loss vs epoch, with and without dropout. Solid lines = Train, dashed = Validation, blue = baseline (no dropout), orange = with dropout. Plot courtesy of Tensorflow tutorials.
Weight regularization behaves similarly.
Regularization delays the epoch at which validation loss starts to increase, but regularization apparently does not decrease the minimum value of validation loss (at least in my models and the tutorial from which the above plot is taken).
If we use early stopping to stop training when validation loss is minimum (to avoid overfitting) and if regularization is only delaying the minimum validation loss point (vs. decreasing the minimum validation loss value) then it seems that regularization does not result in a network with greater generalization but rather just slows down training.
How can regularization be used to reduce the minimum validation loss (to improve model generalization) as opposed to just delaying it? If regularization is only delaying minimum validation loss and not reducing it, then why use it?
Over-generalizing from a single tutorial plot is arguably not a good idea; here is a relevant plot from the original dropout paper:
Clearly, if the effect of dropout was to delay convergence it would not be of much use. But of course it does not work always (as your plot clearly suggests), hence it should not be used by default (which is arguably the lesson here)...
I am training a neural network and at the beginning of training my networks loss and accuracy on the validation data fluctuates a lot, but towards the end of training it stabilizes. I am reduce learning rate on plateau for this network. Could it be that the network starts with a high learning rate and as the learning rate decreases both accuracy and loss stabilize?
For SGD, the amount of change in the parameters is a multiple of the learning rate and the gradient of the parameter values with respect to the loss.
θ = θ − α ∇θ E[J(θ)]
Every step it takes will be in a sub-optimal direction (ie slightly wrong) as the optimiser has usually only seen some of the values. At the start of training you are relatively from the optimal solution, so the gradient ∇θ E[J(θ)] is large, therefore each sub-optimal step has a large effect on your loss and accuracy.
Over time, as you (hopefully) get closer to the optimal solution, the gradient is smaller, so the steps become smaller, meaning that the effects of being slightly wrong are diminished. Smaller errors on each step makes your loss decrease more smoothly, so reduces fluctuations.
I am using MaxEnt part of speech tagger to pos tag classification of a language corpus. I know it from theory, that increasing training examples should generally improve the classification accuracy. But, I am observing that in my case, the tagger gives max f measure value if I take 3/4th data for training and rest for testing. If I increase the training data size by taking it to be 85 or 90℅ of the whole corpus, then the accuracy decreases. Even on reducing the training data size to 50℅ of full corpus, the accuracy decreases.
I would like to know the possible reason for this decrease in accuracy with increasing training examples.
I suspected that in the reduced testing set you selected extreme samples and add more general samples into your train set then you reduced the number of testing samples that your model knows them.
Description: I am trying to train an alexnet similar(actually same but without groups) CNN from scratch (50000 images, 1000 classes and x10 augmentation). Each epoch has 50,000 iterations and image size is 227x227x3.
There was a smooth cost decline and improvement in the accuracy for a few initial epochs but now i'm facing this problem where the cost has settled to ~6(started from 13) for a long time, its been a day and cost is continuously oscillating in the range 6.02-6.7. The accuracy has also become stagnant.
Now i'm not sure what to do and not having any proper guidance. Is this the problem of vanishing gradients in local minima? So, to avoid this should i decrease my learning rate? Currently the learning rate is 0.08 with Relu activation (which helps in avoiding vanishing gradients), Glorot initialization and a batch size of 96. Before making another change and again training for days, i want to make sure that i'm moving in a correct direction. What could be the possible reasons?
For convex optimization, like as logistic regression.
For example I have 100 training samples. In mini batch gradient decent I set batch size equal to 10.
So after 10 times of mini batch gradient decent updating. Can I get the same result with one times gradient decent updating?
For non-convex optimization, like as Neural Network.
I know mini batch gradient decent can avoid some local optima sometimes. But are there any fixed relationships between them.
When we say batch gradient descent, it is updating the parameters using all the data. Below is an illustration of batch gradient descent. Note each iteration of the batch gradient descent involves a computation of the average of the gradients of the loss function over the entire training data set. In the figure, -gamma is the negative of the learning rate.
When the batch size is 1, it is called stochastic gradient descent (GD).
When you set the batch size to 10 (I assume the total training data size >>10), this method is called mini batches stochastic GD, which is a compromise between true stochastic GD and batch GD (which uses all the training data at one update). Mini batches performs better than true stochastic gradient descent because when the gradient computed at each step uses more training examples, we usually see smoother convergence. Below is an illustration of SGD. In this online learning setting, each iteration of the update consists of choosing a random training instance (z_t) from the outside world and update the parameter w_t.
The two figures I included here are from this paper.
From wiki:
The convergence of stochastic gradient descent has been analyzed using
the theories of convex minimization and of stochastic approximation.
Briefly, when the learning rates \alpha decrease with an appropriate
rate, and subject to relatively mild assumptions, stochastic gradient
descent converges almost surely to a global minimum when the objective
function is convex or pseudoconvex, and otherwise converges almost
surely to a local minimum. This is in fact a consequence of the
Robbins-Siegmund theorem.
Regarding your question:
[convex case] Can I get the same result with one times gradient decent updating?
If the meaning of "same result" is "converging" to the global minimum, then YES. This is approved by L´eon Bottou in his paper. That is either SGD or mini batch SGD converges to a global minimum almost surely. Note when we say almost surely:
It is obvious however that any online learning algorithm can be
mislead by a consistent choice of very improbable examples. There is
therefore no hope to prove that this algorithm always converges. The
best possible result then is the almost sure convergence, that is to
say that the algorithm converges towards the solution with probability 1.
For non-convex case, it is also proved in the same paper (section 5), that stochastic or mini batches converges to the local minimum almost surely.