I trained my network in a dataset. I got the training loss/iterations as follows:
As you see, the loss grow up rapidly at some points as the red arrow. I am using Adam solver with learning rate is 0.001 and momentum as 0.9, weight decay is 0.0005, without dropout. In my network, I used BatchNorm, Pooling, Conv. From the above figure. Could you suggest what is my problem and how to fix it? Thanks all
Update: This is more detail figure
Related
In neural nets, regularization (e.g. L2, dropout) is commonly used to reduce overfitting. For example, the plot below shows typical loss vs epoch, with and without dropout. Solid lines = Train, dashed = Validation, blue = baseline (no dropout), orange = with dropout. Plot courtesy of Tensorflow tutorials.
Weight regularization behaves similarly.
Regularization delays the epoch at which validation loss starts to increase, but regularization apparently does not decrease the minimum value of validation loss (at least in my models and the tutorial from which the above plot is taken).
If we use early stopping to stop training when validation loss is minimum (to avoid overfitting) and if regularization is only delaying the minimum validation loss point (vs. decreasing the minimum validation loss value) then it seems that regularization does not result in a network with greater generalization but rather just slows down training.
How can regularization be used to reduce the minimum validation loss (to improve model generalization) as opposed to just delaying it? If regularization is only delaying minimum validation loss and not reducing it, then why use it?
Over-generalizing from a single tutorial plot is arguably not a good idea; here is a relevant plot from the original dropout paper:
Clearly, if the effect of dropout was to delay convergence it would not be of much use. But of course it does not work always (as your plot clearly suggests), hence it should not be used by default (which is arguably the lesson here)...
I am training a neural network and at the beginning of training my networks loss and accuracy on the validation data fluctuates a lot, but towards the end of training it stabilizes. I am reduce learning rate on plateau for this network. Could it be that the network starts with a high learning rate and as the learning rate decreases both accuracy and loss stabilize?
For SGD, the amount of change in the parameters is a multiple of the learning rate and the gradient of the parameter values with respect to the loss.
θ = θ − α ∇θ E[J(θ)]
Every step it takes will be in a sub-optimal direction (ie slightly wrong) as the optimiser has usually only seen some of the values. At the start of training you are relatively from the optimal solution, so the gradient ∇θ E[J(θ)] is large, therefore each sub-optimal step has a large effect on your loss and accuracy.
Over time, as you (hopefully) get closer to the optimal solution, the gradient is smaller, so the steps become smaller, meaning that the effects of being slightly wrong are diminished. Smaller errors on each step makes your loss decrease more smoothly, so reduces fluctuations.
I have implemented a neural network with 3 layers Input to Hidden Layer with 30 neurons(Relu Activation) to Softmax Output layer. I am using the cross entropy cost function. No outside libraries are being used. This is working on the NMIST dataset so 784 input neurons and 10 output neurons.
I have got about 96% accuracy with hyperbolic tangent as my hidden layer activation.
When I try to switch to relu activation my activations grow very fast which cause my weights grow unbounded as well until it blows up!
Is this a common problem to have when using relu activation?
I have tried L2 Regularization with minimal success. I end up having to set the learning rate lower by a factor of ten compared to the tanh activation and I have tried adjusting the weight decay rate accordingly and still the best accuracy I have gotten is about 90%. The rate of weight decay is still outpaced in the end by the updating of certain weights in the network which lead to an explosion.
It seems everyone is just replacing their activation functions with relu and they experience better results, so I keep looking for bugs and validating my implementation.
Is there more that goes into using relu as an activation function? Maybe I have problems in my implemenation, can someone validate accuracy with the same neural net structure?
as you can see the Relu function is unbounded on positive values, thus creating the weights to grow
in fact, that's why hyperbolic tangent and alike function are being used in those cases, to bound the output value between a certain range (-1 to 1 or 0 to 1 in most cases)
there is another approach to deal with this phenomenon called weights decay
the basic motivation is to get a more generalised model (avoid overfitting) and make sure the weights won't blow up you use a regulation value depending on the weight itself when update them
meaning that bigger weights get bigger penalty
you can farther read about it here
I am using Caffe and also NVIDIA DIGITS. I want to use AlexNet pretrained on ImageNet and wanna fine tune it on my medical data. I have nearly 1000 images and using 80% for training, I generated 40,000 images by data augmentation (using cropping and rotation). However I face a severe overfitting. I tried to overcome this by adding multiple dropout layers. and the result change from :
to:
but my accuracy does not improve.
my network specifications:
AlexNet pre-trained on ImageNet
base learning rate: 0.001
learning rate multiplier: 0.1 for convolution layers and 1 for fully connected layers and xavier weight initialisation.
dropout: 0.5
Now I want to add L2 regularization. I did not find such layer in Caffe and I should maybe make it myself.
first question: Do you have any solution for my problem? ( I have tried other ways like changing stepsize, changing learning rate from 1 to 10^(-5) and I found 0.001 is better, weigh decay changes, adding various dropout layer (which helped as you see))
second question: can you please help me how I can implement L2 regularization??
You have L2 regularization by default in caffe.
See this thread for more information.
Description: I am trying to train an alexnet similar(actually same but without groups) CNN from scratch (50000 images, 1000 classes and x10 augmentation). Each epoch has 50,000 iterations and image size is 227x227x3.
There was a smooth cost decline and improvement in the accuracy for a few initial epochs but now i'm facing this problem where the cost has settled to ~6(started from 13) for a long time, its been a day and cost is continuously oscillating in the range 6.02-6.7. The accuracy has also become stagnant.
Now i'm not sure what to do and not having any proper guidance. Is this the problem of vanishing gradients in local minima? So, to avoid this should i decrease my learning rate? Currently the learning rate is 0.08 with Relu activation (which helps in avoiding vanishing gradients), Glorot initialization and a batch size of 96. Before making another change and again training for days, i want to make sure that i'm moving in a correct direction. What could be the possible reasons?