I am using a software program of the type that is known as an Artificial Neural Network. One of the parameters of the software is called Learning Rate (also known as alpha). The learning rate setting can be controlled by moving a slider back and forth. On one side of the slider is the value 1E-05 on the other side is just 1. In between are various values such as 9E-05, .000045, etc. What I want to know is which one of these 2 learning rates is the fastest learning rate, 1E-05 on one side or 1 on the other. Thanks.
Learning rate is not about speed of training it is about size of the step when using quite naive approximation of the function (linear - for 1st order optimizers, or quadratic - for 2nd order). Consequently very small learning rate should lead to slow training, but big learning rate can lead to lack of training. Furthermore - values in between can still be not monotonic (you can have training where smaller learning rate actually converges faster than bigger one). So even though naively we could say that big learning rate is faster training - in general this is not true, furthermore - one cannot answer what learning rate is the fastest one. You can only use some general heuristics/observations here - you can start with big learning rate, and if results are bad, try reducing it. But in terms of actual training time guarantees - there are none.
Related
Is Loss dependent upon learning rate and batch size. For .e.g if i keep batch size 4 and a learning rate lets say .002 then loss does not converge but if change the batch size to 32 keeping the learning rate same , i get a converging loss curve. Is this okk?
I would say that the loss is highly dependent on what parameters you use for your training. On the other hand, I would not call it a dependency in terms of a mathematical function but rather a relation.
If your network does not learn you need to tweak the parameters (architecture, learning rate, batch size, etc.).
It is hard to give a more specific answer to your question. What parameters that are ok are depending on the problem. However, if it converges and you can validate your solution I would say that you are fine.
I am new to this DNN field and I am fed up with tunning hyperparameters and other parameters in a DNN cause there are a lot of parameters to tune and it is like a multivariable analysis without the help of a computer. How human can move towards the highest accuracy that can be achieved for a task using DNN due to the huge number of variables inside a DNN. And how will we know what accuracy is possible to get by using DNN or do I have to give up on DNN? I am lost. Help is appreciated.
Main problems I have :
1. What are the limits of DNN / when we have to give up on DNN
2. What is the proper way of tunning without missing good parameter values
Here is the summary I got by learning theory in this field. Corrections are much appreciated if I am wrong or misunderstood. You can add anything I missed. Sorted by the importance according to my knowledge.
for overfitting -
1. reduce the number of layers
2. reduce the number of nodes of layers
3. add regularizers (l1/ l2/ l1-l2) - have to decide the factors
4. add dropout layers and -have to decide the dropout factor
5. reduce batch size
6. stop earlier
for underfitting
1. increase the number of layers
2. increase number of nodes of layers
3. Add different types of layers (Conv, LSTM, ...)
4. add learning rate decay (decide the type and parameters for the type)
5. reduce the learning rate
other than that generally we can do,
1. number of epochs (by seeing what is happening while model training)
2. Adjust Learning Rate
3. batch normalization -for fast learning
4. initializing techniques (zero/ random/ Xavier / he)
5. different optimization algorithms
auto tunning methods
- Gridsearchcv - but for this, we have to choose what we want to change and it takes a lot of time.
Short Answer: You should experiment a lot!
Long Answer: At first, you may be overwhelmed by having plenty of knobs that you can tweak, but you gradually become experienced. A very quick way to gain some intuition on how you should tune the hyperparameters of your model is trying to replicate what other researchers have published. By replicating the results (and trying to improve the state-of-the-art), you acquire the intuition about deep learning.
I, personally, follow no particular order in tuning the hyperparameters of the model. Instead, I try to implement a dirty model and try to improve it. For instance, if I see that there are overshoots in validation accuracy, which might be an indicator of the fact that the model is bouncing around the sweet spot, I divide the learning rate by ten and see how it goes. If I see the model begins to overfit, I use early stopping to save the best parameters before overfitting. I also play with dropout rates and weight decay to find the best combination of them in order to have the model fit enough while maintaining the regularization effect. And so on.
To correct some of your assumptions, adding different types of layers will not necessarily help your model not to overfit. Moreover, sometimes (especially when using transfer learning, which is a trend these days), you cannot simply add a convolutional layer to your neural network.
Assuming you are dealing with computer vision tasks, Data Augmentation is another useful approach to increase the amount of available data to train your model and perform its performance.
Also, note that Batch Normalization also has a regularization effect. Weight Decay is another implementation of l2 regularization that is widely used.
Another interesting technique that can improve the training of neural networks is the One Cycle policy for learning rate and momentum (if applicable). Check this paper out: https://doi.org/10.1109/WACV.2017.58
I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.
Meaning to say if during training you have set your learning rate too high and you had unfortunately reached a local minimum where the value is too high, is it good to retrain with a lower learning rate or should you start from a higher learning rate for the poor-performing model, in hopes that the loss will escape the local minimum?
In the strict sense, you don't have to retrain as you can continue training with a lower learning rate (this is called a learning shedule). A very common approach is to lower the learning rate (by usually dividing by 10) each time the loss stagnates or becomes constant.
Another approach is to use an optimizer that scales the learning rate with the gradient magnitude, so the learning rate naturally decays as you get closer to the minima. Examples of this are ADAM, Adagrad and RMSProp.
In any case, make sure to find the optimal learning rate on a validation set, this will considerably improve performance and make learning faster. This applies to both plain SGD and with any other optimizer.
I have created an artificial neural network in Java that learns with a backpropagation algorithm, I have produced the following graph which shows how changing the learning rate affects the time it takes for the network to train.
It seems to show that the learning is very unstable considering it either trains correctly very quickly or gets stuck (the backpropagation will stop training at either 1 minute or a specific error threshold). I want to understand why the network is so unpredictable, is the momentum too high? do I need a adaptive learning rate? Is this a good example of how local minima affects training http://www.willamette.edu/~gorr/classes/cs449/momrate.html.
This is the graph I produced:
http://i.stack.imgur.com/ooXqP.png
If you are initializing random weights before new experiment - you are starting optimization every time from new random point (in weight space), and for NN it's very important, because from different points with gradient descent you will converge into different local optima, ofcourse with different number of iterations and different time needed to converge. You need to generate initialization weights only once and start every experiment with new learning rate from that state, not from new random state.