I have created an artificial neural network in Java that learns with a backpropagation algorithm, I have produced the following graph which shows how changing the learning rate affects the time it takes for the network to train.
It seems to show that the learning is very unstable considering it either trains correctly very quickly or gets stuck (the backpropagation will stop training at either 1 minute or a specific error threshold). I want to understand why the network is so unpredictable, is the momentum too high? do I need a adaptive learning rate? Is this a good example of how local minima affects training http://www.willamette.edu/~gorr/classes/cs449/momrate.html.
This is the graph I produced:
http://i.stack.imgur.com/ooXqP.png
If you are initializing random weights before new experiment - you are starting optimization every time from new random point (in weight space), and for NN it's very important, because from different points with gradient descent you will converge into different local optima, ofcourse with different number of iterations and different time needed to converge. You need to generate initialization weights only once and start every experiment with new learning rate from that state, not from new random state.
Related
What is the difference between the two?, the two serve to reach the minimum point (lower loss) of a function for example.
I understand (I think) that the learning rate is multiplied by the gradient ( slope ) to make the gradient descent , but is that so ? Do I miss something?
What is the difference between lr and gradient?
Thanks
Deep learning neural networks are trained using the stochastic gradient descent algorithm.
Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.
The amount that the weights are updated during training is referred to as the step size or the “learning rate.”
Specifically, the learning rate is a configurable hyperparameter
used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.
A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.
The challenge of training deep learning neural networks involves carefully selecting the learning rate. It may be the most important hyperparameter for the model.
The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.
— Page 429, Deep Learning, 2016.
For more on what the learning rate is and how it works, see the post:
How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks
Also you can refer to here: Understand the Impact of Learning Rate on Neural Network Performance
I want to implement a simple feed-forward neural network to approximate the function y=f(x)=ax^2 where a is some constant and x is the input value.
The NN has one input node, one hidden layer with 1-n nodes, and one output node. For example, I input the value 2.0 -> the NN produces 4.0, and again I input 3.0 -> the NN produces 9.0 or close to it and so on.
If I understand "online-training," the training data is fed one by one - meaning I input the value 2.0 -> I iterate with the gradient decent 100 times, and then I pass the value 3.0, and I iterate another 100 times.
However, when I try to do this with my experimental/learning NN - I input the value 2.0 -> the error gets very small -> the output is very close to 4.0.
Now if I want to predict for the input 3.0 -> the NN produces 4.36 or something instead of 9.0. So the NN just learns the last training value.
How can I use online-training to get a Neural Network that approximates the desired function for a range [-d, d]? What am I missing?
The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function. This is besides the point but in case someone was wondering.
Any advise would be greatly appreciated.
More info - I am activating the hidden layer with the Sigmoid function and the output layer with the linear one.
The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function.
Recurrent Neural Networks (RNNs) are the state of the art for modeling time series. This is because they can take inputs of arbitrary length, and they can also use internal state to model the changing behavior of the series over time.
Training feedforward neural networks for time series is an old method which will generally not perform as well. They require a fixed sized input so you must choose a fixed sized sliding time window, and they also don't preserve state, so it is hard to learn a time-varying function.
I can find very little about "online training" of feedforward neural nets with stochastic gradient descent to model non-stationary behavior except for a couple of very vague references. I don't think this provides any benefit besides allowing you to train in real time when you are getting a stream of data one at a time. I don't think it will actually help you model time-dependent behavior.
Most of the older methods I can find in the literature about online learning for neural networks use a hybrid approach with a neural network and some other method that can help capture time dependencies. Again, these should all be inferior to RNNs, not to mention harder to implement in practice.
Furthermore, I don't think you are implementing online training correctly. It should be stochastic gradient descent with a mini-batch size of 1. Therefore, you only run one iteration of gradient descent on each training example per training epoch. Since you are running 100 iterations before moving on to the next training example, you are going too far down the error gradient with respect to that single example, resulting in serious overfitting to a single data point. This is why you get poor results on the next input. I don't think this is a justifiable method of training, nor do I think it will work for time series.
You haven't mentioned what your activations are or your loss function is, so I can't comment on whether those are appropriate for the task.
Also, I don't think the learning y=ax^2 is a good analogy for time series prediction. This is a static function that always gives the same output for a given input, regardless of the index of the input or the value of previous inputs.
Meaning to say if during training you have set your learning rate too high and you had unfortunately reached a local minimum where the value is too high, is it good to retrain with a lower learning rate or should you start from a higher learning rate for the poor-performing model, in hopes that the loss will escape the local minimum?
In the strict sense, you don't have to retrain as you can continue training with a lower learning rate (this is called a learning shedule). A very common approach is to lower the learning rate (by usually dividing by 10) each time the loss stagnates or becomes constant.
Another approach is to use an optimizer that scales the learning rate with the gradient magnitude, so the learning rate naturally decays as you get closer to the minima. Examples of this are ADAM, Adagrad and RMSProp.
In any case, make sure to find the optimal learning rate on a validation set, this will considerably improve performance and make learning faster. This applies to both plain SGD and with any other optimizer.
I am using a software program of the type that is known as an Artificial Neural Network. One of the parameters of the software is called Learning Rate (also known as alpha). The learning rate setting can be controlled by moving a slider back and forth. On one side of the slider is the value 1E-05 on the other side is just 1. In between are various values such as 9E-05, .000045, etc. What I want to know is which one of these 2 learning rates is the fastest learning rate, 1E-05 on one side or 1 on the other. Thanks.
Learning rate is not about speed of training it is about size of the step when using quite naive approximation of the function (linear - for 1st order optimizers, or quadratic - for 2nd order). Consequently very small learning rate should lead to slow training, but big learning rate can lead to lack of training. Furthermore - values in between can still be not monotonic (you can have training where smaller learning rate actually converges faster than bigger one). So even though naively we could say that big learning rate is faster training - in general this is not true, furthermore - one cannot answer what learning rate is the fastest one. You can only use some general heuristics/observations here - you can start with big learning rate, and if results are bad, try reducing it. But in terms of actual training time guarantees - there are none.
I want my neural network to be trained on every new data that it classifies incorrectly. Assuming that I somehow label the data correctly every time the network makes a mistake, how many back props do i need to run on this single instance of new data in order to train my network for that particular case? Is there a better way to train a neural network on real time scenarios?
It depends on the optimization algorithm you use. The backpropagation by itself calculates only the gradient, which is used by the next iteration of the algorithm.
In the simplest case you can use a self-developed gradient descent and check the behavior of your cost function. If the cost function decreases less than some threshold epsilon, you might break the optimization loop for the current instance. You can also limit the maximum number of iterations.
It is worth using some advanced optimizers such fminunc in Matlab, which will stop by themselves when reached an optimum.
You may find this post about different termination conditions of gradient descent very useful.
I think, learning only using one single instance is not really efficient. The cost function can behave jerky. You may consider the batch learning method, where you learn using small batches of new instances. It should provide a better learning rate.
In order to illustrate how network's accuracy depends on the iteration number and on the batch size, I experimented a bit with a neural network used to recognize hand written digits. I had 4000 examples in the training set and 1000 examples in the validation set. Then I started the learning algorithm with different parameters and measured the resulted accuracy. You can see the result here:
Of course this plot describes only my particular case, but you can get some intuition on what to expect and on how to validate network parameters.