How do we define the bad learning rate in neural network? - machine-learning

I am trying to define the proper definition of the bad learning rate in the neural network as follows:
Bad learning rate in the neural network is when you assigned the learning rate too low or too high, with too low learning rate the network would take too much time to train but with too high learning rate the network would change too quick which might result in output.
Any suggestion would be much appreciated.

I believe that an efficient learning rate(alpha) depends on the data. The points that you have mentioned are absolutely correct with regards to inefficient learning rates. So, there is no hard and fast rule for selecting alpha. Let me enumerate the steps that I take while deciding alpha :
You would obviously need a big alpha so that your model learns quickly
Also take note that big alpha can lead to overshooting the minima and hence your hypothesis will not converge
To take care of this, you can go for learning rate decay. This decreases your learning rate as you approach minima and slow down learning so that your model doesn't overshoot.
Couple of ways to do this:
Step decay
exponential decay
linear decay
You can select either and then train your model. Having said that, let me point out that it still requires some trial and error from your side until you get the optimum result.

Related

TFF : test accuracy fluctuate

I train a ResNet50 model with TFF, I use test accuracy on test data for evaluation, but I find many fluctuations as shown in the figure below, So please how can I avoid this fluctuation ?
I would say behavior such as this is to be expected for stochastic optimization in general. The inherent variance causes you to oscillate somewhere around good solution. The magnitude of the variance and properties of the optimization objective control how much this oscillates when looking at a accuracy metric.
For plain SGD, decreasing learning rate decreases the variance and slows down convergence.
For optimization methods for federated learning, the story is a bit more complicated, but decreasing the client learning rate, or decreasing the number of local steps (while keeping other things the same) can have a similar effect, typically including slowing down convergence. More details can be found in https://arxiv.org/abs/2007.00878 mentioned also in the other answer. Potentially decreasing the client learning rate across rounds could also work. The details can differ also based on what exactly is the optimization method you are using.
How is the test accuracy calculated? How many local epochs are the clients training?
If the global model is tested on a held out set of examples, it is possible that clients are detrimentally overfitting during local training. As the global model approaches convergence, each client ends up training a model that works well for them individually, but may be diverging from the optimal global model (sometimes called client drift https://arxiv.org/abs/1910.06378). This is may occur when the client's local dataset has a distribution very different from the global distribution and more likely when the client learning rates are high (https://arxiv.org/abs/2007.00878).
Decreasing the client learning rate, reducing the number of steps/batches, and other methods that cause the clients to do less "work" per communication round may reduce the fluctuation.

Best Reinforcement Learner Optimizer

I'm running a SAC reinforcement learner for a robotics application with some pretty decent results. One of the reasons I opted for reinforcement learning is for the ability for learning in the field, e.g. to adjust to a mechanical change, such as worn tires or a wheel going a little out of alignment.
My reinforcement learner restores it's last saved weights and replay buffer upon startup, so it doesn't need to retrain every time I turn it on. However, one concern I have is with respect to the optimizer.
Optimizers have come a long way since ADAM, but everything I read and all the RL code samples I see still seem to use ADAM with a fixed learning rate. I'd like to take advantage of some of the advances in optimizers, e.g. one cycle AdamW. However, a one-cycle optimizer seems inappropriate for a continuous real-world reinforcement learning problem: I imagine it's pretty good for the initial training/calibration, but I expect the low final learning rate would react too slowly to mechanical changes.
One thought I had was perhaps to do a one-cycle approach for initial training, and triggering a smaller one-cycle restart if a change in error that indicates something has changed (perhaps the size of the restart could be based on the size of the change in error).
Has anyone experimented with optimizers other than ADAM for reinforcement learning or have any suggestions for dealing with this sort of problem?
Reinforcement learning is very different from traditional supervised learning because the training data distribution changes as the policy improves. In optimization terms, the objective function can be said to be non-stationary. For this reason, I suspect your intuition is likely correct -- that a "one-cycle" optimizer would perform poorly after a while in your application.
My question is, what is wrong with Adam? Typically, the choice of optimizer is a minor detail for deep reinforcement learning; other factors like the exploration policy, algorithmic hyperparameters, or network architecture tend to have a much greater impact on performance.
Nevertheless, if you really want to try other optimizers, you could experiment with RMSProp, Adadelta, or Nesterov Momentum. However, my guess is that you will see incremental improvements, if any. Perhaps searching for better hyperparameters to use with Adam would be a more effective use of time.
EDIT: In my original answer, I made the claim that the choice of a particular optimizer is not primarily important for reinforcement learning speed, and neither is generalization. I want to add some discussion that helps illustrate these points.
Consider how most deep policy gradient methods operate: they sample a trajectory of experience from the environment, estimate returns, and then conduct one or more gradient steps to improve the parameterized policy (e.g. a neural network). This process repeats until convergence (to a locally optimal policy).
Why must we continuously sample new experience from the environment? Because our current data can only provide a reasonable first-order approximation within a small trust region around the policy parameters that were used to collect that data. Hence, whenever we update the policy, we need to sample more data.
A good way to visualize this is to consider an MM algorithm. At each iteration, a surrogate objective is constructed based on the data we have now and then maximized. Each time, we will get closer to the true optimum, but the speed at which we approach it is determined only by the number of surrogates we construct -- not by the specific optimizer we use to maximize each surrogate. Adam might maximize each surrogate in fewer gradient steps than, say, RMSProp does, but this does not affect the learning speed of the agent (with respect to environment samples). It just reduces the number of minibatch updates you need to conduct.
SAC is a little more complicated than this, as it learns Q-values in an off-policy manner and conducts updates using experience replay, but the general idea holds. The best attainable policy is subject to whatever the current data in our replay memory are; regardless of the optimizer we use, we will need to sample roughly the same amount of data from the environment to converge to the optimal policy.
So, how do you make a faster (more sample-efficient) policy gradient method? You need to fundamentally change the RL algorithm itself. For example, PPO almost always learns faster than TRPO, because John Schulman and co-authors found a different and empirically better way to generate policy gradient steps.
Finally, notice that there is no notion of generalization here. We have an objective function that we want to optimize, and once we do optimize it, we have solved the task as well as we can. This is why I suspect that the "Adam-generalizes-worse-than-SGD" issue is actually irrelevant for RL.
My initial testing suggest the details of the optimizer and it's hyperparameters matter, at least for off-policy techniques. I haven't had the chance to experiment much with PPO or on-policy techniques, so I can't speak for those unfortunately.
To speak to #Brett_Daley's thoughtful response a bit: the optimizer is certainly one of the less important characteristics. The means of exploration, and the use of a good prioritized replay buffer are certainly critical factors, especially with respect to achieving good initial results. However, my testing seems to show that the optimizer becomes important for the fine-tuning.
The off-policy methods I have been using have been problematic with fine-grained stability. In other words, the RL finds the mostly correct solution, but never really hones in on the perfect solution (or if it does find it briefly, it drifts off). I suspect the optimizer is at least partly to blame.
I did a bit of testing and found that varying the ADAM learning rate has an obvious effect. Too high and both the actor and critic bounce around the minimum and never converge on the optimal policy. In my robotics application this looks like the RL consistently makes sub-optimal decisions, as though there's a bit of random exploration with every action that always misses the mark a little bit.
OTOH, a lower learning rate tends to get stuck in sub-optimal solutions and is unable to adapt to changes (e.g. slower motor response due to low battery).
I haven't yet run any tests of single-cycle schedule or AdamW for the learning rate, but I did a very basic test with a two stage learning rate adjustment for both Actor and Critic (starting with a high rate and dropping to a low rate) and the results were a clearly more precise solution that converged quickly during the high learning rate and then honed in better with the low-learning rate.
I imagine AdamW's better weight decay regularization may result in similarly better results for avoiding overfitting training batches contributing to missing the optimal solution.
Based on the improvement I saw, it's probably worth trying single-cycle methods and AdamW for the actor and critic networks for tuning the results. I still have some concerns for how the lower learning rate at the end of the cycle will adapt to changes in the environment, but a simple solution for that may be to monitor the loss and do a restart of the learning rate if it drifts too much. In any case, more testing seems warranted.

Effect of learning rate on the neural network

I have a dataset of about a 100 numeric values. I have set the learning rate of my neural network to 0.0001. I have successfully trained it on the dataset for over 1 million times. But my question is that what is the effect of very low learning rates in the neural networks ?
Low learning rate mainly implies slow convergence: you're moving down the loss function with smaller steps (the step size is the learning rate).
If your function is convex this is not a problem, you will wait more but you'll reach a good solution.
If, as in the case of deep neural networks, your function is not convex than a low learning rate could lead to the reaching of a "good" optimum that's not the best one (getting stuck in a local minimum, without making steps as big as required to jump out of it).
That's why there are different optimization algorithms that are adaptive: such algorithm, like ADAM, RMSProp, ... have different learning rates for each weight in the network (every single learning rate starts from the same value). In this way, the optimization algorithm can work on every single parameter independently with the aim of finding a better solution (and letting the chose of the initial learning rate less critical)

convnet suddenly drop accuracy

I was trying to train an emotion recognition model on the fer2013 dataset using the architecture proposed in this paper
The paper uses different dataset than mine so I did some modifications on on the stride and filter size.
After a couple hours of training, accuracy on both training and test set suddenly drops.
After that the accuracy just stay around 0.1-0.2 for both set, never improve anymore.
Does anybody know about this phenomenon?
In any neural network training, if both accuracies i.e. training and validation improves at first and then starts decreasing, it is a sign that your network is failing to converge. More appropriately, your optimizer has started overshooting.
One most likely reason for this could be high learning rate. Reduce your learning rate and then check your example again. Also, in your linked paper, (at least in first glimpse), I couldn't see learning rate mentioned. Since your data is different from the paper's, same learning rate might not work as well.

Neural network, is it worth changing learning rate and momentum over time

Is it worth to change learning rate after certain conditions are met? And how and why to do it? For example net will start with high learning rate and after squared error is low enough learning rate will drop for better precision or learning rate should increase to jump-out of local minima?. Wouldn't it cause over-fitting? And what about momentum?
Usually you should start with a high learning rate and a low momentum. Then you decrease the learning rate over time and increase the momentum. The idea is to allow more exploration at the beginning of the learning and force convergence at the end of the learning. Usually you should look at the training error to set up your learning schedule: if it got stuck, i.e. the error does not change, it is time to decrease your learning rate.

Resources