This is more or less general question, in my implementation of backpropagation algorithm, I start from some "big" learning rate, and then decrease it after I see the error started to grow, instead of narrowing down.
I am able to do this rate decrease either after I got error grow a bit (StateA), or just before it's about to grow (StateB, kind of rollback to previous "successful" state)
So the question is what is better from mathematical points of view?
Or do I need to execute two parallel testing, let's say try to learn from point StateA, then point StateB both with reduced learning rate and compare which one is decreasing faster?
BTW I did't try approach from last paragraph. It's only pop up in mind during I write this question. In current implementation of algorithm I continue learning from StateA with decreased learning rate with assumptions that the decrease in learning rate is rather small to make me go back in previous direction to global minimum, if I accidentally faced only local minimum
What you describe is one of a collection of techniques called Learning Rate Scheduling. Just for you to know, there are more than two techniques:
Predetermined piesewise constant learning rate
Performance scheduling (looks like the closest one to yours)
Exponential scheduling
Power scheduling
...
The exact performance of each one greatly depends on the optimizer (SGD, Momentum, NAG, RMSProp, Adam, ...) and the data manifold (i.e. the training data and objective function). But they have been studied in regard to deep learning problems. For example, I'd recommend you this paper by Andrew Senior at al that compared various techniques for speech recognition task. The authors concluded is that exponential scheduling performed the best. If you're interested in math behind it, you should definitely take a look at their study.
Related
I'm running a SAC reinforcement learner for a robotics application with some pretty decent results. One of the reasons I opted for reinforcement learning is for the ability for learning in the field, e.g. to adjust to a mechanical change, such as worn tires or a wheel going a little out of alignment.
My reinforcement learner restores it's last saved weights and replay buffer upon startup, so it doesn't need to retrain every time I turn it on. However, one concern I have is with respect to the optimizer.
Optimizers have come a long way since ADAM, but everything I read and all the RL code samples I see still seem to use ADAM with a fixed learning rate. I'd like to take advantage of some of the advances in optimizers, e.g. one cycle AdamW. However, a one-cycle optimizer seems inappropriate for a continuous real-world reinforcement learning problem: I imagine it's pretty good for the initial training/calibration, but I expect the low final learning rate would react too slowly to mechanical changes.
One thought I had was perhaps to do a one-cycle approach for initial training, and triggering a smaller one-cycle restart if a change in error that indicates something has changed (perhaps the size of the restart could be based on the size of the change in error).
Has anyone experimented with optimizers other than ADAM for reinforcement learning or have any suggestions for dealing with this sort of problem?
Reinforcement learning is very different from traditional supervised learning because the training data distribution changes as the policy improves. In optimization terms, the objective function can be said to be non-stationary. For this reason, I suspect your intuition is likely correct -- that a "one-cycle" optimizer would perform poorly after a while in your application.
My question is, what is wrong with Adam? Typically, the choice of optimizer is a minor detail for deep reinforcement learning; other factors like the exploration policy, algorithmic hyperparameters, or network architecture tend to have a much greater impact on performance.
Nevertheless, if you really want to try other optimizers, you could experiment with RMSProp, Adadelta, or Nesterov Momentum. However, my guess is that you will see incremental improvements, if any. Perhaps searching for better hyperparameters to use with Adam would be a more effective use of time.
EDIT: In my original answer, I made the claim that the choice of a particular optimizer is not primarily important for reinforcement learning speed, and neither is generalization. I want to add some discussion that helps illustrate these points.
Consider how most deep policy gradient methods operate: they sample a trajectory of experience from the environment, estimate returns, and then conduct one or more gradient steps to improve the parameterized policy (e.g. a neural network). This process repeats until convergence (to a locally optimal policy).
Why must we continuously sample new experience from the environment? Because our current data can only provide a reasonable first-order approximation within a small trust region around the policy parameters that were used to collect that data. Hence, whenever we update the policy, we need to sample more data.
A good way to visualize this is to consider an MM algorithm. At each iteration, a surrogate objective is constructed based on the data we have now and then maximized. Each time, we will get closer to the true optimum, but the speed at which we approach it is determined only by the number of surrogates we construct -- not by the specific optimizer we use to maximize each surrogate. Adam might maximize each surrogate in fewer gradient steps than, say, RMSProp does, but this does not affect the learning speed of the agent (with respect to environment samples). It just reduces the number of minibatch updates you need to conduct.
SAC is a little more complicated than this, as it learns Q-values in an off-policy manner and conducts updates using experience replay, but the general idea holds. The best attainable policy is subject to whatever the current data in our replay memory are; regardless of the optimizer we use, we will need to sample roughly the same amount of data from the environment to converge to the optimal policy.
So, how do you make a faster (more sample-efficient) policy gradient method? You need to fundamentally change the RL algorithm itself. For example, PPO almost always learns faster than TRPO, because John Schulman and co-authors found a different and empirically better way to generate policy gradient steps.
Finally, notice that there is no notion of generalization here. We have an objective function that we want to optimize, and once we do optimize it, we have solved the task as well as we can. This is why I suspect that the "Adam-generalizes-worse-than-SGD" issue is actually irrelevant for RL.
My initial testing suggest the details of the optimizer and it's hyperparameters matter, at least for off-policy techniques. I haven't had the chance to experiment much with PPO or on-policy techniques, so I can't speak for those unfortunately.
To speak to #Brett_Daley's thoughtful response a bit: the optimizer is certainly one of the less important characteristics. The means of exploration, and the use of a good prioritized replay buffer are certainly critical factors, especially with respect to achieving good initial results. However, my testing seems to show that the optimizer becomes important for the fine-tuning.
The off-policy methods I have been using have been problematic with fine-grained stability. In other words, the RL finds the mostly correct solution, but never really hones in on the perfect solution (or if it does find it briefly, it drifts off). I suspect the optimizer is at least partly to blame.
I did a bit of testing and found that varying the ADAM learning rate has an obvious effect. Too high and both the actor and critic bounce around the minimum and never converge on the optimal policy. In my robotics application this looks like the RL consistently makes sub-optimal decisions, as though there's a bit of random exploration with every action that always misses the mark a little bit.
OTOH, a lower learning rate tends to get stuck in sub-optimal solutions and is unable to adapt to changes (e.g. slower motor response due to low battery).
I haven't yet run any tests of single-cycle schedule or AdamW for the learning rate, but I did a very basic test with a two stage learning rate adjustment for both Actor and Critic (starting with a high rate and dropping to a low rate) and the results were a clearly more precise solution that converged quickly during the high learning rate and then honed in better with the low-learning rate.
I imagine AdamW's better weight decay regularization may result in similarly better results for avoiding overfitting training batches contributing to missing the optimal solution.
Based on the improvement I saw, it's probably worth trying single-cycle methods and AdamW for the actor and critic networks for tuning the results. I still have some concerns for how the lower learning rate at the end of the cycle will adapt to changes in the environment, but a simple solution for that may be to monitor the loss and do a restart of the learning rate if it drifts too much. In any case, more testing seems warranted.
I've been studying reinforcement learning, and understand the concepts of value/policy iteration, TD(1)/TD(0)/TD(Lambda), and Q-learning. What I don't understand is why Q-learning can't be used for everything. Why do we need "deep" reinforcement learning as described in DeepMind's DQN paper?
Q-learning is a model-free reinforcement learning method first documented in 1989. It is “model-free” in the sense that the agent does not attempt to model its environment. It arrives at a policy based on a Q-table which stores the result of taking any action from a given state. When the agent is in state s, it refers to the Q-table for the state and picks the action with the highest associated award. In order for the agent to arrive at an optimal policy, it must balance exploration of all available actions for all states with exploiting what the Q-table says is the optimal action for a given state. If the agent always picks a random action, it will never arrive at an optimal policy; likewise, if the agent always chooses the action with the highest estimated reward, it may arrive at a sub-optimal policy since certain state-action pairs may not have been completely explored.
Given enough time, Q-learning can eventually find an optimal policy π for any finite Markov decision process (MDP). In the example of a simple game of Tic-Tac-Toe, the number of total disparate game states is less than 6,000. That might sound like a high number, but consider a simple video game environment in OpenAI’s gym environment known as “Lunar Lander”.
The goal is to use the lander’s thrusters to navigate it to land between the yellow flags, ensuring the lander’s inertia is slowed enough so as not to cause it to crash. The possible actions are: do nothing, use left thruster, use right thruster, use main center thruster. Using the main thruster incurs a small negative reward. Landing without crashing provides a large reward, and landing between the flags also provides a large reward. Crashing provides a large negative reward. The agent experiences state as a combination of the following parameters: the x and y coordinates of the lander, as well as its x and y velocity, rotation, angular velocity, and simple binary values for each leg to determine if it is touching the ground. Consider all the different possible states the agent could encounter from different combinations of these parameters; the state space of this MDP is enormous compared to tic-tac-toe. It would take an inordinate amount of time for the agent to experience enough episodes to reliably pilot the lander. The state space provided by the Lunar Lander environment is too large for traditional Q-learning to effectively solve in a reasonable amount of time, but with some adjustments (in the form of “deep” Q-learning) it is indeed possible for an agent to successfully navigate the environment on a regular basis within a reasonable amount of time.
As detailed in the DeepMind paper you linked to, Deep Q-learning is based on Tesauro’s TD-Gammon approach which approximates the value function from information received when the agent interacts with the environment. One major difference is that instead of constantly updating the value function, the experiences from an episode are processed in fixed sets, or batches. After an episode is completed, the oldest episode is removed from the set and the most recent episode is pushed into the set. This helps the algorithm explore the environment more efficiently because it attempts to prevent feedback loops. This use of batching is referred to as “experience replay.” It is also more efficient since learning from pairs of consecutive states may lead to inaccuracies due to how closely related those two states are.
TL;DR: When the state-action space is so large that regular Q-learning would take too long to converge, deep reinforcement learning may be a viable alternative because of its use of function approximation.
Q-learning uses Q-tables to store Q-values and use them to select actions for the current state by using the corresponding Q-values.
But this is not always feasible. When we have large state space, our Q-table becomes very large and each estimated Q-values take long time to get updated and most of them may get updated only very few times so they are inaccurate.
To tackle these kind of problems, we use function approximators to learn the general Q-values. Neural networks are good at function approximation so, DQN was proposed to get the state representation and estimate the Q-values. Now the network learns to predict the Q-values using the low level feature of the state, so helps in generalization.
When dealing with ill-conditioned neural networks, is the current state of the art to use an adaptive learning rate, some very sophisticated algorithm to deal with the problem, or to eliminate the ill conditioning by preprocessing/scaling of the data?
The problem can be illustrated with the simplest of scenarios: one input and one output where the function to be learned is y=x/1000, so a single weight whose value needs to be 0.001. One data point (0,0). It turns out to matter a great deal, if you are using gradient descent, whether the second data point is (1000,1) or (1,0.001).
Theoretical discussion of the problem, with expanded examples.
Example in TensorFlow
Of course, straight gradient descent is not the only available algorithm. Other possibilities are discussed at here - however, as that article observes, the alternative algorithms it lists that are good at handling ill condition, are not so good when it comes time to handle a large number of weights.
Are new algorithms available? Yes, but these aren't clearly advertised as solutions for this problem, are perhaps intended to solve a different set of problems; swapping in Adagrad in place of GradientDescent does prevent overshoot, but still converges very slowly.
At one time, there were some efforts to develop heuristics to adaptively tweak the learning rate, but then instead of being just a number, the learning rate hyperparameter is a function, much harder to get right.
So these days, is the state of the art to use a more sophisticated algorithm to deal with ill condition, or to just preprocess/scale the data to avoid the problem in the first place?
I am trying to write an adaptive controller for a control system, namely a power management system using Q-learning. I recently implemented a toy RL problem for the cart-pole system and worked out the formulation of the helicopter control problem from Andrew NG's notes. I appreciate how value function approximation is imperative in such situations. However both these popular examples have very small number of possible discrete actions. I have three questions:
1) What is the correct way to handle such problems if you don't have a small number of discrete actions? The dimensionality of my actions and states seems to have blown up and the learning looks very poor, which brings me to my next question.
2) How do I measure the performance of my agent? Since the reward changes in conjunction with the dynamic environment, at every time-step I can't decide the performance metrics for my continuous RL agent. Also unlike gridworld problems, I can't check the Q-value table due to huge state-action pairs, how do I know my actions are optimal?
3) Since I have a model for the evoluation of states through time. States = [Y, U]. Y[t+1] = aY[t] + bA, where A is an action.
Choosing discretization step for actions A will also affect how finely I have to discretize my state variable Y. How do I choose my discretization steps?
Thanks a lot!
You may use a continuous action reinforcement learning algorithm and completely avoid the discretization issue. I'd suggest you to take a look at CACLA.
As for the performance, you need to measure your agent's accumulated reward during an episode with learning turned off. Since your environment is stochastic, take many measurements and average them.
Have a look at policy search algorithms. Basically, they directly learn a parametric policy without an explicit value function, thus avoiding the problem of approximating the Q-function for continuous actions (eg, no discretization of the action space is needed).
One of the easiest and earliest policy search algorithm is policy gradient. Have a look here for a quick survey about the topic. And here for a survey about policy search (currently, there are more recent techniques, but that's a very good starting point).
In the case of control problem, there is a very simple toy task you can look, the Linear Quadratic Gaussian Regulator (LQG). Here you can find a lecture including this example and also an introduction to policy search and policy gradient.
Regarding your second point, if your environment is dynamic (that is, the reward function of the transition function (or both) change through time), then you need to look at non-stationary policies. That's typically a much more challenging problem in RL.
I'm looking to analyze and compare the following `signals':
(Edit: better renderings here: oscillations good and here: oscillations bad)
What you see are plots of neuron activations from a type of artificial neural network plotted against time. Each line in the plot is a neuron's activation over time which can have a value between -1 and 1.
In the first plot, the activities are stable and consistent while the second exemplifies more chaotic activity (for want of a better term)-- some kind of destructive interference seems to occur ever so often..
Anyhow, I would like to do some kind of 'clever' analysis but since signal analysis is really not my strong point, thought I'd ask for some advice here...
EDIT: Let me clarify a bit. Ultimately, I would like to characterize the data. This could for example involve the pinpointing of correlations between the individual signals contained in each plot. I would like to measure 'regularity' or data invariance: in the above examples, the upper plot is more regular than the lower plot. I guess therefore I could compute the variance of each signal and take that as a measure; but I was wondering if some more comprehensive signal-processing technique could be better suited (I'm not sure). In fact I'm not even sure if signal-processing is what I really want now that I think about it. Perhaps some kind of wavelet or ft analysis...
For those interested, I am working on the computational modelling of worm locomotion.
You should consult some good books on nonlinear time series analysis. For instane, a measure for the regularity of your signal could be the Lyapunov spectrum. Another possibility would entropy. If you are interested in the correlation between signal, you could use transfer-entropy or granger causality, or for neurons it would be good to have a look at some measure for phase synchronization. The bayesian stuff could also be worth trying.
But – most important – firstly you need a proper question about what you really want to know. Once you've got that it is far more easy to pick the right tool.
And one final hint. Look for tools outside the engineering community. Their tools are mostly linear, but you are dealing with a highly nonlinear system. Wavelets, FFT and stuff are useful if you don't know anything about your signal and you want to have another perspective on it, but they are not suited for your kind of problem.