How to choose the reward function for the cart-pole inverted pendulum task - robotics

I am new in python or any programming language for that matter. For months now I have been working on stabilising the inverted pendulum. I have gotten everything working but struggling to get the right reward function. So far, after researching and trials and fails, the best I could come up with is
R=(x_dot**2)+0.001*(x**2)+0.1*(theta**2)
But I don't get to stability, this being theta=0 long enough.
Does anyone has an idea of the logic behind the ideal reward function?
Thank you.

For just the balancing problem (not the swing-up), even a binary reward is enough. Something like
Always 0, then -1 when the pole falls. Or,
Always 1, then 0 when the pole falls.
Which one to use depends on the algorithm used, the discount factor and the episode horizon. Anyway, the task is easy and both will do their job.
For the swing-up task (harder than just balancing, as the pole starts upside down and you need to swing it up by moving the cart) it is better to have a reward depending on the state. Usually, the simple cos(theta) is fine. You can also add a penalty for the angle velocity and for the action, in order to prefer slow-changing smooth trajectory.
You can also add a penalty if the cart goes out of the boundaries of the x coordinate.
A cost including all these terms would look like this
reward = cos(theta) - 0.001*theta_d.^2 - 0.0001*action.^2 - 100*out_of_bound(x)

I am working on inverted pendulum too.
I found the following reward function which I am trying.
costs = angle_normalise((th)**2 +.1*thdot**2 + .001*(action**2))
# normalize between -pi and pi
reward=-costs
but still have a problem in choosing the actions, maybe we can discuss.

Related

How does DQN work in an environment where reward is always -1

Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code that proves it. By working, I mean that when I train the agent, the agent quickly (within 300-500 episodes) learns how to solve the mountaincar problem. Below is an example from my trained agent.
It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.
What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course remember that tuples are pulled from memory at random, not episode-by-episode.
Another element of this particular OpenAI Gym environment is that the Done state is returned on either of two occasions: hitting the flag (yay) or timing out after some number of steps (boo). However, the agent treats both the same, accepting the reward of -1. Thus as far as the tuples in memory are concerned, both events look identical from a reward standpoint.
So, I don't see anything in the memory that indicates that the episode was performed well.
And thus, I have no idea why this DQN code is working for MountainCar.
The reason this works is because in Q-learning, your model is trying to estimate the SUM (technically the time-decayed sum) of all future rewards for each possible action. In MountainCar you get a reward of -1 every step until you win, so if you do manage to win, you’ll end up getting less negative reward than usual. For example, your total score after winning might be -160 instead of -200, so your model will start predicting higher Q-values for actions that have historically led to winning the game.
You are right. There is no direct association between memory (experience replay) and the performance of the model in episode reward. The Q-value in DQN is used to predict each action's expected reward in each step. The performance measure of how good your model was is the difference between the real reward and the expected reward (TD-error).
The deployment of -1 for non-goal steps is a trick to help RL models to choose the actions that can finish the episode quicker. Because of the Q-value is an action-value. At each step, the model predicts rewards for every possible move and the policy (usually greedy or epsilon-greedy) choose the action with the most significant value. You can imagine that going back at one moment will result in 200 steps to finish the episode but going forward takes only 100 steps. The Q-value will be -200 (without discount) and -100 respectively. You might wonder how the model knows the value of each action, that is because in the repeated episodes and successive trial-and-error. The model was trained to minimise the difference between real reward and expected reward, aka TD-error.
In a randomly sampled experience replay, all experiences are sampled and deleted uniformly. However, in priority experience replay, you can reuse those experience with high estimated error. Usually, the priorities are proportional to the TD error (real_reward - expected_reward) of expected Q-value and the current model's predicted Q-value. The larger the priority means how surprising the experience, and it helps accelerate the training.
You can check the idea in Priority Experience Replay, Schaul et al., 2016
It might help to look at a reduced problem. Consider:
States:
───┬───┬───┬───┬───┐
...│L2 │L1 │ S │ R1│
───┴───┴───┴───┴───┘
Actions:
left or right
r = -1 for all states
episode terminates upon reaching R1 or after 2 steps
Memory:
(S, left, -1, L1) non-terminal
(S, right, -1, R1) terminal
(L1, left, -1, L2) terminal
(L1, right, -1, S) terminal
The obvious but important thing to note is that although the rewards are all identical, the states and actions are not. This information allows us to reason about the next state given the current one.
Let's look at the targets that we are updating towards (with no discount):
Q(S, left ) --> -1 + max{a}Q(L1, a)
Q(S, right) --> -1
Q(L1, left ) --> -1
Q(L1, right) --> -1
In this contrived example, only transition 1 presents an extra source of instability. But over time, as action value on L1 converges from having sampled enough of transitions 3 and 4, so should the target on transition 1.
At that point, when we encounter transition 1 again, we would have a better estimate Q(S, left) --> -1 + -1.
It is not enough to only look at the rewards when we are asking how DQN learns from its memory since it is also using the next observation to determine its current best estimate of the action value at the next step (relevant code), effectively linking everything together and slowly tallying up the rewards. Albeit it does so in a much more unstable manner unlike traditional Q-learning.
As an exercise, consider extending this further and put the terminal state on R2.
Then we can easily see that now max{a}Q(S, a) = -2; it takes the same amount of time to reach R2 and to simply timeout, so it doesn't matter what we do (unless we start closer to R2). However, bumping the number of timeout steps up and it should head towards R2 again. I.e. the mountain car works also because the timeout is set to a number larger than the time step it takes to reach the goal. This path should eventually propagate its (negative but better) values back to our initial state.
Although this is true for any other environment, they can still at least learn to get closer to the goal before timing out provided they are rewarded for the effort. This is not so given the reward design in the mountain car environment.
In DQN you learn the Q-function, which basically approximates your return. In the memory, you store tuples with (s,a,s',r) and re-train your Q-function on those tuples. If, for a given tuple, you performed well (you reached the flag quickly) then you are going to re-experience it by re-using the tuple for training, because the Q-function is higher for that tuple.
Anyway, usually experience replay works better for any problem, not just for the mountain car.
You say:
those returns (-1000, -200) are never present in the experience replay memory.
What is present in the replay memory is a SARdS tuple with a done flag that tells you that the episode is finished. See the openai gym deepq example:
# Store transition in the replay buffer.
replay_buffer.add(obs, action, rew, new_obs, float(done))
In the final update, if float(done) == 1 then the future Q value is ignored. Hence at the end of the episode, the Q value is 0. If that happens on step 200, then the total return in the episode will be -200. If it happens on step 1000, then the total return will be -1000.
To put it a different way - if completely at random, the episode ended on step 200, an agent that had been performing as badly as a random policy would have a Q value (expected future return) of -800. If the episode then ended, the TD error would be +799 representing the positive surprise of only losing another -1.
I note that the code you linked to does not seem to use the done flag in the replay buffer. Instead it relies on a final state of s_ == None to signify the end of the episode. Take that code out, and the agent won't learn.
if s_ is None:
t[a] = r
else:
t[a] = r + GAMMA * numpy.amax(p_[i])

What's the best objective function for the CartPole task?

I'm doing policy gradient and I'm trying to figure out what the best objective function is for the task. The task is the open ai CartPole-v0 environment in which the agent receives a reward of 1 for each timestep it survives and a reward of 0 upon termination. I'm trying to figure out which is the best way to model the objective function. I've come up with 3 possible functions:
def total_reward_objective_function(self, episode_data) :
return sum([timestep_data['reward'] for timestep_data in timestep_data])
def average_reward_objective_function(self, episode_data):
return total_reward_objective_function(episode_data) / len(episode_data)
def sum_of_discounted_rewards_objective_function(self, episode_data, discount_rate=0.7)
return sum([episode_data[timestep]['reward'] * pow(discount_rate, timestep)
for timestep in enumerate(episode_data)])
Note that for the average reward objective function will always return 1 unless I intervene and modify the reward function to return a negative value upon termination. The reason I'm asking rather than just running a few experiments is because there's errors elsewhere. So if someone could point me towards a good practice in this area I could focus on the more significant mistakes in the algorithm.
You should use the last one (sum of discounted rewards), since the cart-pole problem is an infinite horizon MDP (you want to balance the pole as long as you can). The answer to this question explains why you should use a discount factor in infinite horizon MDPs.
The first one, instead, is just an undiscounted sum of the rewards, which could be used if episodes have a fixed length (for instance, in the case of a robot performing a 10 seconds trajectory). The second one is usually used in finite horizon MDPs, but I am not very familiar with it.
For the cart-pole, a discount factor of 0.9 should work (or, depending on the algorithm used, you can search for scientific papers and see the discount factor used).
A final note. The reward function you described (+1 at each timestep) is not the only one used in literature. A common one (and I think also the "original" one) gives 0 at each timestep and -1 if the pole falls. Other reward functions are related to the angle between the pole and the cart.

Q Learning Techniuqe for not falling in fires

Please take a look at picture below :
My Objective is that the agent rotating and moving in the environment and not falling in fire holes, I have think like this :
Do for 1000 episodes:
An Episode :
start to traverse the environment;
if falls into a hole , back to first place !
So I have read some where : goal is an end point for an episode , So if we think that goal is not to fall in fires , the opposite of the goal (i.e. putting in fire holes) will be end point of an episode . what you will suggest for goal setting ?
Another question is that why should I set the reward matrix ? I have read that Q Learning is Model Free ! I know that In Q Learning we will setup the goal and not the way for achieving to it . ( in contrast to supervised learning.)
Lots of research has been directed to reward functions. Crafting a reward function to produce desired behavior can be non-intuitive. As Don Reba commented, simply staying still (as long as you don't begin in a fire state!) is an entirely reasonable approach for avoiding fire. But that's probably not what you want.
One way to spur activity (and not camp in a particular state) is to penalize the agent for each timestep experienced in a non-goal state. In this case, you might assign a -1 reward for each timestep spent in a non-goal state, and a zero reward for the goal state.
Why not a +1 for goal? You might code a solution that works with a +1 reward but consider this: if the goal state is +1, then the agent can compensate for any number of poor, non-optimal choices by simply parking in the goal state until the reward becomes positive.
A goal state of zero forces the agent to find the quickest path to the goal (which I assume is desired). The only way to maximize reward (or minimize negative reward) is to find the goal as quickly as possible.
And the fire? Assign a reward of -100 (or -1,000 or -1,000,000 - whatever suits your aims) for landing in fire. The combination of +0 for goal, -1 for non-goal, and -100 for fire should provide a reward function that yields the desired control policy.
Footnote: Google "negative bounded Markov Decision Processes (MDPs)" for more information on these reward functions and the policies they can produce.

Quadcopter PID controller

Context
My task is to design and build a velocity PID controller for a micro quadcopter than flies indoors. The room where the quadcopter is flying is equipped with a camera-based high accuracy indoor tracking system that can provide both velocity and position data for the quadcopter. The resulting system should be able to accept a target velocity for each axis (x, y, z) and drive the quadcopter with that velocity.
The control inputs for the quadcopter are roll/pitch/yaw angles and thrust percentage for altitude.
My idea is to implement a PID controller for each axis where the SP is the desired velocity in that direction, the measured value is the velocity provided by the tracking system and the output value is the roll/pitch/yaw angle and respectively thrust percent.
Unfortunately, because this is my first contact with control theory, I am not sure if I am heading in the right direction.
Questions
I understand the basic PID controller principle, but it is still
unclear to me how it can convert velocity (m/s) into roll/pitch/yaw
(radians) by only summing errors and multiplying with a constant?
Yes, velocity and roll/pitch are directly proportional, so does it
mean that by multiplying with the right constant yields a correct
result?
For the case of the vertical velocity controller, if the velocity is set to 0, the quadcopter should actually just maintain its altitude without ascending or descending. How can this be integrated with the PID controller so that the thrust values is not 0 (should keep hovering, not fall) when the error is actually 0? Should I add a constant term to the output?
Once the system has been implemented, what would be a good approach
on tuning the PID gain parameters? Manual trial and error?
The next step in the development of the system is an additional layer
of position PID controllers that take as a setpoint a desired
position (x,y,z), the measured positions are provided by the indoor
tracking system and the outputs are x/y/z velocities. Is this a good approach?
The reason for the separation of these PID control layers is that the project is part of a larger framework that promotes re-usability. Would it be better to just use a single layer of PID controllers that directly take position coordinates as setpoints and output roll/pitch/yaw/thrust values?
That's the basic idea. You could theoretically wind up with several constants in the system to convert from one unit to another. In your case, your P,I, D constants will implicitly include the right conversion factors. for example, if in an idealized system you would want to control the angle by feedback with P=0.5 in degrees, bu all you have access to is speed, and you know that 5 degrees of tilt gives 1 m/s, then you would program the P controller to multiply the measured speed by 0.1 and the result would be the control input for the angle.
I think the best way to do that is to implement a SISO PID controller not on height (h) but on vertical speed (dh/dt) with thrust percentage as the ouput, but I am not sure about that.
Unfortunately, if you have no advance knowledge about the parameters, that will be the only way to go. I recommend an environment that won't do much damage if the quadcopter crashes...
That sounds like a good way to go
My blog may be of interest to you: http://oskit.se/quadcopter-control-and-how-to-really-implement-it/
The basic principle is easy to understand. Tuning is much harder. It took me the most time to actually figure out what to feed into the pid controller and small things like the + and - signs on variables. But eventually I solved these problems.
The best way to test PID controller is to have some way to set your values dynamically so that you don't have to recompile your firmware every time. I use mavlink in my quadcopter firmware that I work on. With mavlink the copter can easily be configured using any ground control software that supports mavlink protocol.
Start small, build a working copter with using a ready made controller, then write your own. Then test it and tune it and experiment. You will never truly understand PID controllers unless you try building a copter yourself. My bare metal software development library can assist you with making it easier to write your software without having to worry about hardware driver implementation. Link: https://github.com/mkschreder/martink
If you are using python programming language, you can use ddcontrol framework.
You can estimate a transfer function by SISO data. You can use tfest function for this purpose. And then you can optimize a PID Controller by using pidopt function.

What are some relevant A.I. techniques for programming a flock of entities?

Specifically, I am talking about programming for this contest: http://www.nodewar.com/about
The contest involves you facing other teams with a swarm of spaceships in a two dimensional world. The ships are constrained by a boundary (if they exit they die), and they must continually avoid moons (who pull the ships in with their gravity). The goal is to kill the opposing queen.
I have attempted to program a few, relatively basic techniques, but I feel as though I am missing something fundamental.
For example, I implemented some of the boids behaviours (http://www.red3d.com/cwr/boids/), but they seemed to lack a... goal, so to speak.
Are there any common techniques (or, preferably, combinations of techniques) for this sort of game?
EDIT
I would just like to open this up again with a bounty since I feel like I'm still missing critical pieces of information. The following is my NodeWar code:
boundary_field = (o, position) ->
distance = (o.game.moon_field - o.lib.vec.len(o.lib.vec.diff(position, o.game.center)))
return distance
moon_field = (o, position) ->
return o.lib.vec.len(o.lib.vec.diff(position, o.moons[0].pos))
ai.step = (o) ->
torque = 0;
thrust = 0;
label = null;
fields = [boundary_field, moon_field]
# Total the potential fields and determine a target.
target = [0, 0]
score = -1000000
step = 1
square = 1
for x in [(-square + o.me.pos[0])..(square + o.me.pos[0])] by step
for y in [(-square + o.me.pos[1])..(square + o.me.pos[1])] by step
position = [x, y]
continue if o.lib.vec.len(position) > o.game.moon_field
value = (fields.map (f) -> f(o, position)).reduce (t, s) -> t + s
target = position if value > score
score = value if value > score
label = target
{ torque, thrust } = o.lib.targeting.simpleTarget(o.me, target)
return { torque, thrust, label }
I may have implemented potential fields incorrectly, however, since all the examples I could find are about discrete movements (whereas NodeWar is continuous and not exact).
The main problem being my A.I. never stays within the game area for more than 10 seconds without flying off-screen or crashing into a moon.
You can easily make the boids algorithm play the game of Nodewar, by just adding additional steering behaviours to your boids and/or modifying the default ones. For example, you would add a steering behaviour to avoid the moon, and a steering behaviour for enemy ships (repulsion or attraction, depending on the position between yours and the enemy ship). The weights of the attraction/repulsion forces should then be tweaked (possibly by genetic algorithms, playing different configurations against each other).
I believe this approach would give you already a relatively strong baseline player, to which you can start adding collaborative strategies etc.
You can achieve the best possible flocking behaviour using control theory. We can start by expressing the state of a ship (a vector of positions and velocities) as variable X. We will use x to represent a small deviation from some state X0. For each node, linearize around the state you want the individual ships to be at:
d/dt(X) = f(X - X0)
where X0 is the state you want the ship to be at. f() can be nonlinear (as in the case of your potential field). Close to this point, the will obey
d/dt(x) = Ax
A is a fixed matrix which you should tweak to achieve stability. Much of the field of control theory tells you how to do this. It's very important to you that A leads to a stable system and that it converges fast. You should now break out MATLAB or GNU Octave. You should also read up on what poles mean in control theory, and Laplace transforms. The poles of the system (eigenvalues of matrix A) will characterize the response of the ship to disturbances of any kind, its stability, and its convergeance speed. It will also tell you whether the ship will move towards X0 or oscillate around X0.
There are several ways of choosing A (You probably don't get full control over A because of the game's limited physics) without having to do much analysis yourself. Two such algorithms are optimal control (You define the cost of controlling the system vs the cost of deviating from X0), and pole placement (You choose where the poles will go).
The state space theory also applies to an entire flock of ships, deviating from the configuration desired by every ship at that point in time. If the system is nonlinear, You might not be able to guarantee that all configurations are stable.
Conclusion:
Use control theory to guarantee individual stability, calculating X0 based on the ships around you, and add a potential field to prevent ships from bumping into eachother and moons.
Disclaimer:
There are several ways of expressing a state space system. This is the simplest but you'll need to express it differently when considering the open loop system (split up A into another 'A' which gives you the physical system and some other matrices which give you the control parameters)

Resources