Q-learning: How to include a terminal state in updating rule? - q-learning

I use Q-learning in order to determine the optimal path of an agent. I know in advance that my path is composed of exactly 3 states (so after 3 states I reach a terminal state). I would like to know how to include that in the updating rule of the q-function.
What I am doing currently:
for t=1:Nb_Epochs-1
if rand(1)<Epsilon
an action 'a' is chosen at random
else
[Maxi a]=max(QIter(CurrentState,:,t));
end
NextState=FindNextState(CurrentState,a);
QIter(CurrentState,a,t+1)=(1-LearningRate)*QIter(CurrentState,a,t)+LearningRate*(Reward(NexState)+Discount*max(QIter(NextState,:,t)));
CurrentState=NextState;
end

Related

Understanding State Groups in Context in Drake

I've created a diagram with an LQR Controller, a MultiBodyPlant, a scenegraph, and a PlanarSceneGraphVisualizer.
While trying to run this simulation, I set the random initial conditions using the function: context.SetDiscreteState(randInitState). However, with this, I get the following error:
RuntimeError: Context::SetDiscreteState(): expected exactly 1 discrete state group but there were 2 groups. Use the other signature if you have multiple groups.
And indeed when I check the number of groups using context.num_discrete_state_groups(), it returns 2. So, then I have to specify the group index while setting the state using the command context.SetDiscreteState(0, randInitState). This works but I don't exactly know why. I understand that I have to select a correct group to set the state for but what exactly is a group here? In the cartpole example given here, the context was set using context.SetContinuousState(UprightState() + 0.1 * np.random.randn(4,)) without specifying any group(s).
Are groups only valid for discrete systems? The context documentation talks about groups and but doesn't define them.
Is there a place to find the definition of what a group is while setting up a drake simulation with multiple systems inside a diagram and how to check the group index of a system?
We would typically recommend that you use a workflow that sets the context using a subsystem interface. E.g.
plant_context = plant.GetMyMutableContextFromRoot(context)
plant_context.SetContinuousState(...)
Figuring out the discrete index of a state group for a DiagramContext might be possible, but it's certainly not typical.
You might find it helpful to print the context. In pydrake, you can actually just call print(context), and you will see the different elements and where they are coming from.

How do you set up input ports in custom plants in Drake?

I am currently working to implement a nonlinear system in Drake, so I've been setting up a LeafSystem as my plant. Following this tutorial, I was able to get a passive version of my system functioning properly (no variable control input). But the tutorial doesn't cover defining inputs in these custom systems, and neither do any examples I was able to find. Could someone help me with a few questions I have listed below?
As shown in the code shown below, I simply guessed at the DeclareVectorInputPort function, but I feel like there should be a third callback similar to CopyStateOut. Should I add a callback that just sets the input port as the scalar phi (my desired voltage input)?
When I call CalcTimeDerivatives, how should I access phi? Should it be passed as another argument or should the aforementioned callback be used to set phi as some callable member of DEAContinuousSys?
At the end of CalcTimeDerivatives, I also feel like there should be a more efficient way to set all derivatives (both velocity and acceleration) to a single vector rather than individually assigning each index of derivatives, but I wasn't able to figure it out. Tips?
from pydrake.systems.framework import BasicVector, LeafSystem
class DEAContinuousSys(LeafSystem):
def __init__(self):
LeafSystem.__init__(self)
self.DeclareContinuousState(2) # two state variables: lam, lam_d
self.DeclareVectorOutputPort('Lam', BasicVector(2), self.CopyStateOut) # two outputs: lam_d, lam_dd
self.DeclareVectorInputPort('phi', BasicVector(1)) # help: guessed at this one, do I need a third argument?
# TODO: add input instead of constant phi
def DoCalcTimeDerivatives(self, context, derivatives):
Lam = context.get_continuous_state_vector() # get state
Lam = Lam.get_value() # cast as regular array
Lam_d = DEA.dynamics(Lam, None, phi) # derive acceleration (no timestep required) TODO: CONNECT VOLTAGE INPUT
derivatives.get_mutable_vector().SetAtIndex(0, Lam_d[0]) # set velocity
derivatives.get_mutable_vector().SetAtIndex(1, Lam_d[1]) # set acceleration
def CopyStateOut(self, context, output):
Lam = context.get_continuous_state_vector().CopyToVector()
output.SetFromVector(Lam)
continuous_sys = DEAContinuousSys()
You're off to a good start! Normally input ports are wired into some other system's output port so there is no need for them to have their own "Calc" method. However, you can also set them to fixed values, using port.FixValue(context, value) (the port object is returned by DeclareVectorInputPort()). Here is a link to the C++ documentation. I'm not certain of the Python equivalent but it should be very similar.
Hopefully someone else can point you to a Python example if the C++ documentation is insufficient.
You might take a look at the quadrotor2d example from the underactuated notes:
https://github.com/RussTedrake/underactuated/blob/master/underactuated/quadrotor2d.py#L44

Reinforcement Learning - How does an Agent know which action to pick?

I'm trying to understand Q-Learning
The basic update formula:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I understand the formula, and what it does, but my question is:
How does the agent know to choose Q(st, at)?
I understand that an agent follows some policy π, but how do you create this policy in the first place?
My agents are playing checkers, so I am focusing on model-free algorithms.
All the agent knows is the current state it is in.
I understand that when it performs an action, you update the utility, but how does it know to take that action in the first place.
At the moment I have:
Check each move you could make from that state.
Pick whichever move has the highest utility.
Update the utility of the move made.
However, this doesnt really solve much, you still get stuck in local minimum/maximums.
So, just to round things off, my main question is:
How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take?
That update formula incrementally computes the expected value of each action in every state. A greedy policy chooses always the highest valued action. This is the best policy when you have already learned the values. The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε.

How do you include categories with 0 responses in SPSS frequency output?

Is there a way to display response options that have 0 responses in SPSS frequency output? The default is for SPSS to omit in the frequency table output any response option that is not selected by at least a single respondent. I looked for a syntax-driven option to no avail. Thank you in advance for any assistance!
It doesn't show because there is no one single case in the data is with that attribute. So, by forcing a row of zero you'll need to realize we're asking SPSS to do something incorrect.
Having said that, you can introduce a fake case with the missing category. E.g. if you have Orange, Apple, and Pear, but no one answered they like Pear, the add one fake case that says Pear.
Now, make a new weight variable that consists of only 1. But for the Pear case, make it very very small like 0.00001. Then, go to Data > Weight Cases > Weight cases by and put that new weight variable over. Click OK to apply. Now what happens is that SPSS will treat the "1" with a weight of 1 and the fake case with a weight that is 1/10000 of a normal case. If you rerun the frequency you should see the one with zero count shows up.
If you have purchased the Custom Table module you can also do that directly as well, as far as I can tell from their technical document. That module costs 637 to 3630 depending on license type, so probably only worth a try if your institute has it.
So, I'm a noob with SPSS, I (shame on me) have a cracked version of SPSS 22 and if I understood your question correctly, this is my solution:
double click the Frequency table in Output
right click table, select Table Properties
go to General and then uncheck the Hide empty rows and columns option
Hope this helps someone!
If your SPSS version has no Custom Tables installed and you haven't collected money for that module yet then use the following (run this syntax):
*Note: please use variable names up to 8 characters long.
set mxloops 1000. /*in case your list of values is longer than 40
matrix.
get vars /vari= V1 V2 /names= names /miss= omit. /*V1 V2 here is your categorical variable(s)
comp vals= {1,2,3,4,5,99}. /*let this be the list of possible values shared by the variables
comp freq= make(ncol(vals),ncol(vars),0).
loop i= 1 to ncol(vals).
comp freq(i,:)= csum(vars=vals(i)).
end loop.
comp names= {'vals',names}.
print {t(vals),freq} /cnames= names /title 'Frequency'. /*here you are - the frequencies
print {t(vals),freq/nrow(vars)*100} /cnames= names /format f8.2 /title 'Percent'. /*and percents
end matrix.
*If variables have missing values, they are deleted listwise. To include missings, use
get vars /vari= V1 V2 /names= names /miss= -999. /*or other value
*To exclude missings individually from each variable, analyze by separate variables.

Q-Learning: Can you move backwards?

I'm looking over a sample exam and there is a question on Q-learning, I have included it below. In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2). It appears the Q value to go back up to A2 would be 0.18, and the Q value to go right would be 0.09. So why wouldn't the agent go back to A2 instead of going to B3?
Maze & Q-Table
Solution
Edit: Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
Edit2: Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Edit3: Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
You seem to assume that you should look at the values of the state in the next time step. This is incorrect. The Q-function answers the question:
If I'm in state x, which action should I take?
In non-deterministic environments you don't even know what the next state will be, so it would be impossible to determine which action to take in your interpretation.
The learning part of Q-learning indeed acts on two subsequent timesteps, but after they are already known, and they are used to update the values of Q-function. This has nothing to do with how these samples (state, action, reinforcement, next state) are collected. In this case, samples are collected by the agent interacting with the environment. And in Q-learning setting agents interact with the environment according to a policy, which is based on current values of Q-function here. Conceptually, a policy works in terms of answering the question I quoted above.
In steps 1 and 2, the Q-function is modified only for states 1,A and 2,A. In step 3 the agent is in state 3,A so that's the only part of Q-function that's relevant.
In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2).
In state 3,A the action that has the highest Q-value is "right" (0.2). All other actions have value 0.0.
Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
As I see it, there is no wall to the right from 2,C. Nevertheless, the Q-function is given and it's irrelevant in this task whether it's possible to reach such Q-function using Q-learning. And you can always start Q-learning from an arbitrary Q-function anyway.
In Q-learning your only knowledge is the Q-function, so you don't know anything about "walls" and other things - you act according to Q-function, and that's the whole beauty of this algorithm.
Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Again, you should look at the values for the state the agent is currently in, so for 1,B "right" is optimal - it has 0.1 and other actions are 0.0.
To answer the last question, even though it's irrelevant here: yes, if the agent is taking the greedy step and multiple actions seem optimal, it chooses one at random in most common policies.
Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
No. As I've stated above - the only guideline agent is using in pure Q-learning is the Q-function. It's not aware that it has been in a particular state before.

Resources