How are eligibility traces with SARSA calculated? - machine-learning

I'm trying to implement eligibility traces (forward looking), whose pseudocode can be found in the following image
I'm uncertain what the For all s, a means (5th line from below). Where do they get that collection of s, a from?
If it's forward-looking, do loop forward from the current state to observe s'?
Do you adjust every single e(s, a)?

It's unfortunate that they've reused the variables s and a in two different scopes here, but yes, you adjust all e(s,a) values, e.g.,
for every state s in your state space
for every action a in your action space
update Q(s,a)
update e(s,a)
Note what's happening here. e(s,a) is getting incremented by an exponentially decreasing amount. But right before you go into that loop, you increment the single e(s,a) corresponding to the state/action pair just visited. So that pair gets "reset" in a way -- it doesn't get the exponentially smaller update, and on the next iteration, it's update will continue to be larger than all the pairs you haven't recently visited. Every time you visit a state/action pair, you're increasing the weight it contributes to the update of Q for a few iterations.

Related

Stuck in understanding the difference between update usels of TD(0) and TD(λ)

I'm studying Temporal difference learning from this post. Here the update rule of TD(0) is clear to me but in TD(λ), I don't understand how utility values of all the previous states are updated in a single update.
Here is the diagram given for comparison of bot updates:
above diagram is explained as following:
In TD(λ) the result is propagated back to all the previous states thanks to the eligibility traces.
My question is how the information is propagated to all the previous states in a single update even if we are using following update rule with eligibility traces?
Here in a single update, we're only updating utility of a single state Ut(s), then how the utilities of all the previous states are getting updated?
Edit
As per the answer, it is clear that this update is applied for every single step and that's the reason why information is propagated. If this is the case, then it again confuses me because, the only difference between the update rules is eligibility trace.
So even if the value of eligibility trace is non zero for previous states, the values of delta will be zero in above case (because initially rewards and utility function is initialized 0). Then how is it possible for previous states to get other utility values than zero in first update?
Also in the given python implementation, following output is given after a single iteration:
[[ 0. 0.04595 0.1 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
Here only 2 values are updated instead of all 5 previous states as shown in the figure. What I'm missing here?
So even if the value of eligibility trace is non zero for previous states, the values of delta will be zero in above case (because initially rewards and utility function is initialized 0). Then how is it possible for previous states to get other utility values than zero in first update?
You are right that, in the first update, all rewards and updates will still be 0 (except if we already manage to reach the goal in a single step, then the reward won't be 0).
However, the eligibility traces e_t will continue to "remember" or "memorize" all the states that we have previously visited. So, as soon as we do manage to reach the goal state and get a non-zero reward, the eligibility traces will still remember all the states that we went through. Those states will still have non-zero entries in the table of eligibility traces, and therefore all at once get a non-zero update as soon as you observe your first reward.
The table of eligibility traces does decay every time step (multiplied by gamma * lambda_), so the magnitude of updates to states that were visited a long time ago will be smaller than the magnitude of updates to states that we visited very recently, but we will continue to remember all those states, they will have non-zero entries (under the assumption that gamma > 0 and lambda_ > 0). This allows for values of all visited states to be updated, not as soon as we reach those states, but as soon as we observe a non-zero reward (or, in epochs after the first epoch, as soon as we reach a state for which we already have an existing non-zero predicted value) after having visited them at some earlier point in time.
Also in the given python implementation, following output is given after a single iteration:
[[ 0. 0.04595 0.1 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
Here only 2 values are updated instead of all 5 previous states as shown in the figure. What I'm missing here?
The first part of their code looks as follows:
for epoch in range(tot_epoch):
#Reset and return the first observation
observation = env.reset(exploring_starts=True)
So, every new epoch, they start by resetting the environment using the exploring_starts flag. If we look at the implementation of their environment, we see that the usage of this flag means that we always start out with a random initial position.
So, I suspect that, when the code was run to generate that output, the initial position was simply randomly selected to be the position two steps to the left of the goal, rather than the position in the bottom-left corner. If the initial position is randomly selected to already be closer to the goal, the agent only ever visits those two states for which you see non-zero updates, so those will also be the only states with non-zero entries in the table of eligibility traces and therefore be the only states with non-zero updates.
If the initial position is indeed the position in the bottom-left corner, a correct implementation of the algorithm would indeed update the values for all the states along that path (assuming no extra tricks are added, like setting entries to 0 if they happen to get "close enough" to 0 due to decaying).
I'd also like to note that there is in fact a mistake in the code on that page: they do not reset all the entries of the table of eligibility traces to 0 when resetting the environment / starting a new epoch. This should be done. If this is not done, the eligibility traces will still memorize states that were visited during previous epochs and also still update all of those, even if they were not visited again in the new epoch. This is incorrect. A correct version of their code should start out like this:
for epoch in range(tot_epoch):
#Reset and return the first observation
observation = env.reset(exploring_starts=True)
trace_matrix = trace_matrix * 0.0 # IMPORTANT, added this
for step in range(1000):
...
You are missing a small but important detail, the update rule is applied to all states, not only the current state. So, in practice, you are updating all the states whose e_t(s) is different from zero.
Edit
The delta is not zero because is computed for the current state, when the episode ends and the agent receives a reward of +1. Therefore, after computing a delta different from zero, you update all the states using that delta and the current eligibility traces.
I don't know why in the Python implementation (I haven't checked it carefully) the output updates only 2 values, but please verify that the eligibility traces for all the 5 previous states are different from 0, and if it's not the case, try to understand why. Sometimes you are not interested in maintaining the traces under a very small threshold (e.g., 10e-5), because it has a very small effect in the learning process and it's wasting computational resources.
As it may be seen δ is used in order to calculate the states utility. But, δ uses the next state's utility, as seen in the article. This means that for TD(0) it will update all the states because, to calculate the Ut(s) we need to calculate the next state's U and so on.
at the same time eligibility trace is using the previous states's eligibility trace
In eligibility trace, since, gamma is 0.999 (less than 1), lambda = 0.5 multipliyng by gamma and lamba gives the utilities a strong decrease,
weight by adding a +1 only to the state of the current time t.
Meaning that the closer we are to the current state the more weight we have and thus a bigger multiplier
and the further appart the fewer and a smaller multiplier.
In TD(λ) there is also an extra decrease while adding eligibility trace to the calculation
The above leads to the conclusion that the previous values are 0 in the first iteration because all the utilities are 0 at start and in TD(λ) when updated they get an a lot stronger decrease by the eligibility trace. You could say that they are extremely small, too small to calculate or consider.

Is this a correct implementation of Q-Learning for Checkers?

I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.

Is there a more efficient way to find the middle of a singly-linked list? (Any language)

Let's set the context/limitations:
A linked-list consists of Node objects.
Nodes only have a reference to their next node.
A reference to the list is only a reference to the head Node object.
No preprocessing or indexing has been done on the linked-list other than construction (there are no other references to internal nodes or statistics collected, i.e. length).
The last node in the list has a null reference for its next node.
Below is some code for my proposed solution.
Node cursor = head;
Node middle = head;
while (cursor != null) {
cursor = cursor.next;
if (cursor != null) {
cursor = cursor.next;
middle = middle.next;
}
}
return middle;
Without changing the linked-list architecture (not switching to a doubly-linked list or storing a length variable), is there a more efficient way to find the middle element of singly-linked list?
Note: When this method finds the middle of an even number of nodes, it always finds the left middle. This is ideal as it gives you access to both, but if a more efficient method will always find the right middle, that's fine, too.
No, there is no more efficient way, given the information you have available to you.
Think about it in terms of transitions from one node to the next. You have to perform N transitions to work out the list length. Then you have to perform N/2 transitions to find the middle.
Whether you do this as a full scan followed by a half scan based on the discovered length, or whether you run the cursor (at twice speed) and middle (at normal speed) pointers in parallel is not relevant here, the total number of transitions remains the same.
The only way to make this faster would be to introduce extra information to the data structure which you've discounted but, for the sake of completeness, I'll include it here. Examples would be:
making it a doubly-linked list with head and tail pointers, so you could find it in N transitions by "squeezing" in from both ends to the middle. That doubles the storage requirements for pointers however so may not be suitable.
having a skip list with each node pointing to both it's "child" and its "grandchild". This would speed up the cursor transitions resulting in only about N in total (that's N/2 for each of cursor and middle). Like the previous point, there's an extra pointer per node required for this.
maintaining the length of the list separately so you could find the middle in N/2 transitions.
same as the previous point but caching the middle node for added speed under certain circumstances.
That last point bears some extra examination. Like many optimisations, you can trade space for time and the caching shows one way to do it.
First, maintain the length of the list and a pointer to the middle node. The length is initially zero and the middle pointer is initially set to null.
If you're ever asked for the middle node when the length is zero, just return null. That makes sense because the list is empty.
Otherwise, if you're asked for the middle node and the pointer is null, it must be because you haven't cached the value yet.
In that case, calculate it using the length (N/2 transitions) and then store that pointer for later, before returning it.
As an aside, there's a special case here when adding to the end of the list, something that's common enough to warrant special code.
When adding to the end when the length is going from an even number to an odd number, just set middle to middle->next rather than setting it back to null.
This will save a recalculation and works because you (a) have the next pointers and (b) you can work out how the middle "index" (one-based and selecting the left of a pair as per your original question) changes given the length:
Length Middle(one-based)
------ -----------------
0 none
1 1
2 1
3 2
4 2
5 3
: :
This caching means, provided the list doesn't change (or only changes at the end), the next time you need the middle element, it will be near instantaneous.
If you ever delete a node from the list (or insert somewhere other than the end), set the middle pointer back to null. It will then be recalculated (and re-cached) the next time it's needed.
So, for a minimal extra storage requirement, you can gain quite a bit of speed, especially in situations where the middle element is needed more often than the list is changed.

How can I generate unique random numbers in quick succession?

I have a function that is called three times in quick succession, and it needs to generate a pseudorandom integer between 1 and 6 on each pass. However I can't manage to get enough entropy out of the function.
I've tried seeding math.randomseed() with all of the following, but there's never enough variation to affect the outcome.
os.time()
tonumber(tostring(os.time()):reverse():sub(1,6))
socket.gettime() * 1000
I've also tried this snippet, but every time my application runs, it generates the same pattern of numbers in the same order. I need different(ish) numbers every time my application runs.
Any suggestions?
Bah, I needed another zero when multiplying socket.gettime(). Multiplied by 10000 there is sufficient distance between the numbers to give me a good enough seed.

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

Resources