Choose closest point to origin with reinforcement learning - machine-learning

I am attempting to use reinforcement learning to choose the closest point to the origin out of a given set of points repeatedly, until a complex (and irrelevant) end condition is reached. (This is a simplification of my main problem.)
A 2D array containing possible points is passed to the reinforcement learning algorithm, which makes a choice as to which point it thinks is the most ideal.
A [1, 10]
B [100, 0]
C [30, 30]
D [5, 7]
E [20, 50]
In this case, D would be the true best choice. (The algorithm should ideally output 3, from the range 0 to 4.)
However, whenever I train the algorithm, it seems to not learn what the "concept" is, but instead just that choosing, say, C is usually the best choice, so it should always choose that.
import numpy as np
import rl.core as krl
class FindOriginEnv(krl.Env):
def observe(self):
return np.array([
[np.random.randint(100), np.random.randint(100)] for _ in range(5)
])
def step(self, action):
observation = self.observe()
done = np.random.rand() < 0.01 # eventually
reward = 1 if done else 0
return observation, reward, done, {}
# ...
What should I modify about my algorithm such that it will actually learn about the goal it is trying to accomplish?
Observation shape?
Reward function?
Action choices?
Keras code would be appreciated, but is not required; a purely algorithmic explanation would also be extremely helpful.

Sketching out the MDP from your description, there are a few issues:
Your observation function appears to be returning 5 points, so that means a state can be any configuration of 10 integers in [0,99]. That's 100^10 possible states! Your state space needs to be much smaller. As written, observe appears to be generating possible actions, not state observations.
You suggest that you're are picking actions from [0,4], where each action is essentially an index into an array of points available to the agent. This definition of the action space doesn't give the agent enough information to discriminate what you say you'd like it to (smaller magnitude point is better), because you only act based on the point's index! If you wanted to tweak the formulation a bit to make this work, you would define an action to be selecting a 2D point with each dimension in [0,99]. This would mean you would have 100^2 total possible actions, but to maintain the multiple choice aspect, you would restrict the agent to selecting amongst a subset at a given step (5 possible actions) based on its current state.
Finally, the reward function that gives zero reward until termination means that you're allowing a large number of possible optimal policies. Essentially, any policy that terminates, regardless of how long the episode took, is optimal! If you want to encourage policies that terminate quickly, you should penalize the agent with a small negative reward at each step.

Related

How does DQN work in an environment where reward is always -1

Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code that proves it. By working, I mean that when I train the agent, the agent quickly (within 300-500 episodes) learns how to solve the mountaincar problem. Below is an example from my trained agent.
It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.
What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course remember that tuples are pulled from memory at random, not episode-by-episode.
Another element of this particular OpenAI Gym environment is that the Done state is returned on either of two occasions: hitting the flag (yay) or timing out after some number of steps (boo). However, the agent treats both the same, accepting the reward of -1. Thus as far as the tuples in memory are concerned, both events look identical from a reward standpoint.
So, I don't see anything in the memory that indicates that the episode was performed well.
And thus, I have no idea why this DQN code is working for MountainCar.
The reason this works is because in Q-learning, your model is trying to estimate the SUM (technically the time-decayed sum) of all future rewards for each possible action. In MountainCar you get a reward of -1 every step until you win, so if you do manage to win, you’ll end up getting less negative reward than usual. For example, your total score after winning might be -160 instead of -200, so your model will start predicting higher Q-values for actions that have historically led to winning the game.
You are right. There is no direct association between memory (experience replay) and the performance of the model in episode reward. The Q-value in DQN is used to predict each action's expected reward in each step. The performance measure of how good your model was is the difference between the real reward and the expected reward (TD-error).
The deployment of -1 for non-goal steps is a trick to help RL models to choose the actions that can finish the episode quicker. Because of the Q-value is an action-value. At each step, the model predicts rewards for every possible move and the policy (usually greedy or epsilon-greedy) choose the action with the most significant value. You can imagine that going back at one moment will result in 200 steps to finish the episode but going forward takes only 100 steps. The Q-value will be -200 (without discount) and -100 respectively. You might wonder how the model knows the value of each action, that is because in the repeated episodes and successive trial-and-error. The model was trained to minimise the difference between real reward and expected reward, aka TD-error.
In a randomly sampled experience replay, all experiences are sampled and deleted uniformly. However, in priority experience replay, you can reuse those experience with high estimated error. Usually, the priorities are proportional to the TD error (real_reward - expected_reward) of expected Q-value and the current model's predicted Q-value. The larger the priority means how surprising the experience, and it helps accelerate the training.
You can check the idea in Priority Experience Replay, Schaul et al., 2016
It might help to look at a reduced problem. Consider:
States:
───┬───┬───┬───┬───┐
...│L2 │L1 │ S │ R1│
───┴───┴───┴───┴───┘
Actions:
left or right
r = -1 for all states
episode terminates upon reaching R1 or after 2 steps
Memory:
(S, left, -1, L1) non-terminal
(S, right, -1, R1) terminal
(L1, left, -1, L2) terminal
(L1, right, -1, S) terminal
The obvious but important thing to note is that although the rewards are all identical, the states and actions are not. This information allows us to reason about the next state given the current one.
Let's look at the targets that we are updating towards (with no discount):
Q(S, left ) --> -1 + max{a}Q(L1, a)
Q(S, right) --> -1
Q(L1, left ) --> -1
Q(L1, right) --> -1
In this contrived example, only transition 1 presents an extra source of instability. But over time, as action value on L1 converges from having sampled enough of transitions 3 and 4, so should the target on transition 1.
At that point, when we encounter transition 1 again, we would have a better estimate Q(S, left) --> -1 + -1.
It is not enough to only look at the rewards when we are asking how DQN learns from its memory since it is also using the next observation to determine its current best estimate of the action value at the next step (relevant code), effectively linking everything together and slowly tallying up the rewards. Albeit it does so in a much more unstable manner unlike traditional Q-learning.
As an exercise, consider extending this further and put the terminal state on R2.
Then we can easily see that now max{a}Q(S, a) = -2; it takes the same amount of time to reach R2 and to simply timeout, so it doesn't matter what we do (unless we start closer to R2). However, bumping the number of timeout steps up and it should head towards R2 again. I.e. the mountain car works also because the timeout is set to a number larger than the time step it takes to reach the goal. This path should eventually propagate its (negative but better) values back to our initial state.
Although this is true for any other environment, they can still at least learn to get closer to the goal before timing out provided they are rewarded for the effort. This is not so given the reward design in the mountain car environment.
In DQN you learn the Q-function, which basically approximates your return. In the memory, you store tuples with (s,a,s',r) and re-train your Q-function on those tuples. If, for a given tuple, you performed well (you reached the flag quickly) then you are going to re-experience it by re-using the tuple for training, because the Q-function is higher for that tuple.
Anyway, usually experience replay works better for any problem, not just for the mountain car.
You say:
those returns (-1000, -200) are never present in the experience replay memory.
What is present in the replay memory is a SARdS tuple with a done flag that tells you that the episode is finished. See the openai gym deepq example:
# Store transition in the replay buffer.
replay_buffer.add(obs, action, rew, new_obs, float(done))
In the final update, if float(done) == 1 then the future Q value is ignored. Hence at the end of the episode, the Q value is 0. If that happens on step 200, then the total return in the episode will be -200. If it happens on step 1000, then the total return will be -1000.
To put it a different way - if completely at random, the episode ended on step 200, an agent that had been performing as badly as a random policy would have a Q value (expected future return) of -800. If the episode then ended, the TD error would be +799 representing the positive surprise of only losing another -1.
I note that the code you linked to does not seem to use the done flag in the replay buffer. Instead it relies on a final state of s_ == None to signify the end of the episode. Take that code out, and the agent won't learn.
if s_ is None:
t[a] = r
else:
t[a] = r + GAMMA * numpy.amax(p_[i])

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

What is the way to understand Proximal Policy Optimization Algorithm in RL?

I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able read arxiv PPO paper ?
What is the roadmap to learn and use PPO ?
To better understand PPO, it is helpful to look at the main contributions of the paper, which are: (1) the Clipped Surrogate Objective and (2) the use of "multiple epochs of stochastic gradient ascent to perform each policy update".
From the original PPO paper:
We have introduced [PPO], a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update. These methods have the stability and reliability of trust-region [TRPO] methods but are much simpler to implement, requiring only a few lines of code change to a vanilla policy gradient implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.
1. The Clipped Surrogate Objective
The Clipped Surrogate Objective is a drop-in replacement for the policy gradient objective that is designed to improve training stability by limiting the change you make to your policy at each step.
For vanilla policy gradients (e.g., REINFORCE) --- which you should be familiar with, or familiarize yourself with before you read this --- the objective used to optimize the neural network looks like:
This is the standard formula that you would see in the Sutton book, and other resources, where the A-hat could be the discounted return (as in REINFORCE) or the advantage function (as in GAE) for example. By taking a gradient ascent step on this loss with respect to the network parameters, you will incentivize the actions that led to higher reward.
The vanilla policy gradient method uses the log probability of your action (log π(a | s)) to trace the impact of the actions, but you could imagine using another function to do this. Another such function, introduced in this paper, uses the probability of the action under the current policy (π(a|s)), divided by the probability of the action under your previous policy (π_old(a|s)). This looks a bit similar to importance sampling if you are familiar with that:
This r(θ) will be greater than 1 when the action is more probable for your current policy than it is for your old policy; it will be between 0 and 1 when the action is less probable for your current policy than for your old.
Now to build an objective function with this r(θ), we can simply swap it in for the log π(a|s) term. This is what is done in TRPO:
But what would happen here if your action is much more probable (like 100x more) for your current policy? r(θ) will tend to be really big and lead to taking big gradient steps that might wreck your policy. To deal with this and other issues, TRPO adds several extra bells and whistles (e.g., KL Divergence constraints) to limit the amount the policy can change and help guarantee that it is monotonically improving.
Instead of adding all these extra bells and whistles, what if we could build these stabilizing properties into the objective function? As you might guess, this is what PPO does. It gains the same performance benefits as TRPO and avoids the complications by optimizing this simple (but kind of funny looking) Clipped Surrogate Objective:
The first term (blue) inside the minimization is the same (r(θ)A) term we saw in the TRPO objective. The second term (red) is a version where the (r(θ)) is clipped between (1 - e, 1 + e). (in the paper they state a good value for e is about 0.2, so r can vary between ~(0.8, 1.2)). Then, finally, the minimization of both of these terms is taken (green).
Take your time and look at the equation carefully and make sure you know what all the symbols mean, and mathematically what is happening. Looking at the code may also help; here is the relevant section in both the OpenAI baselines and anyrl-py implementations.
Great.
Next, let's see what effect the L clip function creates. Here is a diagram from the paper that plots the value of the clip objective for when the Advantage is positive and negative:
On the left half of the diagram, where (A > 0), this is where the action had an estimated positive effect on the outcome. On the right half of the diagram, where (A < 0), this is where the action had an estimated negative effect on the outcome.
Notice how on the left half, the r-value gets clipped if it gets too high. This will happen if the action became a lot more probable under the current policy than it was for the old policy. When this happens, we do not want to get greedy and step too far (because this is just a local approximation and sample of our policy, so it will not be accurate if we step too far), and so we clip the objective to prevent it from growing. (This will have the effect in the backward pass of blocking the gradient --- the flat line causing the gradient to be 0).
On the right side of the diagram, where the action had an estimated negative effect on the outcome, we see that the clip activates near 0, where the action under the current policy is unlikely. This clipping region will similarly prevent us from updating too much to make the action much less probable after we already just took a big step to make it less probable.
So we see that both of these clipping regions prevent us from getting too greedy and trying to update too much at once and leaving the region where this sample offers a good estimate.
But why are we letting the r(θ) grow indefinitely on the far right side of the diagram? This seems odd as first, but what would cause r(θ) to grow really large in this case? r(θ) growth in this region will be caused by a gradient step that made our action a lot more probable, and it turning out to make our policy worse. If that was the case, we would want to be able to undo that gradient step. And it just so happens that the L clip function allows this. The function is negative here, so the gradient will tell us to walk the other direction and make the action less probable by an amount proportional to how much we screwed it up. (Note that there is a similar region on the far left side of the diagram, where the action is good and we accidentally made it less probable.)
These "undo" regions explain why we must include the weird minimization term in the objective function. They correspond to the unclipped r(θ)A having a lower value than the clipped version and getting returned by the minimization. This is because they were steps in the wrong direction (e.g., the action was good but we accidentally made it less probable). If we had not included the min in the objective function, these regions would be flat (gradient = 0) and we would be prevented from fixing mistakes.
Here is a diagram summarizing this:
And that is the gist of it. The Clipped Surrogate Objective is just a drop-in replacement you could use in the vanilla policy gradient. The clipping limits the effective change you can make at each step in order to improve stability, and the minimization allows us to fix our mistakes in case we screwed it up. One thing I didn't discuss is what is meant by PPO objective forming a "lower bound" as discussed in the paper. For more on that, I would suggest this part of a lecture the author gave.
2. Multiple epochs for policy updating
Unlike vanilla policy gradient methods, and because of the Clipped Surrogate Objective function, PPO allows you to run multiple epochs of gradient ascent on your samples without causing destructively large policy updates. This allows you to squeeze more out of your data and reduce sample inefficiency.
PPO runs the policy using N parallel actors each collecting data, and then it samples mini-batches of this data to train for K epochs using the Clipped Surrogate Objective function. See full algorithm below (the approximate param values are: K = 3-15, M = 64-4096, T (horizon) = 128-2048):
The parallel actors part was popularized by the A3C paper and has become a fairly standard way for collecting data.
The newish part is that they are able to run K epochs of gradient ascent on the trajectory samples. As they state in the paper, it would be nice to run the vanilla policy gradient optimization for multiple passes over the data so that you could learn more from each sample. However, this generally fails in practice for vanilla methods because they take too big of steps on the local samples and this wrecks the policy. PPO, on the other hand, has the built-in mechanism to prevent too much of an update.
For each iteration, after sampling the environment with π_old (line 3) and when we start running the optimization (line 6), our policy π will be exactly equal to π_old. So at first, none of our updates will be clipped and we are guaranteed to learn something from these examples. However, as we update π using multiple epochs, the objective will start hitting the clipping limits, the gradient will go to 0 for those samples, and the training will gradually stop...until we move on to the next iteration and collect new samples.
....
And that's all for now. If you are interested in gaining a better understanding, I would recommend digging more into the original paper, trying to implement it yourself, or diving into the baselines implementation and playing with the code.
[edit: 2019/01/27]: For a better background and for how PPO relates to other RL algorithms, I would also strongly recommend checking out OpenAI's Spinning Up resources and implementations.
PPO, and including TRPO tries to update the policy conservatively, without affecting its performance adversely between each policy update.
To do this, you need a way to measure how much the policy has changed after each update. This measurement is done by looking at the KL divergence between the updated policy and the old policy.
This becomes a constrained optimization problem, we want change the policy in the direction of maximum performance, following the constraints that the KL divergence between my new policy and old do not exceed some pre defined (or adaptive) threshold.
With TRPO, we compute the KL constraint during update and finds the learning rate for this problem (via Fisher Matrix and conjugate gradient). This is somewhat messy to implement.
With PPO, we simplify the problem by turning the KL divergence from a constraint to a penalty term, similar to for example to L1, L2 weight penalty (to prevent a weights from growing large values). PPO makes additional modifications by removing the need to compute KL divergence all together, by hard clipping the policy ratio (ratio of updated policy with old) to be within a small range around 1.0, where 1.0 means the new policy is same as old.
PPO is a simple algorithm, which falls into policy optimization algorithms class (as opposed to value-based methods such as DQN). If you "know" RL basics (I mean if you have at least read thoughtfully some first chapters of Sutton's book for example), then a first logical step is to get familiar with policy gradient algorithms. You can read this paper or chapter 13 of Sutton's book new edition. Additionally, you may also read this paper on TRPO, which is a previous work from PPO's first author (this paper has numerous notational mistakes; just note). Hope that helps. --Mehdi
I think implementing for a discrete action space such as Cartpole-v1 is easier than for continuous action spaces. But for continuous action spaces, this is the most straight-forward implementation I found in Pytorch as you can clearly see how they get mu and std where as I could not with more renowned implementations such as Openai Baselines and Spinning up or Stable Baselines.
RL-Adventure PPO
These lines from the link above:
class ActorCritic(nn.Module):
def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):
super(ActorCritic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 1)
)
self.actor = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, num_outputs),
)
self.log_std = nn.Parameter(torch.ones(1, num_outputs) * std)
self.apply(init_weights)
def forward(self, x):
value = self.critic(x)
mu = self.actor(x)
std = self.log_std.exp().expand_as(mu)
dist = Normal(mu, std)
return dist, value
and the clipping:
def ppo_update(ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantages, clip_param=0.2):
for _ in range(ppo_epochs):
for state, action, old_log_probs, return_, advantage in ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantages):
dist, value = model(state)
entropy = dist.entropy().mean()
new_log_probs = dist.log_prob(action)
ratio = (new_log_probs - old_log_probs).exp()
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param) * advantage
I found the link above the comments on this video on Youtube:
arxiv insights PPO

What's the best objective function for the CartPole task?

I'm doing policy gradient and I'm trying to figure out what the best objective function is for the task. The task is the open ai CartPole-v0 environment in which the agent receives a reward of 1 for each timestep it survives and a reward of 0 upon termination. I'm trying to figure out which is the best way to model the objective function. I've come up with 3 possible functions:
def total_reward_objective_function(self, episode_data) :
return sum([timestep_data['reward'] for timestep_data in timestep_data])
def average_reward_objective_function(self, episode_data):
return total_reward_objective_function(episode_data) / len(episode_data)
def sum_of_discounted_rewards_objective_function(self, episode_data, discount_rate=0.7)
return sum([episode_data[timestep]['reward'] * pow(discount_rate, timestep)
for timestep in enumerate(episode_data)])
Note that for the average reward objective function will always return 1 unless I intervene and modify the reward function to return a negative value upon termination. The reason I'm asking rather than just running a few experiments is because there's errors elsewhere. So if someone could point me towards a good practice in this area I could focus on the more significant mistakes in the algorithm.
You should use the last one (sum of discounted rewards), since the cart-pole problem is an infinite horizon MDP (you want to balance the pole as long as you can). The answer to this question explains why you should use a discount factor in infinite horizon MDPs.
The first one, instead, is just an undiscounted sum of the rewards, which could be used if episodes have a fixed length (for instance, in the case of a robot performing a 10 seconds trajectory). The second one is usually used in finite horizon MDPs, but I am not very familiar with it.
For the cart-pole, a discount factor of 0.9 should work (or, depending on the algorithm used, you can search for scientific papers and see the discount factor used).
A final note. The reward function you described (+1 at each timestep) is not the only one used in literature. A common one (and I think also the "original" one) gives 0 at each timestep and -1 if the pole falls. Other reward functions are related to the angle between the pole and the cart.

Learning of Outcome Space Given Noisy Actions and Non-Monotonic Reinforcment

I'm looking to construct or adapt a model preferably based in RL theory that can solve the following problem. Would greatly appreciate any guidance or pointers.
I have a continuous action space, where actions can be chosen from the range 10-100 (inclusive). Each action is associated with a certain reinforcement value, ranging from 0 to 1 (also inclusive) according to a value function. So far, so good. Here's where I start to get in over my head:
Complication 1:
The value function V maps actions to reinforcement according to the distance between a given action x and a target action A. The less the distance between the two, the greater the reinforcement (that is, reinforcement is inversely proportional to abs(A - x). However, the value function is only nonzero for actions close to A ( abs(A - x) is less than, say, epsilon) and zero elsewhere. So:
**V** is proportional to 1 / abs(**A** - **x**) for abs(**A** - **x**) < epsilon, and
**V** = 0 for abs(**A** - **x**) > epsilon.
Complication 2:
I do not know precisely what actions have been taken at each step. I know roughly what they are, such that I know they belong to the range x +/- sigma, but cannot exactly associate a single action value with the reinforcement I receive.
The precise problem I would like to solve is as follows: I have a series of noisy action estimates and exact reinforcement values (e.g. on trial 1 I might have x of ~15-30 and reinforcement of 0; on trial 2 I might have x of ~25-40 and reinforcement of 0; on trial 3, x of ~80-95 and reinforcment of 0.6.) I would like to construct a model which represents the estimate of the most likely location of the target action A after each step, probably weighting new information according to some learning rate parameter (since certainty will increase with increasing samples).
This journal article which may be relevant: It addresses delayed rewards and robust learning in the presence of noise and inconsistent rewards.
"Rare neural correlations implement robot conditioning with delayed rewards and disturbances"
Specifically, they trace (remember) which synapses (or actions) had been firing before a reward event and reinforce all of them, where the amount of the reinforcement decays with time between the action and the reward.
An individual reward event will reward any synapses which happen to be firing before the reward (or actions performed) including those irrelevant to the reward. However, with a suitable learning rate, this should stabilize over a handful of iterations, with only the desired action being consistently rewarded and reinforced.

Resources