Q-Learning: Inaccurate predictions - machine-learning

I recently started getting into Q-Learning. I'm currently writing an agent that should play a simple board game (Othello).
I'm playing against a random opponent. However it is not working properly. Either my agent stays around 50% winrate or gets even worse, the longer I train.
The output of my neural net is an 8x8 matrix with Q-values for each move.
My reward is as follows:
1 for a win
-1 for a loss
an invalid move counts as loss therefore the reward is -1
otherwise I experimented with a direct reward of -0.01 or 0.
I've made some observations I can't explain:
When I consider the prediction for the first move (my agent always starts) the predictions for the invalid moves get really close to -1 really fast. The predictions for the valid moves however seem to rise the longer I play. They even went above 1 (they were above 3 at some point), what shouldn't be possible since I'm updating my Q-values according to the Bellman-equation (with alpha = 1)
where Gamma is a parameter less than 1 (I use 0.99 for the most time). If a game lasts around 30 turns I would expect max/min values of +-0.99^30=+-0.73.
Also the predictions for the starting state seem to be the highest.
Another observation I made, is that the network seems to be too optimistic with its predictions. If I consider the prediction for a move, after which the game will be lost, the predictions for the invalid turns are again close to -1. The valid turn(s) however, often have predictions well above 0 (like 0.1 or even 0.5).
I'm somewhat lost, as I can't explain what could cause my problems, since I already double-checked my reward/target matrices. Any ideas?

i suspect your bellman calculation (specifically, Q(s', a')) does not check for valid moves as the game progresses. that would explain why "predictions for the invalid moves get really close to -1 really fast".

Related

Force a neural network to have 0-sum outputs

I have a pytorch neural net with n-dimensional output which I want to have 0-sum during training (my training data, i.e. the true outputs, have 0 sum). Of course I could just add a line computing the sum s and then subtract s/n from each element of the output. But this way, the network would be driven even less to actually finding outputs with zero sum, as this would get taken care of anyways (I've been getting worse test results with this approach). Also, as the true outputs in the training data have 0 sum, obviously the network converges to having almost 0 sum outputs, but not quite. Hence, I was wondering whether there is a smart way to force the network to have outputs that sum to 0, without just brute-force subtracting the sum in the end (which would corrupt learning outputs to have sum 0)? I.e. some sort of solution directly incorporated in the network? (Probably there isn't, at least I couldn't think of any...)
Your approach with "explicitly substracting the mean" is the correct way. The same way we use softmax to nicely parametrise distributions, and you could complain that "this makes the network not learn about probability even more!", but in fact it does, it simply does so in its own, unnormalised space. Same in your case - by subtracting the mean you make sure that you match the target variable while allowing your network to focus on hard problems, and not waste its compute on having to learn that the sum is zero. If you do anything else your network will literally have to learn to compute the mean somewhere and subtract it. There are some potential corner cases where there might be some deep representational reason for mean to be zero that could be argues for, but these cases are rare enough that chances that this is actually happening "magically" in the network are zero (and if you knew it was happening there would be better ways of targeting it than by zero ensuring).
What happens if you add an explicit loss?
pred = model(input)
original_loss = criterion(pred, target)
# add this loss
zero_sum_loss = pred.mean() ** 2
loss = original_loss + weight * zero_sum_loss
loss.backward()
optim.step()
# ...

How does DQN work in an environment where reward is always -1

Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code that proves it. By working, I mean that when I train the agent, the agent quickly (within 300-500 episodes) learns how to solve the mountaincar problem. Below is an example from my trained agent.
It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.
What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course remember that tuples are pulled from memory at random, not episode-by-episode.
Another element of this particular OpenAI Gym environment is that the Done state is returned on either of two occasions: hitting the flag (yay) or timing out after some number of steps (boo). However, the agent treats both the same, accepting the reward of -1. Thus as far as the tuples in memory are concerned, both events look identical from a reward standpoint.
So, I don't see anything in the memory that indicates that the episode was performed well.
And thus, I have no idea why this DQN code is working for MountainCar.
The reason this works is because in Q-learning, your model is trying to estimate the SUM (technically the time-decayed sum) of all future rewards for each possible action. In MountainCar you get a reward of -1 every step until you win, so if you do manage to win, you’ll end up getting less negative reward than usual. For example, your total score after winning might be -160 instead of -200, so your model will start predicting higher Q-values for actions that have historically led to winning the game.
You are right. There is no direct association between memory (experience replay) and the performance of the model in episode reward. The Q-value in DQN is used to predict each action's expected reward in each step. The performance measure of how good your model was is the difference between the real reward and the expected reward (TD-error).
The deployment of -1 for non-goal steps is a trick to help RL models to choose the actions that can finish the episode quicker. Because of the Q-value is an action-value. At each step, the model predicts rewards for every possible move and the policy (usually greedy or epsilon-greedy) choose the action with the most significant value. You can imagine that going back at one moment will result in 200 steps to finish the episode but going forward takes only 100 steps. The Q-value will be -200 (without discount) and -100 respectively. You might wonder how the model knows the value of each action, that is because in the repeated episodes and successive trial-and-error. The model was trained to minimise the difference between real reward and expected reward, aka TD-error.
In a randomly sampled experience replay, all experiences are sampled and deleted uniformly. However, in priority experience replay, you can reuse those experience with high estimated error. Usually, the priorities are proportional to the TD error (real_reward - expected_reward) of expected Q-value and the current model's predicted Q-value. The larger the priority means how surprising the experience, and it helps accelerate the training.
You can check the idea in Priority Experience Replay, Schaul et al., 2016
It might help to look at a reduced problem. Consider:
States:
───┬───┬───┬───┬───┐
...│L2 │L1 │ S │ R1│
───┴───┴───┴───┴───┘
Actions:
left or right
r = -1 for all states
episode terminates upon reaching R1 or after 2 steps
Memory:
(S, left, -1, L1) non-terminal
(S, right, -1, R1) terminal
(L1, left, -1, L2) terminal
(L1, right, -1, S) terminal
The obvious but important thing to note is that although the rewards are all identical, the states and actions are not. This information allows us to reason about the next state given the current one.
Let's look at the targets that we are updating towards (with no discount):
Q(S, left ) --> -1 + max{a}Q(L1, a)
Q(S, right) --> -1
Q(L1, left ) --> -1
Q(L1, right) --> -1
In this contrived example, only transition 1 presents an extra source of instability. But over time, as action value on L1 converges from having sampled enough of transitions 3 and 4, so should the target on transition 1.
At that point, when we encounter transition 1 again, we would have a better estimate Q(S, left) --> -1 + -1.
It is not enough to only look at the rewards when we are asking how DQN learns from its memory since it is also using the next observation to determine its current best estimate of the action value at the next step (relevant code), effectively linking everything together and slowly tallying up the rewards. Albeit it does so in a much more unstable manner unlike traditional Q-learning.
As an exercise, consider extending this further and put the terminal state on R2.
Then we can easily see that now max{a}Q(S, a) = -2; it takes the same amount of time to reach R2 and to simply timeout, so it doesn't matter what we do (unless we start closer to R2). However, bumping the number of timeout steps up and it should head towards R2 again. I.e. the mountain car works also because the timeout is set to a number larger than the time step it takes to reach the goal. This path should eventually propagate its (negative but better) values back to our initial state.
Although this is true for any other environment, they can still at least learn to get closer to the goal before timing out provided they are rewarded for the effort. This is not so given the reward design in the mountain car environment.
In DQN you learn the Q-function, which basically approximates your return. In the memory, you store tuples with (s,a,s',r) and re-train your Q-function on those tuples. If, for a given tuple, you performed well (you reached the flag quickly) then you are going to re-experience it by re-using the tuple for training, because the Q-function is higher for that tuple.
Anyway, usually experience replay works better for any problem, not just for the mountain car.
You say:
those returns (-1000, -200) are never present in the experience replay memory.
What is present in the replay memory is a SARdS tuple with a done flag that tells you that the episode is finished. See the openai gym deepq example:
# Store transition in the replay buffer.
replay_buffer.add(obs, action, rew, new_obs, float(done))
In the final update, if float(done) == 1 then the future Q value is ignored. Hence at the end of the episode, the Q value is 0. If that happens on step 200, then the total return in the episode will be -200. If it happens on step 1000, then the total return will be -1000.
To put it a different way - if completely at random, the episode ended on step 200, an agent that had been performing as badly as a random policy would have a Q value (expected future return) of -800. If the episode then ended, the TD error would be +799 representing the positive surprise of only losing another -1.
I note that the code you linked to does not seem to use the done flag in the replay buffer. Instead it relies on a final state of s_ == None to signify the end of the episode. Take that code out, and the agent won't learn.
if s_ is None:
t[a] = r
else:
t[a] = r + GAMMA * numpy.amax(p_[i])

Anomaly detection with machine learning without labels

I am tracing multiple signals for a certain period of time and associating them with a timestamp like following:
t0 1 10 2 0 1 0 ...
t1 1 10 2 0 1 0 ...
t2 3 0 9 7 1 1 ... // pressed a button to change the mode
t3 3 0 9 7 1 1 ...
t4 3 0 8 7 1 1 ... // pressed button to adjust a certain characterstic like temperature (signal 3)
where t0 is the tamp stamp, 1 is the value for signal 1, 10 the value for signal 2 and so on.
That captured data during that certain period of time should be considered as the normal case. Now significant derivations should be detected from the normal case. With significant derivation I do NOT mean that one signal value just changes to a value that has not been seen during the tracing phase but rather that a lot of values change that have not yet been related to each other. I do not want to hardcode rules since in the future more signals might be added or removed and other "modi" that have other signal values might be implemented.
Can this be achieved via a certain Machine Learning algorithm? If a small derivation occurs I want the algorithm to first see it as a minor change to the training set and if it occurs multiple times in the future it should be "learned". The major goal is to detect the bigger changes / anomalies.
I hope I could explain my problem detailed enough. Thanks in advance.
you could just calculate the nearest neighbor in your feature space and set a threshold how far its allowed to be away from your test point to not be an anomaly.
Lets say you have 100 values in your "certain period of time"
so you use a 100 dimensional feature space with your training data (which doesn't contain anomalies)
If you get a new dataset you want to test, you calculate the (k) nearest neighbor(s) and calculate the (e.g. euclidean) distance in your featurespace.
If that distance is larger than a certain threshold it's an anomaly.
What you have to do in order to optimize is finding a good k and a good threshold. E.g. by Grid-search.
(1) Note that something like this probably only works well if your data has a fixed starting and ending point. Otherwise you would need a huge amount of data and even than it will not perform as good.
(2) Note It should be worth trying to create an own detector for every "mode" you have mentioned in your question.

Neural Network Diverging instead of converging

I have implemented a neural network (using CUDA) with 2 layers. (2 Neurons per layer).
I'm trying to make it learn 2 simple quadratic polynomial functions using backpropagation.
But instead of converging, the it is diverging (the output is becoming infinity)
Here are some more details about what I've tried:
I had set the initial weights to 0, but since it was diverging I have randomized the initial weights
I read that a neural network might diverge if the learning rate is too high so I reduced the learning rate to 0.000001
The two functions I am trying to get it to add are: 3 * i + 7 * j+9 and j*j + i*i + 24 (I am giving the layer i and j as input)
I had implemented it as a single layer previously and that could approximate the polynomial functions better
I am thinking of implementing momentum in this network but I'm not sure it would help it learn
I am using a linear (as in no) activation function
There is oscillation in the beginning but the output starts diverging the moment any of weights become greater than 1
I have checked and rechecked my code but there doesn't seem to be any kind of issue with it.
So here's my question: what is going wrong here?
Any pointer will be appreciated.
If the problem you are trying to solve is of classification type, try 3 layer network (3 is enough accordingly to Kolmogorov) Connections from inputs A and B to hidden node C (C = A*wa + B*wb) represent a line in AB space. That line divides correct and incorrect half-spaces. The connections from hidden layer to ouput, put hidden layer values in correlation with each other giving you the desired output.
Depending on your data, error function may look like a hair comb, so implementing momentum should help. Keeping learning rate at 1 proved optimum for me.
Your training sessions will get stuck in local minima every once in a while, so network training will consist of a few subsequent sessions. If session exceeds max iterations or amplitude is too high, or error is obviously high - the session has failed, start another.
At the beginning of each, reinitialize your weights with random (-0.5 - +0.5) values.
It really helps to chart your error descent. You will get that "Aha!" factor.
The most common reason for a neural network code to diverge is that the coder has forgotten to put the negative sign in the change in weight expression.
another reason could be that there is a problem with the error expression used for calculating the gradients.
if these don't hold, then we need to see the code and answer.

Tic tac toe machine learning - valid moves

I am toying around with Machine learning. Especially Q-Learning where you have a state and actions and give rewards depending on how well the network did.
Now for starters I set myself a simple goal: Train a network so it emits valid moves for tic-tac-toe (vs a random opponent) as actions. My problem is that the network does not learn at all or even gets worse over time.
The first thing I did was getting in touch with torch and a deep q learning module for it: https://github.com/blakeMilner/DeepQLearning .
Then I wrote a simple tic-tac-toe game where a random player competes with the neural net and plugged this into the code from this sample https://github.com/blakeMilner/DeepQLearning/blob/master/test.lua . The output of the network consists of 9 nodes for setting the respective cell.
A move is valid if the network chooses an empty cell (no X or O in it). According to this I give positive reward (if network chooses empty cell) and negative rewards (if network chooses an occupied cell).
The problem is it never seems to learn. I tried lots of variations:
mapping the tic-tac-toe field as 9 inputs (0 = cell empty, 1 = player 1, 2 = player 2) or as 27 inputs (e.g. for an empty cell 0 [empty = 1, player1 = 0, player2 = 0])
vary the hidden nodes count between 10 and 60
tried up to 60k iterations
varying learning rate between 0.001 and 0.1
giving negative rewards for fails or only rewards for success, different reward values
Nothing works :(
Now I have a couple of questions:
Since this is my very first attempt at Q-Learning is there anything I am fundamentally doing wrong?
What parameters are worth changing? The "Brain" thing has a lot: https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L57 .
What would a good count for the number of hidden nodes be?
Is the simple network structure as defined at https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L116 too simple for this problem?
Am I just too impatient and have to train much more iterations?
Thank you,
-Matthias
Matthias,
It seems you are using one output node? "The output of the network in the forward step is a number between 1 and 9". If so, then I believe this is the problem. Instead of having one output node I would treat this as a classification problem and have nine output nodes corresponding to each board position. Then take the argmax of these nodes as the predicted move. This is how networks that play the game of Go are setup (there are 361 output nodes each representing an intersection on the board).
Hope this helps!

Resources