Updates in Temporal Difference Learning - machine-learning

I read about Tesauro's TD-Gammon program and would love to implement it for tic tac toe, but almost all of the information is inaccessible to me as a high school student because I don't know the terminology.
The first equation here, http://www.stanford.edu/group/pdplab/pdphandbook/handbookch10.html#x26-1310009.2
gives the "general supervised learning paradigm." It says that the w sub t on the left side of the equation is the parameter vector at time step t. What exactly does "time step" mean? Within the framework of a tic tac toe neural network designed to output the value of a board state, would the time step refer to the number of played pieces in a given game? For example the board represented by the string "xoxoxoxox" would be at time step 9 and the board "xoxoxoxo " would be at time step 8? Or would the time step refer to the amount of time elapsed since training has begun?
Since w sub t is the weight vector for a given time step, does this mean that every time step has its own evaluation function (neural network)? So to evaluate a board state with only one move, you would have to feed into a different NN than you would feed a board state with two moves? I think I am misinterpreting something here because as far as I know Tesauro used only one NN for evaluating all board states (though it is hard to find reliable information about TD-Gammon).
How come the gradient of the output is taken with respect to w and not w sub t?
Thanks in advance for clarifying these ideas. I would appreciate any advice regarding my project or suggestions for accessible reading material.

TD deals with learning within the framework of a Markov Decision Process. That is, you begin in some state st, perform an action at, receive reward rt, and end up in another state — st+1. The initial state is called s0. The subscript is called time.
A TD agent begins not knowing what rewards it will receive for what actions or what other states those actions take it to. It's goal is to learn which actions maximize long-term total reward.
The state could represent the state of the tic-tac-toe board. So, s0 in your case would be a clear board: "---------", s1 might be "-o-------", s2 might be "-o--x----", etc. The action might be the cell index to mark. With this representation, you would have 39=19 683 possible states and 9 possible actions. After learning the game, you would have a table telling you which cell to mark on each possible board.
The same kind of direct representation would not work well for Backgammon, because there are far too many possible states. What TD-Gammon does is approximate states using a neural network. The weights of the network are treated as a state vector, and the reward is always 0, except upon winning.
The tricky part here is that the state the TD-Gammon algorithm learns is the state of the neural network used to evaluate board positions, not the state of the board. So, at t=0 you have not played a single game and the network is in its initial state. Every time you have to make a move, you use the network to choose the best possible one, and then update the network's weights based on whether the move led to victory or not. Before making the move, the network has weights wt, afterwards it has weights wt+1. After playing hundreds of thousands of games, you learn weights that allow the neural network to evaluate board positions quite accurately.
I hope this clears things up.

Related

Is reinforcement learning suitable for predicting bias in dice?

I would like to analyze a problem similar to the following.
Problem:
You will be given N dices.
You will be given a lot of data about each dice (eg surface information, material information, location of the center of gravity … etc).
The features of the dice are randomly generated every game and are fired at the same speed, angle and initial position.
As a result of rolling the dice, you get 1 point if you get 6 and 0 points otherwise.
There are training data of 100000 games. (Dice data and match results)
I would like to learn the rule of selecting only dice whose probability of getting 6 is higher than 1/6.
I apologize for the vague problem statement.
First of all, it is my mistake to assume that "N dice".
The dice may be one by one.
One dice with random characteristics are distributed
When it rolls, it is recorded whether 6 has come out or not.
It was easy to understand if it was made into the problem that "this [characteristics, result] data is 100,000".
If you get something other than 6, you will get -1 points.
If you get 6, you will get +5 points.
Example:
X: vector of a dice data
f: function I want to know
f: X-> [0, 1]
(if result> 0.5, I pick this dice.)
For example, a dice with a 1/5 chance of getting a 6 gets 4 out of 5 times a non-6, so I wondered if it would be better to give an immediate reward.
Is it good to decide the reward by the number of points after 100000 games?
I have read some general reinforcement learning methods, but there is a concept of state transition. However, there is no state transition in this game. (Each game ends in 1 step, and each game is independent.)
I am a student just learning neural networks from scratch. It helps if you give me a hint. Thank you.
by the way,
I think that the result of this learning can be concluded "It is good to choose the dice whose pips farthest to the center of gravity is 6."
Let's first talk about Reinforcement-Learning.
Problem setups, in order of increasing generality:
Multi-Armed Bandit - no state, just actions with unknown rewards
Contextual Bandit - rewards also depend on some context (state)
Reinforcement Learning (MDP) - actions can also influence the next state
Common to all of all three is that you want to maximize the sum of rewards over time, and there is an exploration vs exploitation trade-off. You are not just given a large dataset. If you want to know what the best action is, you have to try it a few times and observe the reward. This may cost you some reward you could have earned otherwise.
Of those three, the Contextual Bandit is the closest match to your setup, although it doesn't quite match to your goals. It would go like this: Given some properties of the dice (the context), select the best dice from a group of possible choices (the action, e.g. the output from your network), such that you get the highest expected reward. At the same time you are also training your network, so you have to pick bad or unknown properties sometimes to explore them.
However, there are two reasons why it doesn't match:
You already have data from several 100000 of games, and seem to be not interested in minimizing the cost of trial and error to acquire more data. You assume this data is representative, so no exploration is required.
You are only interested in prediction. You want classify the dice into "good for rolling 6" vs "bad". This piece of information could be used later to make a decision between different choices if you know the cost for making a wrong decision. If you are just learning f() because you are curious about the property of a dice, then is a pure statistical prediction problem. You don't have to worry about short- or long-term rewards. You don't have to worry about selection or consequences of any actions.
Because of this, you actually only have a supervised learning problem. You could still solve it with reinforcement learning because RL is more general. But your RL algorithm would be wasting a lot of time figuring out that it really cannot influence the next state.
Supervised Learning
Your dice actually behaves like a biased coin, it's a Bernoulli trial with ~1/6 success probability. This is now a standard classification problem: given your features, predict the probability that a dice will lead to a good match result.
It seems that your "match results" can be easily converted in the number of rolls and the number of positive outcomes (rolled a six) with the same dice. If you have a large number of rolls for every dice, you can simply classify this die and use this class (together with the physical properties) as one data point to train your network.
You can do more fancy things if you have fewer rolls but I won't go into that. (If you are interested, have a look at the beta distribution and how the cross-entropy loss works with neural networks.)

deep reinforcement learning parameters and training time for a simple game

I want to learn how deep reinforcement algorithm works and what time it takes to train itself for any given environment.
I came up with a very simple example of environment:
There is a counter which holds an integer between 0 to 100.
counting to 100 is its goal.
there is one parameter direction whose value can be +1 or -1.
it simply show the direction to move.
out neural network takes this direction as input and 2 possible action as output.
Change the direction
Do not change the direction
1st action will simply flip the direction (+1 => -1 or -1 =>+1). 2nd action will keep the direction as it is.
I am using python for backend and javascript for frontend.
It seems to take too much time, and still it is pretty random. i have used 4 layer perceptron. training rate of 0.001 . memory learning with batch of 100. Code is of Udemy tutorial of Artificial Intelligence and is working properly.
My question is, What should be the reward for completion and for each state.? and how much time it is required to train simple example as that.?
In Reinforcement Learning the underlining reward function is what defines the game. Different reward functions lead to different games with different optimal strategies.
In your case there are a few different possibilities:
Give +1 for reaching 100 and only then.
Give +1 for reaching 100 and -0.001 for every time step it is not at 100.
Give +1 for going up -1 for going down.
The third case is way too easy there is no long term planing involved. In the first too cases the agent will only start learning once it accidentally reaches 100 and sees that it is good. But in the first case once it learns to go up it doesn't matter how long it takes to get there. The second is the most interesting where it needs to get there as fast as possible.
There is no right answer for what reward to use, but ultimately the reward you choose defines the game you are playing.
Note: 4 layer perceptron for this problem is Big Time Overkill. One layer should be enough (this problem is very simple). Have you tried the reinforcement learning environments at OpenAI's gym? Highly recommend it, they have all the "classical" reinforcement learning problems.

Suggestion for Neural Network for Table Tennis Robot

based on my last question I built a 3D-Game in which two robot arms are playing table tennis with each other. The robots have six degrees of freedom.
The states are composed by:
x, y and z-Position of the Ball
6 angles of the robot
All values are normalized, so that they take a value between [-1,1].
With 4 consecutive frames, I get a total input of 37 parameters.
Rewards
+0.3 when the player hits the ball
+0.7 when the player wins the match
-0.7 when the player loses the match
Output
Every of the six robot joints can move with a certain speed, so every joint has the possibilities to move in a positive direction, to stay or to move in a negative direction.
This results in 3^6=729 outputs.
With these settings, the neural network should learn the inverse kinematics of the robot and to play table tennis.
My problem is, that my network converges, but seems to get stuck in a local minimum and depending on the configuration, afterward, begins to converge.
I first tried networks with two and three hidden layers with 1000 nodes, and after a few epochs, the network began to converge. I realized that 1000 nodes are way too much and lowered them to 100, with the result, that the network behaves as described, it converges first and then slightly diverges. So I decided to add hidden layers. Currently, I am trying a network with 6 hidden layers, 80 nodes each. The current loss looks like:
So what do you think, experienced machine learning experts? Do you see any problems with my configuration? Which type of network would you choose?
I'm glad for every suggestion.
I had in the past a similar problem. The goal was to learn the inverse kinematics for a robotarm with the neuroevolution framework NEAT. In the figure on the left is the errorplot. At the beginning all works fine, the network improves, but on a certain point the errorvalue remains on the same value and even after 30 Minutes of calculating no change was there. I don't think that your neural network is wrong, or that the number of neurons are wrong. I think, that neural networks are in general not capable of learning the inverse kinematics problem. I also think that the famous paper of deepmind (Playing Atari games with neuralnetwork) is bogus.
But back to the facts. The plot in the OP (loss average) and my plot (population fitness) are showing both an improvment at the beginning and after a certain time a standstillcurve which can't be improved despite the fact, that the cpu is running on 100% for finding a better solution. It is unclear, how long the neuralnetwork have to be optimized until a significant improvement is visable and perhaps even after days or years of constant calculation no better solution will be found. A look into literature shows, that for every middle- or hardproblem that result is normal and until now no better neural networks or better learning algorithm were invented. The underlying problem is called combinatorial explosion and means that there are many milions possible solution for the network-weights and that a computer can only scan a small amount of them. If the problem is really easy like the "xor problem" a learning algorithm like backpropagation or RPropMinus will find a solution. On slightly more difficult problems like navigating in a maze, finding inverse kinematics or peg-in-hole task no current neural network will find a solution.

Tic Tac Toe Neural Network as Evaluation Function

I've been trying to program an AI for tic tac toe using a multilayer perceptron and backpropagation. My idea was to train the neural network to be an accurate evaluation function for board states, but the problem is even after analyzing thousands of games, the network does not output accurate evaluations.
I'm using 27 input neurons; each square on the 3x3 board is associated with three input neurons that receive values of 0 or 1 depending on whether the square has an x, o or is blank. These 27 input neurons send signals to 10 hidden neurons (I chose 10 arbitrarily, but I have tried with 5 and 15 as well).
For training, I've had the program generate a series of games by playing against itself using the current evaluation function to select what are deemed optimal moves for each side. After generating a game, the NN compiles training examples (which comprise a board state and the correct output) by taking the correct output for a given board state to be the value (using the evaluation function) of the board state that follows it in the game sequence. I think this is what Gerald Tesauro did when programming TD-Gammon, but I might have misinterpreted the article. (note: I put the specific mechanism for updating weights at the bottom of this post).
I have tried various values for the learning rate, as well as varying numbers of hidden neurons, but nothing seems to work. Even after hours of "learning," there is no discernible improvement in strategy and the evaluation function is not anywhere close to accurate.
I realize that there are much easier ways to program tic tac toe, but I want to do it with a multilayer perceptron so that I may apply it to connect 4 later on. Is this even possible? I'm starting to think that there is no reliable evaluation function for a tic tac toe board with a reasonable amount of hidden neurons.
I assure you that I am not looking for some quick code to turn in for a homework assignment. I've been working unsuccessfully for a while now and would just like to know what I'm doing wrong. All advice is appreciated.
This is the specific mechanism I used for the NN:
Each of the 27 input neurons receives a 0 or 1, which passes through the differentiable sigmoid function 1/(1+e^(-x)). Each input neuron i sends this output (i.output), multiplied by some weight (i.weights[h]) to each hidden neuron h. The sum of these values is taken as input by the hidden neuron h (h.input), and this input passes through the sigmoid to form the output for each hidden neuron (h.output). I denote the lastInput to be the sum of (h.output * h.weight) across all of the hidden neurons. The outputted value of the board is then sigmoid(lastInput).
I denote the learning rate to be alpha, and err to be the correct output minus to actual output. Also I let dSigmoid(x) equal the derivative of the sigmoid at the point x.
The weight of each hidden neuron h is incremented by the value: (alpha*err*dSigmoid(lastInput)*h.output) and the weight of the signal from a given input neuron i to a given hidden neuron h is incremented by the value: (alpha*err*dSigmoid(lastInput)*h.weight*dSigmoid(h.input)*i.output).
I got these formulas from this lecture on backpropagation: http://www.youtube.com/watch?v=UnWL2w7Fuo8 .
Tic tac toe has 3^9 = 19683 states (actually, some of them aren't legal, but the order of magnitude is right). The ouput function isn't smooth, so I think the best a backpropagation network can do is "rote learning" a look-up table for all these states.
With that in mind, 10 hidden neurons seems very small, and there's no way you can train 20k different look-up-table entries by teaching a few thousand games. For that, the network would have to "extrapolate" from states it has been taught to states it has never seen, and I don't see how it could do that.
You might want to consider more than one hidden layer, as well as upping the size of the hidden layer. For comparison purposes, Fogel and Chellapilla used two layers of 40 and 10 neurons to program up a checkers player, so if you need something more than that, something is probably going terribly wrong.
You might also want to use bias inputs, if you're not already.
Your basic methodology seems sound, although I'm not 100% sure what you mean by this:
After generating a game, the NN compiles training examples (which comprise a board state and the correct output) by taking the correct output for a given board state to be the value (using the evaluation function) of the board state that follows it in the game sequence.
I think you mean that you're using some known-good method (like a minimax game tree) to determine the "correct" answers for the training examples. Can you explain that a little bit? Or, if I'm correct, it seems like there's a subtlety to deal with, in terms of symmetric boards, which might have more than one equally good best response. If you're only treating one of those as correct, that might lead to problems. (Or it might not, I'm not sure.)
Just to throw in another thought have your thought about using reinforcement learning for this task? It would be much easier to implement and much more effective. For example you could use Q learning which is often used for games.
Here you can find an implementation for training a Neural Network in Tik Tak Toe (variable board-size) using self-play. The gradient is back-propagated through the whole game employing a simple gradient-copy trick.

Reinforcement learning of a policy for multiple actors in large state spaces

I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. At each time step, I'm given a reward R, indicating the overall success of all actors.
I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. This results in 25000000 potential actions per actor.
Nearly all reinforcement learning algorithms don't seem to be suitable for this domain.
First, they nearly all involve evaluating the expected utility of each action in a given state. My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. Even if I could, it would take too long to find the best action out of a million actions in each time step.
Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors.
How should I approach this problem? I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm.
Clarifying the problem
N=10 actors
O=50 objects
L=1K locations
S=50 features
As I understand it, you have a warehouse with N actors, O objects, L locations, and some walls. The goal is to make sure that each of the O objects ends up in any one of the L locations in the least amount of time. The action space consist of decisions on which actor should be moving which object to which location at any point in time. The state space consists of some 50 X-dimensional environmental factors that include features such as proximity of actors and objects to walls and to each other. So, at first glance, you have XS(OL)N action values, with most action dimensions discrete.
The problem as stated is not a good candidate for reinforcement learning. However, it is unclear what the environmental factors really are and how many of the restrictions are self-imposed. So, let's look at a related, but different problem.
Solving a different problem
We look at a single actor. Say, it knows it's own position in the warehouse, positions of the other 9 actors, positions of the 50 objects, and 1000 locations. It wants to achieve maximum reward, which happens when each of the 50 objects is at one of the 1000 locations.
Suppose, we have a P-dimensional representation of position in the warehouse. Each position could be occupied by the actor in focus, one of the other actors, an object, or a location. The action is to choose an object and a location. Therefore, we have a 4P-dimensional state space and a P2-dimensional action space. In other words, we have a 4PP2-dimensional value function. By futher experimenting with representation, using different-precision encoding for different parameters, and using options 2, it might be possible to bring the problem into the practical realm.
For examples of learning in complicated spatial settings, I would recommend reading the Konidaris papers 1 and 2.
1 Konidaris, G., Osentoski, S. & Thomas, P., 2008. Value function approximation in reinforcement learning using the Fourier basis. Computer Science Department Faculty Publication Series, p.101.
2 Konidaris, G. & Barto, A., 2009. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Y. Bengio et al., eds. Advances in Neural Information Processing Systems, 18, pp.1015-1023.

Resources