Tic Tac Toe Neural Network as Evaluation Function - machine-learning

I've been trying to program an AI for tic tac toe using a multilayer perceptron and backpropagation. My idea was to train the neural network to be an accurate evaluation function for board states, but the problem is even after analyzing thousands of games, the network does not output accurate evaluations.
I'm using 27 input neurons; each square on the 3x3 board is associated with three input neurons that receive values of 0 or 1 depending on whether the square has an x, o or is blank. These 27 input neurons send signals to 10 hidden neurons (I chose 10 arbitrarily, but I have tried with 5 and 15 as well).
For training, I've had the program generate a series of games by playing against itself using the current evaluation function to select what are deemed optimal moves for each side. After generating a game, the NN compiles training examples (which comprise a board state and the correct output) by taking the correct output for a given board state to be the value (using the evaluation function) of the board state that follows it in the game sequence. I think this is what Gerald Tesauro did when programming TD-Gammon, but I might have misinterpreted the article. (note: I put the specific mechanism for updating weights at the bottom of this post).
I have tried various values for the learning rate, as well as varying numbers of hidden neurons, but nothing seems to work. Even after hours of "learning," there is no discernible improvement in strategy and the evaluation function is not anywhere close to accurate.
I realize that there are much easier ways to program tic tac toe, but I want to do it with a multilayer perceptron so that I may apply it to connect 4 later on. Is this even possible? I'm starting to think that there is no reliable evaluation function for a tic tac toe board with a reasonable amount of hidden neurons.
I assure you that I am not looking for some quick code to turn in for a homework assignment. I've been working unsuccessfully for a while now and would just like to know what I'm doing wrong. All advice is appreciated.
This is the specific mechanism I used for the NN:
Each of the 27 input neurons receives a 0 or 1, which passes through the differentiable sigmoid function 1/(1+e^(-x)). Each input neuron i sends this output (i.output), multiplied by some weight (i.weights[h]) to each hidden neuron h. The sum of these values is taken as input by the hidden neuron h (h.input), and this input passes through the sigmoid to form the output for each hidden neuron (h.output). I denote the lastInput to be the sum of (h.output * h.weight) across all of the hidden neurons. The outputted value of the board is then sigmoid(lastInput).
I denote the learning rate to be alpha, and err to be the correct output minus to actual output. Also I let dSigmoid(x) equal the derivative of the sigmoid at the point x.
The weight of each hidden neuron h is incremented by the value: (alpha*err*dSigmoid(lastInput)*h.output) and the weight of the signal from a given input neuron i to a given hidden neuron h is incremented by the value: (alpha*err*dSigmoid(lastInput)*h.weight*dSigmoid(h.input)*i.output).
I got these formulas from this lecture on backpropagation: http://www.youtube.com/watch?v=UnWL2w7Fuo8 .

Tic tac toe has 3^9 = 19683 states (actually, some of them aren't legal, but the order of magnitude is right). The ouput function isn't smooth, so I think the best a backpropagation network can do is "rote learning" a look-up table for all these states.
With that in mind, 10 hidden neurons seems very small, and there's no way you can train 20k different look-up-table entries by teaching a few thousand games. For that, the network would have to "extrapolate" from states it has been taught to states it has never seen, and I don't see how it could do that.

You might want to consider more than one hidden layer, as well as upping the size of the hidden layer. For comparison purposes, Fogel and Chellapilla used two layers of 40 and 10 neurons to program up a checkers player, so if you need something more than that, something is probably going terribly wrong.
You might also want to use bias inputs, if you're not already.
Your basic methodology seems sound, although I'm not 100% sure what you mean by this:
After generating a game, the NN compiles training examples (which comprise a board state and the correct output) by taking the correct output for a given board state to be the value (using the evaluation function) of the board state that follows it in the game sequence.
I think you mean that you're using some known-good method (like a minimax game tree) to determine the "correct" answers for the training examples. Can you explain that a little bit? Or, if I'm correct, it seems like there's a subtlety to deal with, in terms of symmetric boards, which might have more than one equally good best response. If you're only treating one of those as correct, that might lead to problems. (Or it might not, I'm not sure.)

Just to throw in another thought have your thought about using reinforcement learning for this task? It would be much easier to implement and much more effective. For example you could use Q learning which is often used for games.

Here you can find an implementation for training a Neural Network in Tik Tak Toe (variable board-size) using self-play. The gradient is back-propagated through the whole game employing a simple gradient-copy trick.

Related

Neural network online training

I want to implement a simple feed-forward neural network to approximate the function y=f(x)=ax^2 where a is some constant and x is the input value.
The NN has one input node, one hidden layer with 1-n nodes, and one output node. For example, I input the value 2.0 -> the NN produces 4.0, and again I input 3.0 -> the NN produces 9.0 or close to it and so on.
If I understand "online-training," the training data is fed one by one - meaning I input the value 2.0 -> I iterate with the gradient decent 100 times, and then I pass the value 3.0, and I iterate another 100 times.
However, when I try to do this with my experimental/learning NN - I input the value 2.0 -> the error gets very small -> the output is very close to 4.0.
Now if I want to predict for the input 3.0 -> the NN produces 4.36 or something instead of 9.0. So the NN just learns the last training value.
How can I use online-training to get a Neural Network that approximates the desired function for a range [-d, d]? What am I missing?
The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function. This is besides the point but in case someone was wondering.
Any advise would be greatly appreciated.
More info - I am activating the hidden layer with the Sigmoid function and the output layer with the linear one.
The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function.
Recurrent Neural Networks (RNNs) are the state of the art for modeling time series. This is because they can take inputs of arbitrary length, and they can also use internal state to model the changing behavior of the series over time.
Training feedforward neural networks for time series is an old method which will generally not perform as well. They require a fixed sized input so you must choose a fixed sized sliding time window, and they also don't preserve state, so it is hard to learn a time-varying function.
I can find very little about "online training" of feedforward neural nets with stochastic gradient descent to model non-stationary behavior except for a couple of very vague references. I don't think this provides any benefit besides allowing you to train in real time when you are getting a stream of data one at a time. I don't think it will actually help you model time-dependent behavior.
Most of the older methods I can find in the literature about online learning for neural networks use a hybrid approach with a neural network and some other method that can help capture time dependencies. Again, these should all be inferior to RNNs, not to mention harder to implement in practice.
Furthermore, I don't think you are implementing online training correctly. It should be stochastic gradient descent with a mini-batch size of 1. Therefore, you only run one iteration of gradient descent on each training example per training epoch. Since you are running 100 iterations before moving on to the next training example, you are going too far down the error gradient with respect to that single example, resulting in serious overfitting to a single data point. This is why you get poor results on the next input. I don't think this is a justifiable method of training, nor do I think it will work for time series.
You haven't mentioned what your activations are or your loss function is, so I can't comment on whether those are appropriate for the task.
Also, I don't think the learning y=ax^2 is a good analogy for time series prediction. This is a static function that always gives the same output for a given input, regardless of the index of the input or the value of previous inputs.

Time Series Prediction using Recurrent Neural Networks

I am using a Bike Sharing dataset to predict the number of rentals in a day, given the input. I will use 2011 data to train and 2012 data to validate. I successfully built a linear regression model, but now I am trying to figure out how to predict time series by using Recurrent Neural Networks.
Data set has 10 attributes (such as month, working day or not, temperature, humidity, windspeed), all numerical, though an attribute is day (Sunday: 0, Monday:1 etc.).
I assume that one day can and probably will depend on previous days (and I will not need all 10 attributes), so I thought about using RNN. I don't know much, but I read some stuff and also this. I think about a structure like this.
I will have 10 input neurons, a hidden layer and 1 output neuron. I don't know how to decide on how many neurons the hidden layer will have.
I guess that I need a matrix to connect input layer to hidden layer, a matrix to connect hidden layer to output layer, and a matrix to connect hidden layers in neighbouring time-steps, t-1 to t, t to t+1. That's total of 3 matrices.
In one tutorial, activation function was sigmoid, although I'm not sure exactly, if I use sigmoid function, I will only get output between 0 and 1. What should I use as activation function? My plan is to repeat this for n times:
For each training data:
Forward propagate
Propagate the input to hidden layer, add it to propagation of previous hidden layer to current hidden layer. And pass this to activation function.
Propagate the hidden layer to output.
Find error and its derivative, store it in a list
Back propagate
Find current layers and errors from list
Find current hidden layer error
Store weight updates
Update weights (matrices) by multiplying them by learning rate.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
It seems to be the correct way to do it, if you are just wanting to learn the basics. If you want to build a neural network for practical use, this is a very poor approach and as Marcin's comment says, almost everyone who constructs neural nets for practical use do so by using packages which have an ready simulation of neural network available. Let me answer your questions one by one...
I don't know how to decide on how many neurons the hidden layer will have.
There is no golden rule to choose the right architecture for your neural network. There are many empirical rules people have established out of experience, and the right number of neurons are decided by trying out various combinations and comparing the output. A good starting point would be (3/2 times your input plus output neurons, i.e. (10+1)*(3/2)... so you could start with a 15/16 neurons in hidden layer, and then go on reducing the number based on your output.)
What should I use as activation function?
Again, there is no 'right' function. It totally depends on what suits your data. Additionally, there are many types of sigmoid functions like hyperbolic tangent, logistic, RBF, etc. A good starting point would be logistic function, but again you will only find the right function through trial and error.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
All activation functions(including the one assigned to output neuron) will give you an output of 0 to 1, and you will have to use multiplier to convert it to real values, or have some kind of encoding with multiple output neurons. Coding this manually will be complicated.
Another aspect to consider would be your training iterations. Doing it 'n' times doesn't help. You need to find the optimal training iterations with trial and error as well to avoid both under-fitting and over-fitting.
The correct way to do it would be to use packages in Python or R, which will allow you to train neural nets with large amount of customization quickly, where you can train and test multiple nets with different activation functions (and even different training algorithms) and network architecture without too much hassle. With some amount of trial and error, you will eventually find the net that gives you desirable output.

Why do I get good accuracy with IRIS dataset with a single hidden node?

I have a minimal example of a neural network with a back-propagation trainer, testing it on the IRIS data set. I started of with 7 hidden nodes and it worked well.
I lowered the number of nodes in the hidden layer to 1 (expecting it to fail), but was surprised to see that the accuracy went up.
I set up the experiment in azure ml, just to validate that it wasn't my code. Same thing there, 98.3333% accuracy with a single hidden node.
Can anyone explain to me what is happening here?
First, it has been well established that a variety of classification models yield incredibly good results on Iris (Iris is very predictable); see here, for example.
Secondly, we can observe that there are relatively few features in the Iris dataset. Moreover, if you look at the dataset description you can see that two of the features are very highly correlated with the class outcomes.
These correlation values are linear, single-feature correlations, which indicates that one can most likely apply a linear model and observe good results. Neural nets are highly nonlinear; they become more and more complex and capture greater and greater nonlinear feature combinations as the number of hidden nodes and hidden layers is increased.
Taking these facts into account, that (a) there are few features to begin with and (b) that there are high linear correlations with class, would all point to a less complex, linear function as being the appropriate predictive model-- by using a single hidden node, you are very nearly using a linear model.
It can also be noted that, in the absence of any hidden layer (i.e., just input and output nodes), and when the logistic transfer function is used, this is equivalent to logistic regression.
Just adding to DMlash's very good answer: The Iris data set can even be predicted with a very high accuracy (96%) by using just three simple rules on only one attribute:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63] then Species = versicolor
If Petal.Width = (1.63,2.5] then Species = virginica
In general neural networks are black boxes where you never really know what they are learning but in this case back-engineering should be easy. It is conceivable that it learnt something like the above.
The above rules were found by using the OneR package.

Neural network input with exponential decay

Often, to improve learning rates, inputs to a neural network are preprocessed by scaling and shifting to be between -1 and 1. I'm wondering though if that's a good idea with an input whose graph would be exponentially decaying. For instance, if I had an input with integer values 0 to 100 distributed with the majority of inputs being 0 and smaller values being more common than large values, with 99 being very rare.
Seems that scaling them and shifting wouldn't be ideal, since now the most common value would be -1. How is this type of input best dealt with?
Consider you're using a sigmoid activation function which is symmetric around the origin:
The trick to speed up convergence is to have the mean of the normalized data set be 0 as well. The choice of activation function is important because you're not only learning weights from the input to the first hidden layer, i.e. normalizing the input is not enough: the input to the second hidden layer/output is learned as well and thus needs to obey the same rule to be consequential. In the case of non-input layers this is done by the activation function. The much cited Efficient Backprop paper by Lecun summarizes these rules and has some nice explanations as well which you should look up. Because there's other things like weight and bias initialization that one should consider as well.
In chapter 4.3 he gives a formula to normalize the inputs in a way to have the mean close to 0 and the std deviation 1. If you need more sources, this is great faq as well.
I don't know your application scenario but if you're using symbolic data and 0-100 is ment to represent percentages, then you could also apply softmax to the input layer to get better input representations. It's also worth noting that some people prefer scaling to [.1,.9] instead of [0,1]
Edit: Rewritten to match comments.

Updates in Temporal Difference Learning

I read about Tesauro's TD-Gammon program and would love to implement it for tic tac toe, but almost all of the information is inaccessible to me as a high school student because I don't know the terminology.
The first equation here, http://www.stanford.edu/group/pdplab/pdphandbook/handbookch10.html#x26-1310009.2
gives the "general supervised learning paradigm." It says that the w sub t on the left side of the equation is the parameter vector at time step t. What exactly does "time step" mean? Within the framework of a tic tac toe neural network designed to output the value of a board state, would the time step refer to the number of played pieces in a given game? For example the board represented by the string "xoxoxoxox" would be at time step 9 and the board "xoxoxoxo " would be at time step 8? Or would the time step refer to the amount of time elapsed since training has begun?
Since w sub t is the weight vector for a given time step, does this mean that every time step has its own evaluation function (neural network)? So to evaluate a board state with only one move, you would have to feed into a different NN than you would feed a board state with two moves? I think I am misinterpreting something here because as far as I know Tesauro used only one NN for evaluating all board states (though it is hard to find reliable information about TD-Gammon).
How come the gradient of the output is taken with respect to w and not w sub t?
Thanks in advance for clarifying these ideas. I would appreciate any advice regarding my project or suggestions for accessible reading material.
TD deals with learning within the framework of a Markov Decision Process. That is, you begin in some state st, perform an action at, receive reward rt, and end up in another state — st+1. The initial state is called s0. The subscript is called time.
A TD agent begins not knowing what rewards it will receive for what actions or what other states those actions take it to. It's goal is to learn which actions maximize long-term total reward.
The state could represent the state of the tic-tac-toe board. So, s0 in your case would be a clear board: "---------", s1 might be "-o-------", s2 might be "-o--x----", etc. The action might be the cell index to mark. With this representation, you would have 39=19 683 possible states and 9 possible actions. After learning the game, you would have a table telling you which cell to mark on each possible board.
The same kind of direct representation would not work well for Backgammon, because there are far too many possible states. What TD-Gammon does is approximate states using a neural network. The weights of the network are treated as a state vector, and the reward is always 0, except upon winning.
The tricky part here is that the state the TD-Gammon algorithm learns is the state of the neural network used to evaluate board positions, not the state of the board. So, at t=0 you have not played a single game and the network is in its initial state. Every time you have to make a move, you use the network to choose the best possible one, and then update the network's weights based on whether the move led to victory or not. Before making the move, the network has weights wt, afterwards it has weights wt+1. After playing hundreds of thousands of games, you learn weights that allow the neural network to evaluate board positions quite accurately.
I hope this clears things up.

Resources