Tic tac toe machine learning - valid moves - machine-learning

I am toying around with Machine learning. Especially Q-Learning where you have a state and actions and give rewards depending on how well the network did.
Now for starters I set myself a simple goal: Train a network so it emits valid moves for tic-tac-toe (vs a random opponent) as actions. My problem is that the network does not learn at all or even gets worse over time.
The first thing I did was getting in touch with torch and a deep q learning module for it: https://github.com/blakeMilner/DeepQLearning .
Then I wrote a simple tic-tac-toe game where a random player competes with the neural net and plugged this into the code from this sample https://github.com/blakeMilner/DeepQLearning/blob/master/test.lua . The output of the network consists of 9 nodes for setting the respective cell.
A move is valid if the network chooses an empty cell (no X or O in it). According to this I give positive reward (if network chooses empty cell) and negative rewards (if network chooses an occupied cell).
The problem is it never seems to learn. I tried lots of variations:
mapping the tic-tac-toe field as 9 inputs (0 = cell empty, 1 = player 1, 2 = player 2) or as 27 inputs (e.g. for an empty cell 0 [empty = 1, player1 = 0, player2 = 0])
vary the hidden nodes count between 10 and 60
tried up to 60k iterations
varying learning rate between 0.001 and 0.1
giving negative rewards for fails or only rewards for success, different reward values
Nothing works :(
Now I have a couple of questions:
Since this is my very first attempt at Q-Learning is there anything I am fundamentally doing wrong?
What parameters are worth changing? The "Brain" thing has a lot: https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L57 .
What would a good count for the number of hidden nodes be?
Is the simple network structure as defined at https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L116 too simple for this problem?
Am I just too impatient and have to train much more iterations?
Thank you,
-Matthias

Matthias,
It seems you are using one output node? "The output of the network in the forward step is a number between 1 and 9". If so, then I believe this is the problem. Instead of having one output node I would treat this as a classification problem and have nine output nodes corresponding to each board position. Then take the argmax of these nodes as the predicted move. This is how networks that play the game of Go are setup (there are 361 output nodes each representing an intersection on the board).
Hope this helps!

Related

Q-Learning: Inaccurate predictions

I recently started getting into Q-Learning. I'm currently writing an agent that should play a simple board game (Othello).
I'm playing against a random opponent. However it is not working properly. Either my agent stays around 50% winrate or gets even worse, the longer I train.
The output of my neural net is an 8x8 matrix with Q-values for each move.
My reward is as follows:
1 for a win
-1 for a loss
an invalid move counts as loss therefore the reward is -1
otherwise I experimented with a direct reward of -0.01 or 0.
I've made some observations I can't explain:
When I consider the prediction for the first move (my agent always starts) the predictions for the invalid moves get really close to -1 really fast. The predictions for the valid moves however seem to rise the longer I play. They even went above 1 (they were above 3 at some point), what shouldn't be possible since I'm updating my Q-values according to the Bellman-equation (with alpha = 1)
where Gamma is a parameter less than 1 (I use 0.99 for the most time). If a game lasts around 30 turns I would expect max/min values of +-0.99^30=+-0.73.
Also the predictions for the starting state seem to be the highest.
Another observation I made, is that the network seems to be too optimistic with its predictions. If I consider the prediction for a move, after which the game will be lost, the predictions for the invalid turns are again close to -1. The valid turn(s) however, often have predictions well above 0 (like 0.1 or even 0.5).
I'm somewhat lost, as I can't explain what could cause my problems, since I already double-checked my reward/target matrices. Any ideas?
i suspect your bellman calculation (specifically, Q(s', a')) does not check for valid moves as the game progresses. that would explain why "predictions for the invalid moves get really close to -1 really fast".

Does the Izhikevich neuron model use weights?

I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission

Training a neural net to predict the winner of a card game

In a made up card game there are 2 players, each of which are dealt 5 cards (standard 52 card deck), after which some arbitrary function decides a winning player. The goal is to predict the outcome of the game, given the 5 cards that each player is holding. The training data could look something like this:
Player A Player B Winner
AsKs5d3h2d JcJd8d7h6s 1
7h5d8s9sTh 2c3c4cAhAs 0
6d6s6h6cQd AsKsQsJsTs 0
Where the 'Player' columns are 5 card hands, and the 'Winner' column is 1 when player A has won, and 0 when player A has lost.
There should be an indifference towards the order of the hands, such that after training, feeding the network mirrored input data like:
Player A Player B
2d3d6h7s9s TsTdJsQc3h
and
Player A Player B
TsTdJsQc3h 2d3d6h7s9s
will always predict opposite outcomes.
It should also be indifferent to the order of the cards within the hands themselves, such that AsKsQsJsTs is the same as JsTsAsKsQs, which is the same as JsQsTsAsKs etc.
What are some reasonable ways to structure a neural net and its training data to tackle such a problem?
You are going to need a network with 104 inputs (players * number of cards)
The first 52 inputs correspond to player A, the next 52 correspond to player B.
Initialize all inputs to 0, then for each card each player has, set the corresponding input to 1.
For the output layer there are usually two options for binary classification. You can have one output neuron, and if the output of this neuron is greater than a certain threshold, player A wins, else player B wins. Or you can have two output neurons and just look at which one produces the highest output. Both generally work fine.
For training data, instead of something like "AsKs5d3h2d", you will need a one-hot encoding, something like "0001000001000000100100000100000000011001000000001001" (pretend there are 104 numbers, 10 of them are 1's and the rest are 0s) And for output data you just need a 1 or a 0 corresponding to who won (in the case of having one output neuron)
This will make your network invariant to the order of the cards (all possible orders of a given hand will create the same input) And as for swapping player A and B's hands and getting the opposite result, this is something that should come naturally to any well trained network.
First, you should understand the use of Neural Network (NN) before going ahead with this problem. NN tries to find out complex relationship between input and output. (here your input is five cards and output is predicted class).
Here in this question, relationship between input and output can easily be formulated. i.e. you can easily choose some set of rules which declares a final winner.
Still like any other problem, this problem can also be dealt with NN. First you need to prepare your data.
There are total 52 type of inputs possible. So, take 52 column in dataset. Now in these 52 column you can fill three type of categorical data. Either it belongs to 'A' or 'B' or no body. 'C' and output can be the winner .
Now you can train it using NN.

LSTM network learning

I have attempted to program my own LSTM (long short term memory) neural network. I would like to verify that the basic functionality is working. I have implemented a Back propagation through time BPTT algorithm to train a single cell network.
Should a single cell LSTM network be able to learn a simple sequence, or are more than one cells necessary? The network does not seem to be able to learn a simple sequence such as 1 0 0 0 1 0 0 0 1 0 0 0 1.
I am sending the the sequence 1's and 0's one by one, in order, into the network, and feeding it forward. I record each output for the sequence.
After running the whole sequence through the LSTM cell, I feed the mean error signals back into the cell, saving the weight changes internal to the cell, in a seperate collection, and after running all the errors one by one through and calculating the new weights after each error, I average the new weights together to get the new weight, for each weight in the cell.
Am i doing something wrong? I would very appreciate any advice.
Thank you so much!
Having only one cell (one hidden unit) is not a good idea even if you are just testing the correctness of your code. You should try 50 even for such simple problem. This paper here: http://arxiv.org/pdf/1503.04069.pdf gives you very clear gradient rules for updating the parameters. Having said that, there is no need to implement your own even if your dataset and/or the problem you are working on is new LSTM. Pick from the existing library (Theano, mxnet, Torch etc...) and modify from there I think is a easier way, given that it's less error prone and it supports gpu computing which is essential for training lstm within a reasonable amount of time.
I haven't tried 1 hidden unit before, but I am sure 2 or 3 hidden units will work for sequence 0,1,0,1,0,1. It is not necessarily the more the cells, the better the result. Training difficulty also increases with the number of cells.
You said you averaged new weights together to get the new weight. Does that mean you run many training sessions and take the average of the trained weights?
There are many possibilities your LSTM did not work, even if you implemented it correctly. The weights are not easy to train by simple gradient descent.
Here are my suggestion for weight optimization.
Using Momentum method for gradient descent.
Add some gaussian noise to your training set to prevent overfitting.
using adaptive learning rates for each unit.
Maybe you can take a look at Coursera's course Neural Network offered by Toronto University, and discuss with people there.
Or you can take a look at other examples on GitHub. For instance :
https://github.com/JANNLab/JANNLab/tree/master/examples/de/jannlab/examples
The best way to test an LSTM implementation (after gradient checking) is to try it out on the toy memory problems described in the original LSTM paper itself.
The best one that I often use is the 'Addition Problem':
We give a sequence of tuples of the form (value, mask). Value is a real valued scalar number between 0 and 1. Mask is a binary value - either 0 or 1.
0.23, 0
0.65, 0
...
0.86, 0
0.13, 1
0.76, 0
...
0.34, 0
0.43, 0
0.12, 1
0.09, 0
..
0.83, 0 -> 0.125
In the entire sequence of such tuples (usually of length 100), only 2 tuples should have mask as 1, the rest of the tuples should have the mask as 0. The target at the final time step is the a average of the two values for which the mask was 1. The outputs at all other time steps, other than the last one is ignored. The values and the positions of the mask are arbitrarily chosen. Thus, this simple task shows if your implementation can actually remember things over long periods of time.

Neural Network Diverging instead of converging

I have implemented a neural network (using CUDA) with 2 layers. (2 Neurons per layer).
I'm trying to make it learn 2 simple quadratic polynomial functions using backpropagation.
But instead of converging, the it is diverging (the output is becoming infinity)
Here are some more details about what I've tried:
I had set the initial weights to 0, but since it was diverging I have randomized the initial weights
I read that a neural network might diverge if the learning rate is too high so I reduced the learning rate to 0.000001
The two functions I am trying to get it to add are: 3 * i + 7 * j+9 and j*j + i*i + 24 (I am giving the layer i and j as input)
I had implemented it as a single layer previously and that could approximate the polynomial functions better
I am thinking of implementing momentum in this network but I'm not sure it would help it learn
I am using a linear (as in no) activation function
There is oscillation in the beginning but the output starts diverging the moment any of weights become greater than 1
I have checked and rechecked my code but there doesn't seem to be any kind of issue with it.
So here's my question: what is going wrong here?
Any pointer will be appreciated.
If the problem you are trying to solve is of classification type, try 3 layer network (3 is enough accordingly to Kolmogorov) Connections from inputs A and B to hidden node C (C = A*wa + B*wb) represent a line in AB space. That line divides correct and incorrect half-spaces. The connections from hidden layer to ouput, put hidden layer values in correlation with each other giving you the desired output.
Depending on your data, error function may look like a hair comb, so implementing momentum should help. Keeping learning rate at 1 proved optimum for me.
Your training sessions will get stuck in local minima every once in a while, so network training will consist of a few subsequent sessions. If session exceeds max iterations or amplitude is too high, or error is obviously high - the session has failed, start another.
At the beginning of each, reinitialize your weights with random (-0.5 - +0.5) values.
It really helps to chart your error descent. You will get that "Aha!" factor.
The most common reason for a neural network code to diverge is that the coder has forgotten to put the negative sign in the change in weight expression.
another reason could be that there is a problem with the error expression used for calculating the gradients.
if these don't hold, then we need to see the code and answer.

Resources