Training a neural net to predict the winner of a card game - machine-learning

In a made up card game there are 2 players, each of which are dealt 5 cards (standard 52 card deck), after which some arbitrary function decides a winning player. The goal is to predict the outcome of the game, given the 5 cards that each player is holding. The training data could look something like this:
Player A Player B Winner
AsKs5d3h2d JcJd8d7h6s 1
7h5d8s9sTh 2c3c4cAhAs 0
6d6s6h6cQd AsKsQsJsTs 0
Where the 'Player' columns are 5 card hands, and the 'Winner' column is 1 when player A has won, and 0 when player A has lost.
There should be an indifference towards the order of the hands, such that after training, feeding the network mirrored input data like:
Player A Player B
2d3d6h7s9s TsTdJsQc3h
and
Player A Player B
TsTdJsQc3h 2d3d6h7s9s
will always predict opposite outcomes.
It should also be indifferent to the order of the cards within the hands themselves, such that AsKsQsJsTs is the same as JsTsAsKsQs, which is the same as JsQsTsAsKs etc.
What are some reasonable ways to structure a neural net and its training data to tackle such a problem?

You are going to need a network with 104 inputs (players * number of cards)
The first 52 inputs correspond to player A, the next 52 correspond to player B.
Initialize all inputs to 0, then for each card each player has, set the corresponding input to 1.
For the output layer there are usually two options for binary classification. You can have one output neuron, and if the output of this neuron is greater than a certain threshold, player A wins, else player B wins. Or you can have two output neurons and just look at which one produces the highest output. Both generally work fine.
For training data, instead of something like "AsKs5d3h2d", you will need a one-hot encoding, something like "0001000001000000100100000100000000011001000000001001" (pretend there are 104 numbers, 10 of them are 1's and the rest are 0s) And for output data you just need a 1 or a 0 corresponding to who won (in the case of having one output neuron)
This will make your network invariant to the order of the cards (all possible orders of a given hand will create the same input) And as for swapping player A and B's hands and getting the opposite result, this is something that should come naturally to any well trained network.

First, you should understand the use of Neural Network (NN) before going ahead with this problem. NN tries to find out complex relationship between input and output. (here your input is five cards and output is predicted class).
Here in this question, relationship between input and output can easily be formulated. i.e. you can easily choose some set of rules which declares a final winner.
Still like any other problem, this problem can also be dealt with NN. First you need to prepare your data.
There are total 52 type of inputs possible. So, take 52 column in dataset. Now in these 52 column you can fill three type of categorical data. Either it belongs to 'A' or 'B' or no body. 'C' and output can be the winner .
Now you can train it using NN.

Related

Is a small vocabulary for Neural Nets ok?

I am designing a neural network to try and generate music. The neural network would be a 2 layered LSTM (Long Short Term Memory).
I am hoping to encode the music into a many-hot format for training, ie it would be a 1 if that note was playing and a 0 if that note was not playing.
Here is an excerpt of what this data would look like:
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000011010100100001010000000000000000000000
There are 88 columns which represent 88 notes and each now represents a new beat. The output will be at a character level.
I am just wondering since there are only 2 characters in the vocabulary, would the probability of a 0 being next always be higher than the probability of a 1 being next?
I know for a large vocabulary, a large training set is needed, but I only have a small vocabulary. I have 229 files which corresponds to about 50,000 lines of text. Is this enough to prevent the output being all 0s?
Also, would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?
Thanks in advance
A small vocabulary is fine as long as your dataset not skewed overwhelmingly to one of the "words".
As to "would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?", each timestep is represented as 88 characters. Each character is a feature of that timestep. Your LSTM should be outputting the next timestep, so you should have 88 nodes. Each node should output the probability of that node being present in that timestep.
Finally since you are building a Char-RNN I would strongly suggest using abc notation to represent your data. A song in ABC notation looks like this:
X:1
T:Speed the Plough
M:4/4
C:Trad.
K:G
|:GABc dedB|dedB dedB|c2ec B2dB|c2A2 A2BA|
GABc dedB|dedB dedB|c2ec B2dB|A2F2 G4:|
|:g2gf gdBd|g2f2 e2d2|c2ec B2dB|c2A2 A2df|
g2gf g2Bd|g2f2 e2d2|c2ec B2dB|A2F2 G4:|
This is perfect for Char-RNNs because it represents every song as a set of of characters, and you can run conversions from MIDI to ABC and vice versa. All you have to do is train your model to predict the next character in this sequence instead of dealing with 88 output nodes.

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

How to train a neural network in forward manner and using it in backward manner

I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.

Association Rule - Non-Binary Items

I have studied association rules and know how to implement the algorithm on the classic basket of goods problem, such as:
Transaction ID Potatoes Eggs Milk
A 1 0 1
B 0 1 1
In this problem each item has a binary identifier. 1 indicates the basket contains the good, 0 indicates it does not.
But what would be the best way to model a basket which can contain many of the same good? E.g., take the below, very unrealistic example.
Transaction ID Potatoes Eggs Milk
A 5 0 178
B 0 35 7
Using binary indicators in this case would obviously be losing a lot of information and I am seeking a model which takes into account not only the presence of items in the basket, but also the frequency that the items occur.
What would be a suitable algorithm for this problem?
In my actual data there are over one hundred items and, based on the profile of a user's basket, I would like to calculate the probabilities of the customer consuming the other available items.
An alternative is to use binary indicators but constructing them in a more clever way.
The idea is to set the indicator when an amount is more than the central value, which means that it shall be significant. If everyone buys 3 breads on average, does it make sense to flag someone as a "bread-lover" for buying two or three?
Central value can a plain arithmetic mean, one with outliers removed, or the median.
Instead of:
binarize(x) = 0 if x = 0
1 otherwise
you can use
binarize*(x) = 0 if x <= central(X)
1 otherwise
I think if you really want to have probabilities is to encode your data in a probabilistic way. Bayesian or Markov networks might be a feasible way. Nevertheless without having a reasonable structure this will be computational extremely expansive. For three item types this, however, seems to be feasible
I would try to go for a Neural Network Autoencoder if you have many more item types. If there is some dependency in the data it will discover that.
For the above example you could use a network with three input, two hidden and three output neurons.
A little bit more fancy would be to use 3 fully connected layers with drop out in the middle layer.

Tic tac toe machine learning - valid moves

I am toying around with Machine learning. Especially Q-Learning where you have a state and actions and give rewards depending on how well the network did.
Now for starters I set myself a simple goal: Train a network so it emits valid moves for tic-tac-toe (vs a random opponent) as actions. My problem is that the network does not learn at all or even gets worse over time.
The first thing I did was getting in touch with torch and a deep q learning module for it: https://github.com/blakeMilner/DeepQLearning .
Then I wrote a simple tic-tac-toe game where a random player competes with the neural net and plugged this into the code from this sample https://github.com/blakeMilner/DeepQLearning/blob/master/test.lua . The output of the network consists of 9 nodes for setting the respective cell.
A move is valid if the network chooses an empty cell (no X or O in it). According to this I give positive reward (if network chooses empty cell) and negative rewards (if network chooses an occupied cell).
The problem is it never seems to learn. I tried lots of variations:
mapping the tic-tac-toe field as 9 inputs (0 = cell empty, 1 = player 1, 2 = player 2) or as 27 inputs (e.g. for an empty cell 0 [empty = 1, player1 = 0, player2 = 0])
vary the hidden nodes count between 10 and 60
tried up to 60k iterations
varying learning rate between 0.001 and 0.1
giving negative rewards for fails or only rewards for success, different reward values
Nothing works :(
Now I have a couple of questions:
Since this is my very first attempt at Q-Learning is there anything I am fundamentally doing wrong?
What parameters are worth changing? The "Brain" thing has a lot: https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L57 .
What would a good count for the number of hidden nodes be?
Is the simple network structure as defined at https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L116 too simple for this problem?
Am I just too impatient and have to train much more iterations?
Thank you,
-Matthias
Matthias,
It seems you are using one output node? "The output of the network in the forward step is a number between 1 and 9". If so, then I believe this is the problem. Instead of having one output node I would treat this as a classification problem and have nine output nodes corresponding to each board position. Then take the argmax of these nodes as the predicted move. This is how networks that play the game of Go are setup (there are 361 output nodes each representing an intersection on the board).
Hope this helps!

Resources