Training neural network on a sequence of moves - machine-learning

I've been playing with neural networks just out of personal curiosity and wanting to learn. So far, it's been a success. I've written a code for one from scratch, with backpropagation for training, and trained it to play tic-tac-toe (outputs 3x3 matrix with highest value in an open spot being played). It's based off of this example except designed to allow multiple hidden layers.
The training data is a randomly generated mid-game situation and to supervise the learning, I wrote by hand an algorithm for ranking the possible moves to use as the "correct" answers. After training it a couple thousand times, it works pretty well at making the best move and can play a whole game pretty well (favors opening in center, while perfect game you take corner, but whatever).
Anyways, that's all well and good, but for the training, it required me to create a specific algorithm to rank each move for any given play, which was easy enough cause tic-tac-toe is very simple, but this isn't practical. My next milestone, would be to be able to train it just based on winning or losing the game. However, this requires it to remember a sequence of moves, then backpropagate the training not just through the neurons but a sequence of moves after the game is finished. I just have no idea where to start, even being pointed in the right direction would help.

Related

What is wrong with my approach of using MLP to make a chess engine?

I’m making a chess engine using machine learning, and I’m experiencing problems debugging it. I need help figuring out what is wrong with my program, and I would appreciate any help.
I made my research and borrowed ideas from multiple successful projects. The idea is to use reinforcement learning to teach NN to differentiate between strong and weak positions.
I collected 3 million games with Elo over 2000 and used my own method to label them. After researching hundreds of games, I found out, that it’s safe to assume that in the last 10 turns of any game, the balance doesn’t change, and the winning side has a strong advantage. So I picked positions from the last 10 turns and made two labels: one for a win for white and zero for black. I didn’t include any draw positions. To avoid bias, I have picked even numbers of positions labeled with wins for both sides and even number of positions for both sides with the next turn.
Each position I represented by a vector with the length of 773 elements. Every piece on every square of a chess board, together with castling rights and a next turn, I coded with ones and zeros. My sequential model has an input layer with 773 neurons and an output layer with one single neuron. I have used a three hidden layer deep MLP with 1546, 500 and 50 hidden units for layers 1, 2, and 3 respectively with dropout regularization value of 20% on each. Hidden layers are connected with the non- linear activation function ReLU, while the final output layer has a sigmoid output. I used binary crossentropy loss function and the Adam algorithm with all default parameters, except for the learning rate, which I set to 0.0001.
I used 3 percent of the positions for validation. During the first 10 epochs, validation accuracy gradually went up from 90 to 92%, just one percent behind training accuracy. Further training led to overfitting, with training accuracy going up, and validation accuracy going down.
I tested the trained model on multiple positions by hand, and got pretty bad results. Overall the model can predict which side is winning, if that side has more pieces or pawns close to a conversion square. Also it gives the side with a next turn a small advantage (0.1). But overall it doesn’t make much sense. In most cases it heavily favors black (by ~0.3) and doesn’t properly take into account the setup. For instance, it labels the starting position as ~0.0001, as if the black side has almost 100% chance to win. Sometimes irrelevant transformation of a position results in unpredictable change of the evaluation. One king and one queen from each side usually is viewed as lost position for white (0.32), unless black king is on certain square, even though it doesn’t really change the balance on the chessboard.
What I did to debug the program:
To make sure I have not made any mistakes, I analyzed, how each position is being recorded, step by step. Then I picked a dozen of positions from the final numpy array, right before training, and converted it back to analyze them on a regular chess board.
I used various numbers of positions from the same game (1 and 6) to make sure, that using too many similar positions is not the cause for the fast overfitting. By the way, even one position for each game in my database resulted in 3 million data set, which should be sufficient according to some research papers.
To make sure that the positions I use are not too simple, I analyzed them. 1.3 million of them had 36 points in pieces (knights, bishops, rooks, and queens; pawns were not included in the count), 1.4 million - 19 points, and only 0.3 million - had less.
Some things you could try:
Add unit tests and asserts wherever possible. E.g. if you know that some value is never supposed to get negative, add an assert to check that this condition really holds.
Print shapes of all tensors to check that you have really created the architecture you intended.
Check if your model outperforms some simple baseline model.
You say your model overfits, so maybe simplify it / add regularization?
Check how your model performs on the simplest positions. E.g. can it recognize a checkmate?

Whether Data augmentation really needed in Machine Learning [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am interested in knowing the importance of data augmentation(rotation at various angles, flipping the images) while providing a dataset to a Machine Learning problem.
Whether it is really needed? Or the CNN networks using will handle that as well no matter how different the data are transformed?
So I took a classification task with 2 classes to conclude some results
Arrow shapes
Circle shapes
The idea is to train the shapes with only one orientation(I have taken arrows pointing right) and check the model with a different orientation(I have taken arrows pointing downwards) which is not at all given during the training stage.
Some of the samples used in Training
Some of the samples used in Testing
This is the entire dataset I am using in for creating a tensorflow model.
https://bitbucket.org/akhileshmalviya/samples/src/bab50b85d826?at=master
I am wondering with the results I got,
(i) Except a few downward arrows all others are getting predicted correctly as arrow. Does it mean data augmentation is not at all needed?
(ii) Or is this the right use case I have taken to understand the importance of data augmentation?
Kindly share your thoughts, Any help could be really appreciated!
Data augmentation is a data-depended process.
In general, you need it when your training data is complex and you have a few samples.
A neural network can easily learn to extract simple patterns like arcs or straight lines and these patterns are enough to classify your data.
In your case data augmentation can barely help, the features the network will learn to extract are easy and highly different from each other.
When you, instead, have to deal with complex structures (cats, dogs, airplanes, ...) you can't rely on simple features like edges, arcs, etc..
Instead, you have to show to your network that the instances you're trying to classify got an high variance and that the features extracted can be combined in a lot of different ways for the same subject.
Think about a cat: it can be of any color, the picture can be taken in different light conditions, its whole body can be in any position, the picture could be taken with a certain orientation...
To correctly classify instances so different, the network must learn to extract robust features that could be learned only after seeing a lot of different inputs.
In your case, instead, simple features can completely discriminate your input, thus any sort of data augmentation could help by just a little bit.
The task you are solving can be easily solved without any NN and even without machine learning.
Just because the problem is so simple it does not really matter whether you do a data augmentation or not. The need for data augmentation is task specific and depends on many things:
how easy is to augment the data with preserving the ability to correctly mark the class. For image, sounds which we used to see/hear it is not a problem (we know that adding small noise to the sound does not change the meaning, rotating the lizard is still a lizard). For other things augmenting without preserving the class/value is hard (for example in Go, randomly adding a stone can change the value of the position dramatically)
does the augmented data is drawn from the same distribution you care about. Adding random stones to Go does not work, but rotating flipping the board works and preserves distribution. But for example in a racing king game (variant of chess) it will not help. You can't flip the position (left <-> right), the evaluation stays the same, but it will never happen in real game and therefore drawn from different distribution and useless
how much data do you have and how expressive is your model. The more parameters you model have, the bigger the chance of overfitting and the more is your need for data. If you train a linear regression in n dims, you will have n + 1 params. You do not really need to augment this. Also if you already have 10bln data points, the augmentation is probably will not be helpful.
how expensive the augmentation procedure. For rotating/scaling the image it is very cheap, but for other augmentation it can be computationally expensive
something else that I forgot.

Regarding to backward of convolution layer in Deep learning

I understood the way to compute the forward part in Deep learning. Now, I want to understand the backward part. Let's take X(2,2) as an example. The backward at the position X(2,2) can compute as the figure bellow
My question is that where is dE/dY (such as dE/dY(1,1),dE/dY(1,2)...) in the formula? How to compute it at the first iteration?
SHORT ANSWER
Those terms are in the final expansion at the bottom of the slide; they contribute to the summation for dE/dX(2,2). In your first back-propagation, you start at the end and work backwards (hence the name) -- and the Y values are the ground-truth labels. So much for computing them. :-)
LONG ANSWER
I'll keep this in more abstract, natural-language terms. I'm hopeful that the alternate explanation will help you see the large picture as well as sorting out the math.
You start the training with assigned weights that may or may not be at all related to the ground truth (labels). You move blindly forward, making predictions at each layer based on naive faith in those weights. The Y(i,j) values are the resulting meta-pixels from that faith.
Then you hit the labels at the end. You work backward, adjusting each weight. Note that, at the last layer, the Y values are the ground-truth labels.
At each layer, you mathematically deal with two factors:
How far off was this prediction?
How heavily did this parameter contribute to that prediction?
You adjust the X-to-Y weight by "off * weight * learning_rate".
When you complete that for layer N, you back up to layer N-1 and repeat.
PROGRESSION
Whether you initialize your weights with fixed or random values (I generally recommend the latter), you'll notice that there's really not much progress in the early iterations. Since this is slow adjustment from guess-work weights, it takes several iterations to get a glimmer of useful learning into the last layers. The first layers are still cluelessly thrashing at this point. The loss function will bounce around close to its initial values for a while. For instance, with GoogLeNet's image recognition, this flailing lasts for about 30 epochs.
Then, finally, you get some valid learning in the latter layers, the patterns stabilize enough that some consistency percolates back to the early layers. At this point, you'll see the loss function drop to a "directed experimentation" level. From there, the progression depends a lot on the paradigm and texture of the problem: some have a sharp drop, then a gradual convergence; others have a more gradual drop, almost an exponential decay to convergence; more complex topologies have additional sharp drops as middle or early phases "get their footing".

How to train SVM for "Euro" coin recognition with OpenCV 3?

My xmas holiday project this year was to build a little Android app, which should be able to detect arbitrary Euro coins in a picture, recognize their value and sum the values up.
My assumptions/requirements for the picture for a good recognition are
uniform background
picture should be roughly the size of a DinA4 paper
coins may not overlap, but may touch each other
number-side of the coins must be up/visible
My initial thought was, that for the coin value-recognition later it would be best to first detect the actual coins/their regions in the picture. Any recognition then would run on only these regions of the picture, where actual coins are found.
So the first step was to find circles. This i have accomplished using this OpenCV 3 pipeline, as suggested in several books and SO postings:
convert to gray
CannyEdge detection
Gauss blurring
HoughCircle detection
filtering out inner/redundant circles
The detection works rather successfully IMHO, here a picture of the result:
Coins detected with HoughCircles with blue border
Up to the recognition now for every found coin!
I searched for solutions to this problem and came up with
template matching
feature detection
machine learning
The template matching seems very inappropriate for this problem, as the coins can be arbitrary rotated with respect to a template coin (and the template matching algorithm is not rotation-invariant! so i would have to rotate the coins!).
Also pixels of the template coin will never exactly match those of the region of the formerly detected coin. So any algorithm computing the similarity will produce only poor results, i think.
Then i looked into feature detection. This seemed more appropriate to me. I detected the features of a template-coin and the candidate-coin picture and drew the matches (combination of ORB and BRUTEFORCE_HAMMING). Unfortunately the features of the template-coin were also detected in the wrong candidate coins.
See the following picture, where the template or "feature" coin is on the left, a 20 Cents coin. To the right there are the candidate coin, where the left-most coin is a 20 Cents coin. I actually expected this coin to have the most matches, unfortunately not. So again, this seems not to be a viable way to recognize the value of coins.
Feature-matches drawn between a template coin and candidate coins
So machine learning is the third possible solution. From university i still now about neural networks, how they work, etc. Unfortunately my practical knowledge is rather poor AND i don't know Support Vector Machines (SVM) at all, which is the machine learning supported by OpenCV.
So my question is actually not source-code related, but more how to setup the learning process.
Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
How much positives and negatives per coin should be given?
Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Of course, if you have a different possible solution, this is also welcome!
Any help is appreciated! :-)
Bye & Thanks!
Answering your questions:
1- Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
For many object classification tasks it's better to extract the features first and then train a classifier using a learning algorithm. (e.g the features can be HOG and the learning algorithm can be something like SVM or Adaboost). It's mainly due to the fact that the features have more meaningful information compared to the pixel values. (They can describe edges,shapes, texture, etc.) However, the algorithms like deep learning will extract the useful features as a part of learning procedure.
2 - How much positives and negatives per coin should be given?
You need to answer this question depending on the variation in the classes you want to recognize and the learning algorithm you use. For SVM , if you use HOG features and want to recognize specific numbers on coins you won't need much.
3- Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
Again it depends on your final decision about the features(not SVM which is the learning algorithm) you're going to choose. HOG features are not rotation invariant but there are features like SIFT or SURF which are.
4-One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Again, choose your algorithm , some of them ask you for a fixed/similar width/height ratio. You can find out about the specific requirements in related papers.
If you decide to use SVM take a look at this and also if you feel ok with Neural Network, using Tensorflow is a good idea.

Weights updating and estimating training example values in playing checks

I am reading the Tom Mitchell's Machine Learning book, the first chapter.
What I want do is to write the program to play checker with itself, and learn to win at the end. My question is about the credit assignment of a non-terminal board position it encounters. Maybe we can set the value using the linear combination of its feature and randomly weights, how to updates it with LMS rules? Because we don't have the training samples apart from ending states.
I am not sure whether I state my question clearly although I tried to.
I haven't read that specific book, but my approach would be the following. Suppose that White wins. Then, every position White passed through should receive positive credit, while every position Black passed through should receive negative credit. If you iterate this reasoning, whenever you have a set of moves making up a game, you should add some amount of score to all positions from the victor and remove some amount of score from all positions from the loser. You do this for a bunch of computer vs. computer games.
You now have a data set made up of a bunch of checker positions and respective scores. You can now compute features over those positions and train your favorite regressor, such as LMS.
An improvement of this approach would be to train the regressor, then make some more games where each move is randomly drawn according to the predicted score of that move (i.e. moves which lead to positions with higher scores have higher probability). Then you update those scores and re-train the regressor, etc.

Resources