I am trying to train a CNN to play a game similar to myself. The pixel input should be mapped to an output which is whether or not to press:
q
w
e
r
d
f
right click
left click
hover the mouse over a certain pixel location
or a combination of any of the above simultaneously, sometimes holding down some buttons while pressing others.
What is the best way to go about capturing this training data, I am mostly concerned with what the output vector might look like that I am mapping the pixel input to.
Any advice is really appreciated! Thanks.
Your question seems to relate to both "Reinforcement learning" and CNNs.
This question is very broad.
I can suggest looking at the following example for Tutorial of playing Doom given only an input of raw pixels to get yourself started.
Related
I've got a task of measuring of lanes or stripes on severely varied, often noisy and underexposed images using C++. The example ot the input image and of what should be measured is below:
An example of what should be measured on images provided
I've tried a couple of approaches using OpenCV so far. The first one basically consisted of the next steps:
Filtering, background substruction -> adaptiveThreshold -> thinning -> HoughLinesP -> and then filtering and merging of lines.
Please see the llustration image below:
The first attempt result
The second approach comprised of search for the beginnings of short stripes with SURF and movement to the left and up along long lines.
Please see the llustration image below, note that SURF was done on the original halftone image:
The second attempt result
The third approach I've tried: doing the Fourier transform for frames - image fragments (a 4-dimensional matrix is obtained), then finding basic patterns using PCA. Got this result below:
The third attempt result
Not sure what to do with that PCA output. Have tried to select lines using adaptiveThreshold using original image, then teach the multilayer perceptron based on this threshold and the PCA result so that it would yield "refined" threshold. An attempt was made to select the parameters resulting in a cleared threshold for further treatment - it works occasionally, but the result is very unstable.
Unfortulately all the approaches above work only with few selected "good" images.
I presume that the ML approach would be the way to go. Unfortunately I have only few images for learning.
With the ML approach, still a piece of advise would be appreciated at least to start: It looks like it falls under the segmentation tasks area. While following this route, do I need to select the whole area containing the segment measured and then split it using some other approach? Or it is possible/feasible to to detect the measured segments separatedly at once?
I would greatly appreciate any suggestions on moving forward to solving this task.
Some test source images can be found here: github.com/aliakseis/detect-lines/tree/master/images
Please find an update here: https://github.com/aliakseis/detect-lines Any suggestions would be highly appreciated.
I am trying to learn about convolutional neural nets and I was watching this video. https://www.youtube.com/watch?v=FmpDIaiMIeA
I was under the impression that when a filter was convolved around an input, the output at each point represented how closely the feature matched the input.
However in this video at 6:56 an example is shown where 7/9 pixels match (~78%) the output in the video is 55% which matches the cross-product method used but is nowhere near the 78% I expected.
Also, if a filter was looking for a location in an input where each pixel was 0 then using a cross product in the conv-layer would be of no use. Every output would be multiplied by 0 and so there would be no way to tell where the pattern occurred.
If anyone can tell me what I am missing that would be great!
Thanks in advance for your help.
First, 7 pixels match and 2 are opposite. So you have (7-2)/9 which is 55%. It is a measure of how closely the feature matched the input, but it is not just the sum of matched over all comparisons (for example in the video you can have negative data).
I'm not sure about your second question. The output of a convolutional layer is feature maps. Each feature map is the results of convolving the previous layer with a specific filter. So each map of the output looks at more than just one location.
Still, almost always you have irrelevant parts of the input.
My xmas holiday project this year was to build a little Android app, which should be able to detect arbitrary Euro coins in a picture, recognize their value and sum the values up.
My assumptions/requirements for the picture for a good recognition are
uniform background
picture should be roughly the size of a DinA4 paper
coins may not overlap, but may touch each other
number-side of the coins must be up/visible
My initial thought was, that for the coin value-recognition later it would be best to first detect the actual coins/their regions in the picture. Any recognition then would run on only these regions of the picture, where actual coins are found.
So the first step was to find circles. This i have accomplished using this OpenCV 3 pipeline, as suggested in several books and SO postings:
convert to gray
CannyEdge detection
Gauss blurring
HoughCircle detection
filtering out inner/redundant circles
The detection works rather successfully IMHO, here a picture of the result:
Coins detected with HoughCircles with blue border
Up to the recognition now for every found coin!
I searched for solutions to this problem and came up with
template matching
feature detection
machine learning
The template matching seems very inappropriate for this problem, as the coins can be arbitrary rotated with respect to a template coin (and the template matching algorithm is not rotation-invariant! so i would have to rotate the coins!).
Also pixels of the template coin will never exactly match those of the region of the formerly detected coin. So any algorithm computing the similarity will produce only poor results, i think.
Then i looked into feature detection. This seemed more appropriate to me. I detected the features of a template-coin and the candidate-coin picture and drew the matches (combination of ORB and BRUTEFORCE_HAMMING). Unfortunately the features of the template-coin were also detected in the wrong candidate coins.
See the following picture, where the template or "feature" coin is on the left, a 20 Cents coin. To the right there are the candidate coin, where the left-most coin is a 20 Cents coin. I actually expected this coin to have the most matches, unfortunately not. So again, this seems not to be a viable way to recognize the value of coins.
Feature-matches drawn between a template coin and candidate coins
So machine learning is the third possible solution. From university i still now about neural networks, how they work, etc. Unfortunately my practical knowledge is rather poor AND i don't know Support Vector Machines (SVM) at all, which is the machine learning supported by OpenCV.
So my question is actually not source-code related, but more how to setup the learning process.
Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
How much positives and negatives per coin should be given?
Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Of course, if you have a different possible solution, this is also welcome!
Any help is appreciated! :-)
Bye & Thanks!
Answering your questions:
1- Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
For many object classification tasks it's better to extract the features first and then train a classifier using a learning algorithm. (e.g the features can be HOG and the learning algorithm can be something like SVM or Adaboost). It's mainly due to the fact that the features have more meaningful information compared to the pixel values. (They can describe edges,shapes, texture, etc.) However, the algorithms like deep learning will extract the useful features as a part of learning procedure.
2 - How much positives and negatives per coin should be given?
You need to answer this question depending on the variation in the classes you want to recognize and the learning algorithm you use. For SVM , if you use HOG features and want to recognize specific numbers on coins you won't need much.
3- Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
Again it depends on your final decision about the features(not SVM which is the learning algorithm) you're going to choose. HOG features are not rotation invariant but there are features like SIFT or SURF which are.
4-One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Again, choose your algorithm , some of them ask you for a fixed/similar width/height ratio. You can find out about the specific requirements in related papers.
If you decide to use SVM take a look at this and also if you feel ok with Neural Network, using Tensorflow is a good idea.
I've been playing with neural networks just out of personal curiosity and wanting to learn. So far, it's been a success. I've written a code for one from scratch, with backpropagation for training, and trained it to play tic-tac-toe (outputs 3x3 matrix with highest value in an open spot being played). It's based off of this example except designed to allow multiple hidden layers.
The training data is a randomly generated mid-game situation and to supervise the learning, I wrote by hand an algorithm for ranking the possible moves to use as the "correct" answers. After training it a couple thousand times, it works pretty well at making the best move and can play a whole game pretty well (favors opening in center, while perfect game you take corner, but whatever).
Anyways, that's all well and good, but for the training, it required me to create a specific algorithm to rank each move for any given play, which was easy enough cause tic-tac-toe is very simple, but this isn't practical. My next milestone, would be to be able to train it just based on winning or losing the game. However, this requires it to remember a sequence of moves, then backpropagate the training not just through the neurons but a sequence of moves after the game is finished. I just have no idea where to start, even being pointed in the right direction would help.
I'm designing a feed forward neural network learning how to play the game checkers.
For the input, the board has to be given and the output should give the probability of winning versus losing. But what is the ideal transformation of the checkers board to a row of numbers for input? There are 32 possible squares and 5 different possibilities (king or piece of white or black player and free position) on each square. If I provide an input unit for each possible value for each square, it will be 32 * 5. Another option is that:
Free Position: 0 0
Piece of white: 0 0.5 && King Piece of white: 0 1
Piece of black: 0.5 1 && King Piece of black: 1 0
In this case, the input length will be just 64, but I'm not sure which one will give a better result. Could anyone give any insight on this?
In case anyone is still interested in this topic—I suggest encoding the Checkers board with a 32 dimensional vector. I recently trained a CNN on an expert Checkers database and was able to acheive a suprisingly high level of play with no search, somewhat similar (I suspect) to the supervised learning step that Deepmind used to pretrain AlphaGo. I represented my input as an 8x4 grid, with entries in the set [-3, -1, 0, 1, 3] corresponding to an opposing king, opposing checker, empty, own checker, own king, repsectively. Thus, rather than encoding the board with a 160 dimensional vector where each dimension corresponds to a location-piece combination, the input space can be reduced to a 32-dimensional vector where each board location is represented by a unique dimension, and the piece at that location is encoded by a set of real numbers—this is done without any loss of information.
The more interesting question, at least in my mind, is which output encoding is most conducive for learning. One option is to encode it in the same way as the input. I would advise against this having found that simplifying the output encoding to a location (of the piece to move) and a direction (along which to move said piece) is much more advantageous for learning. While the reasons for this are likely more subtle, I suspect it is due to the enormous state space of checkers (something like 50^20 board possitions). Considering that the goal of our predictive model is to accept an input containing an enourmous number of possible states, and produce one ouput (i.e., move) from (at-most) 48 possibilities (12 pieces times 4 possible directions excluding jumps), a top priority in architecting a neural network should be matching the complexity of its input and output space to that of the actual game. With this in mind, I chose to encode the ouput as a 32 x 4 matrix, with each row representing a board location, and each column representing a direction. During training I simply unraveled this into a 128 dimensional, one-hot encoded vector (using argmax of softmax activations). Note that this output encoding lends itself to many invalid moves for a given board (e.g., moves off the board from edges and corners, moves to occupied locations, etc..)—we hope that the neural network can learn valid play given a large enough training set. I found that the CNN did a remarkable job at learning valid moves.
I’ve written more about this project at http://www.chrislarson.io/checkers-p1.
I've done this sort of thing with Tic-Tac-Toe. There are several ways to represent this. One of the most common for TTT is have input and output that represent the entire size of the board. In TTT this becomes 9 x hidden x 9. Input of -1 for X, 0 for none, 1 for O. Then the input to the neural network is the current state of the board. The output is the desired move. Whatever output neuron has the highest activation is going to be the move.
Propagation training will not work too well here because you will not have a finite training set. Something like Simulated Annealing, PSO, or anything with a score function would be ideal. Pitting the networks against each other for the scoring function would be great.
This worked somewhat well for TTT. I am not sure how it would work for Checkers. Chess would likely destroy it. For Go it would likely be useless.
The problem is that the neural network will learn patters only at fixed location. For example jumping an opponent in the top-left corner would be a totally different situation than jumping someone in the bottom left corner. These would have to be learned separately.
Perhaps better is to represent the exact state of the board in position independent way. This would require some thought. For instance you might communicate what "jump" opportunities exist. What move-towards king square opportunity's exist, etc and allow the net to learn to prioritize these.
I've tried all possibilities and intuitive i can say that the most great idea is separating all possibilities for all squares. Thus, concrete:
0 0 0: free
1 0 0: white piece
0 0 1: black piece
1 1 0: white king
0 1 1: black king
It is also possible to enhance other parameters about the situation of the game like the amount of pieces under threat or amount of possibilities to jump.
Please see this thesis
Blondie24 page 46, there is description of input for neural network.