In each tiny step of doc2vec training process, it takes a word and its neighbors within certain length(called window size). The neighbors are summed up, averaged, or concated, and so on and so on.
My question is, what if the window exceed the boundary of a certain doc, like
this
Then how are the neighbors summed up, averaged, or concated? Or they are just simply discarded?
I am doing some nlp work and most doc in my dataset are quite short. Appeciate for any idea.
The pure PV-DBOW mode (dm=0), which trains quickly and often performs very well (especially on short documents), makes use of no sliding window at all. Each per-document vector is just trained to be good at directly predicting the document's words - neighboring words don't make any difference.
Only when you either switch to PV-DM mode (dm=1), or add interleaved skip-gram word-vector training (dm=0, dbow_words=1) is the window relevant. And then, the window is handled the same as in Word2Vec training: if it would go past either end of the text, it's just truncated to not go over the end, perhaps leaving the effective window lop-sided.
So if you have a text "A B C D E", and a window of 2, when predicting the 1st word 'A', only the 'B' and 'C' to the right contribute (because there are zero words to the left). When predicting the 2nd word 'B', the 'A' to the left and the 'C' and 'D' to the right contribute. And so forth.
An added wrinkle is that to effect a stronger weighting of nearby words in a computationally-efficient manner, the actual window used for any one target prediction is actually of a random size from 1 up to the configured window value. So for window=2, half the time it's really only using a window of 1 on each side, and the other half the time using the full window of 2. (For window=5, it's using an effective value of 1 for 20% of the predictions, 2 for 20% of the predictions, 3 for 20% of the predictions, 4 for 20% of the predictions, and 5 for 20% of the predictions.) This effectively gives nearer words more influence, without the full computational cost of including all full-window words every time or any extra partial-weighting calculations.
Related
I’m making a chess engine using machine learning, and I’m experiencing problems debugging it. I need help figuring out what is wrong with my program, and I would appreciate any help.
I made my research and borrowed ideas from multiple successful projects. The idea is to use reinforcement learning to teach NN to differentiate between strong and weak positions.
I collected 3 million games with Elo over 2000 and used my own method to label them. After researching hundreds of games, I found out, that it’s safe to assume that in the last 10 turns of any game, the balance doesn’t change, and the winning side has a strong advantage. So I picked positions from the last 10 turns and made two labels: one for a win for white and zero for black. I didn’t include any draw positions. To avoid bias, I have picked even numbers of positions labeled with wins for both sides and even number of positions for both sides with the next turn.
Each position I represented by a vector with the length of 773 elements. Every piece on every square of a chess board, together with castling rights and a next turn, I coded with ones and zeros. My sequential model has an input layer with 773 neurons and an output layer with one single neuron. I have used a three hidden layer deep MLP with 1546, 500 and 50 hidden units for layers 1, 2, and 3 respectively with dropout regularization value of 20% on each. Hidden layers are connected with the non- linear activation function ReLU, while the final output layer has a sigmoid output. I used binary crossentropy loss function and the Adam algorithm with all default parameters, except for the learning rate, which I set to 0.0001.
I used 3 percent of the positions for validation. During the first 10 epochs, validation accuracy gradually went up from 90 to 92%, just one percent behind training accuracy. Further training led to overfitting, with training accuracy going up, and validation accuracy going down.
I tested the trained model on multiple positions by hand, and got pretty bad results. Overall the model can predict which side is winning, if that side has more pieces or pawns close to a conversion square. Also it gives the side with a next turn a small advantage (0.1). But overall it doesn’t make much sense. In most cases it heavily favors black (by ~0.3) and doesn’t properly take into account the setup. For instance, it labels the starting position as ~0.0001, as if the black side has almost 100% chance to win. Sometimes irrelevant transformation of a position results in unpredictable change of the evaluation. One king and one queen from each side usually is viewed as lost position for white (0.32), unless black king is on certain square, even though it doesn’t really change the balance on the chessboard.
What I did to debug the program:
To make sure I have not made any mistakes, I analyzed, how each position is being recorded, step by step. Then I picked a dozen of positions from the final numpy array, right before training, and converted it back to analyze them on a regular chess board.
I used various numbers of positions from the same game (1 and 6) to make sure, that using too many similar positions is not the cause for the fast overfitting. By the way, even one position for each game in my database resulted in 3 million data set, which should be sufficient according to some research papers.
To make sure that the positions I use are not too simple, I analyzed them. 1.3 million of them had 36 points in pieces (knights, bishops, rooks, and queens; pawns were not included in the count), 1.4 million - 19 points, and only 0.3 million - had less.
Some things you could try:
Add unit tests and asserts wherever possible. E.g. if you know that some value is never supposed to get negative, add an assert to check that this condition really holds.
Print shapes of all tensors to check that you have really created the architecture you intended.
Check if your model outperforms some simple baseline model.
You say your model overfits, so maybe simplify it / add regularization?
Check how your model performs on the simplest positions. E.g. can it recognize a checkmate?
I read in many papers that a preprocessing of background removal help reduce the amount of computation. But why is this the case? My understanding is that he CNN works on a rectangular window no matter how is it filled up, 0 or positive.
See this for an example.
In the paper you provide, it seems that they do not pass the entire image to the network. Instead, they seem to be selecting smaller patches from the non-white background. This makes sense because it reduces the noise in their data, but it also reduces computational complexity, because of the effect it has on fully connected layers.
Suppose the input image is of size h*w. In your CNN, the image passes through a series of convolutions and max-poolings, and as a result, right before the first fully connected layer, you end up with a feature map of size
sz=m*(h/k)*(w/d)
where m is the number of feature planes, and where k and d depend on the number of layers, the parameters of each convolution and max pooling modules (e.g. the size of the convolution kernel, etc). Usually, we'll have d==k. Now, assume that you feed this to a fully connected layer, to produce a vector of q parameters. What this layer does is basicaly a matrix multiplication
A*x
where A is a matrix of size q*sz, and x is just your feature map written as a vector.
Now, assume you pass a patch of size (h/t)*(w/t) to the network. You end up with a feature map of size
sz/(t^2)
Given the size of the images in their datasets, this is a considerable reduction in the number of parameters. Also, small patches also means larger batches, and that too can accelerate training (better gradient approximation.).
I hope this helps.
Edit, following #wlnirvana's comment : Yes, patch size is a hyper parameter. In the example I gave, it is set via selecting t. Given the size of the images in the dataset, I'd say something like t>=6 would be realistic. As for how this relate to background removal, to quote the paper (section 3.1):
"To reduce computation time and to focus our analysis on regions of the slide most likely to contain cancer metastasis..."
This means that they select patches only around areas that are not background. This makes sense, since passing a completely white patch to the network would just be a waste of time (in figure 1, you can have so many white/gray/useless patches if you select them randomly, without removing the background). I didn't find any explanation on how patch selection is done in their paper, but I assume something like selecting a number of pixels p_1,...,p_n in the non-background regions, and considering n patches of size (h/t)*(w/t) around each of them would make sense.
The question is conceptual. I basically understand how MNIST example works, the feedforward net takes an image as input and output a predicted label 0 to 9.
I'm working on a project that ideally will take an image as the input, and for every pixel on that image, I will output a probability of that pixel being a certain label or not.
So my input, for example is of the size 600 * 800 * 3 pixels, and my output would be 600 * 800, where every single entry on my output is a probability.
How can I design the pipeline for that using Convolutional Neural Network? I'm working with Tensorflow. Thanks
Elaboration:
Basically I wanted to label every pixel as either foreground or background (The probability of the pixel being foreground). My intuition is that in convolutional layers, the neurons will be able to pick up information in a patch around that pixel, and finally be able to tell how likely this pixel could be the foreground.
Although it wouldn't be very efficient, a naive method could be to color a window (say, 5px x 5px) of pixels black, record the probabilities for each output class, then slide the window over a bit, then record again. This would be repeated until the window passed over the whole image.
Now we have some interesting information. For each window position, we know the delta of the probability distribution over the labels compared to the probabilities when the classifier received the whole image. That delta corresponds to the amount that that region contributed to the classifier making that decision.
If you want this mapped down to a per-pixel level for visualization purposes, you could use a stride length of 1 pixel when sliding the window and map the probability delta to the centermost pixel of the window.
Note that you don't want to make the window too small, otherwise the deltas will be too small to make a difference. Also, you'll probably want to be a bit smart about how you choose the color of the window so the window itself doesn't appear to be a feature to the classifier.
Edit in response to your elaboration:
This would still work for what you're trying to do. In fact, it becomes a bit nicer even. Instead of keeping all the label probability deltas separate, you would sum them. This would give you measurement which tells you "how much does this region make the image more like a number" (or in other words, the foreground). Also, you wouldn't measure the deltas against the uncovered image, but rather against the vector of probabilities where P(x)=0 for each label.
I'm designing a feed forward neural network learning how to play the game checkers.
For the input, the board has to be given and the output should give the probability of winning versus losing. But what is the ideal transformation of the checkers board to a row of numbers for input? There are 32 possible squares and 5 different possibilities (king or piece of white or black player and free position) on each square. If I provide an input unit for each possible value for each square, it will be 32 * 5. Another option is that:
Free Position: 0 0
Piece of white: 0 0.5 && King Piece of white: 0 1
Piece of black: 0.5 1 && King Piece of black: 1 0
In this case, the input length will be just 64, but I'm not sure which one will give a better result. Could anyone give any insight on this?
In case anyone is still interested in this topic—I suggest encoding the Checkers board with a 32 dimensional vector. I recently trained a CNN on an expert Checkers database and was able to acheive a suprisingly high level of play with no search, somewhat similar (I suspect) to the supervised learning step that Deepmind used to pretrain AlphaGo. I represented my input as an 8x4 grid, with entries in the set [-3, -1, 0, 1, 3] corresponding to an opposing king, opposing checker, empty, own checker, own king, repsectively. Thus, rather than encoding the board with a 160 dimensional vector where each dimension corresponds to a location-piece combination, the input space can be reduced to a 32-dimensional vector where each board location is represented by a unique dimension, and the piece at that location is encoded by a set of real numbers—this is done without any loss of information.
The more interesting question, at least in my mind, is which output encoding is most conducive for learning. One option is to encode it in the same way as the input. I would advise against this having found that simplifying the output encoding to a location (of the piece to move) and a direction (along which to move said piece) is much more advantageous for learning. While the reasons for this are likely more subtle, I suspect it is due to the enormous state space of checkers (something like 50^20 board possitions). Considering that the goal of our predictive model is to accept an input containing an enourmous number of possible states, and produce one ouput (i.e., move) from (at-most) 48 possibilities (12 pieces times 4 possible directions excluding jumps), a top priority in architecting a neural network should be matching the complexity of its input and output space to that of the actual game. With this in mind, I chose to encode the ouput as a 32 x 4 matrix, with each row representing a board location, and each column representing a direction. During training I simply unraveled this into a 128 dimensional, one-hot encoded vector (using argmax of softmax activations). Note that this output encoding lends itself to many invalid moves for a given board (e.g., moves off the board from edges and corners, moves to occupied locations, etc..)—we hope that the neural network can learn valid play given a large enough training set. I found that the CNN did a remarkable job at learning valid moves.
I’ve written more about this project at http://www.chrislarson.io/checkers-p1.
I've done this sort of thing with Tic-Tac-Toe. There are several ways to represent this. One of the most common for TTT is have input and output that represent the entire size of the board. In TTT this becomes 9 x hidden x 9. Input of -1 for X, 0 for none, 1 for O. Then the input to the neural network is the current state of the board. The output is the desired move. Whatever output neuron has the highest activation is going to be the move.
Propagation training will not work too well here because you will not have a finite training set. Something like Simulated Annealing, PSO, or anything with a score function would be ideal. Pitting the networks against each other for the scoring function would be great.
This worked somewhat well for TTT. I am not sure how it would work for Checkers. Chess would likely destroy it. For Go it would likely be useless.
The problem is that the neural network will learn patters only at fixed location. For example jumping an opponent in the top-left corner would be a totally different situation than jumping someone in the bottom left corner. These would have to be learned separately.
Perhaps better is to represent the exact state of the board in position independent way. This would require some thought. For instance you might communicate what "jump" opportunities exist. What move-towards king square opportunity's exist, etc and allow the net to learn to prioritize these.
I've tried all possibilities and intuitive i can say that the most great idea is separating all possibilities for all squares. Thus, concrete:
0 0 0: free
1 0 0: white piece
0 0 1: black piece
1 1 0: white king
0 1 1: black king
It is also possible to enhance other parameters about the situation of the game like the amount of pieces under threat or amount of possibilities to jump.
Please see this thesis
Blondie24 page 46, there is description of input for neural network.
I am reading the Tom Mitchell's Machine Learning book, the first chapter.
What I want do is to write the program to play checker with itself, and learn to win at the end. My question is about the credit assignment of a non-terminal board position it encounters. Maybe we can set the value using the linear combination of its feature and randomly weights, how to updates it with LMS rules? Because we don't have the training samples apart from ending states.
I am not sure whether I state my question clearly although I tried to.
I haven't read that specific book, but my approach would be the following. Suppose that White wins. Then, every position White passed through should receive positive credit, while every position Black passed through should receive negative credit. If you iterate this reasoning, whenever you have a set of moves making up a game, you should add some amount of score to all positions from the victor and remove some amount of score from all positions from the loser. You do this for a bunch of computer vs. computer games.
You now have a data set made up of a bunch of checker positions and respective scores. You can now compute features over those positions and train your favorite regressor, such as LMS.
An improvement of this approach would be to train the regressor, then make some more games where each move is randomly drawn according to the predicted score of that move (i.e. moves which lead to positions with higher scores have higher probability). Then you update those scores and re-train the regressor, etc.