I am new to deep learning and tensor flow and I am trying to train a CNN at localizing digits in the Street View House Numbers data set. To this end I have an input set of 32x32 images and, since I want to recognize up to 5 digits, I am using as labels vectors of 20 elements like this
[top_x_digit1,top_y_digit1,width_digit1,height_digit1,top_x_digit2, etc..]
0,0,0,0 when there is no digit
As far as I understand, after (let me say) 3 layers of convolution and pooling I can add 5 (parallel) fully connected layers aimed at extracting each the box features of a different digit (when present, 0 0 0 0 otherwise).
is my approach correct?
Related
I am experimenting with the generalised dice loss implemented in niftynet to segment MRI volumes containing 4 classes (1 background 3 regions of interest) using the V-Net. I tried to format the labels in 2 ways:
spatial dimensions only with 0 being background and 1,2,3 being the labels for the regions of interest.
5 dimensional images ([3 spatial],1,4) storing binary volumes for each class in the 5th dimension
an inference from the second case produced a 3D volume where only the class with label '3' was detected while the loss didn't decrease at all during training for the first case. Am I storing the labels in the correct format?
I think the first format is the correct one.
You might need to clip the gradients in the code for segmentation application. Does the loss decrease when you use a standard Dice metric?
I've got a 2D surface where a ship (with constant speed) navigates around the scene to pick up candy. For every candy the ship picks up I increase the fitness. The NN has one output to steer the ship (0 for left and 1 for right, so 0.5 would be straight forward) There are four inputs in the range [-1 .. 1], that represents two normalized vectors. The ship direction and the direction to the piece of candy.
Is there any way to calculate the minimum number of neurons in the hidden layer? I also tried giving two inputs instead of four, the first was the dot product [-1..1] (where I dotted the ship direction with the direction to the candy) and the second was (0/1) if the candy was to the left/right of the ship. It seems like this approach worked a lot better with fewer neurons in the hidden layer.
Fewer inputs should imply fewer number of neurons. This is because the number of input combinations decrease and it gets easier for the neural network to learn the system. There is no golden rule as to how to calculate the best number of nodes in the hidden layer. However, with 2 inputs I'd say 2 hidden nodes should work fine. It really depends on the degree of non linearity in your inputs.
Defining the number of hidden layers and the number of neurons in each hidden layers always was a challenge and it may diverge from each type of problems. By the way, a single hidden layer in a feedforward neural network could solve most of the problems, given it can aproximate functions.
Murata defined some rules to use in neural networks to define the number of hidden neurons in a feedforward neural network:
The value should be between the size of the input and output layers.
The value should be 2/3 the size of the input layer plus the size of the output layer.
The value should be less than twice the size of the input layer
You could try these rules and evaluate the impact of it in your neural network.
I'm trying to create the neural network shown below. It has 3 inputs, 2 outputs, and 2 hidden layers (so 4 layers altogether, or 3 layers of weight matrices). In the first hidden layer there are 4 neurons, and in the second hidden layer there are 3. There is a bias neuron going to the first and second hidden layer, and the output layer.
I have tried using the "create custom neural network" function in MATLAB, but I can't get it to work how I want it to.
This is how I used the function
net1=network(3,3,[1;1;1],[1,1,1;0,0,0;0,0,0],[0,0,0;1,0,0;0,1,0],[0,0,0])
view(net1)
And it gives me the neural network shown below:
As you can see, this isn't what I want. There are only 3 weights in the first layer, 1 in the second, 1 in the output layer, and only one output. How would I fix this?
Thanks!
Just to clarify how I want this network to work:
The user will input 3 numbers into the network.
Each one of the 3 inputs is multiplied by 4 different weights, and then these numbers are sent to the 4 neurons in the first hidden layer.
The bias node acts the same as one of the inputs, but it always has a value of 1. It is multiplied by 4 different weights, and then sent to the 4 neurons in the first hidden layer.
Each neuron in the first hidden layer sums the 4 numbers going into it, and then passes this number through the sigmoid activation function.
The neurons in the first hidden layer then output 4 numbers that are each multiplied by 3 different weights, and sent to the 3 neurons in the second hidden layer.
The bias node going to the second hidden layer works the same as the first bias node
Each neurons in the second hidden layer sums up the 5 numbers going into it and passes it through the sigmoid activation function.
The neurons in the second layer then output two numbers that are again multiplied by weights and go to each of the outputs
The output layer also sums all of its inputs, including its bias input, and then passes this through the sigmoid activation function to get the final two values.
After some time playing around I've figured out how to do it. The code I needed to use is:
net = newff([0 1; 0 1; 0 1],[4,3 2],{'logsig','logsig','logsig'})
view(net)
This creates the network I was looking for.
I was originally mistaken about the matlab representation of neural networks. The green arrows show the path of all of the numbers, not just a single number.
My project is to create a software that recognizes certain objects like an apple or a coin etc. I want to use Kinect. My question is: Do I need to have a machine learning algorithm like haar classifier to recognize a object or kinect itself can do that?
Kinect itself cannot recognize objects. It will give you a dense depth map. Then you can use the depth features along with some simple features (in your case, maybe color features or gradient features would do the job). Those features you input to a classifier (SVM or Random Forest for example) to train the system. You use the trained model for testing on new samples.
Regarding Haar features, I think they could do the job but you would need a sufficiently large database of features. It all depends on what you want to detect. In the case of an apple and a coin, just color would suffice.
Refer this paper to get an idea how to perform human pose recognition using Kinect camera. You just have to pay attention to their depth features and their classifiers. Do not apply their approach directly. Your problem is simpler.
Edit: simple gradient orientations histogram
Gradient orientations can give you a coarse idea about the shape of the object (It is not a shape-feature to be specific, better shape features exist, but this one is extremely fast to calculate).
Code snippet:
%calculate gradient
[dx,dy] = gradient(double(img));
A = (atan(dy./(dx+eps))*180)/pi; %eps added to avoid division by zero.
A will contain orientation for each pixel. Segment your original image according to the depth values. For a segment having similar depth values, calculate color histogram. Extract the pixel orientations corresponding to that region, call it A_r. calculate a 9-bin (you can have more bins. Nine bins mean each bin will contain 180/9=20 degrees) histogram. Concatenate the color features and the gradient histogram. Do this for sufficient number of leaves. Then you can give this to a classifier for training.
Edit: This is a reply to a comment below.
Regarding MaxDepth parameter in opencv_traincascade
The documentation says, "Maximal depth of a weak tree. A decent choice is 1, that is case of stumps". When you perform binary classification, it takes a form of:
if yourFeatureValue>=learntThresh
class=1;
else
class=0;
end
The above type of classifier which performs thresholding on a single feature value (a scalar) is called decision stumps. There is only one split between positive and negative class (therefore maxDepth is one). For example, it would work in following scenario. Imagine you have a 1-D feature:
f=[1 2 3 4 -1 -2 -3 -4]
First 4 are class 1, rest are class 0. Decision stumps would get 100% accuracy on this data by setting the threshold to zero. Now, imagine a complicated feature space such as:
f=[1 2 3 4 5 6 7 8 9 10 11 12];
First 4 and last 4 are class 1, rest are class 0. Here, you cannot get 100% classification by decision stumps. You need two thresholds/splits. Therefore, you can construct a tree with depth value 2. You will have 2^(2-1)=2 thresholds. For depth=3, you get 4 thresholds, for depth=4, you get 8 thresholds and so on. Here, I assume a tree with a single node has height 1.
You may feel that the more the number of levels, you can achieve more accuracy, but then there is a problem of overfitting (and computation, memory storage etc.). Therefore, you have to set a good value for depth. I usually set it to 3.
I'm designing a feed forward neural network learning how to play the game checkers.
For the input, the board has to be given and the output should give the probability of winning versus losing. But what is the ideal transformation of the checkers board to a row of numbers for input? There are 32 possible squares and 5 different possibilities (king or piece of white or black player and free position) on each square. If I provide an input unit for each possible value for each square, it will be 32 * 5. Another option is that:
Free Position: 0 0
Piece of white: 0 0.5 && King Piece of white: 0 1
Piece of black: 0.5 1 && King Piece of black: 1 0
In this case, the input length will be just 64, but I'm not sure which one will give a better result. Could anyone give any insight on this?
In case anyone is still interested in this topic—I suggest encoding the Checkers board with a 32 dimensional vector. I recently trained a CNN on an expert Checkers database and was able to acheive a suprisingly high level of play with no search, somewhat similar (I suspect) to the supervised learning step that Deepmind used to pretrain AlphaGo. I represented my input as an 8x4 grid, with entries in the set [-3, -1, 0, 1, 3] corresponding to an opposing king, opposing checker, empty, own checker, own king, repsectively. Thus, rather than encoding the board with a 160 dimensional vector where each dimension corresponds to a location-piece combination, the input space can be reduced to a 32-dimensional vector where each board location is represented by a unique dimension, and the piece at that location is encoded by a set of real numbers—this is done without any loss of information.
The more interesting question, at least in my mind, is which output encoding is most conducive for learning. One option is to encode it in the same way as the input. I would advise against this having found that simplifying the output encoding to a location (of the piece to move) and a direction (along which to move said piece) is much more advantageous for learning. While the reasons for this are likely more subtle, I suspect it is due to the enormous state space of checkers (something like 50^20 board possitions). Considering that the goal of our predictive model is to accept an input containing an enourmous number of possible states, and produce one ouput (i.e., move) from (at-most) 48 possibilities (12 pieces times 4 possible directions excluding jumps), a top priority in architecting a neural network should be matching the complexity of its input and output space to that of the actual game. With this in mind, I chose to encode the ouput as a 32 x 4 matrix, with each row representing a board location, and each column representing a direction. During training I simply unraveled this into a 128 dimensional, one-hot encoded vector (using argmax of softmax activations). Note that this output encoding lends itself to many invalid moves for a given board (e.g., moves off the board from edges and corners, moves to occupied locations, etc..)—we hope that the neural network can learn valid play given a large enough training set. I found that the CNN did a remarkable job at learning valid moves.
I’ve written more about this project at http://www.chrislarson.io/checkers-p1.
I've done this sort of thing with Tic-Tac-Toe. There are several ways to represent this. One of the most common for TTT is have input and output that represent the entire size of the board. In TTT this becomes 9 x hidden x 9. Input of -1 for X, 0 for none, 1 for O. Then the input to the neural network is the current state of the board. The output is the desired move. Whatever output neuron has the highest activation is going to be the move.
Propagation training will not work too well here because you will not have a finite training set. Something like Simulated Annealing, PSO, or anything with a score function would be ideal. Pitting the networks against each other for the scoring function would be great.
This worked somewhat well for TTT. I am not sure how it would work for Checkers. Chess would likely destroy it. For Go it would likely be useless.
The problem is that the neural network will learn patters only at fixed location. For example jumping an opponent in the top-left corner would be a totally different situation than jumping someone in the bottom left corner. These would have to be learned separately.
Perhaps better is to represent the exact state of the board in position independent way. This would require some thought. For instance you might communicate what "jump" opportunities exist. What move-towards king square opportunity's exist, etc and allow the net to learn to prioritize these.
I've tried all possibilities and intuitive i can say that the most great idea is separating all possibilities for all squares. Thus, concrete:
0 0 0: free
1 0 0: white piece
0 0 1: black piece
1 1 0: white king
0 1 1: black king
It is also possible to enhance other parameters about the situation of the game like the amount of pieces under threat or amount of possibilities to jump.
Please see this thesis
Blondie24 page 46, there is description of input for neural network.