I am trying to train a neural network to control a characters speed in 2 dimensions. x and y between -1 and 1 m/sec. Currently I split the range into 0.1 m/sec intervals so I end up with 400 output neurons (20 x values * 20 y values) if I increase the accuracy to 0.01 I end up with 40k output neurons. Is there a way to reduce the number of output neurons?
I assume you are treating the problem as a classification problem. In the training time, you have input X and output Y. Since you are training the neural network for classification, your expected output is always like:
-1 -0.9 ... 0.3 0.4 0.5 ... 1.0 m/s
Y1 = [0, 0, ..., 1, 0, 0, ..., 0] // speed x component
Y2 = [0, 0, ..., 0, 0, 1, ..., 0] // speed y component
Y = [Y1, Y2]
That is: only one of the neurons outputs 1 for each of the speed component at x and y direction; all other neurons output 0 (in the example above, the expected output is 0.3m/s in x direction and 0.5m/s in y direction for this training instance). Actually this is probably easier to learn and has better prediction performance. But as you pointed out, it does not scale.
I think you can also treat the problem as a regression problem. In your network, you have one neuron for each of the speed component. Your expected output is just:
Y = [0.3, 0.5] // for the same training instance you have.
To get an output range of -1 to 1, you have different options for the activation function in the output layer. For example, you can use
f(x) = 2 * (Sigmoid(x) - 0.5)
Sigmoid(x) = 1 / (1 + exp(-x))
Since sigmoid (x) is in (0,1), 2*(sigmoid(x) - 0.5) is in (-1,1). This change (replace multiple neurons in the output layer with two neurons) greatly decreases the complexity of the model so you might want to add more neurons in the middle layer to avoid under fitting.
Related
I'm trying to implement my own neural network, but I'm not very confident that my math is correct.
I'm doing the MNIST digit recognition, so I have a softmaxed output of 10 probabilities. as output. I then compute my delta output thus:
delta_output = vector of outputs - one-hot encoded actual label
delta_output is a matrix of dimension 10 x 1.
Then I compute the delta for the weights of the last hidden layer thus:
delta_hidden = weight_hidden.transpose * delta_output * der_hidden_activation(output_hidden)
Assuming there are N nodes in the last hidden layer, weight_hidden is a matrix of dimension of N by 10, delta_output from above is 10x1, and the result of der_hidden_activation(output_hidden) is N x 1.
Now, my first question here is, should the multiplication of delta_output and der_hidden_activation(output_hidden) return a 10 x N matrix using outer product? I think the I need to do a Hadamard product of this resulting matrix with the untouched weights to get delta_hidden to be N x 10 still.
Finally I multiply this delta_hidden by my learning rate and subtract it from the original weight of the last hidden layer to get my new weights.
My second and final question here is, did I miss anything?
Thanks in advance.
Assuming there are N nodes in the last hidden layer, weight_hidden is a matrix of dimension of N by 10, delta_output from above is 10x1, and the result of der_hidden_activation(output_hidden) is N x 1.
When going from a layer of N neurons (hidden layer), to a layer of M neurons (output layer), under matrix multiplication, the weight matrix dimensions should be M x N. So for the back propagation stage, going from layer of M neurons (output), to a layer of N neurons (hidden layer), use the transpose of the weight matrix which will give you a matrix of dimension (N x M).
Now, my first question here is, should the multiplication of delta_output and der_hidden_activation(output_hidden) return a 10 x N matrix using outer product?
I think the I need to do a Hadamard product of this resulting matrix with the untouched weights to get delta_hidden to be N x 10 still.
Yes you need to use hadamard product, however you can't multiply delta_output and der_hidden_activation(output_hidden) since these are matrices of different dimensions (10 x 1, and N x 1 respectively). Instead you multiply the transpose of the hidden_weight matrix (N x 10) by delta_output (10 x 1) to get a matrix of N x 1, and then perform the hadamard product with der_hidden_activation(output_hidden).
If I am translating this correctly...
hidden_weight matrix =
delta_output =
delta_hidden =
der_hidden_activation(output_hidden) =
Plugging this into the BP formula...
As you can see, you need to multiply the transpose of the weight_hidden (N x 10) with delta_output (10 x 1) first to produce a matrix (N x 1), and then you use hadamard product with der_hidden_activation(output_hidden).
Finally I multiply this delta_hidden by my learning rate and subtract it from the original weight of the last hidden layer to get my new weights.
You don't multiply delta_hidden by the learning rate. You need to use the learning rate on a bias and delta weight matrix...
The delta weight matrix is a matrix of the same dimensions as your (hidden) weight matrix and is calculated using the formula...
And then you can easily apply the learning rate...
Incidentally, I just answered a similar question on AISE which might help shed some light and goes into more detail about the matricies to use during backpropagation.
I'm learning regularization in Neural networks from deeplearning.ai course. Here in dropout regularization, the professor says that if dropout is applied, the calculated activation values will be smaller then when the dropout is not applied (while testing). So we need to scale the activations in order to keep the testing phase simpler.
I understood this fact, but I don't understand how scaling is done. Here is a code sample which is used to implement inverted dropout.
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob
In the above code, why the activations are divided by 0.8 or the probability of keeping a node in a layer (keep_prob)? Any numerical example will help.
I got the answer by myself after spending some time understanding the inverted dropout. Here is the intuition:
We are preserving the neurons in any layer with the probability keep_prob. Let's say keep_prob = 0.6. This means to shut down 40% of the neurons in any layer. If the original output of the layer before shutting down 40% of neurons was x, then after applying 40% dropout, it'll be reduced by 0.4 * x. So now it will be x - 0.4x = 0.6x.
To maintain the original output (expected value), we need to divide the output by keep_prob (or 0.6 here).
Another way of looking at it could be:
TL;DR: Even though due to dropout we have fewer neurons, we want the neurons to contribute the same amount to the output as when we had all the neurons.
With dropout = 0.20, we're "shutting down 20% of the neurons", that's also the same as "keeping 80% of the neurons."
Say the number of neurons is x. "Keeping 80%" is concretely 0.8 * x. Dividing x again by the keep_prob helps "scale it back" to the original value, which is x/0.8:
x = 0.8 * x # x is 80% of what it used to be
x = x/0.8 # x is scaled back up to its original value
Now, the purpose of the inverting is to assure that the Z value will not be impacted by the reduction of W. (Cousera).
When we scale down a3 by keep_prob, we're inadvertently also scaling down the value of z4 (Since, z4 = W4 * a3 + b4). To compensate for this scaling, we need to divide it by keep_prob, to scale it back up. (Stackoverflow)
# keep 80% of the neurons
keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
# Scale it back up
a3 = a3 / keep_prob
# this way z4 is not affected
z4 = W4 * a3 + b4
What happens if you don't scale?
With scaling:
-------------
Cost after iteration 0: 0.6543912405149825
Cost after iteration 10000: 0.061016986574905605
Cost after iteration 20000: 0.060582435798513114
On the train set:
Accuracy: 0.9289099526066351
On the test set:
Accuracy: 0.95
Without scaling:
-------------
Cost after iteration 0: 0.6634619861891963
Cost after iteration 10000: 0.05040089794130624
Cost after iteration 20000: 0.049722351029060516
On the train set:
Accuracy: 0.933649289099526
On the test set:
Accuracy: 0.95
Though this is just a single example with one dataset, I'm not sure if it makes a major difference in shallow neural networks. Perhaps it pertains more to deeper architectures.
I learned from several articles that to compute the gradients for the filters, you just do a convolution with the input volume as input and the error matrix as the kernel. After that, you just subtract the filter weights by the gradients(multiplied by the learning rate). I implemented this process but it's not working.
I even tried doing the backpropagation process myself with pen and paper but the gradients I calculated doesn't make the filters perform any better. So am I understanding the whole process wrong?
Edit:
I will provide an example of my understanding of the backpropagation in CNNs and the problem with it.
Consider a randomised input matrix for a convolutional layer:
1, 0, 1
0, 0, 1
1, 0, 0
And a randomised weight matrix:
1, 0
0, 1
The output would be (applied ReLU activator):
1, 1
0, 0
The target for this layer is a 2x2 matrix filled with zeros. This way, we know the weight matrix should be filled with zeros also.
Error:
-1, -1
0, 0
By applying the process as stated above, the gradients are:
-1, -1
1, 0
So the new weight matrix is:
2, 1
-1, 1
This is not getting anywhere. If I repeat the process, the filter weights just go to extremely high values. So I must have made a mistake somewhere. So what is it that I'm doing wrong?
I'll give you a full example, not going to be short but hopefully you will get it. I'm omitting both bias and activation functions for simplicity, but once you get it it's simple enough to add those too. Remember, backpropagation is essentially the SAME in CNN as in a simple MLP, but instead of having multiplications you'll have convolutions. So, here's my sample:
Input:
.7 -.3 -.7 .5
.9 -.5 -.2 .9
-.1 .8 -.3 -.5
0 .2 -.1 .6
Kernel:
.1 -.3
-.5 .7
Doing the convolution yields (Result of 1st convolutional layer, and input for the 2nd convolutional layer):
.32 .27 -.59
.99 -.52 -.55
-.45 .64 .13
L2 Kernel:
-.5 .1
.3 .9
L2 activation:
.73 .29
.37 -.63
Here you would have a flatten layer and a standard MLP or SVM to do the actual classification. During backpropagation you'll recieve a delta which for fun let's assume is the following:
-.07 .15
-.09 .02
This will always be the same size as your activation before the flatten layer. Now, to calculate the kernel's delta for the current L2, you'll convolve L1's activation with the above delta. I'm not writting this down again but the result will be:
.17 .02
-.05 .13
Updating the kernel is done as L2.Kernel -= LR * ROT180(dL2.K), meaning you first rotate the above 2x2 matrix and then update the kernel. This for our toy example turns out to be:
-.51 .11
.3 .9
Now, to calculate the delta for the first convolutional layer, recall that in MLP you had the following: current_delta * current_weight_matrix. Well in Conv layer, you pretty much have the same. You have to convolve the original Kernel (before update) of L2 layer with your delta for the current layer. But this convolution will be a full convolution. The result turns out to be:
.04 -.08 .02
.02 -.13 .14
-.03 -.08 .01
With this you'll go for the 1st convolutional layer, and will convolve the original input with this 3x3 delta:
.16 .03
-.09 .16
And update your L1 kernel the same way as above:
.08 -.29
-.5 .68
Then you can start over from feeding forward. The above calculations were rounded to 2 decimal places and a learning rate of .1 was used for calculating the new kernel values.
TLDR:
You get a delta
You calculate the next delta that will be used for the next layer as: FullConvolution(Li.Input, delta)
Calculate the kernel delta that is used to update the kernel: Convolution(Li.W, delta)
Go to next layer and repeat.
I don't really follow how they came up with the derivative equation. Could somebody please explain in some details or even a link to somewhere with sufficient math explanation?
Laplacian filter looks like
Monsieur Laplace came up with this equation. This is simply the definition of the Laplace operator: the sum of second order derivatives (you can also see it as the trace of the Hessian matrix).
The second equation you show is the finite difference approximation to a second derivative. It is the simplest approximation you can make for discrete (sampled) data. The derivative is defined as the slope (equation from Wikipedia):
In a discrete grid, the smallest h is 1. Thus the derivative is f(x+1)-f(x). This derivative, because it uses the pixel at x and the one to the right, introduces a half-pixel shift (i.e. you compute the slope in between these two pixels). To get to the 2nd order derivative, simply compute the derivative on the result of the derivative:
f'(x) = f(x+1) - f(x)
f'(x+1) = f(x+2) - f(x+1)
f"(x) = f'(x+1) - f'(x)
= f(x+2) - f(x+1) - f(x+1) + f(x)
= f(x+2) - 2*f(x+1) + f(x)
Because each derivative introduces a half-pixel shift, the 2nd order derivative ends up with a 1-pixel shift. So we can shift the output left by one pixel, leading to no bias. This leads to the sequence f(x+1)-2*f(x)+f(x-1).
Computing this 2nd order derivative is the same as convolving with a filter [1,-2,1].
Applying this filter, and also its transposed, and adding the results, is equivalent to convolving with the kernel
[ 0, 1, 0 [ 0, 0, 0 [ 0, 1, 0
1,-4, 1 = 1,-2, 1 + 0,-2, 0
0, 1, 0 ] 0, 0, 0 ] 0, 1, 0 ]
I'm following the TensorFlow tutorial
Initially x is defined as
x = tf.placeholder(tf.float32, shape=[None, 784])
Later on it reshapes x, I'm trying to understand why.
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
What does -1 mean in the reshaping vector and why is x being reshaped?
1) What does -1 mean in the reshaping vector
From the documentation of reshape:
If one component of shape is the special value -1, the size of that
dimension is computed so that the total size remains constant. In
particular, a shape of [-1] flattens into 1-D. At most one component
of shape can be -1.
this is a standard feature and is available in numpy as well. Basically it means - I do not have time to calculate all the dimensions, so infer the one for me. In your case because x * 28 * 28 * 1 = 784 so your -1 = 1
2) Why is x being reshaped
They are planning to use convolution for image classification. So they need to use some spatial information. Current data is 1 dimensional. So they transform it to 4 dimensions. I do not know the point of the forth dimension because in my opinion they might have used only (x, y, color). Or even (x, y). Try to modify their reshape and convolution and most probably you will get similar accuracy.
why 4 dimensions
TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.
[batch, in_height, in_width, in_channels]