Generalization of gradient calculation for multi channel convolutions - machine-learning

I have been trying to understand how backpropagation for conv nets is implemented at mathematical level. I came across this article which explains gradient calculation graphically for 2D convolution. The conv layer consists of 3x3 inputs and the dimension of the filter used is 2x2 which, on convolution results in 2x2 layer which is then fully connected. The gradient for this fully connected layer will be of dimension 2x2.
According to the article :-
Gradient of conv layer = convolution between gradient of next layer
and weights of this layer
But I cannot generalize this for 3 channel inputs.
Lets say out input layer is of dimension 3x3x3 and we use 1 filter of dimension 2x2x3 then the resultant convolution will again be of dimension 2x2 which will then be treated as fully connected layer.
Now the gradient for fully connected layer will be 2x2. So, to calculate the gradient for conv layer we again to calculate the convolution between 2x2 gradient layer and 2x2x3 weight layer but they are incompatible .
So, I dont understand how to use this formula for calculating gradient for 3D convolutions. How can I proceed after this step ?
Derivation(or an article) with respect to 3D input will also be really helpful .

Related

How do you calculate the gradient of bias in a conolutional neural network?

I am having a hard time finding resources online about how to preform backpropagation with the bias in a convolutional neural network. By bias I mean the number added to every number resulting from a convolution.
Here is a picture further explaining
I know how to calculate the gradient for the filter's weights but I am not sure what to do about the biases. Right now I am just adjusting it by the average error for that layer. Is this correct?
It is similar to the bias gradient in standard neural networks but here we sum over all the gradients w.r.t convolution output:
where L is the loss function, w and h are the width and height of the conv output, is the gradient of the conv output w.r.t the loss function.
Thus, the gradient of b is computed by summing all the convolution output gradients at each position (w, h) w.r.t the loss function L.
Hope this helps.

Implementing backward-convolution in CNN for multi-channel data

I have been trying to get a deeper understanding of convolutional operation as I am implementing a convolutional neural network. But I am stuck while trying to calculate the backward pass or deconvolution.
Lets say the input is a 3 dimensional RGB image with dimension 3x7x7 The filter has the dimension 3x3x3. On convolving with stride set to 2 we will get an output of dimension 3x3.
Now here is my problem. I have read that deconvolution is the convolution of the output with flipped kernel. But on flipping the kernel, it will still be of dimension 3x3x3 and output is of dimension 3x3 which . The input was of dimension 3x7x7 . So, how is deconvolution calculated ?
Here is a nice visualisation how convolution and deconvolution (transposed convolution). The white pieces are simply zeros.

How neural net extract features

I'm new on neural networks. I follow some tutorials on a lot of platforms, but there is one thing than I don't understand.
In a simple multi layer perceptron :
We have the input layer, an hidden layer for this example (with the same number of neurons than the input layer) and an output layer with one unit.
We initialize the weights of the units in hidden layer randomly but in a range of small values.
Now, the input layer is fully connected with the hidden layer.
So each units in hidden layer are going to receive the same parameters. How are they going to extract different features from each other ?
Thanks for explanation!
We initialize the weights of the units in hidden layer randomly but in
a range of small values. Now, the input layer is fully connected with
the hidden layer. So each units in hidden layer are going to receive
the same parameters. How are they going to extract different features
from each other ?
Actually each neuron will not have the same value. To get to the activations of the hidden layer you use the matrix equation Wx + b In this case W is the weight matrix of shape (Hidden Size, Input Size). x is the input vector of the hidden layer of shape (Input Size) and b is the bias of shape (Hidden Size). This results in an activation of shape (Hidden Size). So while each hidden neuron would be "seeing" the same x vector it will be taking the dot product of x with its own random row vector and adding its own random bias which will give that neuron a different value. The values contained in the W matrix and b vector are what are trained and optimized. Since they have different starting points they will eventually learn different features through the gradient decent.

How many layers are in this neural network?

I am trying to make sure I'm using the correct terminology. The below diagram shows the MNIST example
X is 784 row vector
W is 784X10 matrix
b is a 10 row vector
The out of the linear box is fead into softmax
The output of softmax is fed into the distance function cross-entropy
How many layers are in this NN? What are the input and hidden layer in that example?
Similarly, how many layers are in this answer If my understanding is correct, then 3 layers?
Edit
#lejlot Does the below represent a 3 layered NN with 1 hidden layer?
Take a look at this picture:
http://cs231n.github.io/assets/nn1/neural_net.jpeg
In your first picture you have only two layers:
Input layers -> 784 neurons
Output layer -> 10 neurons
Your model is too simple (w contains directly connections between the input and the output and b contains the bias terms).
With no hidden layer you are obtaining a linear classifier, because a linear combination of linear combinations is a linear combination again. The hidden layers are what include non linear transformations in your model.
In your second picture you have 3 layers, but you are confused the notation:
The input layer is the vector x where you place an input data.
Then the operation -> w -> +b -> f() -> is the conexion between the first layer and the second layer.
The second layer is the vector where you store the result z=f(xw1+b1)
Then softmax(zw2+b2) is the conexion between the second and the third layer.
The third layer is the vector y where you store the final result y=softmax(zw2+b2).
Cross entropy is not a layer is the cost function to train your neural network.
EDIT:
One more thing, if you want to obtain a non linear classifier you must add a non linear transformation in every hidden layer, in the example that I have described, if f() is a non linear function (for example sigmoid, softsign, ...):
z=f(xw1+b1)
If you add a non linear transformation only in the output layer (the softmax function that you have at the end) your outputs are still linear classifiers.
That has 1 hidden layer.
The answer you link to, I would call a 2-hidden layer NN.
Your input-layer is the X-vector.
Your layer Wx+b is the hidden layer, aka. the box in your picture.
The output-layer is the Soft-max.
The cross-entropy is your loss/cost function, and is not a layer at all.

How does the unpooling and deconvolution work in DeConvNet

I have been trying to understand how unpooling and deconvolution works in DeConvNets.
Unpooling
While during the unpooling stage, the activations are restored back to the locations of maximum activation selections, which makes sense, but what about the remaining activations? Do those remaining activations need to be restored as well or interpolated in some way or just filled as zeros in unpooled map.
Deconvolution
After the convolution section (i.e., Convolution layer, Relu, Pooling ), it is common to have more than one feature map output, which would be treated as input channels to successive layers ( Deconv.. ). How could these feature maps be combined together in order to achieve the activation map with same resolution as original input?
Unpooling
As etoropov wrote, you can read about unpooling in Visualizing and Understanding Convolutional Networks by Zeiler and Ferguson:
Unpooling: In the convnet, the max pooling operation
is non-invertible, however we can obtain an approximate
inverse by recording the locations of the
maxima within each pooling region in a set of switch
variables. In the deconvnet, the unpooling operation
uses these switches to place the reconstructions from
the layer above into appropriate locations, preserving
the structure of the stimulus. See Fig. 1(bottom) for
an illustration of the procedure.
Deconvolution
Deconvolution works like this:
You add padding around each pixel
You apply a convolution
For example, in the following illustration the original blue image is padded with zeros (white), the gray convolution filter is applied to get the green output.
Source: What are deconvolutional layers?
1 Unpooling.
In the original paper on unpooling, remaining activations are zeroed.
2 Deconvolution.
A deconvolutional layer is just the transposed of its corresponding conv layer. E.g. if conv layer's shape is [height, width, previous_layer_fms, next_layer_fms], than the deconv layer will have the shape [height, width, next_layer_fms, previous_layer_fms]. The weights of conv and deconv layers are shared! (see this paper for instance)

Resources