I am trying to make sure I'm using the correct terminology. The below diagram shows the MNIST example
X is 784 row vector
W is 784X10 matrix
b is a 10 row vector
The out of the linear box is fead into softmax
The output of softmax is fed into the distance function cross-entropy
How many layers are in this NN? What are the input and hidden layer in that example?
Similarly, how many layers are in this answer If my understanding is correct, then 3 layers?
Edit
#lejlot Does the below represent a 3 layered NN with 1 hidden layer?
Take a look at this picture:
http://cs231n.github.io/assets/nn1/neural_net.jpeg
In your first picture you have only two layers:
Input layers -> 784 neurons
Output layer -> 10 neurons
Your model is too simple (w contains directly connections between the input and the output and b contains the bias terms).
With no hidden layer you are obtaining a linear classifier, because a linear combination of linear combinations is a linear combination again. The hidden layers are what include non linear transformations in your model.
In your second picture you have 3 layers, but you are confused the notation:
The input layer is the vector x where you place an input data.
Then the operation -> w -> +b -> f() -> is the conexion between the first layer and the second layer.
The second layer is the vector where you store the result z=f(xw1+b1)
Then softmax(zw2+b2) is the conexion between the second and the third layer.
The third layer is the vector y where you store the final result y=softmax(zw2+b2).
Cross entropy is not a layer is the cost function to train your neural network.
EDIT:
One more thing, if you want to obtain a non linear classifier you must add a non linear transformation in every hidden layer, in the example that I have described, if f() is a non linear function (for example sigmoid, softsign, ...):
z=f(xw1+b1)
If you add a non linear transformation only in the output layer (the softmax function that you have at the end) your outputs are still linear classifiers.
That has 1 hidden layer.
The answer you link to, I would call a 2-hidden layer NN.
Your input-layer is the X-vector.
Your layer Wx+b is the hidden layer, aka. the box in your picture.
The output-layer is the Soft-max.
The cross-entropy is your loss/cost function, and is not a layer at all.
Related
I was going through one of the deep learning lectures from MIT on CNN. It said when multiplying weights with pixel values, a non linear activation function like relu can be applied on every pixel. I understand why it should be applied in a simple neural network, since it introduces non linearity in our input data. But why would I want to apply it on a single pixel ? Or am I getting it wrong ?
You may have got it a little wrong.
When they say "multiplying weights with pixel values" - they refer to the linear operation of multiplying the filter (weights + bias) with the pixels of the image. If you think about it, each filter in a CNN essentially represents a linear equation.
For example - if we're looking at a 4*4 filter, the filter is essentially computing x1 * w1 + x2 * w2 + x3 * w3 + x4 * w4 + b for every 4*4 patch of the image it goes over. (In the above equation, x1,x2,x4,x4 refer to pixels of the image, while w1,w2,w3,w4 refer to the weights present in the CNN filter)
Now, hopefully it's fairly clear that the filter is essentially computing a linear equation. To be able to perform a task like let's say image classification, we require some amount of non-linearity. This is achieved by using, most popularly, the ReLU activation function.
So you aren't applying non linearity to a "pixel" per se, you're still applying it to a linear operation (like in a vanilla neural network) - which consists of pixel values multiplied by the weights present in a filter.
Hope this cleared your doubt, feel free to reach out for more help!
I'm new on neural networks. I follow some tutorials on a lot of platforms, but there is one thing than I don't understand.
In a simple multi layer perceptron :
We have the input layer, an hidden layer for this example (with the same number of neurons than the input layer) and an output layer with one unit.
We initialize the weights of the units in hidden layer randomly but in a range of small values.
Now, the input layer is fully connected with the hidden layer.
So each units in hidden layer are going to receive the same parameters. How are they going to extract different features from each other ?
Thanks for explanation!
We initialize the weights of the units in hidden layer randomly but in
a range of small values. Now, the input layer is fully connected with
the hidden layer. So each units in hidden layer are going to receive
the same parameters. How are they going to extract different features
from each other ?
Actually each neuron will not have the same value. To get to the activations of the hidden layer you use the matrix equation Wx + b In this case W is the weight matrix of shape (Hidden Size, Input Size). x is the input vector of the hidden layer of shape (Input Size) and b is the bias of shape (Hidden Size). This results in an activation of shape (Hidden Size). So while each hidden neuron would be "seeing" the same x vector it will be taking the dot product of x with its own random row vector and adding its own random bias which will give that neuron a different value. The values contained in the W matrix and b vector are what are trained and optimized. Since they have different starting points they will eventually learn different features through the gradient decent.
I have been trying to understand how unpooling and deconvolution works in DeConvNets.
Unpooling
While during the unpooling stage, the activations are restored back to the locations of maximum activation selections, which makes sense, but what about the remaining activations? Do those remaining activations need to be restored as well or interpolated in some way or just filled as zeros in unpooled map.
Deconvolution
After the convolution section (i.e., Convolution layer, Relu, Pooling ), it is common to have more than one feature map output, which would be treated as input channels to successive layers ( Deconv.. ). How could these feature maps be combined together in order to achieve the activation map with same resolution as original input?
Unpooling
As etoropov wrote, you can read about unpooling in Visualizing and Understanding Convolutional Networks by Zeiler and Ferguson:
Unpooling: In the convnet, the max pooling operation
is non-invertible, however we can obtain an approximate
inverse by recording the locations of the
maxima within each pooling region in a set of switch
variables. In the deconvnet, the unpooling operation
uses these switches to place the reconstructions from
the layer above into appropriate locations, preserving
the structure of the stimulus. See Fig. 1(bottom) for
an illustration of the procedure.
Deconvolution
Deconvolution works like this:
You add padding around each pixel
You apply a convolution
For example, in the following illustration the original blue image is padded with zeros (white), the gray convolution filter is applied to get the green output.
Source: What are deconvolutional layers?
1 Unpooling.
In the original paper on unpooling, remaining activations are zeroed.
2 Deconvolution.
A deconvolutional layer is just the transposed of its corresponding conv layer. E.g. if conv layer's shape is [height, width, previous_layer_fms, next_layer_fms], than the deconv layer will have the shape [height, width, next_layer_fms, previous_layer_fms]. The weights of conv and deconv layers are shared! (see this paper for instance)
Which operation takes place to produce the output from say a 9x9 filter and pass that output as the input to MLP.
After the last convolutional layer, you have N feature maps, with WxH resolution. This can be seen as a feature vector X of size NxWxH if you concatenate all the values.
This is how you connect it to an MLP: i.e X acts as an input of a linear transformation with nb. rows = MLP output and nb. columns = NxWxH.
Example: a simple convnet with 2 convolutional layers (x) for traffic sign recognition gives:
input: 3 channels, width=32, height=32
layer 1: 108 feature maps, width=14, height=14
layer 2: 200 feature maps, width=5, height=5
2-layer classifier with 100 hidden units, and 43 output classes
So to connect it to the final MLP you reshape the outputs of layer 2 into a vector of 200x5x5=5000 elements.
This vector becomes the input for a linear transform of size 100 (rows) x 5000 (columns).
(x) convolution kernel size = 5, spatial pooling size = 2.
If I have a feed-forward multilayer perceptron with sigmoid activation function, which is trained and has known weights, how can I find the equation of the curve that is approximated by the network (the curve that separates between 2 types of data)?
In general, there is no closed form solution for the input points where your NN output is 0.5 (or 0, in case of -1/1 instead of 0/1).
What is usually done for visualization in low-dimensional input space is gridding up the input space and computing the contours of the NN output. (The contours are smooth estimate of what the NN response surface looks like.)
In MATLAB, one would do
[X,Y] = meshgrid(linspace(-1,1), linspace(-1,1));
contour(f(X,Y))
where f is your trained NN, and assuming [-1,1] x [-1,1] space.