I saw many ML tutorials explained fully connected network by constructing two matrices, weight matrix and input(or activation) matrix and perform a matrix to matrix multiplication (matmul) to form the linear equations.
All the examples I saw place input as first argument to matmul and weight tensor as second argument. Why is that? Why can’t I perform weights times input (assuming the weight matrix was created properly with columns count equal to input matrix row counts)?
To get (nx1) output For a (nx1) input, you should multiplicate input with a (nxn) matrix from left or (1x1) matrix from right.
If you multiplicate input with a scalar ( (1x1) matrix), then there are one connection from input to output from each neuron. If you multiplicate it with a matrix, for each output cell we get weighted sum of input neurons. In other words, each neuron in input connected to each neuron in output which is fully connected.
By preserving this logic, it doesn't matter how you arrange your weight matrices.
Related
While reading linear regression in Ch-2 of book "The elements of Statistical Learning", I came across 2 equations and I failed to understand how the 2nd was derived from the first.
Background:
How do we fit the linear model to a set of training data? There are
many different methods, but by far the most popular is the method of
least squares. In this approach, we pick the coefficients β to minimize the
residual sum of squares
Equation 1
RSS(β) is a quadratic function of the parameters, and hence its minimum
always exists, but may not be unique. The solution is easiest to characterize
in matrix notation. We can write
Equation 2
where X is an N × p matrix with each row an input vector, and y is an
N-vector of the outputs in the training set.
1st equation:
2nd equation:
I got it. The RHS of the 2nd equation is in the matrix form and to get the 1st equation, you have to transpose one part of the RHS of 2nd equation(this is how matrix multiplication is done)
I'm new on neural networks. I follow some tutorials on a lot of platforms, but there is one thing than I don't understand.
In a simple multi layer perceptron :
We have the input layer, an hidden layer for this example (with the same number of neurons than the input layer) and an output layer with one unit.
We initialize the weights of the units in hidden layer randomly but in a range of small values.
Now, the input layer is fully connected with the hidden layer.
So each units in hidden layer are going to receive the same parameters. How are they going to extract different features from each other ?
Thanks for explanation!
We initialize the weights of the units in hidden layer randomly but in
a range of small values. Now, the input layer is fully connected with
the hidden layer. So each units in hidden layer are going to receive
the same parameters. How are they going to extract different features
from each other ?
Actually each neuron will not have the same value. To get to the activations of the hidden layer you use the matrix equation Wx + b In this case W is the weight matrix of shape (Hidden Size, Input Size). x is the input vector of the hidden layer of shape (Input Size) and b is the bias of shape (Hidden Size). This results in an activation of shape (Hidden Size). So while each hidden neuron would be "seeing" the same x vector it will be taking the dot product of x with its own random row vector and adding its own random bias which will give that neuron a different value. The values contained in the W matrix and b vector are what are trained and optimized. Since they have different starting points they will eventually learn different features through the gradient decent.
I am trying to make sure I'm using the correct terminology. The below diagram shows the MNIST example
X is 784 row vector
W is 784X10 matrix
b is a 10 row vector
The out of the linear box is fead into softmax
The output of softmax is fed into the distance function cross-entropy
How many layers are in this NN? What are the input and hidden layer in that example?
Similarly, how many layers are in this answer If my understanding is correct, then 3 layers?
Edit
#lejlot Does the below represent a 3 layered NN with 1 hidden layer?
Take a look at this picture:
http://cs231n.github.io/assets/nn1/neural_net.jpeg
In your first picture you have only two layers:
Input layers -> 784 neurons
Output layer -> 10 neurons
Your model is too simple (w contains directly connections between the input and the output and b contains the bias terms).
With no hidden layer you are obtaining a linear classifier, because a linear combination of linear combinations is a linear combination again. The hidden layers are what include non linear transformations in your model.
In your second picture you have 3 layers, but you are confused the notation:
The input layer is the vector x where you place an input data.
Then the operation -> w -> +b -> f() -> is the conexion between the first layer and the second layer.
The second layer is the vector where you store the result z=f(xw1+b1)
Then softmax(zw2+b2) is the conexion between the second and the third layer.
The third layer is the vector y where you store the final result y=softmax(zw2+b2).
Cross entropy is not a layer is the cost function to train your neural network.
EDIT:
One more thing, if you want to obtain a non linear classifier you must add a non linear transformation in every hidden layer, in the example that I have described, if f() is a non linear function (for example sigmoid, softsign, ...):
z=f(xw1+b1)
If you add a non linear transformation only in the output layer (the softmax function that you have at the end) your outputs are still linear classifiers.
That has 1 hidden layer.
The answer you link to, I would call a 2-hidden layer NN.
Your input-layer is the X-vector.
Your layer Wx+b is the hidden layer, aka. the box in your picture.
The output-layer is the Soft-max.
The cross-entropy is your loss/cost function, and is not a layer at all.
I am trying to implement convolutional neural network by Lecun. I have two questions.
1) Do i have to multiply activation function on the (max_value * weight_value) in the maxpooling layer.
2) if yes than in backpropagating the error as i selected only one value from 2x2 receptive field. How can i distribute the error for the other 3 values in the receptive field. Should i replicate one error for all 2x2 window.
3) If no than in backpropagation How can i take determinant of the output i.e. (x*(1-x)? As to find gradient, we need derivative of the activated weighted sum i.e f'(x)
4) USing stochastic diagonal livenberg marquardt method, What should i take the value for eeta, and meeo in equation (21) page 2319 of Lecun Paper http://enpub.fulton.asu.edu/cseml/summer08/papers/cnn-appendix.pdf
I will be thankful for any explanation, or code sample etc.
Regards
If I have a feed-forward multilayer perceptron with sigmoid activation function, which is trained and has known weights, how can I find the equation of the curve that is approximated by the network (the curve that separates between 2 types of data)?
In general, there is no closed form solution for the input points where your NN output is 0.5 (or 0, in case of -1/1 instead of 0/1).
What is usually done for visualization in low-dimensional input space is gridding up the input space and computing the contours of the NN output. (The contours are smooth estimate of what the NN response surface looks like.)
In MATLAB, one would do
[X,Y] = meshgrid(linspace(-1,1), linspace(-1,1));
contour(f(X,Y))
where f is your trained NN, and assuming [-1,1] x [-1,1] space.