I'm trying to implement my own neural network, but I'm not very confident that my math is correct.
I'm doing the MNIST digit recognition, so I have a softmaxed output of 10 probabilities. as output. I then compute my delta output thus:
delta_output = vector of outputs - one-hot encoded actual label
delta_output is a matrix of dimension 10 x 1.
Then I compute the delta for the weights of the last hidden layer thus:
delta_hidden = weight_hidden.transpose * delta_output * der_hidden_activation(output_hidden)
Assuming there are N nodes in the last hidden layer, weight_hidden is a matrix of dimension of N by 10, delta_output from above is 10x1, and the result of der_hidden_activation(output_hidden) is N x 1.
Now, my first question here is, should the multiplication of delta_output and der_hidden_activation(output_hidden) return a 10 x N matrix using outer product? I think the I need to do a Hadamard product of this resulting matrix with the untouched weights to get delta_hidden to be N x 10 still.
Finally I multiply this delta_hidden by my learning rate and subtract it from the original weight of the last hidden layer to get my new weights.
My second and final question here is, did I miss anything?
Thanks in advance.
Assuming there are N nodes in the last hidden layer, weight_hidden is a matrix of dimension of N by 10, delta_output from above is 10x1, and the result of der_hidden_activation(output_hidden) is N x 1.
When going from a layer of N neurons (hidden layer), to a layer of M neurons (output layer), under matrix multiplication, the weight matrix dimensions should be M x N. So for the back propagation stage, going from layer of M neurons (output), to a layer of N neurons (hidden layer), use the transpose of the weight matrix which will give you a matrix of dimension (N x M).
Now, my first question here is, should the multiplication of delta_output and der_hidden_activation(output_hidden) return a 10 x N matrix using outer product?
I think the I need to do a Hadamard product of this resulting matrix with the untouched weights to get delta_hidden to be N x 10 still.
Yes you need to use hadamard product, however you can't multiply delta_output and der_hidden_activation(output_hidden) since these are matrices of different dimensions (10 x 1, and N x 1 respectively). Instead you multiply the transpose of the hidden_weight matrix (N x 10) by delta_output (10 x 1) to get a matrix of N x 1, and then perform the hadamard product with der_hidden_activation(output_hidden).
If I am translating this correctly...
hidden_weight matrix =
delta_output =
delta_hidden =
der_hidden_activation(output_hidden) =
Plugging this into the BP formula...
As you can see, you need to multiply the transpose of the weight_hidden (N x 10) with delta_output (10 x 1) first to produce a matrix (N x 1), and then you use hadamard product with der_hidden_activation(output_hidden).
Finally I multiply this delta_hidden by my learning rate and subtract it from the original weight of the last hidden layer to get my new weights.
You don't multiply delta_hidden by the learning rate. You need to use the learning rate on a bias and delta weight matrix...
The delta weight matrix is a matrix of the same dimensions as your (hidden) weight matrix and is calculated using the formula...
And then you can easily apply the learning rate...
Incidentally, I just answered a similar question on AISE which might help shed some light and goes into more detail about the matricies to use during backpropagation.
I'm learning "Harris Corner Detector" algorithm,
and stuck here that why Harris matrix is positive semi-definite.
Since Harris matrix's trace is positive, so I can tell Harris matrix's two eigenvalues are all positive or one positive and one negative.
So, how to derivate Harris matrix is positive semi-definite?
The matrix generated in the computation of the Harris corner detector is the structure tensor (see here on Wikipedia). The structure tensor M is a matrix created by the outer product of the gradient field g with itself:
g = gradient( image );
M = smooth( g * g' );
(with smooth the local smoothing applied).
Without any smoothing, g * g' would always have one positive eigenvalue and one 0 eigenvalue, by construction. You can see this by writing out the determinant of the resulting matrix, which is always 0, meaning that one of the eigenvalues must be 0 (their product is the determinant). The other one must be positive because the trace is the sum of two squares; since one eigenvalue is 0, the other eigenvalue must be equal to the trace.
The local smoothing adds together several such matrices (weighted addition). Adding together positive semi-definite matrices leads to a positive-semi-definite matrix: if v'*A*v>=0, and v'*B*v>=0, then v'*(A+B)*v>=0.
Let y = Relu(Wx) where W is a 2d matrix representing a linear transformation on x, a vector. Likewise, let m = Zy, where Z is a 2d matrix representing a linear transformation on y. How do I programmatically calculate the gradient of Loss = sum(m^2) with respect to W, where the power means take the element wise power of the resulting vector, and sum means adding all the elements together?
I can work this out slowly mathematically by taking a hypothetical, multiplying it all out, then element-by-element taking the derivative to construct the gradient, but I can't figure out an efficient approach to write a program once the neural network layer becomes >1.
Say, for just one layer (m = Zy, take gradient wrt Z) I could just say
Loss = sum(m^2)
dLoss/dZ = 2m * y
where * is the outer product of the vectors, and I guess this is kind of like normal calculus and it works. Now for 2 layers + activation (gradient wrt W), if I try to do it like "normal" calculus and apply the chain rule I get:
dLoss/dW = 2m * Z * dRelu * x
where dRelu is the derivative of Relu(Wx) except here I have no idea what * means in this case to make it work.
Is there an easy way to calculate this gradient mathematically without basically multiplying it all out and deriving each separate element in the gradient? I'm really unfamiliar with matrix calculus, so if anyone could also give some mathematical intuition, if my attempt is completely wrong, that would be appreciated.
For the sake of convenience, let's ignore the ReLU for a moment. You have an input space X (of some size [dimX]) mapped to an intermediate space Y (of some size [dimY]) mapped to an output space m (of some size [dimM]) You have, then, W: X → Y a matrix of shape [dimY, dimX] and Z: Y → m a matrix of shape [dimM, dimY]. Finally your loss is simply a function that maps your M space to a scalar value.
Let us walk the way backwards. As you correctly said, you want to compute the derivative of the loss w.r.t W and to do so you need to apply the chain rule all the way back. You then have:
dL/dW = dL/dm * dm/dY * dY/dW
dL/dm is of shape [dimm] (a scalar function with derivatives across dimm dimensions)
dm/dY is of shape [dimm, dimY] (an m-dimensional function with derivatives across dimY dimensions)
dY/dW is of shape [dimY, dimW] = [dimY, dimY, dimX] (a y-dimensional function with derivatives across [dimY, dimX] dimensions)
Edit:
To make the last bit more clear, Y consists of dimY different values, so Y can be treated as dimY constituent functions. We need to apply the gradient operator on each of those mini-functions, all with respect to the basis vectors defined by W. More concretely, if W = [[w11, w12], [w21, w22], [w31, w32]] and x = [x1, x2], then Y = [y1, y2, y3] = [w11x1 + w12x2, w21x1 + w22x2, w31x1 + w32x2]. Then W defines a 6d space (3x2) across which we need to differentiate. We have dY/dW = [dy1/dW, dy2/dW, dy3/dW], and also dy1/dW = [[dy1/dw11, dy1/dw12], [dy1/dw21, dy1/dw22], [dy1/dw31, dy1/dw32]] = [[x1,x2],[0,0],[0,0]], a 3x2 matrix. So dY/dW is a [3,3,2] tensor.
As for the multiplication part; the operation here is tensor contraction (essentially matrix multiplication in high dimension spaces). Practically, if you have a high-order tensor A[[a1, a2, a3... ], β] (i.e. a+1 dimensions, the last of which is of size β) and a tensor B[β, [b1, b2...]] (i.e. b+1 dimensions, the first of which is β), their tensor contraction is a matrix C[[a1,a2...], [b1,b2...]] (i.e. a+b dimensions, the β dimension contracted), where C is obtained by summing over element-wise across the shared dimension β (refer to https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot).
The resulting tensor contraction then is a matrix of shape [dimY, dimX] which can be used to update your W weights. The ReLU which we ignored earlier can easily be thrown in the mix, since ReLU: 1 → 1 is a scalar function applied element-wise on Y.
To summarize, your code would be:
W_gradient = 2m * np.dot(Z, x) * np.e**x/(1+np.e**x))
I just implemented several multiplier neural networks(MLP) from scratch in C++[1], and I think I know what's your pain. And believe me, you don't even need any third party matrix/tensor/automatic differentiation(AD) libraries to do the matrix multiplication or gradient calculation. There are three things you should pay attention to:
There are two kinds of multiplication in the equations: matrix multiplication, and elementwise multiplication, you'll mess up if you denoted them all as a single *.
Use concrete examples, especially concrete numbers as dimensions of your data/matrix/vector to build intuition.
The most powerful tool for programming correctly is dimension compatibility, always don't forget to check dimensions.
Suppose your want to do binary classification and the neural network is input -> h1 -> sigmoid -> h2 -> sigmoid -> loss in which input layer has 1 sample each has 2 features, h1 has 7 neurons, and h2 has 2 neurons. Then:
forward pass:
Z1(1, 7) = X(1, 2) * W1(2, 7)
A1(1, 7) = sigmoid(Z1(1, 7))
Z2(1, 2) = A1(1, 7) * W2(7, 2)
A2(1, 2) = sigmoid(Z2(1, 2))
Loss = 1/2(A2 - label)^2
backward pass:
dA2(1, 2) = dL/dA2 = A2 - label
dZ2(1, 2) = dL/dZ2 = dA2 * dsigmoid(A2_i) -- element wise
dW2(7, 2) = A1(1, 7).T * dZ2(1, 2) -- matrix multiplication
Notice the last equation, the dimension of the gradient of W2 should match W2, which is (7, 2). And the only way to get a (7, 2) matrix is to transpose input A1 and multiply A1 with dZ2, that's dimension compatibility[2].
backward pass continued:
dA1(1, 7) = dZ2(1, 2) * A1(2, 7) -- matrix multiplication
dZ1(1, 7) = dA1(1, 7) * dsigmoid(A1_i) -- element wise
dW1(2, 7) = X.T(2, 1) * dZ1(1, 7) -- matrix multiplication
[1] The code is here, you can see the hidden layer implementation, naive matrix implementation and the references listed there.
[2] I omit the matrix derivation part, it's simple actually but hard to type the equations out. I strongly suggest you read this paper, every tiny detail you should know on matrix derivation in DL is listed in this paper.
[3] One sample as input is used in the above example(as a vector), you can substitute 1 with any batch numbers(become matrix), and the equations still hold.
I have three points: (1,1), (2,3), (3, 3.123). I assume the hypothesis is , and I want to do linear regression on the three points. I have two methods to calculate θ:
Method-1: Least Square
import numpy as np
# get an approximate solution using least square
X = np.array([[1,1],[2,1],[3,1]])
y = np.array([1,3,3.123])
theta = np.linalg.lstsq(X,y)[0]
print theta
Method-2: Matrix multiplication
We have the following derivation process:
# rank(X)=2, rank(X|y)=3, so there is no exact solution.
print np.linalg.matrix_rank(X)
print np.linalg.matrix_rank(np.c_[X,y])
theta = np.linalg.inv(X.T.dot(X)).dot(X.T.dot(y))
print theta
Both method-1 and method-2 can get result [ 1.0615 0.25133333], it seems that method-2 is equivalent to least square. But, I don't know why, can anyone reveal the underlying principle of their equivalence?
Both approaches are equivalent, because least squares method is
θ = argmin (Xθ-Y)'(Xθ-Y) = argmin ||(Xθ-Y)||^2 = argmin ||(Xθ-Y)||, that means that you try to minimize length of vector (Xθ-Y), so you try to minimize distance between Xθ and Y. X is a constant matrix, so Xθ is vector from column space of X. That means the shortest distance between these two vectors is when Xθ is equal to projection of vector Y to column space of X (can be easy observed from picture). That results to Y^(hat) = Xθ = X(X'X)^(-1)X'Y, where X(X'X)^(-1)X' is the projection matrix to column space of X. After some changes you can observe that this is equivalent with (X'X)θ = X'Y. You can find exact proof in any linear algebra book.
I'm following the TensorFlow tutorial
Initially x is defined as
x = tf.placeholder(tf.float32, shape=[None, 784])
Later on it reshapes x, I'm trying to understand why.
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
What does -1 mean in the reshaping vector and why is x being reshaped?
1) What does -1 mean in the reshaping vector
From the documentation of reshape:
If one component of shape is the special value -1, the size of that
dimension is computed so that the total size remains constant. In
particular, a shape of [-1] flattens into 1-D. At most one component
of shape can be -1.
this is a standard feature and is available in numpy as well. Basically it means - I do not have time to calculate all the dimensions, so infer the one for me. In your case because x * 28 * 28 * 1 = 784 so your -1 = 1
2) Why is x being reshaped
They are planning to use convolution for image classification. So they need to use some spatial information. Current data is 1 dimensional. So they transform it to 4 dimensions. I do not know the point of the forth dimension because in my opinion they might have used only (x, y, color). Or even (x, y). Try to modify their reshape and convolution and most probably you will get similar accuracy.
why 4 dimensions
TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.
[batch, in_height, in_width, in_channels]