Let y = Relu(Wx) where W is a 2d matrix representing a linear transformation on x, a vector. Likewise, let m = Zy, where Z is a 2d matrix representing a linear transformation on y. How do I programmatically calculate the gradient of Loss = sum(m^2) with respect to W, where the power means take the element wise power of the resulting vector, and sum means adding all the elements together?
I can work this out slowly mathematically by taking a hypothetical, multiplying it all out, then element-by-element taking the derivative to construct the gradient, but I can't figure out an efficient approach to write a program once the neural network layer becomes >1.
Say, for just one layer (m = Zy, take gradient wrt Z) I could just say
Loss = sum(m^2)
dLoss/dZ = 2m * y
where * is the outer product of the vectors, and I guess this is kind of like normal calculus and it works. Now for 2 layers + activation (gradient wrt W), if I try to do it like "normal" calculus and apply the chain rule I get:
dLoss/dW = 2m * Z * dRelu * x
where dRelu is the derivative of Relu(Wx) except here I have no idea what * means in this case to make it work.
Is there an easy way to calculate this gradient mathematically without basically multiplying it all out and deriving each separate element in the gradient? I'm really unfamiliar with matrix calculus, so if anyone could also give some mathematical intuition, if my attempt is completely wrong, that would be appreciated.
For the sake of convenience, let's ignore the ReLU for a moment. You have an input space X (of some size [dimX]) mapped to an intermediate space Y (of some size [dimY]) mapped to an output space m (of some size [dimM]) You have, then, W: X → Y a matrix of shape [dimY, dimX] and Z: Y → m a matrix of shape [dimM, dimY]. Finally your loss is simply a function that maps your M space to a scalar value.
Let us walk the way backwards. As you correctly said, you want to compute the derivative of the loss w.r.t W and to do so you need to apply the chain rule all the way back. You then have:
dL/dW = dL/dm * dm/dY * dY/dW
dL/dm is of shape [dimm] (a scalar function with derivatives across dimm dimensions)
dm/dY is of shape [dimm, dimY] (an m-dimensional function with derivatives across dimY dimensions)
dY/dW is of shape [dimY, dimW] = [dimY, dimY, dimX] (a y-dimensional function with derivatives across [dimY, dimX] dimensions)
Edit:
To make the last bit more clear, Y consists of dimY different values, so Y can be treated as dimY constituent functions. We need to apply the gradient operator on each of those mini-functions, all with respect to the basis vectors defined by W. More concretely, if W = [[w11, w12], [w21, w22], [w31, w32]] and x = [x1, x2], then Y = [y1, y2, y3] = [w11x1 + w12x2, w21x1 + w22x2, w31x1 + w32x2]. Then W defines a 6d space (3x2) across which we need to differentiate. We have dY/dW = [dy1/dW, dy2/dW, dy3/dW], and also dy1/dW = [[dy1/dw11, dy1/dw12], [dy1/dw21, dy1/dw22], [dy1/dw31, dy1/dw32]] = [[x1,x2],[0,0],[0,0]], a 3x2 matrix. So dY/dW is a [3,3,2] tensor.
As for the multiplication part; the operation here is tensor contraction (essentially matrix multiplication in high dimension spaces). Practically, if you have a high-order tensor A[[a1, a2, a3... ], β] (i.e. a+1 dimensions, the last of which is of size β) and a tensor B[β, [b1, b2...]] (i.e. b+1 dimensions, the first of which is β), their tensor contraction is a matrix C[[a1,a2...], [b1,b2...]] (i.e. a+b dimensions, the β dimension contracted), where C is obtained by summing over element-wise across the shared dimension β (refer to https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot).
The resulting tensor contraction then is a matrix of shape [dimY, dimX] which can be used to update your W weights. The ReLU which we ignored earlier can easily be thrown in the mix, since ReLU: 1 → 1 is a scalar function applied element-wise on Y.
To summarize, your code would be:
W_gradient = 2m * np.dot(Z, x) * np.e**x/(1+np.e**x))
I just implemented several multiplier neural networks(MLP) from scratch in C++[1], and I think I know what's your pain. And believe me, you don't even need any third party matrix/tensor/automatic differentiation(AD) libraries to do the matrix multiplication or gradient calculation. There are three things you should pay attention to:
There are two kinds of multiplication in the equations: matrix multiplication, and elementwise multiplication, you'll mess up if you denoted them all as a single *.
Use concrete examples, especially concrete numbers as dimensions of your data/matrix/vector to build intuition.
The most powerful tool for programming correctly is dimension compatibility, always don't forget to check dimensions.
Suppose your want to do binary classification and the neural network is input -> h1 -> sigmoid -> h2 -> sigmoid -> loss in which input layer has 1 sample each has 2 features, h1 has 7 neurons, and h2 has 2 neurons. Then:
forward pass:
Z1(1, 7) = X(1, 2) * W1(2, 7)
A1(1, 7) = sigmoid(Z1(1, 7))
Z2(1, 2) = A1(1, 7) * W2(7, 2)
A2(1, 2) = sigmoid(Z2(1, 2))
Loss = 1/2(A2 - label)^2
backward pass:
dA2(1, 2) = dL/dA2 = A2 - label
dZ2(1, 2) = dL/dZ2 = dA2 * dsigmoid(A2_i) -- element wise
dW2(7, 2) = A1(1, 7).T * dZ2(1, 2) -- matrix multiplication
Notice the last equation, the dimension of the gradient of W2 should match W2, which is (7, 2). And the only way to get a (7, 2) matrix is to transpose input A1 and multiply A1 with dZ2, that's dimension compatibility[2].
backward pass continued:
dA1(1, 7) = dZ2(1, 2) * A1(2, 7) -- matrix multiplication
dZ1(1, 7) = dA1(1, 7) * dsigmoid(A1_i) -- element wise
dW1(2, 7) = X.T(2, 1) * dZ1(1, 7) -- matrix multiplication
[1] The code is here, you can see the hidden layer implementation, naive matrix implementation and the references listed there.
[2] I omit the matrix derivation part, it's simple actually but hard to type the equations out. I strongly suggest you read this paper, every tiny detail you should know on matrix derivation in DL is listed in this paper.
[3] One sample as input is used in the above example(as a vector), you can substitute 1 with any batch numbers(become matrix), and the equations still hold.
for a machine vision project I am trying to search image data for quadratic surfaces (f(x,y) = Ax^2+Bx+Cy^2+Dy+Exy+F). My plan is to iterate through regions of data and perform a surface-fit, look at the error, see if it's a continuous surface (which would probably indicate a feature in the image).
I was previously able to find quadratic curves (f(x) = Ax^2+Bx+C) in the image data by sampling lines, by using the equations on this site
Link
this worked well, was promising, but it would be much more useful for my task to find 2-D regions that form continuous surfaces.
I see lots of articles indicating that least squares regressions scales up to multiple dimensions, but I'm not able to find code for this Hopefully there is a "closed form" (non-iterative, just compute from your data points) solution, like described above for 1D data. Does anybody know of some source or pseudocode that accomplishes this? Thanks.
(Sorry if my terminology is a bit off.)
I'm not sure what your background is, but if you know some linear algebra you will find linear least squares on wikipedia useful.
Lets take the following example. Say we have the following image
and we want to know how well this fits to a 2D quadratic function in a least squares sense.
Probably the most straightforward way to solve the problem is to compute the optimal coefficients in a least squares sense, then check the error.
First we need to describe the matrices.
Let X be a matrix containing every x,y coordinate in the image, taking the form
X = [x1 x1^2 y1 y1^2 x1*y1 1;
x2 x2^2 y2 y2^2 x2*y2 1;
...
xN xN^2 yN yN^2 xN*yN 1];
For the example image above, X would be a 100x6 matrix.
Let y be the image intensity values in a vector of the form
y = [img(x1,y1);
img(x2,y2);
...
img(xN,yN)]
In this case y is a 100 element column vector.
We want to minimize the least squares objective function S with respect to the vector of coefficients b
S(b) = |y - X*b|^2
where |.| is the L2 norm and b is the desired coefficients
b = [A;
B;
C;
D;
E;
F]
Taking the vector derivative of S(b) with respect to b, setting to zero, and solving for b leads to the standard least squares solution.
b = inv(X'X)*X'*y
where inv is the matrix inverse, ' is transpose, and * is matrix multiplication.
MATLAB example.
% Generate an image
% define x,y coordinates for each location in the image
[x,y] = meshgrid(1:10,1:10);
% true coefficients
b_true = [0.1 0.5 0.3 -0.4 0.4 124];
% magnitude of noise
P = 2;
% create image
img = b_true(1).*x + b_true(2).*x.^2 + b_true(3).*y + b_true(4).*y.^2 + b_true(5).*x.*y + b_true(6);
noise = P*randn(10,10);
img = img + noise;
% Begin least squares optimization
% create matrices
X = [x(:) x(:).^2 y(:) y(:).^2 x(:).*y(:) ones(size(x(:)))];
y = img(:);
% estimated coefficients
b = (X.'*X)\(X.')*y
% mean square error (expected to be near P^2)
E = 1/numel(y) * sum((y - X*b).^2)
Output
b =
0.0906
0.5093
0.1245
-0.3733
0.3776
124.5412
E =
3.4699
In your application you would probably want to define some threshold such that when E < threshold you accept the image (or image region) as a quadratic polynomial.
I had a particular question regarding the gradient for the softmax used in the CS231n. After deriving the softmax function to calculate the gradient for each individual class, the authors divide the gradient by the num_examples, even though the gradient is not summed anywhere. What is the logic behind this. Why can't we just use the softmax gradients directly?
A typical objective of a neural network learning is to minimise an expected loss over data distribution, thus:
minimise E_{x,y} L(x,y)
now, in practise we use an estimate of this quantity, which is given by a sample mean, for a training set xi, yi
minimise 1/N SUM L(xi, yi)
what is given in the above derivation is d L(xi, yi) / d theta, but since we want to minimise 1/N SUM L(xi, yi) we should compute its gradient, which is:
d 1/N SUM L(xi, yi) / d theta = 1/N SUM d L(xi, yi) / d theta
This is just a property of partial derivatives (derivative of the sum being a sum of derivatives and so on). Notice, that in all the above derivations author talks about Li, while the actual optimisation is performed over L (notice lack of index i), which is defined as L = 1/N SUM_i Li
I've been doing the homework 1 in Andrew Ng's machine learning course. But I'm stuck on my understanding of what he was talking about when vectorizing the multivariable gradient descent.
his equation is presented as follows:
theta := theta - alpha*f
f is supposed to be created by 1/m*sum(h(xi)-yi)*Xi where i is the index
now here is where I get confused, I know that h(xi)-y(i) can be rewritten as theta*xi where xi represents a row of feature elements (1xn) and theta represents a column (nx1) producing a scalar which I then subtract from an individual value of y, which I then multiply by Xi where Xi represents a column of 1 features values?
so that would give me mx1 vector? which then has to be subtracted from an nx1 vector?
is it that Xi represents a row of feature values? and if so how can I do this without indexing over all of these rows?
I'm specifically referring to this image:
I'll explain it with the non-vectorized implementation
so that would give me mx1 vector? which then has to be subtracted from
an nx1 vector?
yes it will give you m x 1 vector, but instead to be subtracted from n x 1 vector, it has to be subtracted from m x 1 vector too. How?
I know that h(xi)-y(i) can be rewritten as theta*xi where xi
represents a row of feature elements (1xn) and theta represents a
column (nx1) producing a scalar
you have answer it actually, theta * xi produce a scalar, so when you have m samples it will give you a m x 1 vector. If you see carefully in the equation the scalar result from h(xi) - y(i) is multiplied with a scalar too which is x0 from sample i (x sup i sub 0) so it will give you a scalar result, or m x 1 vector if you have m samples.
I have been trying to understand the SVM algorithm and i can not fully get the hyperplane equation. The equation is- w.x-b=0.
What i understand(with lots of confusions) is- x is unknown set of all the vectors that constitutes the hyperplane and w is normal vector to that hyperplane. We do not know the w, we need to find the optimal w from training set.
Now, we all know, if two vectors are perpendicular to each other then their dot product is zero. So, if w is normal to x then that means it should be w.x=0, but why it's saying w.x-b=0 or w.x=b?(normal and perpendicular is same thing, right?) In normal sense, what i understand if w.x=b, then w and x is not perpendicular and the angle between them is more or less than 90 degree.
Another thing is, in most tutorials(even MIT professor in his lecture) it is being said, that x is projecting on w, but as I know if i want to take projection of x onto w then it will be x.w/|w| (without the direction of w), not only w.x . Am i right with this point?
I think, i am missing something or misunderstanding something. Can anybody help me with this?
First, to be in line:
The projection of x onto w is (x.w /|w|²) w. And x. w/|w| is the component of x in the direction of w (as w/|w| is a unit vector of direction w)
Then, you might be confusing two things:
If x is a vector of an hyperplane, then x.w = 0 is the equation of the hyperplane. Unfortunately, we do not want any of your x to be on the hyperplane.
In the case of SVM, you do not know any vector x on the hyperplane. Instead, you have a training set {(x1,y1), ..., (xN, yN)} from which you want to find the normal vector w of the hyperplane (then you can describe any vector x of this hyperplane knowing that w.x = 0).
So let's inspect the second point where you have a dataset {(x1,y1), ..., (xN, yN)}
and you want to find the hyperplane equation, ie its normal vector w, thanks to some particular vectors (called the support vector).
There is no reason why any of these xi should be normal to w. Moreover, it's impossible for all x to be normal to the hyperplane (If so, let's consider two vectors x1 != x2. Then w.x1 = 0 = w.x2 => w.(x1-x2) = 0 meaning that either w = 0 or x1=x2)
But what we want is that w.U >= C if U positive (one side of the hyperplane) and w.U < C if U negative (other side of the hyperplane).
As of a particular U, we can choose the vectors in the dataset. We expect them to be at a certain distance D (called the gutter in the lecture) of this hyperplane. So we have w.xi >= C + D if yi positive. And w.xi < C - D if yi negative
Let's put b = -C and D = 1 (without loss of generality). Then w.xi + b >= 1 if yi positive. w.xi + b < -1 if yi negative.
If multiplied by yi (being equal to 1 if xi positive, or -1 otherwise) it leads to yi ( w.xi +b ) >= 1.
Finally, by taking the support vector, ie those defining the gutter, we obtain yi (w.xi + b ) - 1 = 0