(Any better way to input markdown?)
I'm starting to attempt to learn how regularized multi-class logistic regression classifiers work, but I'm stuck at one of the first steps. The negative log-likelihood function of logistic regression for $m$ classes and its gradients are given by:
If I have a feature matrix $\bf X$ that is size 800K x 50, What are the dimensions of $\bf W$, $\bf w_{k}$, $\bf x_{j}$, and what do $n$ and $m$ equal?
I thought that $m=50$, $n=800K$, $\bf W$ is also a 800K x 50 matrix, $\bf w_{k}$ is a column of $\bf W$ of size 800K x 1, and $\bf x_{j}$ is a row of $\bf X$ of size 50 x 1. However obviously I'm wrong because I can't take the dot product $\bf w_{k}^{T} \bf x_{j}$ if these vectors have unequal lengths. What part(s) am I misunderstanding?
You have n=500k samples, each is represented by 50 features. So, the number of features is 50. The number of classes, m, is not mentioned in your question (it is not 50). $\bf W$ is the weight matrix which can be seen as a 50 x m matrix. The $\bf w_{k}$ is a column of $\bf W$ which is a vector of size 50 x 1. About x you are right: $\bf x_{j}$ is a row of $\bf X$ of size 50 x 1.
Related
I'm trying to implement my own neural network, but I'm not very confident that my math is correct.
I'm doing the MNIST digit recognition, so I have a softmaxed output of 10 probabilities. as output. I then compute my delta output thus:
delta_output = vector of outputs - one-hot encoded actual label
delta_output is a matrix of dimension 10 x 1.
Then I compute the delta for the weights of the last hidden layer thus:
delta_hidden = weight_hidden.transpose * delta_output * der_hidden_activation(output_hidden)
Assuming there are N nodes in the last hidden layer, weight_hidden is a matrix of dimension of N by 10, delta_output from above is 10x1, and the result of der_hidden_activation(output_hidden) is N x 1.
Now, my first question here is, should the multiplication of delta_output and der_hidden_activation(output_hidden) return a 10 x N matrix using outer product? I think the I need to do a Hadamard product of this resulting matrix with the untouched weights to get delta_hidden to be N x 10 still.
Finally I multiply this delta_hidden by my learning rate and subtract it from the original weight of the last hidden layer to get my new weights.
My second and final question here is, did I miss anything?
Thanks in advance.
Assuming there are N nodes in the last hidden layer, weight_hidden is a matrix of dimension of N by 10, delta_output from above is 10x1, and the result of der_hidden_activation(output_hidden) is N x 1.
When going from a layer of N neurons (hidden layer), to a layer of M neurons (output layer), under matrix multiplication, the weight matrix dimensions should be M x N. So for the back propagation stage, going from layer of M neurons (output), to a layer of N neurons (hidden layer), use the transpose of the weight matrix which will give you a matrix of dimension (N x M).
Now, my first question here is, should the multiplication of delta_output and der_hidden_activation(output_hidden) return a 10 x N matrix using outer product?
I think the I need to do a Hadamard product of this resulting matrix with the untouched weights to get delta_hidden to be N x 10 still.
Yes you need to use hadamard product, however you can't multiply delta_output and der_hidden_activation(output_hidden) since these are matrices of different dimensions (10 x 1, and N x 1 respectively). Instead you multiply the transpose of the hidden_weight matrix (N x 10) by delta_output (10 x 1) to get a matrix of N x 1, and then perform the hadamard product with der_hidden_activation(output_hidden).
If I am translating this correctly...
hidden_weight matrix =
delta_output =
delta_hidden =
der_hidden_activation(output_hidden) =
Plugging this into the BP formula...
As you can see, you need to multiply the transpose of the weight_hidden (N x 10) with delta_output (10 x 1) first to produce a matrix (N x 1), and then you use hadamard product with der_hidden_activation(output_hidden).
Finally I multiply this delta_hidden by my learning rate and subtract it from the original weight of the last hidden layer to get my new weights.
You don't multiply delta_hidden by the learning rate. You need to use the learning rate on a bias and delta weight matrix...
The delta weight matrix is a matrix of the same dimensions as your (hidden) weight matrix and is calculated using the formula...
And then you can easily apply the learning rate...
Incidentally, I just answered a similar question on AISE which might help shed some light and goes into more detail about the matricies to use during backpropagation.
I want to implement Ncuts algorithm for an image of size 1248 x 378 x 1 but the adjacency matrix will be (1248 x 378 ) x (1248 x 378 ) which needs about 800 gb of RAM. Even if i most of it is zero, still it needs too much memory. I do need this matrix though to compute the normalized cut. Is there any way that i can find the eigenvalues without actually calculate the whole matrix?
If most of the matrix is zero,, then don't use a dense format.
Instead use a sparse matrix.
I've been doing the homework 1 in Andrew Ng's machine learning course. But I'm stuck on my understanding of what he was talking about when vectorizing the multivariable gradient descent.
his equation is presented as follows:
theta := theta - alpha*f
f is supposed to be created by 1/m*sum(h(xi)-yi)*Xi where i is the index
now here is where I get confused, I know that h(xi)-y(i) can be rewritten as theta*xi where xi represents a row of feature elements (1xn) and theta represents a column (nx1) producing a scalar which I then subtract from an individual value of y, which I then multiply by Xi where Xi represents a column of 1 features values?
so that would give me mx1 vector? which then has to be subtracted from an nx1 vector?
is it that Xi represents a row of feature values? and if so how can I do this without indexing over all of these rows?
I'm specifically referring to this image:
I'll explain it with the non-vectorized implementation
so that would give me mx1 vector? which then has to be subtracted from
an nx1 vector?
yes it will give you m x 1 vector, but instead to be subtracted from n x 1 vector, it has to be subtracted from m x 1 vector too. How?
I know that h(xi)-y(i) can be rewritten as theta*xi where xi
represents a row of feature elements (1xn) and theta represents a
column (nx1) producing a scalar
you have answer it actually, theta * xi produce a scalar, so when you have m samples it will give you a m x 1 vector. If you see carefully in the equation the scalar result from h(xi) - y(i) is multiplied with a scalar too which is x0 from sample i (x sup i sub 0) so it will give you a scalar result, or m x 1 vector if you have m samples.
I'm following the TensorFlow tutorial
Initially x is defined as
x = tf.placeholder(tf.float32, shape=[None, 784])
Later on it reshapes x, I'm trying to understand why.
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
What does -1 mean in the reshaping vector and why is x being reshaped?
1) What does -1 mean in the reshaping vector
From the documentation of reshape:
If one component of shape is the special value -1, the size of that
dimension is computed so that the total size remains constant. In
particular, a shape of [-1] flattens into 1-D. At most one component
of shape can be -1.
this is a standard feature and is available in numpy as well. Basically it means - I do not have time to calculate all the dimensions, so infer the one for me. In your case because x * 28 * 28 * 1 = 784 so your -1 = 1
2) Why is x being reshaped
They are planning to use convolution for image classification. So they need to use some spatial information. Current data is 1 dimensional. So they transform it to 4 dimensions. I do not know the point of the forth dimension because in my opinion they might have used only (x, y, color). Or even (x, y). Try to modify their reshape and convolution and most probably you will get similar accuracy.
why 4 dimensions
TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.
[batch, in_height, in_width, in_channels]
I have a lots of data points (x, y) and I'm trying to use k-NN to predict future y'.
If y has only two possible values, then I can treat y = +1 or -1.
Each time I have a input x', find the nearest k elements, and multiple their y with inverse of distance(x,x').
If the sum is greater than 0, then I will predict y'=+1, otherwise y'=-1
However, right know my y has 10 different possible values.
How do I do similiar weighted sum under this situation?
You can keep different scores for each class that are always positive, with more positive being a stronger association. Then just take the class max score.