DO i have to apply activation function on Max value in Max pooling? - opencv

I am trying to implement convolutional neural network by Lecun. I have two questions.
1) Do i have to multiply activation function on the (max_value * weight_value) in the maxpooling layer.
2) if yes than in backpropagating the error as i selected only one value from 2x2 receptive field. How can i distribute the error for the other 3 values in the receptive field. Should i replicate one error for all 2x2 window.
3) If no than in backpropagation How can i take determinant of the output i.e. (x*(1-x)? As to find gradient, we need derivative of the activated weighted sum i.e f'(x)
4) USing stochastic diagonal livenberg marquardt method, What should i take the value for eeta, and meeo in equation (21) page 2319 of Lecun Paper http://enpub.fulton.asu.edu/cseml/summer08/papers/cnn-appendix.pdf
I will be thankful for any explanation, or code sample etc.
Regards

Related

Unable to understand transition from one equation to another in linear regression. (From Ch-2, The elements of Statistical Learning)

While reading linear regression in Ch-2 of book "The elements of Statistical Learning", I came across 2 equations and I failed to understand how the 2nd was derived from the first.
Background:
How do we fit the linear model to a set of training data? There are
many different methods, but by far the most popular is the method of
least squares. In this approach, we pick the coefficients β to minimize the
residual sum of squares
Equation 1
RSS(β) is a quadratic function of the parameters, and hence its minimum
always exists, but may not be unique. The solution is easiest to characterize
in matrix notation. We can write
Equation 2
where X is an N × p matrix with each row an input vector, and y is an
N-vector of the outputs in the training set.
1st equation:
2nd equation:
I got it. The RHS of the 2nd equation is in the matrix form and to get the 1st equation, you have to transpose one part of the RHS of 2nd equation(this is how matrix multiplication is done)

Why would I use a Non Linear activation function in CNN convolutional layer?

I was going through one of the deep learning lectures from MIT on CNN. It said when multiplying weights with pixel values, a non linear activation function like relu can be applied on every pixel. I understand why it should be applied in a simple neural network, since it introduces non linearity in our input data. But why would I want to apply it on a single pixel ? Or am I getting it wrong ?
You may have got it a little wrong.
When they say "multiplying weights with pixel values" - they refer to the linear operation of multiplying the filter (weights + bias) with the pixels of the image. If you think about it, each filter in a CNN essentially represents a linear equation.
For example - if we're looking at a 4*4 filter, the filter is essentially computing x1 * w1 + x2 * w2 + x3 * w3 + x4 * w4 + b for every 4*4 patch of the image it goes over. (In the above equation, x1,x2,x4,x4 refer to pixels of the image, while w1,w2,w3,w4 refer to the weights present in the CNN filter)
Now, hopefully it's fairly clear that the filter is essentially computing a linear equation. To be able to perform a task like let's say image classification, we require some amount of non-linearity. This is achieved by using, most popularly, the ReLU activation function.
So you aren't applying non linearity to a "pixel" per se, you're still applying it to a linear operation (like in a vanilla neural network) - which consists of pixel values multiplied by the weights present in a filter.
Hope this cleared your doubt, feel free to reach out for more help!

Bring any PyTorch cuda tensor in the range [0,1]

Suppose I have a PyTorch Cuda Float tensor x of the shape [b,c,h,w] taking on any arbitrary value allowed by Float Tensor range. I want to normalise it in the range [0,1].
I think of the following algorithm (but any other will also do).
Step1: Find minimum in each batch. Call it min and having shape [b,1,1,1].
Step2: Similarly find the maximum and call it max.
Step3: Use y = (x-min)/max. Alternatively use y = (x-min)/(max-min). I don't know which one will be better. y should have the same shape as that of x.
I am using PyTorch 1.3.1.
Specifically I am unable to get the desired min using torch.min(). Same goes for max.
I am going to use it for feeding it to pre-trained VGG for calculating perceptual loss (after the above normalisation i will additionally bring them to ImageNet mean and std). Due to some reason I cannot enforce [0,1] range during data loading part because the previous works in my area have a very specific normalisation algorithm which has to be used but some times does not ensures [0,1] bound but will be somewhere in its vicinity. That is why at the time computing perceptual loss I have to do this explicit normalisation as a precaution. All out of the box implementation of perceptual loss I am aware assume data is in [0,1] or [-1,1] range and so do not do this transformation.
Thankyou very much
Not the most elegant way, but you can do that using keepdim=True and specifying each of the dimensions:
channel_min = x.min(dim=1, keepdim=True)[0].min(dim=2,keepdim=True)[0].min(dim=3, keepdim=True)[0]
channel_max = x.max(dim=1, keepdim=True)[0].max(dim=2,keepdim=True)[0].max(dim=3, keepdim=True)[0]

Matmul input and weight matrices order?

I saw many ML tutorials explained fully connected network by constructing two matrices, weight matrix and input(or activation) matrix and perform a matrix to matrix multiplication (matmul) to form the linear equations.
All the examples I saw place input as first argument to matmul and weight tensor as second argument. Why is that? Why can’t I perform weights times input (assuming the weight matrix was created properly with columns count equal to input matrix row counts)?
To get (nx1) output For a (nx1) input, you should multiplicate input with a (nxn) matrix from left or (1x1) matrix from right.
If you multiplicate input with a scalar ( (1x1) matrix), then there are one connection from input to output from each neuron. If you multiplicate it with a matrix, for each output cell we get weighted sum of input neurons. In other words, each neuron in input connected to each neuron in output which is fully connected.
By preserving this logic, it doesn't matter how you arrange your weight matrices.

Do I need to include my scaled outputs in my back-propagation equation (SGD)?

Quick question, when I am backpropagating the loss function to my parameters and I used a scaled output (ex. tanh(x) * 2), do I need to include the derivative of the scaled output w.r.t the original output? Thank you!
Before we can backprop the errors, we've to compute the gradient of the loss function with respect to each of the parameters. This computation involves computing the gradients of the outputs first and then use chain rule repeatedly. So, when you do this, the scaling constant remains as is. So, yes, you've to scale the errors accordingly.
As an example, you might have observed the following L2 regularized loss - a.k.a Ridge regression:
Loss = 1/2 * |T - Y|^2 + \lambda * ||w||^2
Here, we are scaling down the squared error. So, when we compute the gradient 1/2 & 2 would cancel out. If we would not have multiplied this by 0.5 in the first place, then we would have to scale up our gradient by 2. Else the gradient vector would point in some other direction instead of the direction which minimizes the loss.

Resources