Do I need to include my scaled outputs in my back-propagation equation (SGD)? - machine-learning

Quick question, when I am backpropagating the loss function to my parameters and I used a scaled output (ex. tanh(x) * 2), do I need to include the derivative of the scaled output w.r.t the original output? Thank you!

Before we can backprop the errors, we've to compute the gradient of the loss function with respect to each of the parameters. This computation involves computing the gradients of the outputs first and then use chain rule repeatedly. So, when you do this, the scaling constant remains as is. So, yes, you've to scale the errors accordingly.
As an example, you might have observed the following L2 regularized loss - a.k.a Ridge regression:
Loss = 1/2 * |T - Y|^2 + \lambda * ||w||^2
Here, we are scaling down the squared error. So, when we compute the gradient 1/2 & 2 would cancel out. If we would not have multiplied this by 0.5 in the first place, then we would have to scale up our gradient by 2. Else the gradient vector would point in some other direction instead of the direction which minimizes the loss.

Related

Why would I use a Non Linear activation function in CNN convolutional layer?

I was going through one of the deep learning lectures from MIT on CNN. It said when multiplying weights with pixel values, a non linear activation function like relu can be applied on every pixel. I understand why it should be applied in a simple neural network, since it introduces non linearity in our input data. But why would I want to apply it on a single pixel ? Or am I getting it wrong ?
You may have got it a little wrong.
When they say "multiplying weights with pixel values" - they refer to the linear operation of multiplying the filter (weights + bias) with the pixels of the image. If you think about it, each filter in a CNN essentially represents a linear equation.
For example - if we're looking at a 4*4 filter, the filter is essentially computing x1 * w1 + x2 * w2 + x3 * w3 + x4 * w4 + b for every 4*4 patch of the image it goes over. (In the above equation, x1,x2,x4,x4 refer to pixels of the image, while w1,w2,w3,w4 refer to the weights present in the CNN filter)
Now, hopefully it's fairly clear that the filter is essentially computing a linear equation. To be able to perform a task like let's say image classification, we require some amount of non-linearity. This is achieved by using, most popularly, the ReLU activation function.
So you aren't applying non linearity to a "pixel" per se, you're still applying it to a linear operation (like in a vanilla neural network) - which consists of pixel values multiplied by the weights present in a filter.
Hope this cleared your doubt, feel free to reach out for more help!

Bring any PyTorch cuda tensor in the range [0,1]

Suppose I have a PyTorch Cuda Float tensor x of the shape [b,c,h,w] taking on any arbitrary value allowed by Float Tensor range. I want to normalise it in the range [0,1].
I think of the following algorithm (but any other will also do).
Step1: Find minimum in each batch. Call it min and having shape [b,1,1,1].
Step2: Similarly find the maximum and call it max.
Step3: Use y = (x-min)/max. Alternatively use y = (x-min)/(max-min). I don't know which one will be better. y should have the same shape as that of x.
I am using PyTorch 1.3.1.
Specifically I am unable to get the desired min using torch.min(). Same goes for max.
I am going to use it for feeding it to pre-trained VGG for calculating perceptual loss (after the above normalisation i will additionally bring them to ImageNet mean and std). Due to some reason I cannot enforce [0,1] range during data loading part because the previous works in my area have a very specific normalisation algorithm which has to be used but some times does not ensures [0,1] bound but will be somewhere in its vicinity. That is why at the time computing perceptual loss I have to do this explicit normalisation as a precaution. All out of the box implementation of perceptual loss I am aware assume data is in [0,1] or [-1,1] range and so do not do this transformation.
Thankyou very much
Not the most elegant way, but you can do that using keepdim=True and specifying each of the dimensions:
channel_min = x.min(dim=1, keepdim=True)[0].min(dim=2,keepdim=True)[0].min(dim=3, keepdim=True)[0]
channel_max = x.max(dim=1, keepdim=True)[0].max(dim=2,keepdim=True)[0].max(dim=3, keepdim=True)[0]

Is my understanding right about "Intensity" of Lukas-Kanade method of optical flow?

From 'http://docs.opencv.org/3.1.0/d7/d8b/tutorial_py_lucas_kanade.html', it says that intensities are gradients. f(x), f(y) are gradients and f(t) is similarly gradient along time. My confusion starts now. In above link, f(x) and f(y) are also in derivative form but we cannot calculate derivative along x and y because we don't know where the same point goes to, actually that's what we are going to find in this method. So I wonder what it says is that, since it says f(t) is gradient of one point along time so, can I assume f(t) like average of gradient of a point that is collected by several certain period and f(x) and f(y) are collected every period that average of f(t) gradient is collected?
For example, if f(t) is calculated every 20ms and trying to calculate average in every 100ms. In every 100ms f(x) and f(y) is calculated. Is my understanding right?
If wrong then, what is difference between f(x), f(y) and f(t)?
They assume that the intensity of each pixel does not change over time. Based on this assumption, you can calculate the optical flow formula and get the value of U and V, which is the vector of movement.
Let I(x,y) be an intensity value at a pixel position (x,y), than as in the description fx(x,y) is the derivative in x direction and fy(x,y) in the y direction and ft(x,y) in the temporal direction at the position (x,y).
However, in the discret world in our programm, we could only approximate these derivative we also call this gradient. Therfore e.g. the Sobel filters could be used. Or in a more easy way the differenz operator i.e.
fx(x,y) = I(x,y) - I(x-1,y)
fy(x,y) = I(x,y) - I(x,y-1)
similar is that for the temporal gradient but now with the image intensity of the previous image A(x,y)
ft(x,y) = I(x,y) - A(x,y)
idealy and mostly in practice A is one frame befor I, but in the theory A could be taken from more previous frames but this will make ft less accurate.

Geometric representation of Perceptrons (Artificial neural networks)

I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides%2Flec2.pdf
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.
It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.
I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, https://habrahabr.ru/post/324736/). It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.
The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces
For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.
I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.
Equation of a plane passing through origin is written in the form:
ax+by+cz=0
If a=1,b=2,c=3;Equation of the plane can be written as:
x+2y+3z=0
So,in the XYZ plane,Equation: x+2y+3z=0
Now,in the weight space;every dimension will represent a weight.So,if the perceptron has 10 weights,Weight space will be 10 dimensional.
Equation of the perceptron: ax+by+cz<=0 ==> Class 0
ax+by+cz>0 ==> Class 1
In this case;a,b & c are the weights.x,y & z are the input features.
In the weight space;a,b & c are the variables(axis).
So,for every training example;for eg: (x,y,z)=(2,3,4);a hyperplane would be formed in the weight space whose equation would be:
2a+3b+4c=0
passing through the origin.
I hope,now,you understand it.
Consider we have 2 weights. So w = [w1, w2]. Suppose we have input x = [x1, x2] = [1, 2]. If you use the weight to do a prediction, you have z = w1*x1 + w2*x2 and prediction y = z > 0 ? 1 : 0.
Suppose the label for the input x is 1. Thus, we hope y = 1, and thus we want z = w1*x1 + w2*x2 > 0. Consider vector multiplication, z = (w ^ T)x. So we want (w ^ T)x > 0. The geometric interpretation of this expression is that the angle between w and x is less than 90 degree. For example, the green vector is a candidate for w that would give the correct prediction of 1 in this case. Actually, any vector that lies on the same side, with respect to the line of w1 + 2 * w2 = 0, as the green vector would give the correct solution. However, if it lies on the other side as the red vector does, then it would give the wrong answer.
However, suppose the label is 0. Then the case would just be the reverse.
The above case gives the intuition understand and just illustrates the 3 points in the lecture slide. The testing case x determines the plane, and depending on the label, the weight vector must lie on one particular side of the plane to give the correct answer.

Rescaling after feature scaling, linear regression

Seems like a basic question, but I need to use feature scaling (take each feature value, subtract the mean then divide by the standard deviation) in my implementation of linear regression with gradient descent. After I'm finished, I'd like the weights and regression line rescaled to the original data. I'm only using one feature, plus the y-intercept term. How would I change the weights, after I get them using the scaled data, so that they apply to the original unscaled data?
Suppose your regression is y = W*x + b with x the scaled data, with the original data it is
y = W/std * x0 + b - u/std * W
where u and std are mean value and standard deviation of x0. Yet I don't think you need to transform back the data. Just use the same u and std to scale the new test data.

Resources