How to calculate gradients in ResNet architecture? - machine-learning

I assume that somehow the gradient at each layer will be 0.1. The gradient of a paint/stack network a layer can compute by accumulating the gradient as
In the ResNet, the gradient is propagated by skip connection. So, how can I achieve the gradient of x as above figure? Is it 0.1x0.1+0.1 or 0.1?

Have added the gradient calculation in the above diagram. The gradient delta_x is the sum of the incoming gradient delta_y and the product of the gradients delta_y and delta_F.
So in your example, it should be 0.1x0.1x0.1+0.1.
But note that the in actual calculation of delta_F, the delta_y gets multiplied by weight_1 and gets passed/blocked depending on whether ReLu is active and then gets multiplied by the weights_2.

Related

Do I need to include my scaled outputs in my back-propagation equation (SGD)?

Quick question, when I am backpropagating the loss function to my parameters and I used a scaled output (ex. tanh(x) * 2), do I need to include the derivative of the scaled output w.r.t the original output? Thank you!
Before we can backprop the errors, we've to compute the gradient of the loss function with respect to each of the parameters. This computation involves computing the gradients of the outputs first and then use chain rule repeatedly. So, when you do this, the scaling constant remains as is. So, yes, you've to scale the errors accordingly.
As an example, you might have observed the following L2 regularized loss - a.k.a Ridge regression:
Loss = 1/2 * |T - Y|^2 + \lambda * ||w||^2
Here, we are scaling down the squared error. So, when we compute the gradient 1/2 & 2 would cancel out. If we would not have multiplied this by 0.5 in the first place, then we would have to scale up our gradient by 2. Else the gradient vector would point in some other direction instead of the direction which minimizes the loss.

Generalization of gradient calculation for multi channel convolutions

I have been trying to understand how backpropagation for conv nets is implemented at mathematical level. I came across this article which explains gradient calculation graphically for 2D convolution. The conv layer consists of 3x3 inputs and the dimension of the filter used is 2x2 which, on convolution results in 2x2 layer which is then fully connected. The gradient for this fully connected layer will be of dimension 2x2.
According to the article :-
Gradient of conv layer = convolution between gradient of next layer
and weights of this layer
But I cannot generalize this for 3 channel inputs.
Lets say out input layer is of dimension 3x3x3 and we use 1 filter of dimension 2x2x3 then the resultant convolution will again be of dimension 2x2 which will then be treated as fully connected layer.
Now the gradient for fully connected layer will be 2x2. So, to calculate the gradient for conv layer we again to calculate the convolution between 2x2 gradient layer and 2x2x3 weight layer but they are incompatible .
So, I dont understand how to use this formula for calculating gradient for 3D convolutions. How can I proceed after this step ?
Derivation(or an article) with respect to 3D input will also be really helpful .

The length of the gradient vector

It's just a simple thing that I need to clarify.
I need a little refresh in mathematics:
In a circle the length of the gradient should be the radius?
Or do we use the gradient only to get the orientation?
I got to this question after I read about gradient in image processing:
I've read this answer and this about how to get the image gradient and of course here.
I don't understand if the magnitude should stand for the number of pixels? or it just stand for the strength of the intensity changes in a specific point.
The following image is the magnitude of the gradient:
the magnitude of the gradient:
I run the code and watched the magnitude in numbers, and the numbers clearly are not in the range of the image width\height.
Me, waiting to a simple clarify.
Thanks!
Mathematically speaking, the gradient magnitude, or in other words the norm of the gradient vector, represents the derivative (i.e. the slope) of a 2D signal. This is quite clear in the definition given by Wikipedia:
Here, f is the 2D signal and x^, y^ (this is ugly, I'll note them ux and uy in the following) are respectively unit vectors in the horizontal and vertical direction.
In the context of images, the 2D signal (i.e. the image) is discrete instead of being continuous, hence the derivative is approximated by the difference between the intensity of the current pixel and the intensity of the previous pixel, in the considered direction (actually, there are several ways to approximate the derivative, but let's keep it simple). Hence, we can approximate the gradient by the following quantity:
gradient f (u,v) = [ f(u,v)-f(u-1,v) ] . ux + [ f(u,v)-f(u,v-1) ] . uy
In this case, the gradient magnitude is the following:
|| gradient f (u,v) || = square_root { [ f(u,v)-f(u-1,v) ]² + [ f(u,v)-f(u,v-1) ]² }
To summarize, the gradient magnitude is a measure of the local intensity change at a given point and has not much to do with a radius, nor the width/height of the image.

How can I select the best set of parameters in the Canny edge detection algorithm implemented in OpenCV?

I am working with OpenCV on the Android platform. With the tremendous help from this community and techies, I am able to successfully detect a sheet out of the image.
These are the step I used.
Imgproc.cvtColor()
Imgproc.Canny()
Imgproc.GausianBlur()
Imgproc.findContours()
Imgproc.approxPolyDP()
findLargestRectangle()
Find the vertices of the rectangle
Find the vertices of the rectangle top-left anticlockwise order using center of mass approach
Find the height and width of the rectangle just to maintain the aspect ratio and do warpPerspective transformation.
After applying all these steps I can easily get the document or the largest rectangle from an image. But it highly depends on the difference in the intensities of the background and the document sheet. As the Canny edge detector works on the principle of intensity gradient, a difference in intensity is always assumed from the implementation side. That is why Canny took into the account the various threshold parameters.
Lower threshold
Higher threshold
So if the intensity gradient of a pixel is greater than the higher threshold, it will be added as an edge pixel in the output image. A pixel will be rejected completely if its intensity gradient value is lower than the lower threshold. And if a pixel has an intensity between the lower and higher threshold, it will only be added as an edge pixel if it is connected to any other pixel having the value larger than the higher threshold.
My main purpose is to use Canny edge detection for the document scanning. So how can I compute these thresholds dynamically so that it can work with the both cases of dark and light background?
I tried a lot by manually adjusting the parameters, but I couldn't find any relationship associated with the scenarios.
You could calculate your thresholds using Otsu’s method.
The (Python) code would look like this:
high_thresh, thresh_im = cv2.threshold(im, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
lowThresh = 0.5*high_thresh
Use the following snippet which I obtained from this blog:
v = np.median(gray_image)
#---- Apply automatic Canny edge detection using the computed median----
lower = int(max(0, (1.0 - sigma) * v))
upper = int(min(255, (1.0 + sigma) * v))
edged = cv2.Canny(gray_image, lower, upper)
cv2.imshow('Edges',edged)
##So what am I doing here?
I am taking the median value of the gray scale image. The sigma value of 0.33 is chosen to set the lower and upper threshold. 0.33 value is generally used by statisticians for data science. So it is considered here as well.

How to determine the window size of a Gaussian filter

Gaussian smoothing is a common image processing function, and for an introduction of Gaussian filtering, please refer to here. As we can see, one parameter: standard derivation will determine the shape of Gaussian function. However, when we perform convolution with Gaussian filtering, another parameter: the window size of Gaussian filter should also be determined at the same time. For example, when we use fspecial function provided by MATLAB, not only the standard derivation but also the window size must be provided. Intuitively, the larger the Gaussian standard derivation is the bigger the Gaussian kernel window should. However, there is no general rule about how to set the right window size. Any ideas? Thanks!
The size of the mask drives the filter amount. A larger size, corresponding to a larger convolution mask, will generally result in a greater degree of filtering. As a kinda trade-off for greater amounts of noise reduction, larger filters also affect the details quality of the image.
That's as milestone. Now coming to the Gaussian filter, the standard deviation is the main parameter. If you use a 2D filter, at the edge of the mask you will probably desire the weights to approximate 0.
To this respect, as I already said, you can choose a mask with a size which is generally three times the standard deviation. This way, almost the whole Gaussian bell is taken into account and at the mask's edges your weights will asymptotically tend to zero.
I hope this helps.
Here is a good reference.
After discretizing, pixel with distance greater than 3 sigma have negligible weights. See this
As already pointed, 6sigma, implies 3sigma both ways
Size of convolution matrix to be used for filtering would inadvertently be 6sigma by 6sigma, because of points 1 and 2 above.
Here how you can obtain the discrete Gaussian.
Finally, the size of the standard deviation(and therefore the Kernel used) depends on how much noise you suspect to be in the image. Clearly, a larger convolution kernel implies farther pixels get to contribute to the new value of the centre pixel as opposed to a smaller kernel.
Given sigma and the minimal weight epsilon in the filter you can solve for the necessary radius of the filter x:
For example if sigma = 1 then the gaussian is greater than epsilon = 0.01 when x <= 2.715 so a filter radius = 3 (width = 2*3 + 1 = 7) is sufficient.
sigma = 0.5, x <= 1.48, use radius 2
sigma = 1, x <= 2.715, use radius 3
sigma = 1.5, x <= 3.84, use radius 4
sigma = 2, x <= 4.89, use radius 5
sigma = 2.5, x <= 5.88, use radius 6
If you reduce/increase epsilon then you will need a larger/smaller radius.

Resources