Why is dilated convolution computationally efficient? - machine-learning

I have been studying UNet inspired architecture ENet and I think I follow the basic concepts. The ground-rock of efficiency of ENet is dilated convolution (apart other things). I understand the preserving spatial resolution, how it is computed and so on, however I can't understand why it is computationally and memory-wise less expensive than e.g. max-pooling.
ENet: https://arxiv.org/pdf/1606.02147.pdf

You simply skip computational layer with a dilated convolution layer:
For example a dilated convolution with
a filter kernel k×k = 3×3, dilation rate r = 2, stride s = 1 and no padding
is comparable to
2x downsampling followed by 3x3 convolution followed by 2x upsampling
For further reference look at the amazing paper from Vincent Dumoulin, Francesco Visin:
A guide to convolution arithmetic for deep learning
Also on the github of this paper is a animation how dilated convolution works:
https://github.com/vdumoulin/conv_arithmetic

In addition to the accepted answer by #T1Berger, think of a situation where you want to capture larger features across many pixels without down-sampling which causes loss of information. The traditional way to do this would be to use larger kernels in the convolution layers. These larger kernels are computationally expensive. By using a dilated convolution layer, larger features could be extracted with less operations. This is true for Frameworks where the operations on sparse feature-maps are optimized.

Related

Is there a fundamental limit on how accurate location information is encoded in CNNs

Each layer in a CNN reduces the size of the input via convolution and max-pooling operations. Convolution is translation equivariant, but max-pooling is translation invariant. Correct me if this is wrong : each time max-pooling applied, the precise location of a feature is reduced. So the feature maps of the final conv layer in a very deep CNN will have a large receptive field (w.r.t the original image), but the location of this feature (in the original image) is not discernible from looking at this feature map alone.
If this is true, how can the accuracy of bounding boxes when we do localisation be so good with a deep CNN? I understand how classification works, but making accurate bounding box predictions is confusing me.
Perhaps a toy example will clarify my confusion;
Say we have a dataset of images with dimension 256x256x1, and we want to predict whether a cat is present, and if so, where it is, so our target is something like [sigmoid_cat_present, cat_location].
Our vanilla CNN (let's assume something like VGG) will take in the image and transform it to something like 16x16x256 in the last convolutional layer. Each pixel in this final 16x16 feature map can be influenced by a much larger region in the original image. So if we determine a cat is present, how can the [cat_location] be refined to value more granular than this effective receptive field?
To add to your question - how about pixel perfect accuracy of segmentation boundary !!
Your intuition regarding down-sampling via max-pooling is correct. Normal CNNs have that limit. However, there have been some improvements recently to overcome it.
The breakthrough to this problem came in 2015-6 in the form of U-net and atrous/dilated convolution introduced in DeepLab.
Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels. Dilated convolutions have been shown to decrease blurring in semantic segmentation maps, and are purported to work at least in part by extracting long range information without the need for pooling.
Using U-Net architectures is another method that seeks to retain high spatial frequency information by directly adding skip connections between early and late layers. In other words, up-sampling followed by down-sampling.
In TensorFlow, atrous convolutions are implemented with function:
tf.nn.atrous_conv2d
There are many more methods and this is an ongoing research area.

When to use dilated convolutions?

I do not understand what is the use of dilated convolution and when should we use it. When we want a larger receptive field while saving memory? By increasing dilation size, it increases the spacing between the kernel points?
Referring to Multi-Scale Context Aggregation by Dilated Convolutions, yes, you can save some memory while having a larger receptive field. You might want to use dilated convolutions if you want an exponential expansion of the receptive field without loss of resolution or coverage. This allows us to have a larger receptive field with the same computation and memory costs while preserving resolution. Pooling and stride Convolutions can also kind of "expand" the receptive field, but those reduce data's resolution.
Generally, dilated convolutions have also shown to perform better, for example in image segmentation in DeepLab and in speech in WaveNet.
Here shows a neat visualization of what dilation does.

What is the complexity in Marr-Hildreth (Laplacian of aGaussian) filter?

what is the drawback in Laplacian of Gaussian filter? why are we going for Difference of gaussian?
There is no drawback in Laplace of Gaussian. I use it all the time. Difference of Gaussians is an approximation, but both need the same amount of computation:
LoG: convolution with the second derivative along x of a Gaussian + convolution with the second derivative along y of a Gaussian
DoG: convolution with a Gaussian - convolution with another Gaussian
Each of those convolutions is a separable operation, so both require 4 1D convolutions, and 1 intermediate image to store one of the two results.
Many people implement these operations differently, for example the LoG as a convolution with a Gaussian and then with a discrete Laplace operator. This is, again, an approximation, and could be slightly faster.
There are also separable approximations to the DoG (which require thus only 2 1D convolutions), but these are much less isotropic (which means not invariant to rotations of the image).
Little known fact: as the two sigmas in the Difference of Gaussians approach each other, the approximation becomes more similar to the Laplace of Gaussian.
EDIT: I have just posted a more elaborate answer over at Signal Processing.

How to choose the window size of CNN in deep learning?

In Convolutional Neural Network (CNN), a filter is select for weights sharing. For example, in the following pictures, a 3x3 window with the stride (distance between adjacent neurons) 1 is chosen.
So my question is: How to choose the window size? If I use 4x4 with the stride being 2, how much difference will it cause? Thanks a lot in advance!
There's no definite answer to this: filter size is one of hyperparameters you generally need to tune. However, there're some useful observations, that may help you. It's often preferred to choose smaller filters, but have greater number of those.
Example: four 5x5 filters have 100 parameters (ignoring bias), while 10 3x3 filters have 90 parameters. Through the larger of filters you still can capture the variety of features in the image, but with fewer parameters. More on this here.
Modern CNNs go even further with this idea and choose consecutive 3x1 and 1x3 convolutional layers. This reduces the number of parameters even more, but doesn't affect the performance. See the evolution of inception network.
The choice of stride is also important, but it affects the tensor shape after the convolution, hence the whole network. The general rule is to use stride=1 in usual convolutions and preserve the spatial size with padding, and use stride=2 when you want to downsample the image.

Gaussian blur and FFT

I´m trying to make an implementation of Gaussian blur for a school project.
I need to make both a CPU and a GPU implementation to compare performance.
I am not quite sure that I understand how Gaussian blur works. So one of my questions is
if I have understood it correctly?
Heres what I do now:
I use the equation from wikipedia http://en.wikipedia.org/wiki/Gaussian_blur to calculate
the filter.
For 2d I take RGB of each pixel in the image and apply the filter to it by
multiplying RGB of the pixel and the surrounding pixels with the associated filter position.
These are then summed to be the new pixel RGB values.
For 1d I apply the filter first horizontally and then vetically, which should give
the same result if I understand things correctly.
Is this result exactly the same result as when the 2d filter is applied?
Another question I have is about how the algorithm can be optimized.
I have read that the Fast Fourier Transform is applicable to Gaussian blur.
But I can't figure out how to relate it.
Can someone give me a hint in the right direction?
Thanks.
Yes, the 2D Gaussian kernel is separable so you can just apply it as two 1D kernels. Note that you can't apply these operations "in place" however - you need at least one temporary buffer to store the result of the first 1D pass.
FFT-based convolution is a useful optimisation when you have large kernels - this applies to any kind of filter, not just Gaussian. Just how big "large" is depends on your architecture, but you probably don't want to worry about using an FFT-based approach for anything smaller than, say, a 49x49 kernel. The general approach is:
FFT the image
FFT the kernel, padded to the size of the image
multiply the two in the frequency domain (equivalent to convolution in the spatial domain)
IFFT (inverse FFT) the result
Note that if you're applying the same filter to more than one image then you only need to FFT the padded kernel once. You still have at least two FFTs to perform per image though (one forward and one inverse), which is why this technique only becomes a computational win for large-ish kernels.

Resources