CNN padding and striding - machine-learning

In CNN, if padding is used so that the size of the image doesn't get shrinked after several convolutional layers – then why do we use strided convolutions? I wonder because strided convolutions are also reducing the size of image.

Because we want to reduce to size of image. There are some reasons:
Reduce computational and memory requirement.
Aggregate local features to higher level features.
Subsequent convolutions would have a larger receptive field in the original scale.
Traditionally we have used pooling to reduce the size of image, like max-pooling. Strided convolution is another way to do this (and it's getting more popular).

Related

When to use dilated convolutions?

I do not understand what is the use of dilated convolution and when should we use it. When we want a larger receptive field while saving memory? By increasing dilation size, it increases the spacing between the kernel points?
Referring to Multi-Scale Context Aggregation by Dilated Convolutions, yes, you can save some memory while having a larger receptive field. You might want to use dilated convolutions if you want an exponential expansion of the receptive field without loss of resolution or coverage. This allows us to have a larger receptive field with the same computation and memory costs while preserving resolution. Pooling and stride Convolutions can also kind of "expand" the receptive field, but those reduce data's resolution.
Generally, dilated convolutions have also shown to perform better, for example in image segmentation in DeepLab and in speech in WaveNet.
Here shows a neat visualization of what dilation does.

why the kernel size become greater while the spatial size of feature map goes down in inception network?

In the inception networks like inception-v3 and inception-v4, the kernel sizes are smaller in the lower layers,such as 3*3, but in the higher layers, the kernel sizes seem to be larger,such as 5*5,7*7,although they may be factorized to n*1&1*n later.But as network goes deeper,the spatial size of the feature map goes down,is there any relationship between these two thing?
ps:My question is why the kernel sizes in the lower layers seem to be samller(no more than 3*3),and you can find larger kernel size like 7*7 in the higher layers(more precisely, in the middle layers ).Is there any relationship between the spatial size of the feature map and the spatial spatial size of the conv kernels?Take inception v3 as a example, when the spatial sizes of the feature maps are larger than 35 in the first few layers of the network,the biggest kernel size is 5*5, but when the spatial size become 17, kernel size like 7*7 is used.
Any help will be appreciated.
Generally as you go deep into the network:-
The spatial size of feature map decreases to localize the object and
to reduce the computational cost.
The number of filters/kernels
increase as usually the initial layer represent generic features,
while the deeper layer represents more detailed features. Since
initial layers learn only primitive regularities in the data, you do
not need to have high volumes of filter there. However as you go
deep, you should try to look into as much details as possible, hence
increase in the number of filters/kernels. Therefore increasing the
number of filters in deeper layers increases the representational
power of the network.
In inception module, at each layer, kernels of multiple size (1x1, 3x3, 5x5) are used in computation and and resulting feature map is concatenated and passed to the next layer.

Do convolutional neural networks run faster on binary images

I am trying some DCNN to recognize handwriting words (word spotting) where the images are binary, and I am wondering if the computation time will be faster than using DCNNs with other gray-level or color images.
In addition, how one can equalize the image sizes, as normalizing the images of words will produce words with different scales.
Any suggestions?
The computation time for gray-scale images is certainly faster, but not due to zeros, it's simply the input tensor size. Color images are [batch, width, height, 3], while gray-scale images are [batch, width, height, 1]. The difference in depth, as well as in spatial size, affects the time spent on the first convolutional layer, which is usually one of the most time-consuming. That's why consider resizing the images as well.
You may also want to read about 1x1 convolution trick to speed up computation. Usually it's applied in the middle of the network when the number of filters becomes significantly large.
As for the second question (if I get it right), ultimately you have to resize the images. If the images contain the texts of different font sizes, one possible strategy is to resize + pad or crop + resize. You have to know the font size on each particular image to select the right padding or crop size. This method needs (possibly) fair amount of manual work.
A completely different way would to ignore these differences and let the network learn OCR, despite the font size discrepancy. It is a viable solution, doesn't require a lot of manual pre-processing, but simply needs more training data to avoid overfitting. If you examine MNIST dataset, you notice the digits are not always the same size, yet CNNs achieve 99.5% accuracy pretty easily.

Pooling Layer vs. Using Padding in Convolutional Layers

My understanding is that we use padding when we convolute because convoluting with filters reduces the dimension of the output by shrinking it, as well as loses information from the edges/corners of the input matrix. However, we also use a pooling layer after a number of Conv layers in order to downsample our feature maps. Doesn't this seem sort of contradicting? We use padding because we do NOT want to reduce the spatial dimensions but we later use pooling to reduce the spatial dimensions. Could someone provide some intuition behind these 2?
Without loss of generality, assume we are dealing with images as inputs. The reasons behind padding is not only to keep the dimensions from shrinking, it's also to ensure that input pixels on the corners and edges of the input are not "disadvantaged" in affecting the output. Without padding, a pixel on the corner of an images overlaps with just one filter region, while a pixel in the middle of the image overlaps with many filter regions. Hence, the pixel in the middle affects more units in the next layer and therefore has a greater impact on the output. Secondly, you actually do want to shrink dimensions of your input (Remember, Deep Learning is all about compression, i.e. to find low dimensional representations of the input that disentangle the factors of variation in your data). The shrinking induced by convolutions with no padding is not ideal and if you have a really deep net you would quickly end up with very low dimensional representations that lose most of the relevant information in the data. Instead you want to shrink your dimensions in a smart way, which is achieved by Pooling. In particular, Max Pooling has been found to work well. This is really an empirical result, i.e. there isn't a lot of theory to explain why this is the case. You could imagine that by taking the max over nearby activations, you still retain the information about the presence of a particular feature in this region, while losing information about its exact location. This can be good or bad. Good because it buys you translation invariance, and bad because exact location may be relevant for you problem.

How does the size of the patch/kernel impact the result of a convnet?

I am playing around convolutional neural networks at home with tensorflow (btw I have done the udacity deep learning course, so I have the theory basis). What impact has the size of the patch when one runs a convolution? does such size have to change when the image is bigger/smaller?
One of the exercises I did involved the CIFAR-10 databaese of images (32x32 px), then I used convolutions of 3x3 (with a padding of 1), getting decent results.
But lets say now I want to play with images larger than that (say 100x100), should I make my patches bigger? Do I keep them 3x3? Furthermore, what would be the impact of making a patch really big? (Say 50x50).
Normally I would test this at home directly, but running this on my computer is a bit slow (no nvidia GPU!)
So the question should be summarized as
Should I increase/decrease the size of my patches when my input images are bigger/smaller?
What is the impact (in terms of performance/overfitting) of increasing/decreasing my path size?
If you are not using padding, larger kernel makes number of neuron in the next layer will be smaller.
Example: Kernel with size 1x1 give the next layer the same number of neuron; kernel with size NxN give only one neuron in the next layer.
The impact of larger kernel:
Computational time is faster, memory usage is smaller
Loss a lot of details. Imagine NxN input neuron and the kernel size is NxN too, then the next layer only gives you one neuron. Loss a lot of details can lead you to underfitting.
The answer:
It depends on the images, if you needed a lot of details from the image you don't need to increase your kernel size. If your image is a 1000x1000 pixel large-version of MNIST image, I will increase the kernel size.
Smaller kernel will gives you a lot of details, it can lead you to overfitting, but larger kernel will gives you loss a lot of details, it can lead you to underfitting. You should tune your model to find the best size. Sometimes, time and machine specification should be considered
If you are using padding, you can adjust so the result neuron after convolution will be the same. I can't said it will be better than not using padding, but the loss of more details still occurs than using smaller kernel
It depends more on the size of the objects you want to detect or in other words, the size of the receptive field you want to have. Nevertheless, choosing the kernel size was always a challenging decision. That is why the Inception model was created which uses different kernel sizes (1x1, 3x3, 5x5). The creators of this model also went deeper and tried to decompose the convolutional layers into ones with smaller patch size while maintaining the same receptive field to try to speed up the training (ex. 5x5 was decomposed to two 3x3 and 3x3 was decomposed to 3x1 and 1x3) creating different versions of the inception model.
You can also check the Inception V2 paper for more details https://arxiv.org/abs/1512.00567

Resources