In rensnet-50's bottleneck block, we use conv1x1(previous block out,64), conv3x3(64,64), conv1x1(64,256) and then we repeat.
Why do we increase dimensions in the third conv1x1 layer only to reduce it in the next block?
It's a bottleneck layer, see the answer to this question:
https://ai.stackexchange.com/a/4887
and this question:
https://stats.stackexchange.com/questions/205150/how-do-bottleneck-architectures-work-in-neural-networks
In essence, it's to reduce the feature count, particularly for the 3x3 convolution.
Related
I'm using batch normalization with a batch size of size 10 for face detection, I wanted to know if it is better to remove the batch norm layers or keep them.
And if it is better to remove them what can I use instead?
This question depends on a few things, first being the depth of your neural network. Batch normalization is useful for increasing the training of your data when there are a lot of hidden layers. It can decrease the number of epochs it takes to train your model and hep regulate your data. By standardizing the inputs to your network, you reduce the risk of chasing a 'moving target', meaning your learning algorithm is not performing as optimally as it could be.
My advice would be to include batch normalization layers in your code if you have a deep neural network. Reminder, you should probably include some Dropout in your layers as well.
Let me know if this helps!
Yes, it works for the smaller size, it will work even with the smallest possible size you set.
The trick is the bach size also adds to the regularization effect, not only the batch norm.
I will show you few pics:
We are on the same scale tracking the bach loss. The left-hand side is a module without the batch norm layer (black), the right-hand side is with the batch norm layer.
Note how the regularization effect is evident even for the bs=10.
When we set the bs=64 the batch loss regularization is super evident. Note the y scale is always [0, 4].
My examination was purely on nn.BatchNorm1d(10, affine=False) without learnable parameters gamma and beta i.e. w and b.
This is why when you have low batch size, it has sense to use the BatchNorm layer.
My understanding is that we use padding when we convolute because convoluting with filters reduces the dimension of the output by shrinking it, as well as loses information from the edges/corners of the input matrix. However, we also use a pooling layer after a number of Conv layers in order to downsample our feature maps. Doesn't this seem sort of contradicting? We use padding because we do NOT want to reduce the spatial dimensions but we later use pooling to reduce the spatial dimensions. Could someone provide some intuition behind these 2?
Without loss of generality, assume we are dealing with images as inputs. The reasons behind padding is not only to keep the dimensions from shrinking, it's also to ensure that input pixels on the corners and edges of the input are not "disadvantaged" in affecting the output. Without padding, a pixel on the corner of an images overlaps with just one filter region, while a pixel in the middle of the image overlaps with many filter regions. Hence, the pixel in the middle affects more units in the next layer and therefore has a greater impact on the output. Secondly, you actually do want to shrink dimensions of your input (Remember, Deep Learning is all about compression, i.e. to find low dimensional representations of the input that disentangle the factors of variation in your data). The shrinking induced by convolutions with no padding is not ideal and if you have a really deep net you would quickly end up with very low dimensional representations that lose most of the relevant information in the data. Instead you want to shrink your dimensions in a smart way, which is achieved by Pooling. In particular, Max Pooling has been found to work well. This is really an empirical result, i.e. there isn't a lot of theory to explain why this is the case. You could imagine that by taking the max over nearby activations, you still retain the information about the presence of a particular feature in this region, while losing information about its exact location. This can be good or bad. Good because it buys you translation invariance, and bad because exact location may be relevant for you problem.
I recently found the "global_pooling" flag in the Pooling layer in caffe, however was unable to find sth about it in the documentation here (Layer Catalogue)
nor here (Pooling doxygen doc) .
Is there an easy forward examply explanation to this in comparison to the normal Pool-Layer behaviour?
With Global pooling reduces the dimensionality from 3D to 1D. Therefore Global pooling outputs 1 response for every feature map. This can be the maximum or the average or whatever other pooling operation you use.
It is often used at the end of the backend of a convolutional neural network to get a shape that works with dense layers. Therefore no flatten has to be applied.
Convolutions can work on any image input size (which is big enough). However, if you have a fully connected layer at the end, this layer needs a fixed input size. Hence the complete network needs a fixed image input size.
However, you can remove the fully connected layer and just work with convolutional layers. You can make a convolutional layer at the end which has the same number of filters as you have classes. But you want one value for each class which indicates the probability of that class. Hence you apply a pooling filter over the complete remaining feature map. This pooling is hence "global" as it always is as big as necessary. In contrast, usual pooling layers have a fixed size (e.g. of 2x2 or 3x3).
This is a general concept. You can also find global pooling in other libraries, e.g. Lasagne. If you want a good reference in literature, I recommend reading Network In Network.
We get only one value from entire feature map when we apply GP layer, in which kernel size is the h×w of the feature map. GP layers are used to reduce the spatial dimensions of a three-dimensional feature map. However, GP layers perform a more extreme type of dimensionality reduction, where a feature map with dimensions h×w×d is reduced in size to have dimensions 1×1×d. GP layers reduce each h×w feature map to a single number by simply taking the average of all hw values.
If you are looking for information regarding flags/parameters of caffe, it is best look them up in the comments of '$CAFFE_ROOT/src/caffe/proto/caffe.proto'.
For 'global_pooling' parameter the comment says:
// If global_pooling then it will pool over the size of the bottom by doing
// kernel_h = bottom->height and kernel_w = bottom->width
For more information about caffe layers, see this help pages.
I´m trying to make an implementation of Gaussian blur for a school project.
I need to make both a CPU and a GPU implementation to compare performance.
I am not quite sure that I understand how Gaussian blur works. So one of my questions is
if I have understood it correctly?
Heres what I do now:
I use the equation from wikipedia http://en.wikipedia.org/wiki/Gaussian_blur to calculate
the filter.
For 2d I take RGB of each pixel in the image and apply the filter to it by
multiplying RGB of the pixel and the surrounding pixels with the associated filter position.
These are then summed to be the new pixel RGB values.
For 1d I apply the filter first horizontally and then vetically, which should give
the same result if I understand things correctly.
Is this result exactly the same result as when the 2d filter is applied?
Another question I have is about how the algorithm can be optimized.
I have read that the Fast Fourier Transform is applicable to Gaussian blur.
But I can't figure out how to relate it.
Can someone give me a hint in the right direction?
Thanks.
Yes, the 2D Gaussian kernel is separable so you can just apply it as two 1D kernels. Note that you can't apply these operations "in place" however - you need at least one temporary buffer to store the result of the first 1D pass.
FFT-based convolution is a useful optimisation when you have large kernels - this applies to any kind of filter, not just Gaussian. Just how big "large" is depends on your architecture, but you probably don't want to worry about using an FFT-based approach for anything smaller than, say, a 49x49 kernel. The general approach is:
FFT the image
FFT the kernel, padded to the size of the image
multiply the two in the frequency domain (equivalent to convolution in the spatial domain)
IFFT (inverse FFT) the result
Note that if you're applying the same filter to more than one image then you only need to FFT the padded kernel once. You still have at least two FFTs to perform per image though (one forward and one inverse), which is why this technique only becomes a computational win for large-ish kernels.
I have succesfully written some CUDA FFT code that does a 2D convolution of an image, as well as some other calculations.
How do I go about figuring out what the largest FFT's I can run are? It seems to be that a plan for a 2D R2C convolution takes 2x the image size, and another 2x the image size for the C2R. This seems like a lot of overhead!
Also, it seems like most of the benchmarks and such are for relatively small FFTs..why is this? It seems like for large images, I am going to quickly run out of memory. How is this typically handled? Can you perform an FFT convolution on a tile of an image and combine those results, and expect it to be the same as if I had run a 2D FFT on the entire image?
Thanks for answering these questions
CUFFT plans a different algorithm depending on your image size. If you can't fit in shared memory and are not a power of 2 then CUFFT plans an out-of-place transform while smaller images with the right size will be more amenable to the software.
If you're set on FFTing the whole image and need to see what your GPU can handle my best answer would be to guess and check with different image sizes as the CUFFT planning is complicated.
See the documentation : http://developer.download.nvidia.com/compute/cuda/1_1/CUFFT_Library_1.1.pdf
I agree with Mark and say that tiling the image is the way to go for convolution. Since convolution amounts to just computing many independent integrals you can simply decompose the domain into its constituent parts, compute those independently, and stitch them back together. The FFT convolution trick simply reduces the complexity of the integrals you need to compute.
I expect that your GPU code should outperform matlab by a large factor in all situations unless you do something weird.
It's not usually practical to run FFT on an entire image. Not only does it take a lot of memory, but the image must be a power of 2 in width and height which places an unreasonable constraint on your input.
Cutting the image into tiles is perfectly reasonable. The size of the tiles will determine the frequency resolution you're able to achieve. You may want to overlap the tiles as well.