Backward pass on convolutional layer with 3-channel images - machine-learning

I know that deconvolution is basically convolution of output with flipped filters and I have implemented it for 2D data. But I am not able to generalize it for 3D data. For example consider the input of dimension 3x5x5 and the filter is of dimension 3x3x3 and stride is set to 1. SO, the output will be of the dimension 1x3x3. What I don't understand is how to calculate the deconvolution for this output. The flipped filter again will be of dimension 3x3x3 and output of convolution is of dimension 1x3x3 which are incompatible for convolution. So how can we calculate deconvolution ?

Perhaps this post will help you out a bit. You are correct in saying that a filter of the same size cannot fit the deconvolution dimensions. So in order to remedy that, the 1x3x3 gets padded throughout with zeros, mean-values, nn, etc. until it is the appropriate size that you require. Depth can be handled the same way. In your example, you want a 3x3x3 filter to 'deconvolve' the 1x3x3 to a 3x5x5. So we pad out the 1x3x3 to a 5x7x7 (with whichever method you prefer), and apply the filter. There are definite drawbacks with this process, stemming from the fact that you're trying to extrapolate more information from less.


what if the filter window size is an even number in Gaussian filtering?

I know usually people prefer to choose the odd number as the size of Gaussian Filtering, and since the image made of discrete pixels, we can always locate the central pixel.
But what if the size is an even number? There will lead to several questions:
how will the Gaussian filter be, should it be symmetric or asymmetric?
what if the size number equals to 2?
Thank you.
There really is no such choice to be made.
A Gaussian filtering kernel that is shifted will result in a smoothing + a shift of the image. If you want a filter that doesn’t shift the image, the filter must have the origin of the Gaussian at the origin of the kernel, typically the central pixel of an odd-sized kernel.
Once we have established that, using an even-sized filter must lead to an asymetrical kernel. It is not really desirable to have an asymmetrical smoothing filter (unless we’re talking about adaptive filtering) because the asymmetry introduces a bias.
So, we’re stuck with an odd-sized filter. An even-sized filter will introduce either a bias or a shift of half a pixel.
A 2-pixel kernel cannot be a Gaussian filter because it takes at least 5 samples to represent a Gaussian kernel with sufficient detail for it to present the positive aspects of the Gaussian filter. With fewer samples, the filter will not behave like a Gaussian filter.
For more information about Gaussian filtering, I recommend that you read this blog post that I wrote 10 years ago.

Pooling Layer vs. Using Padding in Convolutional Layers

My understanding is that we use padding when we convolute because convoluting with filters reduces the dimension of the output by shrinking it, as well as loses information from the edges/corners of the input matrix. However, we also use a pooling layer after a number of Conv layers in order to downsample our feature maps. Doesn't this seem sort of contradicting? We use padding because we do NOT want to reduce the spatial dimensions but we later use pooling to reduce the spatial dimensions. Could someone provide some intuition behind these 2?
Without loss of generality, assume we are dealing with images as inputs. The reasons behind padding is not only to keep the dimensions from shrinking, it's also to ensure that input pixels on the corners and edges of the input are not "disadvantaged" in affecting the output. Without padding, a pixel on the corner of an images overlaps with just one filter region, while a pixel in the middle of the image overlaps with many filter regions. Hence, the pixel in the middle affects more units in the next layer and therefore has a greater impact on the output. Secondly, you actually do want to shrink dimensions of your input (Remember, Deep Learning is all about compression, i.e. to find low dimensional representations of the input that disentangle the factors of variation in your data). The shrinking induced by convolutions with no padding is not ideal and if you have a really deep net you would quickly end up with very low dimensional representations that lose most of the relevant information in the data. Instead you want to shrink your dimensions in a smart way, which is achieved by Pooling. In particular, Max Pooling has been found to work well. This is really an empirical result, i.e. there isn't a lot of theory to explain why this is the case. You could imagine that by taking the max over nearby activations, you still retain the information about the presence of a particular feature in this region, while losing information about its exact location. This can be good or bad. Good because it buys you translation invariance, and bad because exact location may be relevant for you problem.

How to choose the window size of CNN in deep learning?

In Convolutional Neural Network (CNN), a filter is select for weights sharing. For example, in the following pictures, a 3x3 window with the stride (distance between adjacent neurons) 1 is chosen.
So my question is: How to choose the window size? If I use 4x4 with the stride being 2, how much difference will it cause? Thanks a lot in advance!
There's no definite answer to this: filter size is one of hyperparameters you generally need to tune. However, there're some useful observations, that may help you. It's often preferred to choose smaller filters, but have greater number of those.
Example: four 5x5 filters have 100 parameters (ignoring bias), while 10 3x3 filters have 90 parameters. Through the larger of filters you still can capture the variety of features in the image, but with fewer parameters. More on this here.
Modern CNNs go even further with this idea and choose consecutive 3x1 and 1x3 convolutional layers. This reduces the number of parameters even more, but doesn't affect the performance. See the evolution of inception network.
The choice of stride is also important, but it affects the tensor shape after the convolution, hence the whole network. The general rule is to use stride=1 in usual convolutions and preserve the spatial size with padding, and use stride=2 when you want to downsample the image.

Can Keras deal with input images with different size?

Can the Keras deal with input images with different size? For example, in the fully convolutional neural network, the input images can have any size. However, we need to specify the input shape when we create a network by Keras. Therefore, how can we use Keras to deal with different input size without resizing the input images to the same size? Thanks for any help.
Just change your input shape to shape=(n_channels, None, None).
Where n_channels is the number of channels in your input image.
I'm using Theano backend though, so if you are using tensorflow you might have to change it to (None,None,n_channels)
You should use:
input_shape=(1, None, None)
None in a shape denotes a variable dimension. Note that not all layers
will work with such variable dimensions, since some layers require
shape information (such as Flatten).
For example, using keras's functional API your input layer would be:
For a RGB dataset
inp = Input(shape=(3,None,None))
For a Gray dataset
inp = Input(shape=(1,None,None))
Implementing arbitrarily sized input arrays with the same computational kernels can pose many challenges - e.g. on a GPU, you need to know how big buffers to reserve, and more weakly how much to unroll your loops, etc. This is the main reason that Keras requires constant input shapes, variable-sized inputs are too painful to deal with.
This more commonly occurs when processing variable-length sequences like sentences in NLP. The common approach is to establish an upper bound on the size (and crop longer sequences), and then pad the sequences with zeros up to this size.
(You could also include masking on zero values to skip computations on the padded areas, except that the convolutional layers in Keras might still not support masked inputs...)
I'm not sure if for 3D data structures, the overhead of padding is not prohibitive - if you start getting memory errors, the easiest workaround is to reduce the batch size. Let us know about your experience with applying this trick on images!
Just use None while specifying input shape. But I still do not know how to pass different-shaped images into fit function.

Gaussian blur and FFT

I´m trying to make an implementation of Gaussian blur for a school project.
I need to make both a CPU and a GPU implementation to compare performance.
I am not quite sure that I understand how Gaussian blur works. So one of my questions is
if I have understood it correctly?
Heres what I do now:
I use the equation from wikipedia to calculate
the filter.
For 2d I take RGB of each pixel in the image and apply the filter to it by
multiplying RGB of the pixel and the surrounding pixels with the associated filter position.
These are then summed to be the new pixel RGB values.
For 1d I apply the filter first horizontally and then vetically, which should give
the same result if I understand things correctly.
Is this result exactly the same result as when the 2d filter is applied?
Another question I have is about how the algorithm can be optimized.
I have read that the Fast Fourier Transform is applicable to Gaussian blur.
But I can't figure out how to relate it.
Can someone give me a hint in the right direction?
Yes, the 2D Gaussian kernel is separable so you can just apply it as two 1D kernels. Note that you can't apply these operations "in place" however - you need at least one temporary buffer to store the result of the first 1D pass.
FFT-based convolution is a useful optimisation when you have large kernels - this applies to any kind of filter, not just Gaussian. Just how big "large" is depends on your architecture, but you probably don't want to worry about using an FFT-based approach for anything smaller than, say, a 49x49 kernel. The general approach is:
FFT the image
FFT the kernel, padded to the size of the image
multiply the two in the frequency domain (equivalent to convolution in the spatial domain)
IFFT (inverse FFT) the result
Note that if you're applying the same filter to more than one image then you only need to FFT the padded kernel once. You still have at least two FFTs to perform per image though (one forward and one inverse), which is why this technique only becomes a computational win for large-ish kernels.
