What is Depth of a convolutional neural network? - machine-learning

I was taking a look at Convolutional Neural Network from CS231n Convolutional Neural Networks for Visual Recognition. In Convolutional Neural Network, the neurons are arranged in 3 dimensions(height, width, depth). I am having trouble with the depth of the CNN. I can't visualize what it is.
In the link they said The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
For example loook at this picture. Sorry if the image is too crappy.
I can grasp the idea that we take a small area off the image, then compare it with the "Filters". So the filters will be collection of small images? Also they said We will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron. So is the receptive field has the same dimension as the filters? Also what will be the depth here? And what do we signify using the depth of a CNN?
So, my question mainly is, if i take an image having dimension of [32*32*3] (Lets say i have 50000 of these images, making the dataset [50000*32*32*3]), what shall i choose as its depth and what would it mean by the depth. Also what will be the dimension of the filters?
Also it will be much helpful if anyone can provide some link that gives some intuition on this.
EDIT:
So in one part of the tutorial(Real-world example part), it says The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96].
Here we see the depth is 96. So is depth something that i choose arbitrarily? or something i compute? Also in the example above(Krizhevsky et al) they had 96 depths. So what does it mean by its 96 depths? Also the tutorial stated Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
So that means the depth will be like this? If so then can i assume Depth = Number of Filters?

In Deep Neural Networks the depth refers to how deep the network is but in this context, the depth is used for visual recognition and it translates to the 3rd dimension of an image.
In this case you have an image, and the size of this input is 32x32x3 which is (width, height, depth). The neural network should be able to learn based on this parameters as depth translates to the different channels of the training images.
UPDATE:
In each layer of your CNN it learns regularities about training images. In the very first layers, the regularities are curves and edges, then when you go deeper along the layers you start learning higher levels of regularities such as colors, shapes, objects etc. This is the basic idea, but there lots of technical details. Before going any further give this a shot : http://www.datarobot.com/blog/a-primer-on-deep-learning/
UPDATE 2:
Have a look at the first figure in the link you provided. It says 'In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).' It means that a ConvNet neuron transforms the input image by arranging its neurons in three dimeonsion.
As an answer to your question, depth corresponds to the different color channels of an image.
Moreover, about the filter depth. The tutorial states this.
Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
Which basically means that a filter is a smaller part of an image that moves around the depth of the image in order to learn the regularities in the image.
UPDATE 3:
For the real world example I just browsed the original paper and it says this : The first convolutional layer filters the 224×224×3 input image with 96 kernels of size 11×11×3 with a stride of 4 pixels.
In the tutorial it refers the depth as the channel, but in real world you can design whatever dimension you like. After all that is your design
The tutorial aims to give you a glimpse of how ConvNets work in theory, but if I design a ConvNet nobody can stop me proposing one with a different depth.
Does this make any sense?

Depth of CONV layer is number of filters it is using.
Depth of a filter is equal to depth of image it is using as input.
For Example: Let's say you are using an image of 227*227*3.
Now suppose you are using a filter of size of 11*11(spatial size).
This 11*11 square will be slided along whole image to produce a single 2 dimensional array as a response. But in order to do so, it must cover every aspect inside of 11*11 area. Therefore depth of filter will be depth of image = 3.
Now suppose we have 96 such filter each producing different response. This will be depth of Convolutional layer. It is simply number of filters used.

I'm not sure why this is skimped over so heavily. I also had trouble understanding it at first, and very few outside of Andrej Karpathy (thanks d00d) have explained it. Although, in his writeup (http://cs231n.github.io/convolutional-networks/), he calculates the depth of the output volume using a different example than in the animation.
Start by reading the section titled 'Numpy examples'
Here, we go through iteratively.
In this case we have an 11x11x4. (why we start with 4 is kind of peculiar, as it would be easier to grasp with a depth of 3)
Really pay attention to this line:
A depth column (or a fibre) at position (x,y) would be the activations
X[x,y,:].
A depth slice, or equivalently an activation map at depth d
would be the activations X[:,:,d].
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V is your output volume. The zero'th index v[0] is your column - in this case V[0] = 0 this is the first column in your output volume.
V[1] = 0 this is the first row in your output volume. V[3]= 0 is the depth. This is the first output layer.
Now, here's where people get confused (at least I did). The input depth has absolutely nothing to do with your output depth. The input depth only has control of the filter depth. W in Andrej's example.
Aside: A lot of people wonder why 3 is the standard input depth. For color input images, this will always be 3 for plain ole images.
np.sum(X[:5,:5,:] * W0) + b0 (convolution 1)
Here, we are calculating elementwise between a weight vector W0 which is 5x5x4. 5x5 is an arbitrary choice. 4 is the depth since we need to match our input depth. The weight vector is your filter, kernel, receptive field or whatever obfuscated name people decide to call it down the road.
if you come at this from a non python background, that's maybe why there's more confusion since array slicing notation is non-intuitive. The calculation is a dot product of your first convolution size (5x5x4) of your image with the weight vector. The output is a single scalar value which takes the position of your first filter output matrix. Imagine a 4 x 4 matrix representing the sum product of each of these convolution operations across the entire input. Now stack them for each filter. That shall give you your output volume. In Andrej's writeup, he starts moving along the x axis. The y axis remains the same.
Here's an example of what V[:,:,0] would look like in terms of convolutions. Remember here, the third value of our index is the depth of your output layer
[result of convolution 1, result of convolution 2, ..., ...]
[..., ..., ..., ..., ...]
[..., ..., ..., ..., ...]
[..., ..., ..., result of convolution n]
The animation is best for understanding this, but Andrej decided to swap it with an example that doesn't match the calculation above.
This took me a while. Partly because numpy doesn't index the way Andrej does in his example, at least it didn't I played around with it. Also, there's some assumptions that the sum product operation is clear. That's the key to understand how your output layer is created, what each value represents and what the depth is.
Hopefully that helps!

Since the input volume when we are doing an image classification problem is N x N x 3. At the beginning it is not difficult to imagine what the depth will mean - just the number of channels - Red, Green, Blue. Ok, so the meaning for the first layer is clear. But what about the next ones? Here is how I try to visualize the idea.
On each layer we apply a set of filters which convolve around the input. Lets imagine that currently we are at the first layer and we convolve around a volume V of size N x N x 3. As #Semih Yagcioglu mentioned at the very beginning we are looking for some rough features: curves, edges etc... Lets say we apply N filters of equal size (3x3) with stride 1. Then each of these filters is looking for a different curve or edge while convolving around V. Of course, the filter has the same depth, we want to supply the whole information not just the grayscale representation.
Now, if M filters will look for M different curves or edges. And each of these filters will produce a feature map consisting of scalars (the meaning of the scalar is the filter saying: The probability of having this curve here is X%). When we convolve with the same filter around the Volume we obtain this map of scalars telling us where where exactly we saw the curve.
Then comes feature map stacking. Imagine stacking as the following thing. We have information about where each filter detected a certain curve. Nice, then when we stack them we obtain information about what curves / edges are available at each small part of our input volume. And this is the output of our first convolutional layer.
It is easy to grasp the idea behind non-linearity when taking into account 3. When we apply the ReLU function on some feature map, we say: Remove all negative probabilities for curves or edges at this location. And this certainly makes sense.
Then the input for the next layer will be a Volume $V_1$ carrying info about different curves and edges at different spatial locations (Remember: Each layer Carries info about 1 curve or edge).
This means that the next layer will be able to extract information about more sophisticated shapes by combining these curves and edges. To combine them, again, the filters should have the same depth as the input volume.
From time to time we apply Pooling. The meaning is exactly to shrink the volume. Since when we use strides = 1, we usually look at a pixel (neuron) too many times for the same feature.
Hope this makes sense. Look at the amazing graphs provided by the famous CS231 course to check how exactly the probability for each feature at a certain location is computed.

In simple terms, it can explain as below,
Let's say you have 10 filters where each filter is the size of 5x5x3. What does this mean? the depth of this layer is 10 which is equal to the number of filters. Size of each filter can be defined as we want e.g., 5x5x3 in this case where 3 is the depth of the previous layer. To be precise, depth of each filer in the next layer should be 10 ( nxnx10) where n can be defined as you want like 5 or something else. Hope will make everything clear.

The first thing you need to note is
receptive field of a neuron is 3D
ie If the receptive field is 5x5 the neuron will be connected to 5x5x(input depth) number of points. So whatever be your input depth, one layer of neurons will only develop 1 layer of output.
Now, the next thing to note is
depth of output layer = depth of conv. layer
ie The output volume is independent of the input volume, and it only depends on the number filters(depth). This should be pretty obvious from the previous point.
Note that the number of filters (depth of the cnn layer) is a hyper parameter. You can take it whatever you want, independent of image depth. Each filter has it's own set of weights enabling it to learn a different feature on the same local region covered by the filter.

The depth of the network is the number of layers in the network. In the Krizhevsky paper, the depth is 9 layers (modulo a fencepost issue with how layers are counted?).

If you are referring to the depth of the filter (I came to this question searching for that) then this diagram of LeNet is illustrating
Source http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
How to create such a filter; Well in python like https://github.com/alexcpn/cnn_in_python/blob/main/main.py#L19-L27
Which will give you a list of numpy arrays and length of the list is the depth
Example in the code above,but adding a depth of 3 for color (RGB), the below is the network. The first Convolutional layer is a filter of shape (5,5,3) and depth 6
Input (R,G,B)= [32.32.3] *(5.5.3)*6 == [28.28.6] * (5.5.6)*1 = [24.24.1] * (5.5.1)*16 = [20.20.16] *
FC layer 1 (20, 120, 16) * FC layer 2 (120, 1) * FC layer 3 (20, 10) * Softmax (10,) =(10,1) = Output
In Pytorch
np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})
# Generate a random image
image_size = 32
image_depth = 3
image = np.random.rand(image_size, image_size)
# to mimic RGB channel
image = np.stack([image,image,image], axis=image_depth-1) # 0 to 2
image = np.moveaxis(image, [2, 0], [0, 2])
print("Image Shape=",image.shape)
input_tensor = torch.from_numpy(image)
m = nn.Conv2d(in_channels=3,out_channels=6,kernel_size=5,stride=1)
output = m(input_tensor.float())
print("Output Shape=",output.shape)
Image Shape= (3, 32, 32)
Output Shape= torch.Size([6, 28, 28])

Related

How do I quantify the similarity between spatial patterns

Problem Formulation
Suppose I have several 10000*10000 grids (can be transformed to 10000*10000 grayscale images. I would regard image and grid as the same below), and at each grid-point, there is some value (in my case it's the number of copies of a specific gene expressed at that pixel location, note that the locations are same for every grid). What I want is to quantify the similarity between two 2D spatial point-patterns of this kind (i.e., the spatial expression patterns of two distinct genes), and rank all pairs of genes in a "most similar" to "most dissimilar" manner. Note that it is not the spatial pattern in terms of the absolute value of expression level that I care about, rather, it's the relative pattern that I care about. As a result, I might need to utilize some correlation instead of distance metrics when comparing corresponding pixels.
The easiest method might be directly viewing all pixels together as a vector and calculate some correlation metric between the two vectors. However, this does not take the spatial information into account. Those genes that I am most interested in have spatial patterns, i.e., clustering and autocorrelation effects their expression pattern (though their "cluster" might take a very thin shape rather than sticking together, e.g., genes specific to the skin cells), which means usually the image would have several peak local regions, while expression levels at other pixels would be extremely low (near 0).
Possible Directions
I am not exactly sure if I should (1) consider applying image similarity comparison algorithms from image processing that take local structure similarity into account (e.g., SSIM, SIFT, as outlined in Simple and fast method to compare images for similarity), or (2) consider applying spatial similarity comparison algorithms from spatial statistics in GIS (there are some papers about this, but I am not sure if there are some algorithms dealing with simple point data rather than the normal region data with shape (in a more GIS-sense way, I need to find an algorithm dealing with raster data rather than polygon data)), or (3) consider directly applying statistical methods that deal with discrete 2D distributions, which I think might be a bit crude (seems to disregard the regional clustering/autocorrelation effects, ~ Tobler's First Law of Geography).
For direction (1), I am thinking about a simple method, that is, first find some "peak" regions in the two images respectively and regard their union as ROIs, and then compare those ROIs in the two images specifically in a simple pixel-by-pixel way (regard them together as a vector), but I am not sure if I can replace the distance metrics with correlation metrics, and am a bit worried that many methods of similarity comparison in image processing might not work well when the two images are dissimilar. For direction (2), I think this direction might be more appropriate because this problem is indeed related to spatial statistics, but I do not yet know where to start in GIS. I guess direction (3) is somewhat masked by (2), so I might not consider it here.
Sample
Sample image: (There are some issues w/ my own data, so here I borrowed an image from SpatialLIBD http://research.libd.org/spatialLIBD/reference/sce_image_grid_gene.html)
Let's say the value at each pixel is discretely valued between 0 and 10 (could be scaled to [0,1] if needed). The shapes of tissues in the right and left subfigure are a bit different, but in my case they are exactly the same.
PS: There is one might-be-serious problem regarding spatial statistics though. The expression of certain marker genes of a specific cell type might not be clustered in a bulk, but in the shape of a thin layer or irregularly. For example, if the grid is a section of the brain, then the high-expression peak region for cortex layer-specific genes (e.g., Ctip2 for layer V) might form a thin arc curved layer in the 10000*10000 grid.
UPDATE: I found a method belonging to the (3) direction called "optimal transport" problem that might be useful. Looks like it integrates locality information into the comparison of distribution. Would try to test this way (seems to be the easiest to code among all three directions?) tomorrow.
Any thoughts would be greatly appreciated!
In the absence of any sample image, I am assuming that your problem is similar to texture-pattern recognition.
We can start with Local Binary Patterns (2002), or LBPs for short. Unlike previous (1973) texture features that compute a global representation of texture based on the Gray Level Co-occurrence Matrix, LBPs instead compute a local representation of texture by comparing each pixel with its surrounding neighborhood of pixels. For each pixel in the image, we select a neighborhood of size r (to handle variable neighborhood sizes) surrounding the center pixel. A LBP value is then calculated for this center pixel and stored in the output 2D array with the same width and height as the input image. Then you can calculate a histogram of LBP codes (as final feature vector) and apply machine learning for classifications.
LBP implementations can be found in both the scikit-image and OpenCV but latter's implementation is strictly in the context of face recognition — the underlying LBP extractor is not exposed for raw LBP histogram computation. The scikit-image implementation of LBPs offer more control of the types of LBP histograms you want to generate. Furthermore, the scikit-image implementation also includes variants of LBPs that improve rotation and grayscale invariance.
Some starter code:
from skimage import feature
import numpy as np
from sklearn.svm import LinearSVC
from imutils import paths
import cv2
import os
class LocalBinaryPatterns:
def __init__(self, numPoints, radius):
# store the number of points and radius
self.numPoints = numPoints
self.radius = radius
def describe(self, image, eps=1e-7):
# compute the Local Binary Pattern representation
# of the image, and then use the LBP representation
# to build the histogram of patterns
lbp = feature.local_binary_pattern(image, self.numPoints,
self.radius, method="uniform")
(hist, _) = np.histogram(lbp.ravel(),
bins=np.arange(0, self.numPoints + 3),
range=(0, self.numPoints + 2))
# normalize the histogram
hist = hist.astype("float")
hist /= (hist.sum() + eps)
# return the histogram of Local Binary Patterns
return hist
# initialize the local binary patterns descriptor along with
# the data and label lists
desc = LocalBinaryPatterns(24, 8)
data = []
labels = []
# loop over the training images
for imagePath in paths.list_images(args["training"]):
# load the image, convert it to grayscale, and describe it
image = cv2.imread(imagePath)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
hist = desc.describe(gray)
# extract the label from the image path, then update the
# label and data lists
labels.append(imagePath.split(os.path.sep)[-2])
data.append(hist)
# train a Linear SVM on the data
model = LinearSVC(C=100.0, random_state=42)
model.fit(data, labels)
Once our Linear SVM is trained, we can use it to classify subsequent texture images:
# loop over the testing images
for imagePath in paths.list_images(args["testing"]):
# load the image, convert it to grayscale, describe it,
# and classify it
image = cv2.imread(imagePath)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
hist = desc.describe(gray)
prediction = model.predict(hist.reshape(1, -1))
# display the image and the prediction
cv2.putText(image, prediction[0], (10, 30), cv2.FONT_HERSHEY_SIMPLEX,
1.0, (0, 0, 255), 3)
cv2.imshow("Image", image)
cv2.waitKey(0)
Have a look at this excellent tutorial for more details.
Ravi Kumar (2016) was able to extract more finely textured images by combining LBP with Gabor filters to filter the coefficients of LBP pattern

Understanding convolutional layers shapes

I've been reading about convolutional nets and I've programmed a few models myself. When I see visual diagrams of other models it shows each layer being smaller and deeper than the last ones. Layers have three dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
TLDR; 256x256x32 refers to the layer's output shape rather than the layer itself.
There are many articles and posts out there explaining how convolution layers work. I'll try to answer your question without going into too many details, just focusing on shapes.
Assuming you are working with 2D convolution layers, your input and output will both be three-dimensional. That is, without considering the batch which would correspond to a 4th axis... Therefore, the shape of a convolution layer input will be (c, h, w) (or (h, w, c) depending on the framework) where c is the number of channels, h is the width of the input and w the width. You can see it as a c-channel hxw image.
The most intuitive example of such input is the input of the first convolution layer of your convolutional neural network: most likely an image of size hxw with c channels for example c=1 for greyscale or c=3 for RGB...
What's important is that for all pixels of that input, the values on each channel gives additional information on that pixel. Having three channels will give each pixel ('pixel' as in position in the 2D input space) a richer content than having a single. Since each pixel will be encoded with three values (three channels) vs. a single one (one channel). This kind of intuition about what channels represent can be extrapolated to a higher number of channels. As we said an input can have c channels.
Now going back to convolution layers, here is a good visualization. Imagine having a 5x5 1-channel input. And a convolution layer consisting of a single 3x3 filter (i.e. kernel_size=3)
input
filter
convolution
output
shape
(1, 5, 5)
(3, 3)
(3,3)
representation
Now keep in mind the dimension of the output will depend on the stride and padding of the convolution layer. Here the shape of the output is the same as the shape of the filter, it does not necessarily have to be! Take an input shape of (1, 5, 5), with the same convolution settings, you would end up with a shape of (4, 4) (which is different from the filter shape (3, 3).
Also, something to note is that if the input had more than one channel: shape (c, h, w), the filter would have to have the same number of channels. Each channel of the input would convolve with each channel of the filter and the results would be averaged into a single 2D feature map. So you would have an intermediate output of (c, 3, 3), which after averaging over the channels, would leave us with (1, 3, 3)=(3, 3). As a result, considering a convolution with a single filter, however many input channels there are, the output will always have a single channel.
From there what you can do is assemble multiple filters on the same layer. This means you define your layer as having k 3x3 filters. So a layer consists k filters. For the computation of the output, the idea is simple: one filter gives a (3, 3) feature map, so k filters will give k (3, 3) feature maps. These maps are then stacked into what will be the channel dimension. Ultimately, you're left with an output shape of... (k, 3, 3).
Let k_h and k_w, be the kernel height and kernel width respectively. And h', w' the height and width of one outputted feature map:
input
layer
output
shape
(c, h, w)
(k, c, k_h, k_w)
(k, h', w')
description
c-channel hxw feature map
k filters of shape (c, k_h, k_w)
k-channel h'xw' feature map
Back to your question:
Layers have 3 dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
Convolution layers have four dimensions, but one of them is imposed by your input channel count. You can choose the size of your convolution kernel, and the number of filters. This number will determine is the number of channels of the output.
256x256 seems extremely high and you most likely correspond to the output shape of the feature map. On the other hand, 32 would be the number of channels of the output, which... as I tried to explain is the number of filters in that layer. Usually speaking the dimensions represented in visual diagrams for convolution networks correspond to the intermediate output shapes, not the layer shapes.
As an example, take the VGG neural network:
Very Deep Convolutional Networks for Large-Scale Image Recognition
Input shape for VGG is (3, 224, 224), knowing that the result of the first convolution has shape (64, 224, 224) you can determine there is a total of 64 filters in that layer.
As it turns out the kernel size in VGG is 3x3. So, here is a question for you: knowing there is a single bias parameter per filter, how many total parameters are in VGG's first convolution layer?
Sorry for the short answer, but when you have a digital image, you have 2 dimensions and then you often have 3 for the colors. The convolutional filter looks into parts of the picture with lower height/width dimensions and much more depth channels (in your case 32) to get more information. This is then fed into the neural network to learn.
I created the example in PyTorch to demonstrate the output you had:
import torch
import torch.nn as nn
bs=16
x = torch.randn(bs, 3, 256, 256)
c = nn.Conv2d(3,32,kernel_size=5,stride=1,padding=2)
out = c(x)
print(out.shape, out.shape[1])
Out:
torch.Size([16, 32, 256, 256]) 32
It's a real tensor inside. It may help.
You can play with a lot of convolution parameters.

How are the FCN heads convolved over RetinaNet's FPN features?

I've recently read the RetinaNet paper and I have yet to understood one minor detail:
We have the multi-scale feature maps obtained from the FPN (P2,...P7).
Then the two FCN heads (the classifier head and regessor head) are convolving each one of the feature maps.
However, each feature map has different spatial scale, so, how does the classifier head and regressor head maintain fixed output volumes, given all their convolution parameters are fix? (i.e. 3x3 filter with stride 1, etc).
Looking at this line at PyTorch's implementation of RetinaNet, I see the heads just convolve each feature and then all features are stacked somehow (the only common dimension between them is the Channel dimension which is 256, but spatially they are double from each other).
Would love to hear how are they combined, I wasn't able to understand that point.
After the convolution at each pyramid step, you reshape the outputs to be of shape (H*W, out_dim) (with out_dim being num_classes * num_anchors for the class head and 4 * num_anchors for the bbox regressor).
Finally, you can concatenate the resulting tensors along the H*W dimension, which is now possible because all the other dimensions match, and compute losses as you would on a network with a single feature layer.

How does the unpooling and deconvolution work in DeConvNet

I have been trying to understand how unpooling and deconvolution works in DeConvNets.
Unpooling
While during the unpooling stage, the activations are restored back to the locations of maximum activation selections, which makes sense, but what about the remaining activations? Do those remaining activations need to be restored as well or interpolated in some way or just filled as zeros in unpooled map.
Deconvolution
After the convolution section (i.e., Convolution layer, Relu, Pooling ), it is common to have more than one feature map output, which would be treated as input channels to successive layers ( Deconv.. ). How could these feature maps be combined together in order to achieve the activation map with same resolution as original input?
Unpooling
As etoropov wrote, you can read about unpooling in Visualizing and Understanding Convolutional Networks by Zeiler and Ferguson:
Unpooling: In the convnet, the max pooling operation
is non-invertible, however we can obtain an approximate
inverse by recording the locations of the
maxima within each pooling region in a set of switch
variables. In the deconvnet, the unpooling operation
uses these switches to place the reconstructions from
the layer above into appropriate locations, preserving
the structure of the stimulus. See Fig. 1(bottom) for
an illustration of the procedure.
Deconvolution
Deconvolution works like this:
You add padding around each pixel
You apply a convolution
For example, in the following illustration the original blue image is padded with zeros (white), the gray convolution filter is applied to get the green output.
Source: What are deconvolutional layers?
1 Unpooling.
In the original paper on unpooling, remaining activations are zeroed.
2 Deconvolution.
A deconvolutional layer is just the transposed of its corresponding conv layer. E.g. if conv layer's shape is [height, width, previous_layer_fms, next_layer_fms], than the deconv layer will have the shape [height, width, next_layer_fms, previous_layer_fms]. The weights of conv and deconv layers are shared! (see this paper for instance)

How to apply box filter on integral image? (SURF)

Assuming that I have a grayscale (8-bit) image and assume that I have an integral image created from that same image.
Image resolution is 720x576. According to SURF algorithm, each octave is composed of 4 box filters, which are defined by the number of pixels on their side. The
first octave uses filters with 9x9, 15x15, 21x21 and 27x27 pixels. The
second octave uses filters with 15x15, 27x27, 39x39 and 51x51 pixels.The third octave uses filters with 27x27, 51x51, 75x75 and 99x99 pixels. If the image is sufficiently large and I guess 720x576 is big enough (right??!!), a fourth octave is added, 51x51, 99x99, 147x147 and 195x195. These
octaves partially overlap one another to improve the quality of the interpolated results.
// so, we have:
//
// 9x9 15x15 21x21 27x27
// 15x15 27x27 39x39 51x51
// 27x27 51x51 75x75 99x99
// 51x51 99x99 147x147 195x195
The questions are:What are the values in each of these filters? Should I hardcode these values, or should I calculate them? How exactly (numerically) to apply filters to the integral image?
Also, for calculating the Hessian determinant I found two approximations:
det(HessianApprox) = DxxDyy − (0.9Dxy)^2 anddet(HessianApprox) = DxxDyy − (0.81Dxy)^2Which one is correct?
(Dxx, Dyy, and Dxy are Gaussian second order derivatives).
I had to go back to the original paper to find the precise answers to your questions.
Some background first
SURF leverages a common Image Analysis approach for regions-of-interest detection that is called blob detection.
The typical approach for blob detection is a difference of Gaussians.
There are several reasons for this, the first one being to mimic what happens in the visual cortex of the human brains.
The drawback to difference of Gaussians (DoG) is the computation time that is too expensive to be applied to large image areas.
In order to bypass this issue, SURF takes a simple approach. A DoG is simply the computation of two Gaussian averages (or equivalently, apply a Gaussian blur) followed by taking their difference.
A quick-and-dirty approximation (not so dirty for small regions) is to approximate the Gaussian blur by a box blur.
A box blur is the average value of all the images values in a given rectangle. It can be computed efficiently via integral images.
Using integral images
Inside an integral image, each pixel value is the sum of all the pixels that were above it and on its left in the original image.
The top-left pixel value in the integral image is thus 0, and the bottom-rightmost pixel of the integral image has thus the sum of all the original pixels for value.
Then, you just need to remark that the box blur is equal to the sum of all the pixels inside a given rectangle (not originating in the top-lefmost pixel of the image) and apply the following simple geometric reasoning.
If you have a rectangle with corners ABCD (top left, top right, bottom left, bottom right), then the value of the box filter is given by:
boxFilter(ABCD) = A + D - B - C,
where A, B, C, D is a shortcut for IntegralImagePixelAt(A) (B, C, D respectively).
Integral images in SURF
SURF is not using box blurs of sizes 9x9, etc. directly.
What it uses instead is several orders of Gaussian derivatives, or Haar-like features.
Let's take an example. Suppose you are to compute the 9x9 filters output. This corresponds to a given sigma, hence a fixed scale/octave.
The sigma being fixed, you center your 9x9 window on the pixel of interest. Then, you compute the output of the 2nd order Gaussian derivative in each direction (horizontal, vertical, diagonal). The Fig. 1 in the paper gives you an illustration of the vertical and diagonal filters.
The Hessian determinant
There is a factor to take into account the scale differences. Let's believe the paper that the determinant is equal to:
Det = DxxDyy - (0.9 * Dxy)^2.
Finally, the determinant is given by: Det = DxxDyy - 0.81*Dxy^2.
Look at page 17 of this document
http://www.sci.utah.edu/~fletcher/CS7960/slides/Scott.pdf
If you made a code for normal Gaussian 2D convolution, just use the box filter as a Gaussian kernel and the input image will be the same original image not integral image. The results from this method will be same with the one you asked.

Resources