Can CNN reduce input size by a specified ratio - machine-learning

I have a fully convoluted CNN and I have no problem designing it to have an exact same output size by using "same" padding and a stride of one.
However, I have an image-translation problem where I need the output being resized to (W/26, H/15) where (W, H) is the size of the input. (Resizing the image beforehand is problematic, it won't be an option in our case)
I understand by using the formula: O = (I - F +2P )/s + 1. Where:
O: output size
I: input size
F: filter size
P: padding
s: stride
I may be able to use some really strange filter size to achieve this. But is there a systematic or organized way to construct such a network to reduce input size?

Getting that precise output size playing only with filter size and stride is going to give you some headaches.
My two cents: whenever the size of your output is particularly weird (like (w//26, h//15)), interpolation layers might be helpful to get you to that particular size.
For example:
PyTorch you can use torch.nn.functional.interpolate.
Tensorflow tf.image.resize

Related

TensorFlow - What is random_crop doing in Cifar10 example?

In the Cifar10 example in the TensorFlow examples they are distorting the images with a random combination of cropping, flipping, brightening, contrasting, and whitening. This concept makes sense except the cropping seems a little odd to me. The images will need to be the same dimensions for the network and the cropping code looks like this:
height = IMAGE_SIZE
width = IMAGE_SIZE
# Image processing for training the network. Note the many random
# distortions applied to the image.
# Randomly crop a [height, width] section of the image.
distorted_image = tf.random_crop(reshaped_image, [height, width, 3])
Since the height and width are based on the image size is this actually doing anything?
In the example, IMAGE_SIZE is set to 24. So basically what this code does is select a randomly chosen offset and extracts a 24 X 24 patch. It probably ensures that the offset is chosen in a way that the patch can be extracted without any wrap around or other weird boundary condition or maybe it pads it (should be easy to check).
I guess IMAGE_SIZE could be better named as PATCH_SIZE or something. Note the original CIFAR 10 input image is 32 x 32

1D discrete denoising of image by variational method (the length of smoothing term)

As of speaking about this 1D discrete denoising via variational calculus I would like to know how to manipulate the length of smoothing term as long as it should be N-1, while the length of data term is N. Here the equation:
E=0;
for i=1:n
E+=(u(i)-f(i))^2 + lambda*(u[i+1]-n[i])
E is the cost of actual u in optimization process
f is given image (noised)
u is output image (denoised)
n is the length of 1D vector.
lambda>=0 is weight of smoothness in optimization process (described around 13 minute in video)
here the length of second term and first term mismatch. How to resolve this?
More importantly, I would like to use linear equation system to solve this problem.
This is nowhere near my cup of tea but I think you are referring to the fact that:
u[i+1]-n[i] is accessing the next pixel making the term work only on resolution 1 pixel smaller then original f image
In graphics and filtering is this usually resolved in 2 ways:
use default value for pixels outside image resolution
you can set default or neutral(for the process) color to those pixels (like black)
use color of the closest neighbor inside image resolution
interpolate the mising pixels (bilinear,bicubic...)
I think the first choice is not suitable for your denoising technique.
change the resolution of output image
Usually after some filtering techniques (via FIR,etc) the result is 1 pixel smaller then the input to resolve the missing data problem. In your case it looks like your resulting u image should be 1 pixel bigger then input image f while computing cost functions.
So either enlarge it via bullet #1 and when the optimization is done you can crop back to original size.
Or virtually crop the f one pixel down (just say n'=n-1) before computing cost function so you avoid access violations (and also you can restore back after the optimization...)

What is Depth of a convolutional neural network?

I was taking a look at Convolutional Neural Network from CS231n Convolutional Neural Networks for Visual Recognition. In Convolutional Neural Network, the neurons are arranged in 3 dimensions(height, width, depth). I am having trouble with the depth of the CNN. I can't visualize what it is.
In the link they said The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
For example loook at this picture. Sorry if the image is too crappy.
I can grasp the idea that we take a small area off the image, then compare it with the "Filters". So the filters will be collection of small images? Also they said We will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron. So is the receptive field has the same dimension as the filters? Also what will be the depth here? And what do we signify using the depth of a CNN?
So, my question mainly is, if i take an image having dimension of [32*32*3] (Lets say i have 50000 of these images, making the dataset [50000*32*32*3]), what shall i choose as its depth and what would it mean by the depth. Also what will be the dimension of the filters?
Also it will be much helpful if anyone can provide some link that gives some intuition on this.
EDIT:
So in one part of the tutorial(Real-world example part), it says The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96].
Here we see the depth is 96. So is depth something that i choose arbitrarily? or something i compute? Also in the example above(Krizhevsky et al) they had 96 depths. So what does it mean by its 96 depths? Also the tutorial stated Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
So that means the depth will be like this? If so then can i assume Depth = Number of Filters?
In Deep Neural Networks the depth refers to how deep the network is but in this context, the depth is used for visual recognition and it translates to the 3rd dimension of an image.
In this case you have an image, and the size of this input is 32x32x3 which is (width, height, depth). The neural network should be able to learn based on this parameters as depth translates to the different channels of the training images.
UPDATE:
In each layer of your CNN it learns regularities about training images. In the very first layers, the regularities are curves and edges, then when you go deeper along the layers you start learning higher levels of regularities such as colors, shapes, objects etc. This is the basic idea, but there lots of technical details. Before going any further give this a shot : http://www.datarobot.com/blog/a-primer-on-deep-learning/
UPDATE 2:
Have a look at the first figure in the link you provided. It says 'In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).' It means that a ConvNet neuron transforms the input image by arranging its neurons in three dimeonsion.
As an answer to your question, depth corresponds to the different color channels of an image.
Moreover, about the filter depth. The tutorial states this.
Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
Which basically means that a filter is a smaller part of an image that moves around the depth of the image in order to learn the regularities in the image.
UPDATE 3:
For the real world example I just browsed the original paper and it says this : The first convolutional layer filters the 224×224×3 input image with 96 kernels of size 11×11×3 with a stride of 4 pixels.
In the tutorial it refers the depth as the channel, but in real world you can design whatever dimension you like. After all that is your design
The tutorial aims to give you a glimpse of how ConvNets work in theory, but if I design a ConvNet nobody can stop me proposing one with a different depth.
Does this make any sense?
Depth of CONV layer is number of filters it is using.
Depth of a filter is equal to depth of image it is using as input.
For Example: Let's say you are using an image of 227*227*3.
Now suppose you are using a filter of size of 11*11(spatial size).
This 11*11 square will be slided along whole image to produce a single 2 dimensional array as a response. But in order to do so, it must cover every aspect inside of 11*11 area. Therefore depth of filter will be depth of image = 3.
Now suppose we have 96 such filter each producing different response. This will be depth of Convolutional layer. It is simply number of filters used.
I'm not sure why this is skimped over so heavily. I also had trouble understanding it at first, and very few outside of Andrej Karpathy (thanks d00d) have explained it. Although, in his writeup (http://cs231n.github.io/convolutional-networks/), he calculates the depth of the output volume using a different example than in the animation.
Start by reading the section titled 'Numpy examples'
Here, we go through iteratively.
In this case we have an 11x11x4. (why we start with 4 is kind of peculiar, as it would be easier to grasp with a depth of 3)
Really pay attention to this line:
A depth column (or a fibre) at position (x,y) would be the activations
X[x,y,:].
A depth slice, or equivalently an activation map at depth d
would be the activations X[:,:,d].
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V is your output volume. The zero'th index v[0] is your column - in this case V[0] = 0 this is the first column in your output volume.
V[1] = 0 this is the first row in your output volume. V[3]= 0 is the depth. This is the first output layer.
Now, here's where people get confused (at least I did). The input depth has absolutely nothing to do with your output depth. The input depth only has control of the filter depth. W in Andrej's example.
Aside: A lot of people wonder why 3 is the standard input depth. For color input images, this will always be 3 for plain ole images.
np.sum(X[:5,:5,:] * W0) + b0 (convolution 1)
Here, we are calculating elementwise between a weight vector W0 which is 5x5x4. 5x5 is an arbitrary choice. 4 is the depth since we need to match our input depth. The weight vector is your filter, kernel, receptive field or whatever obfuscated name people decide to call it down the road.
if you come at this from a non python background, that's maybe why there's more confusion since array slicing notation is non-intuitive. The calculation is a dot product of your first convolution size (5x5x4) of your image with the weight vector. The output is a single scalar value which takes the position of your first filter output matrix. Imagine a 4 x 4 matrix representing the sum product of each of these convolution operations across the entire input. Now stack them for each filter. That shall give you your output volume. In Andrej's writeup, he starts moving along the x axis. The y axis remains the same.
Here's an example of what V[:,:,0] would look like in terms of convolutions. Remember here, the third value of our index is the depth of your output layer
[result of convolution 1, result of convolution 2, ..., ...]
[..., ..., ..., ..., ...]
[..., ..., ..., ..., ...]
[..., ..., ..., result of convolution n]
The animation is best for understanding this, but Andrej decided to swap it with an example that doesn't match the calculation above.
This took me a while. Partly because numpy doesn't index the way Andrej does in his example, at least it didn't I played around with it. Also, there's some assumptions that the sum product operation is clear. That's the key to understand how your output layer is created, what each value represents and what the depth is.
Hopefully that helps!
Since the input volume when we are doing an image classification problem is N x N x 3. At the beginning it is not difficult to imagine what the depth will mean - just the number of channels - Red, Green, Blue. Ok, so the meaning for the first layer is clear. But what about the next ones? Here is how I try to visualize the idea.
On each layer we apply a set of filters which convolve around the input. Lets imagine that currently we are at the first layer and we convolve around a volume V of size N x N x 3. As #Semih Yagcioglu mentioned at the very beginning we are looking for some rough features: curves, edges etc... Lets say we apply N filters of equal size (3x3) with stride 1. Then each of these filters is looking for a different curve or edge while convolving around V. Of course, the filter has the same depth, we want to supply the whole information not just the grayscale representation.
Now, if M filters will look for M different curves or edges. And each of these filters will produce a feature map consisting of scalars (the meaning of the scalar is the filter saying: The probability of having this curve here is X%). When we convolve with the same filter around the Volume we obtain this map of scalars telling us where where exactly we saw the curve.
Then comes feature map stacking. Imagine stacking as the following thing. We have information about where each filter detected a certain curve. Nice, then when we stack them we obtain information about what curves / edges are available at each small part of our input volume. And this is the output of our first convolutional layer.
It is easy to grasp the idea behind non-linearity when taking into account 3. When we apply the ReLU function on some feature map, we say: Remove all negative probabilities for curves or edges at this location. And this certainly makes sense.
Then the input for the next layer will be a Volume $V_1$ carrying info about different curves and edges at different spatial locations (Remember: Each layer Carries info about 1 curve or edge).
This means that the next layer will be able to extract information about more sophisticated shapes by combining these curves and edges. To combine them, again, the filters should have the same depth as the input volume.
From time to time we apply Pooling. The meaning is exactly to shrink the volume. Since when we use strides = 1, we usually look at a pixel (neuron) too many times for the same feature.
Hope this makes sense. Look at the amazing graphs provided by the famous CS231 course to check how exactly the probability for each feature at a certain location is computed.
In simple terms, it can explain as below,
Let's say you have 10 filters where each filter is the size of 5x5x3. What does this mean? the depth of this layer is 10 which is equal to the number of filters. Size of each filter can be defined as we want e.g., 5x5x3 in this case where 3 is the depth of the previous layer. To be precise, depth of each filer in the next layer should be 10 ( nxnx10) where n can be defined as you want like 5 or something else. Hope will make everything clear.
The first thing you need to note is
receptive field of a neuron is 3D
ie If the receptive field is 5x5 the neuron will be connected to 5x5x(input depth) number of points. So whatever be your input depth, one layer of neurons will only develop 1 layer of output.
Now, the next thing to note is
depth of output layer = depth of conv. layer
ie The output volume is independent of the input volume, and it only depends on the number filters(depth). This should be pretty obvious from the previous point.
Note that the number of filters (depth of the cnn layer) is a hyper parameter. You can take it whatever you want, independent of image depth. Each filter has it's own set of weights enabling it to learn a different feature on the same local region covered by the filter.
The depth of the network is the number of layers in the network. In the Krizhevsky paper, the depth is 9 layers (modulo a fencepost issue with how layers are counted?).
If you are referring to the depth of the filter (I came to this question searching for that) then this diagram of LeNet is illustrating
Source http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
How to create such a filter; Well in python like https://github.com/alexcpn/cnn_in_python/blob/main/main.py#L19-L27
Which will give you a list of numpy arrays and length of the list is the depth
Example in the code above,but adding a depth of 3 for color (RGB), the below is the network. The first Convolutional layer is a filter of shape (5,5,3) and depth 6
Input (R,G,B)= [32.32.3] *(5.5.3)*6 == [28.28.6] * (5.5.6)*1 = [24.24.1] * (5.5.1)*16 = [20.20.16] *
FC layer 1 (20, 120, 16) * FC layer 2 (120, 1) * FC layer 3 (20, 10) * Softmax (10,) =(10,1) = Output
In Pytorch
np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})
# Generate a random image
image_size = 32
image_depth = 3
image = np.random.rand(image_size, image_size)
# to mimic RGB channel
image = np.stack([image,image,image], axis=image_depth-1) # 0 to 2
image = np.moveaxis(image, [2, 0], [0, 2])
print("Image Shape=",image.shape)
input_tensor = torch.from_numpy(image)
m = nn.Conv2d(in_channels=3,out_channels=6,kernel_size=5,stride=1)
output = m(input_tensor.float())
print("Output Shape=",output.shape)
Image Shape= (3, 32, 32)
Output Shape= torch.Size([6, 28, 28])

Using openCV, how do I change the resolution and not the size of the image?

This may be a basic question, but I'm new to openCV and I find the documentation to be very poor. I think I would like to use the resize function to get a new image at the same size as the original, but at a lower resolution.
All documentation I find acts as if resolution and size are the same thing and I have absolutely no idea what these parameters mean. Different sources seem tho show a different version of resize than what I see in the headers:
CV_EXPORTS_W void resize( InputArray src, OutputArray dst,
Size dsize, double fx=0, double fy=0,
int interpolation=INTER_LINEAR );
If I keep dsize the same size as my original, what do x and y represent and how would I get a resolution of say 72 dpi?
Let's me explain something straight: when you load your image to a memory, you have, in a good approximation, matrix of numbers with given amount of rows and cols. And from the definition of dpi, which is amount of individual dots that can be placed in line of one inch, you have a lack of "inch" in the memory. How would you define dpi in case of matrix stored in memory? It makes no sense to talk about it only according to the memory. So that is way in opencv (and perhaps in any other processing library) you have resolution and size concepts equal.
Maybe you would like to achieve something as "artificial" dpi lowering? Something that "looks like" image being printed with lower dpi? In that case, why don't you try resizing down and up the same image iteratively to achieve this result.
And cv::resize() function does change the size either by given destination size (param dsize) or scale factors (fx and fy).

Optimal sigma for Gaussian filtering of an image?

When applying a Gaussian blur to an image, typically the sigma is a parameter (examples include Matlab and ImageJ).
How does one know what sigma should be? Is there a mathematical way to figure out an optimal sigma? In my case, i have some objects in images that are bright compared to the background, and I need to find them computationally. I am going to apply a Gaussian filter to make the center of these objects even brighter, which hopefully facilitates finding them. How can I determine the optimal sigma for this?
There's no formula to determine it for you; the optimal sigma will depend on image factors - primarily the resolution of the image and the size of your objects in it (in pixels).
Also, note that Gaussian filters aren't actually meant to brighten anything; you might want to look into contrast maximization techniques - sounds like something as simple as histogram stretching could work well for you.
edit: More explanation - sigma basically controls how "fat" your kernel function is going to be; higher sigma values blur over a wider radius. Since you're working with images, bigger sigma also forces you to use a larger kernel matrix to capture enough of the function's energy. For your specific case, you want your kernel to be big enough to cover most of the object (so that it's blurred enough), but not so large that it starts overlapping multiple neighboring objects at a time - so actually, object separation is also a factor along with size.
Since you mentioned MATLAB - you can take a look at various gaussian kernels with different parameters using the fspecial('gaussian', hsize, sigma) function, where hsize is the size of the kernel and sigma is, well, sigma. Try varying the parameters to see how it changes.
I use this convention as a rule of thumb. If k is the size of kernel than sigma=(k-1)/6 . This is because the length for 99 percentile of gaussian pdf is 6sigma.
You have to find a min/max of a function G such that G(X,sigma) where X is a set of your observations (in your case, your image grayscale values) , This function can be anything that maintain the "order" of the intensities of the iamge, for example, this can be done with the 1st derivative of the image (as G),
fil = fspecial('sobel');
im = imfilter(I,fil);
imagesc(im);
colormap = gray;
this gives you the result of first derivative of an image, now you want to find max sigma by
maximzing G(X,sigma), that means that you are trying a few sigmas (let say, in increasing order) until you reach a sigma that makes G maximal. This can also be done with second derivative.
Given the central value of the kernel equals 1 the dimension that guarantees to have the outermost value less than a limit (e.g 1/100) is as follows:
double limit = 1.0 / 100.0;
size = static_cast<int>(2 * std::ceil(sqrt(-2.0 * sigma * sigma * log(limit))));
if (size % 2 == 0)
{
size++;
}

Resources