What does global pooling do? - image-processing

I recently found the "global_pooling" flag in the Pooling layer in caffe, however was unable to find sth about it in the documentation here (Layer Catalogue)
nor here (Pooling doxygen doc) .
Is there an easy forward examply explanation to this in comparison to the normal Pool-Layer behaviour?

With Global pooling reduces the dimensionality from 3D to 1D. Therefore Global pooling outputs 1 response for every feature map. This can be the maximum or the average or whatever other pooling operation you use.
It is often used at the end of the backend of a convolutional neural network to get a shape that works with dense layers. Therefore no flatten has to be applied.

Convolutions can work on any image input size (which is big enough). However, if you have a fully connected layer at the end, this layer needs a fixed input size. Hence the complete network needs a fixed image input size.
However, you can remove the fully connected layer and just work with convolutional layers. You can make a convolutional layer at the end which has the same number of filters as you have classes. But you want one value for each class which indicates the probability of that class. Hence you apply a pooling filter over the complete remaining feature map. This pooling is hence "global" as it always is as big as necessary. In contrast, usual pooling layers have a fixed size (e.g. of 2x2 or 3x3).
This is a general concept. You can also find global pooling in other libraries, e.g. Lasagne. If you want a good reference in literature, I recommend reading Network In Network.

We get only one value from entire feature map when we apply GP layer, in which kernel size is the h×w of the feature map. GP layers are used to reduce the spatial dimensions of a three-dimensional feature map. However, GP layers perform a more extreme type of dimensionality reduction, where a feature map with dimensions h×w×d is reduced in size to have dimensions 1×1×d. GP layers reduce each h×w feature map to a single number by simply taking the average of all hw values.

If you are looking for information regarding flags/parameters of caffe, it is best look them up in the comments of '$CAFFE_ROOT/src/caffe/proto/caffe.proto'.
For 'global_pooling' parameter the comment says:
// If global_pooling then it will pool over the size of the bottom by doing
// kernel_h = bottom->height and kernel_w = bottom->width
For more information about caffe layers, see this help pages.

Related

Is there a fundamental limit on how accurate location information is encoded in CNNs

Each layer in a CNN reduces the size of the input via convolution and max-pooling operations. Convolution is translation equivariant, but max-pooling is translation invariant. Correct me if this is wrong : each time max-pooling applied, the precise location of a feature is reduced. So the feature maps of the final conv layer in a very deep CNN will have a large receptive field (w.r.t the original image), but the location of this feature (in the original image) is not discernible from looking at this feature map alone.
If this is true, how can the accuracy of bounding boxes when we do localisation be so good with a deep CNN? I understand how classification works, but making accurate bounding box predictions is confusing me.
Perhaps a toy example will clarify my confusion;
Say we have a dataset of images with dimension 256x256x1, and we want to predict whether a cat is present, and if so, where it is, so our target is something like [sigmoid_cat_present, cat_location].
Our vanilla CNN (let's assume something like VGG) will take in the image and transform it to something like 16x16x256 in the last convolutional layer. Each pixel in this final 16x16 feature map can be influenced by a much larger region in the original image. So if we determine a cat is present, how can the [cat_location] be refined to value more granular than this effective receptive field?
To add to your question - how about pixel perfect accuracy of segmentation boundary !!
Your intuition regarding down-sampling via max-pooling is correct. Normal CNNs have that limit. However, there have been some improvements recently to overcome it.
The breakthrough to this problem came in 2015-6 in the form of U-net and atrous/dilated convolution introduced in DeepLab.
Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels. Dilated convolutions have been shown to decrease blurring in semantic segmentation maps, and are purported to work at least in part by extracting long range information without the need for pooling.
Using U-Net architectures is another method that seeks to retain high spatial frequency information by directly adding skip connections between early and late layers. In other words, up-sampling followed by down-sampling.
In TensorFlow, atrous convolutions are implemented with function:
tf.nn.atrous_conv2d
There are many more methods and this is an ongoing research area.

Why don't max pooling layers break CNNs performance in solving regression problems?

When I think about a max pooling layer I think about it detecting features that are anywhere in their receptive field, but agnostic as to the location.
It seems this spatial invariance properly of max pooling should mean it loses information about the exact location of features in the original image.
How then, can a CNN with several layers of max pooling accurately predict the bounding boxes of objects in an image? A quick Google shows many examples of CNNs with max pooling being recommended for bounding box regression problems.
Thanks for any help.
Because your assumption that it'll loose information about the exact location is wrong. Max pooling does not dilute the location of the maximum pixel - instead consider it as a way of downsizing. Max pooling is just a way to reduce dimensionality of the problem such that your problem fits into device memory. A nice side property is that it pools the strongest acitvations from your feature map.
In the case of bbox prediction it also reduces the number of proposed regions for bboxes. Which later in a non-maximum surpression step would kill all redundant proposed bbox locations.

Interlayer scaling or normalisation between hidden layers in ANNs CNNs and MLPs

Would anyone here know if there is any kind of normalisation or scaling between layers in existing Neural Network arcitectures?
Scaling inputs is common and i am familiar with ReLU blow up. Most models i see indicate a small range of values like -2 to +2 but i don't see how this can be maintained from layer to layer. Irrespective of the activation function the second layer output is in the tens then the third layer is hundreds and final output is tens of thousands. In the worst case the layer returns NaN. A work around can be by scaling or alternating ReLU/sigmoid but I would like to know if this is this common?
Pretty much every network uses batch normalization, which is exactly that. Paper can be found here: (https://arxiv.org/abs/1502.03167). In essence it normalizes the values to be 0 mean and unit variance before being fed into the next layer. Another work is on self normalizing linear units (selu), which in some sense does this automatically without needing any kind of scaling. Paper can be found here: (https://arxiv.org/abs/1706.02515).

Reduce dimensions of model's fully connected layer for image retrieval task

I'm working on a image retrieval task(not involving faces) and one of the things I am trying is to swap out the softmax layer in the CNN model and use the LMNN classifier. For this purpose I fine tuned the model and then extracted the features at fully connected layer. I have about 3000 images right now. The fully connected layer gives a 4096 dim vector. So my final vector is a 3000x4096 vector with about 700 classes(Each class has 2+ images). I believe this is an extremely large dimension size which the LMNN algorithm is going to take forever(it really did take forever).
How can I reduce the number of dimensions? I tried PCA but that didn't squeeze down the dimensions too much(got down to 3000x3000). I am thinking 256/512/1024 dim vector should be able to help. If I were to add another layer to reduce dimensions, say a new fully connected layer would I have to fine tune my network again? Inputs on how to do that would be great!
I am also currently trying to augment my data to get more images per class and increase the size of my dataset.
Thank you.
PCA should let you reduce the data further - you should be able to specify the desired dimensionality - see the wikipedia article.
As well as PCA you can try t-distributed stochastic neighbor embedding (t-SNE). I really enjoyed Wattenberg, et al.'s article - worth a read if you want to get an insight into how it works and some of the pitfalls.
In a neural net the standard way to reduce dimensionality is by adding more, smaller layers, as you suggested. As they can only learn during training, you'll need to re-run your fine-tuning. Ideally you would re-run the entire training process if you make a change to the model structure but if you have enough data it may be OK still.
To add new layers in TensorFlow, you would add a fully connected layer whose input is the output of your 3000 element layer, and output size is the desired number of elements. You may repeat this if you want to go down gradually (e.g. 3000 -> 1024 -> 512). You would then perform your training (or fine tuning) again.
Lastly, I did a quick search and found this paper that claims to support LMNN over large datasets through random sampling. You might be able to use that to save a few headaches: Fast LMNN Algorithm through Random Sampling

Having a neural network output a gaussian distribution rather than one single value?

Let's consider I have a neural network with one single output neuron. To outline the scenario: the network gets an image as input and should find one single object in that image. For simplifying the scenario, it should just output the x-coordinate of the object.
However, since the object can be at various locations, the network's output will certainly have some noise on it. Additionally the image can be a bit blurry and stuff.
Therefore I thought it might be a better idea to have the network output a gaussian distribution of the object's location.
Unfortunately I am struggling to model this idea. How would I design the output? A flattened 100 dimensional vector if the image has a width of 100 pixels? So that the network can fit in a gaussian distribution in this vector and I just need to locate the peaks for getting the approximated object's location?
Additionally I fail in figuring out the cost function and teacher signal. Would the teacher signal be a perfect gaussian distribution on the exact x-coordination of the object?
How to model the cost function, then? Currently I have a softmax cross entropy or simply a squared error: network's output <-> real x coordinate.
Is there maybe a better way to handle this scenario? Like a better distribution or any other way to have the network not output a single value without any information of the noise and so on?
Sounds like what you really need is a convolutional network.
You could train a network to recognize your target object when it's positioned in the center of the network's receptive field. You can then create a moving window, at each step feeding the portion of the larger image under that window into the net. If you keep track of the outputs of the trained network for each (x,y) position of the window, some locations of the window will produce better matches than others. Once you've covered the whole image, you can pick the position with the maximum network output as the position where the target object is most likely located.
To handle scale and rotation variations, consider creating an image pyramid, or sets of images at different scales and rotations that are versions of the original image. Then sieve over those images to find the target image.

Resources