Fully Convolutional Network Receptive Field - machine-learning

There are many questions regarding the calculation of the receptive field.
It is explained very well here on StackOverflow.
However, there are no blogs or tutorials on how to calculate it in fully convolutional layer i.e. with residual blocks, feature map concatenation and upsampling layers (like feature pyramid network).
To my understanding residual blocks and skip connections do not contribute to the receptive field and can be skipped. Answer from here.
How are upsampling layers handled? For e.g. we have the effective receptive field of 900 and an upsampling layer follows, does the receptive field get halved?
Does the receptive field change when concatenated with feature maps from prior layers?
Thanks in advance!

To answer your question piece by piece, let us first start with the definition of the receptive field in this context:
The receptive field of an individual sensory neuron is the particular region of the sensory space (e.g., the body surface, or the visual field) in which a stimulus will modify the firing of that neuron.
As taken from Wikipedia. This means we are looking for all pixels in your input that affect the current output. Logically, if you perform a convolution -say for example with a single 3x3 filter kernel - the receptive field of a single pixel is the corresponding 3x3 image region in the input area that gets convolved in that specific step.
Visually, in this graphic the underlying darker area marks the receptive field for specific pixels in the output:
Now, to answer your first question: Residual blocks of course still account for the receptive field! Let us denote the residual block as follows:
F(X): residual block
g_i(X): single convolutional block
Then we can denote the residual block as F(X) = g_3(g_2(g_1(X))) + X, so in this case we would stack 3 convolutions (as an example). Of course, every single layers of this convolution still alters the receptive field, since it is the same as explained in the beginning. Simply adding X again will not change the receptive field, of course. But that addition alone does not make an residual block.
Similarly, skip connections do no affect the receptive field in the way that skipping layers will almost always result in a different (mostly smaller) receptive field. As explained in your linked answer though, it will make a difference if your skip connection has a larger receptive field, since the receptive field is the maximum (more specifically, union) of the different regions of your paths through your flow graph.
For the question about upsampling layers, you can guess the answer yourself by asking the following question:
Does the area of the input image get affected by upsampling anywhere within the image?
The answer should be "obviously not". Essentially, you are still looking at the same area in the input area, although now you have a higher resolution, and similar pixels might in fact look at the same area. To get back to the GIF above: If you had 4x the number of pixels in the green area, every pixel still would have to look at a particular input region in the blue area that does not change in size. So no, upscaling does not affect this.
For the last question: This is very related to the first question. In fact, the receptive field looks at all the pixels that affect the output, so depending on which feature maps you are concatenating, it might change it.
Again, the resulting receptive field is the union of the receptive fields of the feature maps you are concatenating. If they are contained in one another (either A subset of B or B subset of A, where A and B are the feature maps to be concatenated), then the receptive field does not change. Otherwise, the receptive field would be A union B.

Related

What is the difference between filed of view and receptive field?

guys. I want to know difference between filed of view and receptive field. I know both things are similar, but what is the difference?
These are essentially different concepts. Field of view is a term related to the theory of optics, which means a solid angle with a vertex in the center of entrance pupil.
On the other hand, receptive field is a term of neural networks (convolutional/transformers), that means the size of the region in the input that produces the feature in the output (read more here).

Is there a theoritical explanation/quantification for deeper CNN layers learning more sophisticated features?

I have seen the claim that deeper layers in CNN learn to recognize more sophisticated features. This usually comes with a picture of earlier filters recognizing lines/simple curves, and latter filters recognizing more complicated patterns. It makes intuitive sense: the further you are from the data, the more abstract your understanding of it. Is there a theoritical explanation for this, though?
One approach, which is quite close to the intuitive one, would be to look at the receptive field of each layer:
The if the first Convolutional Layer has a kernel size of 3x3, you have a receptive field of size 9, expressed in one (scalar) value in the feature map.
If your second Convolutional Layer now also has a kernel size of 3x3, you have again a receptive field of size 9. But what is in the receptive field now? It's the 9 values of the feature map from the previous layer. Now each of these values has a receptive field of 9 values from the Input Layer, as said before.
If you now look back from the feature map of the 2nd Conv-Layer to the Input Layer, you will have a wider receptive field compared to looking from the feature map of the 1st Conv-Layer to the Input Layer.
So in the 2nd feature map, you will automatically consider more abstract features, since your receptive field in the Input Layer is larger. You can see this quite well in the graphic below, just think of "Map 1" as Input Layer, "Map 2" as Feature Map of the first Conv-Layer, and "Map 3" as Feature Map of the second Conv-Layer.
You will find more details on the receptive field in the CS231-course on CNNs:

Pooling Layer vs. Using Padding in Convolutional Layers

My understanding is that we use padding when we convolute because convoluting with filters reduces the dimension of the output by shrinking it, as well as loses information from the edges/corners of the input matrix. However, we also use a pooling layer after a number of Conv layers in order to downsample our feature maps. Doesn't this seem sort of contradicting? We use padding because we do NOT want to reduce the spatial dimensions but we later use pooling to reduce the spatial dimensions. Could someone provide some intuition behind these 2?
Without loss of generality, assume we are dealing with images as inputs. The reasons behind padding is not only to keep the dimensions from shrinking, it's also to ensure that input pixels on the corners and edges of the input are not "disadvantaged" in affecting the output. Without padding, a pixel on the corner of an images overlaps with just one filter region, while a pixel in the middle of the image overlaps with many filter regions. Hence, the pixel in the middle affects more units in the next layer and therefore has a greater impact on the output. Secondly, you actually do want to shrink dimensions of your input (Remember, Deep Learning is all about compression, i.e. to find low dimensional representations of the input that disentangle the factors of variation in your data). The shrinking induced by convolutions with no padding is not ideal and if you have a really deep net you would quickly end up with very low dimensional representations that lose most of the relevant information in the data. Instead you want to shrink your dimensions in a smart way, which is achieved by Pooling. In particular, Max Pooling has been found to work well. This is really an empirical result, i.e. there isn't a lot of theory to explain why this is the case. You could imagine that by taking the max over nearby activations, you still retain the information about the presence of a particular feature in this region, while losing information about its exact location. This can be good or bad. Good because it buys you translation invariance, and bad because exact location may be relevant for you problem.

How should I setup my input neurons to recieve my input

I need to be able to determine if a shape was drawn correctly or incorrectly,
I have sample data for the shape, that holds the shape and the order of pixels (denoted by the color of the pixel)
for example, you can see of the downsampled image and color variation
I'm having trouble figuring out the network I need to define that will accept this kind of input for training.
should I convert the sampledown image to a matrix and input it? let's say my image is 64x64, I would need 64x64 input neurons (and that's if I ignore the color of the pixels, I think) is that feasible solution?
If you have any guidance, I could use it :)
I gave you an example as below.
It is a binarized 4x4 image of letter c. You can either concatenate the rows or columns. I am concatenating by columns as shown in the figure. Then each pixel is mapped to an input neuron (totally 16 input neurons). In the output layer, I have 26 outputs, the letters a to z.
Note, in the figure, I did not connect all nodes from layer i to layer i+1 for simplicity, which you probably should connect all.
At the output layer, I highlight the node of c to indicate that for this training instance, c is the target label. The expected input and output vector are listed in the bottom of the figure.
If you want to keep the intensity of color, e.g., R/G/B, then you have to triple the number of inputs. Each single pixel is replaced with three neurons.
Hope this helps more. For a further reading, I strongly suggest the deep learning tutorial by Andrew Ng at here - UFLDL. It's the state of art of such image recognition problem. In the exercise with the tutorial, you will be intensively trained to preprocess the images and work with a lot of engineering tricks for image processing, together with the interesting deep learning algorithm end-to-end.

How do I make a U-matrix?

How exactly is an U-matrix constructed in order to visualise a self-organizing-map? More specifically, suppose that I have an output grid of 3x3 nodes (that have already been trained), how do I construct a U-matrix from this? You can e.g. assume that the neurons (and inputs) have dimension 4.
I have found several resources on the web, but they are not clear or they are contradictory. For example, the original paper is full of typos.
A U-matrix is a visual representation of the distances between neurons in the input data dimension space. Namely you calculate the distance between adjacent neurons, using their trained vector. If your input dimension was 4, then each neuron in the trained map also corresponds to a 4-dimensional vector. Let's say you have a 3x3 hexagonal map.
The U-matrix will be a 5x5 matrix with interpolated elements for each connection between two neurons like this
The {x,y} elements are the distance between neuron x and y, and the values in {x} elements are the mean of the surrounding values. For example, {4,5} = distance(4,5) and {4} = mean({1,4}, {2,4}, {4,5}, {4,7}). For the calculation of the distance you use the trained 4-dimensional vector of each neuron and the distance formula that you used for the training of the map (usually Euclidian distance). So, the values of the U-matrix are only numbers (not vectors). Then you can assign a light gray colour to the largest of these values and a dark gray to the smallest and the other values to corresponding shades of gray. You can use these colours to paint the cells of the U-matrix and have a visualized representation of the distances between neurons.
Have also a look at this web article.
The original paper cited in the question states:
A naive application of Kohonen's algorithm, although preserving the topology of the input data is not able to show clusters inherent in the input data.
Firstly, that's true, secondly, it is a deep mis-understanding of the SOM, thirdly it is also a mis-understanding of the purpose of calculating the SOM.
Just take the RGB color space as an example: are there 3 colors (RGB), or 6 (RGBCMY), or 8 (+BW), or more? How would you define that independent of the purpose, ie inherent in the data itself?
My recommendation would be not to use maximum likelihood estimators of cluster boundaries at all - not even such primitive ones as the U-Matrix -, because the underlying argument is already flawed. No matter which method you then use to determine the cluster, you would inherit that flaw. More precisely, the determination of cluster boundaries is not interesting at all, and it is loosing information regarding the true intention of building a SOM. So, why do we build SOM's from data?
Let us start with some basics:
Any SOM is a representative model of a data space, for it reduces the dimensionality of the latter. For it is a model it can be used as a diagnostic as well as a predictive tool. Yet, both cases are not justified by some universal objectivity. Instead, models are deeply dependent on the purpose and the accepted associated risk for errors.
Let us assume for a moment the U-Matrix (or similar) would be reasonable. So we determine some clusters on the map. It is not only an issue how to justify the criterion for it (outside of the purpose itself), it is also problematic because any further calculation destroys some information (it is a model about a model).
The only interesting thing on a SOM is the accuracy itself viz the classification error, not some estimation of it. Thus, the estimation of the model in terms of validation and robustness is the only thing that is interesting.
Any prediction has a purpose and the acceptance of the prediction is a function of the accuracy, which in turn can be expressed by the classification error. Note that the classification error can be determined for 2-class models as well as for multi-class models. If you don't have a purpose, you should not do anything with your data.
Inversely, the concept of "number of clusters" is completely dependent on the criterion "allowed divergence within clusters", so it is masking the most important thing of the structure of the data. It is also dependent on the risk and the risk structure (in terms of type I/II errors) you are willing to take.
So, how could we determine the number classes on a SOM? If there is no exterior apriori reasoning available, the only feasible way would be an a-posteriori check of the goodness-of-fit. On a given SOM, impose different numbers of classes and measure the deviations in terms of mis-classification cost, then choose (subjectively) the most pleasing one (using some fancy heuristics, like Occam's razor)
Taken together, the U-matrix is pretending objectivity where no objectivity can be. It is a serious misunderstanding of modeling altogether.
IMHO it is one of the greatest advantages of the SOM that all the parameters implied by it are accessible and open for being parameterized. Approaches like the U-matrix destroy just that, by disregarding this transparency and closing it again with opaque statistical reasoning.

Resources