guys. I want to know difference between filed of view and receptive field. I know both things are similar, but what is the difference?
These are essentially different concepts. Field of view is a term related to the theory of optics, which means a solid angle with a vertex in the center of entrance pupil.
On the other hand, receptive field is a term of neural networks (convolutional/transformers), that means the size of the region in the input that produces the feature in the output (read more here).
Related
I have seen the claim that deeper layers in CNN learn to recognize more sophisticated features. This usually comes with a picture of earlier filters recognizing lines/simple curves, and latter filters recognizing more complicated patterns. It makes intuitive sense: the further you are from the data, the more abstract your understanding of it. Is there a theoritical explanation for this, though?
One approach, which is quite close to the intuitive one, would be to look at the receptive field of each layer:
The if the first Convolutional Layer has a kernel size of 3x3, you have a receptive field of size 9, expressed in one (scalar) value in the feature map.
If your second Convolutional Layer now also has a kernel size of 3x3, you have again a receptive field of size 9. But what is in the receptive field now? It's the 9 values of the feature map from the previous layer. Now each of these values has a receptive field of 9 values from the Input Layer, as said before.
If you now look back from the feature map of the 2nd Conv-Layer to the Input Layer, you will have a wider receptive field compared to looking from the feature map of the 1st Conv-Layer to the Input Layer.
So in the 2nd feature map, you will automatically consider more abstract features, since your receptive field in the Input Layer is larger. You can see this quite well in the graphic below, just think of "Map 1" as Input Layer, "Map 2" as Feature Map of the first Conv-Layer, and "Map 3" as Feature Map of the second Conv-Layer.
You will find more details on the receptive field in the CS231-course on CNNs:
There are many questions regarding the calculation of the receptive field.
It is explained very well here on StackOverflow.
However, there are no blogs or tutorials on how to calculate it in fully convolutional layer i.e. with residual blocks, feature map concatenation and upsampling layers (like feature pyramid network).
To my understanding residual blocks and skip connections do not contribute to the receptive field and can be skipped. Answer from here.
How are upsampling layers handled? For e.g. we have the effective receptive field of 900 and an upsampling layer follows, does the receptive field get halved?
Does the receptive field change when concatenated with feature maps from prior layers?
Thanks in advance!
To answer your question piece by piece, let us first start with the definition of the receptive field in this context:
The receptive field of an individual sensory neuron is the particular region of the sensory space (e.g., the body surface, or the visual field) in which a stimulus will modify the firing of that neuron.
As taken from Wikipedia. This means we are looking for all pixels in your input that affect the current output. Logically, if you perform a convolution -say for example with a single 3x3 filter kernel - the receptive field of a single pixel is the corresponding 3x3 image region in the input area that gets convolved in that specific step.
Visually, in this graphic the underlying darker area marks the receptive field for specific pixels in the output:
Now, to answer your first question: Residual blocks of course still account for the receptive field! Let us denote the residual block as follows:
F(X): residual block
g_i(X): single convolutional block
Then we can denote the residual block as F(X) = g_3(g_2(g_1(X))) + X, so in this case we would stack 3 convolutions (as an example). Of course, every single layers of this convolution still alters the receptive field, since it is the same as explained in the beginning. Simply adding X again will not change the receptive field, of course. But that addition alone does not make an residual block.
Similarly, skip connections do no affect the receptive field in the way that skipping layers will almost always result in a different (mostly smaller) receptive field. As explained in your linked answer though, it will make a difference if your skip connection has a larger receptive field, since the receptive field is the maximum (more specifically, union) of the different regions of your paths through your flow graph.
For the question about upsampling layers, you can guess the answer yourself by asking the following question:
Does the area of the input image get affected by upsampling anywhere within the image?
The answer should be "obviously not". Essentially, you are still looking at the same area in the input area, although now you have a higher resolution, and similar pixels might in fact look at the same area. To get back to the GIF above: If you had 4x the number of pixels in the green area, every pixel still would have to look at a particular input region in the blue area that does not change in size. So no, upscaling does not affect this.
For the last question: This is very related to the first question. In fact, the receptive field looks at all the pixels that affect the output, so depending on which feature maps you are concatenating, it might change it.
Again, the resulting receptive field is the union of the receptive fields of the feature maps you are concatenating. If they are contained in one another (either A subset of B or B subset of A, where A and B are the feature maps to be concatenated), then the receptive field does not change. Otherwise, the receptive field would be A union B.
As we know, faster-RCNN has two main parts: one is region proposal network(RPN), and another one is fast-RCNN.
My question is, now that region proposal network(RPN) can output class scores and bounding boxes and is trainable, why do we need Fast-RCNN?
Am I thinking it right that the RPN is enough for detection (red circle), and Fast-RCNN is now becoming redundant (blue circle)?
Short answer: no they are not redundant.
The R-CNN article and its variants popularized the use of what we used to call a cascade.
Back then for detection it was fairly common to use different detectors often very similar in structures to do detection because of their complementary power.
If the detections are partly orthogonal it allows to remove false positive along the way.
Furthermore by definition both parts of R-CNN have different roles the first one is used to discriminate objects from background and the second one to discriminate fine grained categories of objects from themselves (and from the background also).
But you are right if there is only 1 class vs the background one could use only the RPN part to to detection but even in that case it would probably better the result to chain two different classifiers (or not see e.g. this article)
PS: I answered because I wanted to but this question is definitely unsuited for stackoverflow
If you just add a class head to the RPN Network, you would indeed get detections, with scores and class estimates.
However, the second stage is used mainly to obtain more accurate detection boxes.
Faster-RCNN is a two-stage detector, like Fast R-CNN.
There, Selective Search was used to generate rough estimates of the location of objects and the second stage then refines them, or rejects them.
Now why is this necessary for the RPN? So why are they only rough estimates?
One reason is the limited receptive field:
The input image is transformed via a CNN into a feature map with limited spatial resolution. For each position on the feature map, the RPN heads estimate if the features at that position correspond to an object and the heads regress the detection box.
The box regression is done based on the final feature map of the CNN. In particular, it may happen that the correct bounding box on the image is larger than the corresponding receptive field due to the CNN.
Example: Lets say we have an image depicting a person and the features at one position of the feature map indicate a high possibiliy for the person. Now, if the corresponding receptive field contains only the body parts, the regressor has to estimate a box enclosing the entire person, although it "sees" only the body part.
Therefore, RPN creates a rough estimate of the bounding box. The second stage of Faster RCNN uses all features contained in the predicted bounding box and can correct the estimate.
In the example, RPN creates a too large bounding box, which is enclosing the person (since it cannot the see the pose of the person), and the second stage uses all information of this box to reshape it such that it is tight. This however can be done much more accurate, since more content of the object is accessable for the network.
faster-rcnn is a two-stage method comparing to one stage method like yolo, ssd, the reason faster-rcnn is accurate is because of its two stage architecture where the RPN is the first stage for proposal generation and the second classification and localisation stage learn more precise results based on the coarse grained result from RPN.
So yes, you can, but your performance is not good enough
I think the blue circle is completely redundant and just adding a class classification layer (gives class for each bounding box containing object) should work just fine and that's what the single shot detectors do with compromised accuracy.
According to my understanding, RPN is just for binary checking if you have Objects in the bbox or not and final Detector part is for classifying the classes ex) car, human, phones, etc
I just read a great post here. I am curious about content of "An example with images" in that post. If the hidden states mean a lot of features of the original picture and getting closer to final result, using dimension reduction on hidden states should provide better result than the original raw pixels, I think.
Hence, I tried it on mnist digits with 2 hidden layers of 256 unit NN, using T-SNE for dimension reduction; the result is far from ideal. From left to right, top to bot, they are raw pixels, second hidden layer and final prediction. Can anyone explain that?
BTW, the accuracy of this model is around 94.x%.
You have ten classes, and as you mentioned, your model is performing well on this dataset - so in this 256 dimensional space - the classes are separated well using linear subspaces.
So why T-SNE projections don't have this property?
One trivial answer which comes to my mind is that projecting a highly dimensional set to two dimensions may lose the linear separation property. Consider following example : a hill where one class is at its peak and second - on lower height levels around. In three dimensions these classes are easily separated by a plane but one can easily find a two dimensional projection which doesn't have that property (e.g. projection in which you are loosing the height dimension).
Of course T-SNE is not such linear projections but it's main purpose is to preserve a local structure of data, so that general property like linear separation property might be easly losed when using this approach.
I'm interested in taking user stroke input (i.e. drawing with an iPad) and classifying it as either text or a drawing (or, I suppose, just non-text), in whatever capacity is reasonably feasible. I'm not expecting a pre-built library for this, I'm just having a hard time finding any papers or algorithmic resources about this.
I don't need to detect what the text is that they're drawing, just whether it's likely text or not.
I would think you will need to generate probabilities for what text character the input is. If the highest probability text character is below some threshold, classify the stroke as drawing.
This is a possible useful paper: http://arxiv.org/pdf/1304.0421v1.pdf (if only for its reference list). Also the first hit on this google scholar search looks relevant: http://scholar.google.com/scholar?q=classification+stroke+input+text+or+drawing