In the paper of 'Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks'
they are using the skip connection concept to concat roi-pooled features of the layer conv3, conv4, conv5, but before the concatenation, they propose to use L2 NORM and rescale each feature map extracted from these layers, the question to me is how to determine the re-scaling values for the pooled feature, which caffe layer can be used to implement this ?
You can use existing layers to compute L2 norm of a feature map. See this thread for and example.
You can use "Scale" layer to scale each feature map.
Related
my CNN network
Above is my config of the network.
l am training a CNN network on picture size of 192*192.
my target is a classification network of 11 kinds.
However, the loss and the accuracy on testing dataset appears to be very unstable. l have to run 15+ epochs to get a stable accuracy and loss. The maximum accuracy is only 50%.
What can l do to improve the performance?
I would recommend you to first refer to models which are widely known like VGG-16, LeNET or VGG-19 and check out the way how the conv2D and max-pooling layers are placed.
Start with a very basic model without any batch normalization and Leaky ReLU layers. You just keep the conv2D and max pooling layers and train your model for a few epochs.
Next, try other activations like ReLU to TanH. Try Changing the max pooling to average pooling.
If you are solving a classification problem then use the softmax layer at the end. Also, introduce Dense layer(s) after flattening.
Your dataset should be large and also the target should be one-hot encoded if you wish to use the softmax layer.
I'm trying to create a new CNN model.
First I pass the rgb images(size 224x224) through a ResNet50 network. The output of the ResNet50 is (None,7, 7, 2048). I now have 2 different ways to proceed to reduce to a (None,512) vector.
Way 1: Insert a FCL(Dense layer) with 512 neurons followed by a global average pooling layer.
Way 2: Do a global average pooling layer first, and only after do the FCL with 512.
Are way 1 and 2 the same? If not, what is the difference?
I found a similar question, How fully connected layer after global average pooling works in Resnet50? , but it doesn't explain the difference between doing first the global pooling.
I believe placing global average pooling after FCL doesnt make sense.
The purpuse of global average pooling is to (partly) replace the FCL for the task of dimensionality reduction after the CNN layers while using less parameters (thus making the overfitting less probable). Some people still place a small FCL after the global average pooling.
A nice explanation can be found here: https://www.quora.com/Why-was-global-average-pooling-used-instead-of-a-fully-connected-layer-in-GoogLeNet-and-how-was-it-different
I am trying to learn about convolutional neural networks, but i am having trouble understanding what happens to neural networks after the pooling step.
So starting from the left we have our 28x28 matrix representing our picture. We apply a three 5x5 filters to it to get three 24x24 feature maps. We then apply max pooling to each 2x2 square feature map to get three 12x12 pooled layers. I understand everything up to this step.
But what happens now? The document I am reading says:
"The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons. "
The text did not go further into describing what happens beyond that and it left me with a few questions.
How are the three pooled layers mapped to the 10 output neurons? By fully connected, does it mean each neuron in every one of the three layers of the 12x12 pooled layers has a weight connecting it to the output layer? So there are 3x12x12x10 weights linking from the pooled layer to the output layer? Is an activation function still taken at the output neuron?
Pictures and extract taken from this online resource: http://neuralnetworksanddeeplearning.com/chap6.html
Essentially, the fully connected layer provides the main way for the neural network to make a prediction. If you have ten classes, then a fully connected layer consists of ten neurons, each with a different probability as to the likelihood of the classified sample belonging to that class (each neuron represents a class). These probabilities are determined by the hidden layers and convolution. The pooling layer is simply outputted into these ten neurons, providing the final interface for your network to make the prediction. Here's an example. After pooling, your fully connected layer could display this:
(0.1)
(0.01)
(0.2)
(0.9)
(0.2)
(0.1)
(0.1)
(0.1)
(0.1)
(0.1)
Where each neuron contains a probability that the sample belongs to that class. In this, case, if you are classifying images of handwritten digits and each neuron corresponds to a prediction that the image is 1-10, then the prediction would be 4. Hope that helps!
Yes, you're on the right track. There is a layer with a weight matrix of 4320 entries.
This matrix will be typically arranged as 432x10. This is because these 432 number are a fixed-size representation of the input image. At this point, you don't care about how you got it -- CNN, plain feed-forward or a crazy RNN going pixel-by-pixel, you just want to turn the description into classifaction. In most toolkits (e.g. TensorFlow, PyTorch or even plain numpy), you'll need to explicitly reshape the 3x12x12 output of the pooling into a 432-long vector. But that's just a rearrangement, the individual elements do not change.
Additionally, there will usually be a 10-long vector of biases, one for every output element.
Finally about the nonlinearity: Since this is about classification, you typically want the output 10 units to represent posterior probabilities that the input belongs to a particular class (digit). For this purpose, the softmax function is used: y = exp(o) / sum(exp(o)), where exp(o) stands for an element-wise exponentiation. It guarantees that it's output will be a proper categorical distribution, all elements in <0; 1> and summing up to 1. There is a nice a detailed discussion of softmax in neural networks in the Deep Learning book (I recommend reading Section 6.2.1 in addition to the softmax Sub-subsection itself.)
Also note that this is not specific to convolutional networks at all, you'll find this block fully connected layer -- softmax at the end of virtually every classification network. You can also view this block as the actual classifier, while anything in front of it (a shallow CNN in your case) is just trying to prepare nice features.
Caffe supports multiple losses. Then for the backpropagation stage, some blobs may have multiple gradients coming from different losses. How does Caffe do with the gradients of this blob?
As far as I know, this may not be a concern when designing networks. But this question really confuse me when I try to write a new layer. Thanks for any idea!
This is not an issue of caffe or any other deep-learning tool. This is purely a mathematical question: When you have several losses, you have loss_weight assigned to each loss and the overall loss of the net is the weighted sum of all losses. Consequently, the gradients computed for the net are gradients of the weighted sum of the losses: there is no gradient-per-loss that needs to be integrated, but rather a single loss which is a weighted sum of loss layers.
Caffe usually uses "Split" layer when directing the "top" of a layer into several layers (in your example the output of "conv2" is "Split" to a "bottom" of "auxiliary loss" and "ip1").
Looking at the code of backward propagation of "Split" layer you can see that the all top.diffs are summed into bottom.diff.
I have seen several different architectures for convolutional neural network (CNN). I am confused which one is the standard and how do I decide what to use. I am not confused by the number of layers being used or the number of parameters involved; I am confused by the COMPONENTS of the network.
Let assume:
CL = convolution layer SL = subsampling layer(pooling) CM = convolution map NN = neural network Softmax = softmax classifier (similar to linear classifier)
Architecture 1
https://www.youtube.com/watch?v=n6hpQwq7Inw
CL,SL,CL,SL,CM,Softmax
Architecture 2 (Do we really need NN at the end again?)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5605630&tag=1
CL,SL,CL, SL, NN, Softmax
Architecture 3
My idea
CL, SL, CL, SL, Softmax
There's no single one-size-suit-all CNN architecture. CNNs are usually designed to efficiently capture features of input data. It's assumed that these features are hierarchical, i.e. high-level features are made of low-level ones. CNN is just fancy feature extraction algorithm, you can put any classifier you want on top of it (NN, Softmax, whatever).
So convolutional layers are used to extract features from input. Subsampling layers, then, downscale the image in order to reduce computation complexity and make it shift-invariant.
Convolution map layer isn't that different from usual convolutional layer, I'm not sure if it's common to make this distinction. Actually, if you want to deal with color information, your input (to the first conv. layer) would be not a single image, but several (3, for example) images, each being a separate feature map.
What classifier to use on top of CNN is completely up to you. You can use Logistic Regression, SVM, NN or any other classification (or regression) algorithm.