I'm trying to create a new CNN model.
First I pass the rgb images(size 224x224) through a ResNet50 network. The output of the ResNet50 is (None,7, 7, 2048). I now have 2 different ways to proceed to reduce to a (None,512) vector.
Way 1: Insert a FCL(Dense layer) with 512 neurons followed by a global average pooling layer.
Way 2: Do a global average pooling layer first, and only after do the FCL with 512.
Are way 1 and 2 the same? If not, what is the difference?
I found a similar question, How fully connected layer after global average pooling works in Resnet50? , but it doesn't explain the difference between doing first the global pooling.
I believe placing global average pooling after FCL doesnt make sense.
The purpuse of global average pooling is to (partly) replace the FCL for the task of dimensionality reduction after the CNN layers while using less parameters (thus making the overfitting less probable). Some people still place a small FCL after the global average pooling.
A nice explanation can be found here: https://www.quora.com/Why-was-global-average-pooling-used-instead-of-a-fully-connected-layer-in-GoogLeNet-and-how-was-it-different
Related
I am trying to learn about convolutional neural networks, but i am having trouble understanding what happens to neural networks after the pooling step.
So starting from the left we have our 28x28 matrix representing our picture. We apply a three 5x5 filters to it to get three 24x24 feature maps. We then apply max pooling to each 2x2 square feature map to get three 12x12 pooled layers. I understand everything up to this step.
But what happens now? The document I am reading says:
"The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons. "
The text did not go further into describing what happens beyond that and it left me with a few questions.
How are the three pooled layers mapped to the 10 output neurons? By fully connected, does it mean each neuron in every one of the three layers of the 12x12 pooled layers has a weight connecting it to the output layer? So there are 3x12x12x10 weights linking from the pooled layer to the output layer? Is an activation function still taken at the output neuron?
Pictures and extract taken from this online resource: http://neuralnetworksanddeeplearning.com/chap6.html
Essentially, the fully connected layer provides the main way for the neural network to make a prediction. If you have ten classes, then a fully connected layer consists of ten neurons, each with a different probability as to the likelihood of the classified sample belonging to that class (each neuron represents a class). These probabilities are determined by the hidden layers and convolution. The pooling layer is simply outputted into these ten neurons, providing the final interface for your network to make the prediction. Here's an example. After pooling, your fully connected layer could display this:
(0.1)
(0.01)
(0.2)
(0.9)
(0.2)
(0.1)
(0.1)
(0.1)
(0.1)
(0.1)
Where each neuron contains a probability that the sample belongs to that class. In this, case, if you are classifying images of handwritten digits and each neuron corresponds to a prediction that the image is 1-10, then the prediction would be 4. Hope that helps!
Yes, you're on the right track. There is a layer with a weight matrix of 4320 entries.
This matrix will be typically arranged as 432x10. This is because these 432 number are a fixed-size representation of the input image. At this point, you don't care about how you got it -- CNN, plain feed-forward or a crazy RNN going pixel-by-pixel, you just want to turn the description into classifaction. In most toolkits (e.g. TensorFlow, PyTorch or even plain numpy), you'll need to explicitly reshape the 3x12x12 output of the pooling into a 432-long vector. But that's just a rearrangement, the individual elements do not change.
Additionally, there will usually be a 10-long vector of biases, one for every output element.
Finally about the nonlinearity: Since this is about classification, you typically want the output 10 units to represent posterior probabilities that the input belongs to a particular class (digit). For this purpose, the softmax function is used: y = exp(o) / sum(exp(o)), where exp(o) stands for an element-wise exponentiation. It guarantees that it's output will be a proper categorical distribution, all elements in <0; 1> and summing up to 1. There is a nice a detailed discussion of softmax in neural networks in the Deep Learning book (I recommend reading Section 6.2.1 in addition to the softmax Sub-subsection itself.)
Also note that this is not specific to convolutional networks at all, you'll find this block fully connected layer -- softmax at the end of virtually every classification network. You can also view this block as the actual classifier, while anything in front of it (a shallow CNN in your case) is just trying to prepare nice features.
Caffe supports multiple losses. Then for the backpropagation stage, some blobs may have multiple gradients coming from different losses. How does Caffe do with the gradients of this blob?
As far as I know, this may not be a concern when designing networks. But this question really confuse me when I try to write a new layer. Thanks for any idea!
This is not an issue of caffe or any other deep-learning tool. This is purely a mathematical question: When you have several losses, you have loss_weight assigned to each loss and the overall loss of the net is the weighted sum of all losses. Consequently, the gradients computed for the net are gradients of the weighted sum of the losses: there is no gradient-per-loss that needs to be integrated, but rather a single loss which is a weighted sum of loss layers.
Caffe usually uses "Split" layer when directing the "top" of a layer into several layers (in your example the output of "conv2" is "Split" to a "bottom" of "auxiliary loss" and "ip1").
Looking at the code of backward propagation of "Split" layer you can see that the all top.diffs are summed into bottom.diff.
Newcomer to machine learning and stackoverflow.
Recently, I have been trying to create a machine learning algorithm that estimates the direction of a light source based on the reflection of an object.
I know this may be a complicated subject and that's why, as a first step, i tried to simplify it as much as possible.
I first changed my problem from a regression problem to a classification problem by only taking as output : Light source is on left side of object or Light source is on the right side of the object.
I am also only making one angle vary for my dataset.
Short version of my question :
Do you think that it is possible to do such thing with machine learning ? (my experience is too limited to really be sure)
If yes, what would be the more suited neural network for you ? CNN ? R-CNN? LSTM ? SVM ?
What would be the pipeline to complete this task ?
I am currently using Unity Engine with directional light that takes a random X angle between [10,60] / [120,170] and a sphere with metallic reflection to create and label a dataset. Here is an example :
https://imgur.com/a/FxNew Label : 0 (Left side)
https://imgur.com/a/9KFhi Label : 1 (Right side)
For the pre-processing :
Images are resized to a 64x64 image
Transformed from RGB to grayscale format.
For the machine learning, i'm currently using tensorflow and a convolutional neural network with :
10000 Balanced, labeled data of 64x64 grayscale pictures as input and 0/1 as Label
3 Convolutional Layers with filter [16,32,64] with size [5,5] RELU
3 Pooling Layers with size [2,2] and stride [2,2]
1 Dense layer with 1024 Hidden neurons and dropout (Rate = 0.4) RELU
1 Dense Layer with 2 output neurons (1 for each class) Softmax
As for the issue : My network is simply not learning the loss hardly goes down and accuracy show that good result are random, whatever the data, the number of layer, optimizer, learning rate, ... My output just average between the two classes : [0.5 , 0.5].
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
Another guess is that the convolutional layer doesn't take position into account, so for the convolution part, all the images are the same since the sphere is always the same, as well as the lighting pattern. It will always detect the same thing and won't take into account that the light region has moved. Do you have any advice on which network I could use to resolve this issue ?
I'm really looking for some advice, warning on how to tackle this kind of task.
Please remember that I am still pretty new to machine learning and still learning more than my machines hehe...
Thank you.
Do you think that it is possible to do such thing with machine learning ?
Absolutely. And you've correctly chosen a CNN model - it's the best suited for this task.
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
No, CNN has proven to classify pretty well from the raw pixels. It should figure out itself what to pay attention to.
Do you have any advice on which network I could use to resolve this issue ?
I would be great if you provide your full code. There are so many reasons for not learning: image pre-processing bugs, data mislabeling, poor choice of hyperparameters (learning rate, initialization, ...), wrong loss function, etc. There can be simply bugs.
What I suggest right away, based on described CNN architecture:
5x5 filter size is probably too large, since you don't have that many filters. Try 3x3 and increase the number of filters a bit, e.g. 32 - 64 - 64.
I assume that you use CONV - POLL - CONV - POLL - CONV - POOL, not CONV - CONV - CONV - POOL - POOL - POLL. Just to make sure.
You probably don't need so many neurons in your FC layer. You have just two classes and pretty similar images! Reduce 1024 to say 256.
You don't experience any overfitting at the moment, so disable the dropout for now: keep_probability=1.0.
Pay attention to initialization and learning rate. Try different values in log-scale, e.g. learning_rate = 0.1, 0.01, 0.001 and check if learning pattern ever changes.
Thank you to #Maxim for his answer. It was very helpful and helped me solve my problem as well as refining my network.
He pointed me out to the problem : Data Mislabeling.
I was pretty sure about my data labeling but verified anyway.
The problem was there...
I write the answer here so it can maybe help other unaware tensorflow users :
When you use tf.string_input_producer without specifying it, the default is : "Shuffle = True" which shuffles your filenames queue.
Since i use a .csv file for the labels and .png folder for the images, the labels where read in order from 1 to 10000 whereas the .png files were read randomly.
I feel very dumb about it, but that's how you learn hehe.
I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.
Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.
Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms
I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.
IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.
After reading up on the subject I don't fully understand: Is the 'convolution' in neural networks comparable to a simple downsampling or 'sharpening' function?
Can you break this term down into a simple, understandable image/analogy?
edit: Rephrase after 1st answer: Can pooling be understood as downsampling of weight matrices?
Convolutional neural network is a family of models which are proved empirically to work great when it comes to image recognition. From this point of view - CNN is something completely different than downsampling.
But in framework used in CNN design there is something what is comparable to a downsampling technique. To fully understand that - you have to understand how CNN usually works. It is build by a hierarchical number of layers and at every layer you have a set of a trainable kernels which output has a dimension very similiar to spatial size of your input images.
This might be a serious problem - the output from such layer might be extremely huge (~ nr_of_kernels * size_of_kernel_output) which could make your computations intractable. This is the reason why a certain techniques are used in order to decrease size of the output:
Stride, pad and kernel size manipulation: be setting these values to a certain value you could decrese the size of the output (on the other hand - you may lose some of important information).
Pooling operation: pooling is an operation in which instead of passing as an output from a layer all outputs from all kernels - you might pass only specific aggregated statistics about it. It is considered as extremely useful and is widely used in CNN design.
For a detailed description you might visit this tutorial.
Edit: Yes, pooling is a kind of downsampling 😊