I am trying to learn about convolutional neural networks, but i am having trouble understanding what happens to neural networks after the pooling step.
So starting from the left we have our 28x28 matrix representing our picture. We apply a three 5x5 filters to it to get three 24x24 feature maps. We then apply max pooling to each 2x2 square feature map to get three 12x12 pooled layers. I understand everything up to this step.
But what happens now? The document I am reading says:
"The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons. "
The text did not go further into describing what happens beyond that and it left me with a few questions.
How are the three pooled layers mapped to the 10 output neurons? By fully connected, does it mean each neuron in every one of the three layers of the 12x12 pooled layers has a weight connecting it to the output layer? So there are 3x12x12x10 weights linking from the pooled layer to the output layer? Is an activation function still taken at the output neuron?
Pictures and extract taken from this online resource: http://neuralnetworksanddeeplearning.com/chap6.html
Essentially, the fully connected layer provides the main way for the neural network to make a prediction. If you have ten classes, then a fully connected layer consists of ten neurons, each with a different probability as to the likelihood of the classified sample belonging to that class (each neuron represents a class). These probabilities are determined by the hidden layers and convolution. The pooling layer is simply outputted into these ten neurons, providing the final interface for your network to make the prediction. Here's an example. After pooling, your fully connected layer could display this:
(0.1)
(0.01)
(0.2)
(0.9)
(0.2)
(0.1)
(0.1)
(0.1)
(0.1)
(0.1)
Where each neuron contains a probability that the sample belongs to that class. In this, case, if you are classifying images of handwritten digits and each neuron corresponds to a prediction that the image is 1-10, then the prediction would be 4. Hope that helps!
Yes, you're on the right track. There is a layer with a weight matrix of 4320 entries.
This matrix will be typically arranged as 432x10. This is because these 432 number are a fixed-size representation of the input image. At this point, you don't care about how you got it -- CNN, plain feed-forward or a crazy RNN going pixel-by-pixel, you just want to turn the description into classifaction. In most toolkits (e.g. TensorFlow, PyTorch or even plain numpy), you'll need to explicitly reshape the 3x12x12 output of the pooling into a 432-long vector. But that's just a rearrangement, the individual elements do not change.
Additionally, there will usually be a 10-long vector of biases, one for every output element.
Finally about the nonlinearity: Since this is about classification, you typically want the output 10 units to represent posterior probabilities that the input belongs to a particular class (digit). For this purpose, the softmax function is used: y = exp(o) / sum(exp(o)), where exp(o) stands for an element-wise exponentiation. It guarantees that it's output will be a proper categorical distribution, all elements in <0; 1> and summing up to 1. There is a nice a detailed discussion of softmax in neural networks in the Deep Learning book (I recommend reading Section 6.2.1 in addition to the softmax Sub-subsection itself.)
Also note that this is not specific to convolutional networks at all, you'll find this block fully connected layer -- softmax at the end of virtually every classification network. You can also view this block as the actual classifier, while anything in front of it (a shallow CNN in your case) is just trying to prepare nice features.
Related
Caffe supports multiple losses. Then for the backpropagation stage, some blobs may have multiple gradients coming from different losses. How does Caffe do with the gradients of this blob?
As far as I know, this may not be a concern when designing networks. But this question really confuse me when I try to write a new layer. Thanks for any idea!
This is not an issue of caffe or any other deep-learning tool. This is purely a mathematical question: When you have several losses, you have loss_weight assigned to each loss and the overall loss of the net is the weighted sum of all losses. Consequently, the gradients computed for the net are gradients of the weighted sum of the losses: there is no gradient-per-loss that needs to be integrated, but rather a single loss which is a weighted sum of loss layers.
Caffe usually uses "Split" layer when directing the "top" of a layer into several layers (in your example the output of "conv2" is "Split" to a "bottom" of "auxiliary loss" and "ip1").
Looking at the code of backward propagation of "Split" layer you can see that the all top.diffs are summed into bottom.diff.
I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.
I just started to learn about neural networks and so far my knowledge of machine learning is simply linear and logistic regression. from my understanding of the latter algorithms, is that given multiples of inputs the job of the learning algorithm is to come up with appropriate weights for each input so that eventually I have a polynomial that either describes the data which is the case of linear regression or separates it as in the case of logistic regression.
if I was to represent the same mechanism in neural network, according to my understanding, it would look something like this,
multiple nodes at the input layer and a single node in the output layer. where I can back propagate the error proportionally to each input. so that also eventually I arrive to a polynomial X1W1 + X2W2+....XnWn that describes the data. to me having multiple nodes per layer, aside from the input layer, seems to make the learning process parallel, so that I can arrive to the result faster. it's almost like running multiple learning algorithms each with different starting points to see which one converges faster. and as for the multiple layers I'm at a lose of what mechanism and advantage does it have on the learning outcome.
why do we have multiple layers and multiple nodes per layer in a neural network?
We need at least one hidden layer with a non-linear activation to be able to learn non-linear functions. Usually, one thinks of each layer as an abstraction level. For computer vision, the input layer contains the image and the output layer contains one node for each class. The first hidden layer detects edges, the second hidden layer might detect circles / rectangles, then there come more complex patterns.
There is a theoretical result which says that an MLP with only one hidden layer can fit every function of interest up to an arbitrary low error margin if this hidden layer has enough neurons. However, the number of parameters might be MUCH larger than if you add more layers.
Basically, by adding more hidden layers / more neurons per layer you add more parameters to the model. Hence you allow the model to fit more complex functions. However, up to my knowledge there is no quantitative understanding what adding a single further layer / node exactly makes.
It seems to me that you might want a general introduction into neural networks. I recommend chapter 4.3 and 4.4 of [Tho14a] (my bachelors thesis) as well as [LBH15].
[Tho14a]
M. Thoma, “On-line recognition of handwritten mathematical symbols,”
Karlsruhe, Germany, Nov. 2014. [Online]. Available: https://arxiv.org/abs/1511.09030
[LBH15]
Y. LeCun,
vol. 521,
Y. Bengio,
no. 7553,
and G. Hinton,
pp. 436–444,
“Deep learning,”
Nature,
May 2015. [Online]. Available:
http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html
I am just starting to grasp the idea of Backpropagation and MLP Networks. What I have confusion about is that how is the input vectors "clamped" in the input layer ?
For example lets take a mock IRIS dataset-:
[5.0,4.4,2.7,1.5,0],
[3.0,3.6,1.8,1.7,1],
[2.0,1.2,3.3.4.2,2]
Are these inputs feed in all together in the input layer ? Or are they fed in one by one.
What I mean is that on the first iteration is the first input vector fed like-:
[5.0,4.4,2.7,1.5,1]
and then the error is calculated and then the next input vector is sent ie.
[3.0,3.6,1.8,1.7,2]
OR are they all sent in together as -:
[[A vector of all petal lengths],[A vector of all sepal lengths],etc]
I know different frameworks handle these differently but feel free to comment on how any popular deeplearning framework would do this. I use DeepLearning4J myself.
Thanks.
Input vectors are usually fed into neural networks in batches. How many vectors those batches contain is dependent on the batch size. E.g. a batch size of 128 would mean that you feed 128 input vectors into the network (or less if there aren't that many left) and then update the weights/parameters. The iris tutorial of Deeplearning4J seems to use a batch size of 150: int batchSize = 150; and later DataSetIterator iter = new IrisDataSetIterator(batchSize, numSamples);.
Note that there's also a batch mode for updating the weights of neural networks, which - confusingly - updates the weights only after all input vectors have been fed into the network. The batch mode however is basically never used.
Given a target function f: R^4 -> R^2, can you draw me(give me an example) an Artificial Neural Network , lets say with two layers, and 3 nodes in the hidden layer.
Now, I think I understand how an ANN works when a function is like [0,1]^5 ->[0,1], but I am not quite sure how to do an example from R4 to R2.
I am new to machine learning, and it's a little bit diffult to catch up with all this concepts.
Thanks in advance.
First, you need two neurons in the output layer. Each neuron would correspond to one dimension of your output space.
Neurons in the output layer don't need an activation function that limits their values in the [0,1] interval (e.g. the logistic function). And even if you scale your output space in the interval [0,1], don't use a sigmoid function for activation.
Although your original data is not in [0,1]^4, you should do some preprocessing to scale and shift them to have mean zero and variance 1. You must apply same preprocessing to all your examples (training and test).
This should give you something to build up on.