Is weight initialization different for dense and convolutional layers? - machine-learning

In a dense layer, one should initialize the weights according to some rule of thumb. For example, with RELU, the weights should come from a normal distribution and should be rescaled by 2/n where n is the number of inputs to the layer (according to Andrew Ng).
Does the same hold for convolutional layers? What is the right way to initialize weights (and biases) in a convolutional layer?

A common initializer for the sigmoid-based networks is Xavier initializer (a.k.a. Glorot initializer), named after Xavier Glorot, one of the authors of "Understanding the difficulty of training deep feedforward neural networks" paper. The formula takes into account not only the number of incoming connections, but also outcoming as well. The authors prove that with this initialization, activations distribution is approximately normal, which helps gradient flow in the backward pass.
For relu-based networks, a better initializer is He initializer from "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" by Kaiming He at al., which prove the same properties for relu activation.
Dense and convolutional layer aren't that different in this case, but it's important to remember that kernel weights are shared across the input image and batch, so the number of incoming connections depends on several parameters, incluing kernel size and striding, and might not be easy to calculate by hand.
In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer().
See also this discussion on CrossValidated.

Related

question of neural network training:the gradient of the same module which is used multiple times in one iteration

When training a neural network, if the same module is used multiple times in one iteration, does the gradient of the module need special processing during backpropagation?
for example:
One Deformable Compensation is used three times in this model, which means they share the same weights.
What will happen when I use loss.backward()?
Will loss.backward() work correctly?
The nice thing about autograd and backward passes is that the underlying framework is not "algorithmic", but rather a mathematic one: it implements the chain rule of derivatives. Therefore, there are no "algorithmic" considerations of "shared weights" or "weighting different layers", it's pure math. The backward pass provides the derivative of your loss function w.r.t the weights in a purely mathematical way.
Sharing weights can be done globally (e.g., when training Saimese networks), on a "layer level" (as in your example), but also within a layer. When you think about it Convolution layers and Reccurent layers are a fancy way of locally sharing weights.
Naturally, pytorch (as well as all other DL frameworks) can trivially handle these cases.
As long as your "deformable compensation" layer is correctly implemented -- pytorch will take care of the gradients for you, in a mathematically correct manner, thanks to the chain rule.

How do Convolutional neural networks proceed after the pooling step?

I am trying to learn about convolutional neural networks, but i am having trouble understanding what happens to neural networks after the pooling step.
So starting from the left we have our 28x28 matrix representing our picture. We apply a three 5x5 filters to it to get three 24x24 feature maps. We then apply max pooling to each 2x2 square feature map to get three 12x12 pooled layers. I understand everything up to this step.
But what happens now? The document I am reading says:
"The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons. "
The text did not go further into describing what happens beyond that and it left me with a few questions.
How are the three pooled layers mapped to the 10 output neurons? By fully connected, does it mean each neuron in every one of the three layers of the 12x12 pooled layers has a weight connecting it to the output layer? So there are 3x12x12x10 weights linking from the pooled layer to the output layer? Is an activation function still taken at the output neuron?
Pictures and extract taken from this online resource: http://neuralnetworksanddeeplearning.com/chap6.html
Essentially, the fully connected layer provides the main way for the neural network to make a prediction. If you have ten classes, then a fully connected layer consists of ten neurons, each with a different probability as to the likelihood of the classified sample belonging to that class (each neuron represents a class). These probabilities are determined by the hidden layers and convolution. The pooling layer is simply outputted into these ten neurons, providing the final interface for your network to make the prediction. Here's an example. After pooling, your fully connected layer could display this:
(0.1)
(0.01)
(0.2)
(0.9)
(0.2)
(0.1)
(0.1)
(0.1)
(0.1)
(0.1)
Where each neuron contains a probability that the sample belongs to that class. In this, case, if you are classifying images of handwritten digits and each neuron corresponds to a prediction that the image is 1-10, then the prediction would be 4. Hope that helps!
Yes, you're on the right track. There is a layer with a weight matrix of 4320 entries.
This matrix will be typically arranged as 432x10. This is because these 432 number are a fixed-size representation of the input image. At this point, you don't care about how you got it -- CNN, plain feed-forward or a crazy RNN going pixel-by-pixel, you just want to turn the description into classifaction. In most toolkits (e.g. TensorFlow, PyTorch or even plain numpy), you'll need to explicitly reshape the 3x12x12 output of the pooling into a 432-long vector. But that's just a rearrangement, the individual elements do not change.
Additionally, there will usually be a 10-long vector of biases, one for every output element.
Finally about the nonlinearity: Since this is about classification, you typically want the output 10 units to represent posterior probabilities that the input belongs to a particular class (digit). For this purpose, the softmax function is used: y = exp(o) / sum(exp(o)), where exp(o) stands for an element-wise exponentiation. It guarantees that it's output will be a proper categorical distribution, all elements in <0; 1> and summing up to 1. There is a nice a detailed discussion of softmax in neural networks in the Deep Learning book (I recommend reading Section 6.2.1 in addition to the softmax Sub-subsection itself.)
Also note that this is not specific to convolutional networks at all, you'll find this block fully connected layer -- softmax at the end of virtually every classification network. You can also view this block as the actual classifier, while anything in front of it (a shallow CNN in your case) is just trying to prepare nice features.

Why do we have normally more than one fully connected layers in the late steps of the CNNs?

As I noticed, in many popular architectures of the convolutional neural networks (e.g. AlexNet), people use more than one fully connected layers with almost the same dimension to gather the responses to previously detected features in the early layers.
Why do not we use just one FC for that? Why this hierarchical arrangement of the fully connected layers is possibly more useful?
Because there are some functions, such as XOR, that can't be modeled by a single layer. In this type of architecture the convolutional layers are computing local features and the fully-connected output layer(s) are then combining these local features to derive the final outputs.. So, you can consider the fully-connected layers as a semi-independent mapping of features to outputs, and if this is a complex mapping then you may need the expressive power of multiple layers.
Actually its no longer popular/normal. 2015+ networks(such as Resnet, Inception 4) uses Global average pooling(GAP) as a last layer + softmax, which gives same performance and mach smaller model. Last 2 layers in VGG16 is about 80% of all parameters in network. But to answer you question its common to use 2 layer MLP for classification and consider the rest of network to be feature generation. 1 layer would be normal logistic regression with global minimum and simple properties, 2 layers give some usefulness to have non linearity and usage of SGD.

What bookkeeping is caffe doing?

Caffe tutorial states:
The net is a set of layers connected in a computation graph – a directed acyclic graph (DAG) to be exact. Caffe does all the bookkeeping for any DAG of layers to ensure correctness of the forward and backward passes.
What is the meaning by "all the bookkeeping"? I don't understand it.
How to do all the bookkeeping?
Caffe, like many other deep-learning frameworks, trains its models using stochastic gradient decent (SGD), implemented as gradient back propagation. That is, for a mini-batch of training examples, caffe feed the batch through the net ("forward pass") to compute the loss w.r.t the net's parameters. Then, it propagates the loss gradient back ("backward pass") to update all the parameters according to the estimated gradient.
By "bookkeeping" the tutorial means, you do not need to worry about estimating the gradients and updating the parameters. Once you are using existing layers (e.g., "Convolution", "ReLU", "Sigmoid" etc.) you only need to define the graph structure (the net's architecture) and supply the training data, and caffe will take care of the rest of the training process: It will forward/backward each mini batch, compute the loss, estimate the gradients and update the parameters - all for you.
Pretty awesome, don't you think? ;)

Graphically, how does the non-linear activation function project the input onto the classification space?

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.
Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?
I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.
Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.
This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.
NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider
this animation
This pictures where taken from this article. Also check out that cool visualization applet.
Activation functions have very little to do with classifying non-linearly separable sets of data.
Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).
You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.
Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.
When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.
Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set:
I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

Resources