Xavier and he_normal initialization difference - machine-learning

What is the difference between He normal and Xavier normal initializer in keras. Both seem to initialize weights based on variance in the input data. Any intuitive explanation for the difference between both?

See this discussion on Stats.SE:
In summary, the main difference for machine learning practitioners is the following:
He initialization works better for layers with ReLu activation.
Xavier initialization works better for layers with sigmoid activation.

Weight (kernel) Initialization parameters for each type of activation function:
Xavier/Glorot Initialization: None, hyperbolic Tan (tanh),
Logistic(sigmoid), softmax.
He Initialization: Rectified Linear activation unit(ReLU) and
Variants.
LeCun Initialization: Scaled Exponential Linear Unit(SELU)
Application...
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
Here's a link to the research paper by Xavier Glorot, Yoshua Bengio on "Understanding the difficulty of training deep feedforward neural networks", incase you want to understand the importance and the math behind weight initialization.
http://proceedings.mlr.press/v9/glorot10a.html

Related

Is weight initialization different for dense and convolutional layers?

In a dense layer, one should initialize the weights according to some rule of thumb. For example, with RELU, the weights should come from a normal distribution and should be rescaled by 2/n where n is the number of inputs to the layer (according to Andrew Ng).
Does the same hold for convolutional layers? What is the right way to initialize weights (and biases) in a convolutional layer?
A common initializer for the sigmoid-based networks is Xavier initializer (a.k.a. Glorot initializer), named after Xavier Glorot, one of the authors of "Understanding the difficulty of training deep feedforward neural networks" paper. The formula takes into account not only the number of incoming connections, but also outcoming as well. The authors prove that with this initialization, activations distribution is approximately normal, which helps gradient flow in the backward pass.
For relu-based networks, a better initializer is He initializer from "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" by Kaiming He at al., which prove the same properties for relu activation.
Dense and convolutional layer aren't that different in this case, but it's important to remember that kernel weights are shared across the input image and batch, so the number of incoming connections depends on several parameters, incluing kernel size and striding, and might not be easy to calculate by hand.
In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer().
See also this discussion on CrossValidated.

Artificial Neural Network RELU Activation Function and Gradients

I have a question. I watched a really detailed tutorial on implementing an artificial neural network in C++. And now I have more than a basic understanding of how a neural network works and how to actually program and train one.
So in the tutorial a hyperbolic tangent was used for calculating outputs, and obviously its derivative for calculating gradients. However I wanted to move on to a different function. Specifically Leaky RELU (to avoid dying neurons).
My question is, it specifies that this activation function should be used for the hidden layers only. For the output layers a different function should be used (either a softmax or a linear regression function). In the tutorial the guy taught the neural network to be an XOR processor. So is this a classification problem or a regression problem?
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
So I implemented the Leaky RELU function and its derivative but I don't know whether I should use a softmax or a regression function for the output layer.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Thanks in advance.
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
In short, classification is for discrete target, regression is for continuous target. If it were a floating point operation, you had a regression problem. But here the result of XOR is 0 or 1, so it's a binary classification (already suggested by Sid). You should use a softmax layer (or a sigmoid function, which works particularly for 2 classes). Note that the output will be a vector of probabilities, i.e. real valued, which is used to choose the discrete target class.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Correct. For the output layer you'll need a cross-entropy loss function, which corresponds to the softmax layer, and it's derivative for the backward pass.
If there will be hidden layers that still use Leaky ReLu, you'll also need Leaky ReLu's derivative accordingly, for these particular layers.
Highly recommend this post on backpropagation details.

When should I use linear neural networks and when non-linear?

I am using feed forward, gradient descent backpropagation neural networks.
Currently I have only worked with non-linear networks where tanh is activation function.
I was wondering.
What kind of tasks would you give to a neural networks with non-linear activation function and what kind of tasks for linear?
I know that network with linear activation function are used to solve linear problems.
What are those linear problems?
Any examples?
Thanks!
I'd say never, since composition of linear functions is still linear using a neural network with linear activations is just a way to complicate linear regression.
Whether to choose a linear model or something more complicated is up to you and depends on the data you have; this is (one of the reasons) why it is customary hold out some data during training and use it to validate the model. Other ways of testing models are residuals analysis, hypothesis testing, and so on

Ada-Delta method doesn't converge when used in Denoising AutoEncoder with MSE loss & ReLU activation?

I just implemented AdaDelta (http://arxiv.org/abs/1212.5701) for my own Deep Neural Network Library.
The paper kind of says that SGD with AdaDelta is not sensitive to hyperparameters, and that it always converge to somewhere good. (at least the output reconstruction loss of AdaDelta-SGD is comparable to that of well-tuned Momentum method)
When I used AdaDelta-SGD as learning method in in Denoising AutoEncoder, it did converge in some specific settings, but not always.
When I used MSE as loss function, and Sigmoid as activation function, it converged very quickly, and after iterations of 100 epochs, the final reconstruction loss was better than all of plain SGD, SGD with Momentum, and AdaGrad.
But when I used ReLU as activation function, it didn't converge but continued to be stacked(oscillating) with high(bad) reconstruction loss (just like the case when you used plain SGD with very high learning rate).
The magnitude of reconstruction loss it stacked was about 10 to 20 times higher than the final reconstruction loss generated with Momentum method.
I really don't understand why it happened since the paper says AdaDelta is just good.
Please let me know the reason behind the phenomena and teach me how I could avoid it.
The activation of a ReLU is unbounded, making its use in Auto Encoders difficult since your training vectors likely do not have arbitrarily large and unbounded responses! ReLU simply isn't a good fit for that type of network.
You can force a ReLU into an auto encoder by applying some transformation to the output layer, as is done here. However, hey don't discuss the quality of the results in terms of an auto-encoder, but instead only as a pre-training method for classification. So its not clear that its a worth while endeavor for building an auto encoder either.

Why isn't the 0-1 loss function used in the perceptron or SVM?

Why isn't the 0-1 loss function (being the most obvious and informative from the standpoint of conceptual binary classification models) used in the perceptron or Support Vector Machine (SVM) algorithms?
In the case of perceptrons, most of the time they are trained using gradient descent (or something similar) and the 0-1 loss function is flat so it doesn't converge well (not to mention that it's not differentiable at 0)
SVM is based on solving an optimization problem that maximize the margin between classes. So in this context a convex loss function is preferable so we can use several general convex optimization methods. The 0-1 loss function is not convex so it is not very useful either. Note that this is due the current state of art, but if a new method that optimize non convex functions efficiently is discovered then that would change.
Edit: typo

Resources