Use of tanh activation function in input gate of LSTM - machine-learning

While studying about LSTM , I got to know about use of 2 different activation functions in input gate - sigmoid and tanh. I got the use of sigmoid but not tanh. In this stackoverflow article about use of tanh says that we want the second derivative of it to be sustain for long time before going to zero , I don't get why he is talking about second derivative. Also, he kind of saying that tanh eliminates vanishing gradient(in 2nd para) but in all articles that i read they say that Leaky ReLU helps in eliminating it. Therefore I want to understand about tanh in LSTM.This is not duplicate question ,I just want to understand the previously answered question. Thank You!🙌

Related

Sigmoid activation for multi-class classification?

I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?

Artificial Neural Network RELU Activation Function and Gradients

I have a question. I watched a really detailed tutorial on implementing an artificial neural network in C++. And now I have more than a basic understanding of how a neural network works and how to actually program and train one.
So in the tutorial a hyperbolic tangent was used for calculating outputs, and obviously its derivative for calculating gradients. However I wanted to move on to a different function. Specifically Leaky RELU (to avoid dying neurons).
My question is, it specifies that this activation function should be used for the hidden layers only. For the output layers a different function should be used (either a softmax or a linear regression function). In the tutorial the guy taught the neural network to be an XOR processor. So is this a classification problem or a regression problem?
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
So I implemented the Leaky RELU function and its derivative but I don't know whether I should use a softmax or a regression function for the output layer.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Thanks in advance.
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
In short, classification is for discrete target, regression is for continuous target. If it were a floating point operation, you had a regression problem. But here the result of XOR is 0 or 1, so it's a binary classification (already suggested by Sid). You should use a softmax layer (or a sigmoid function, which works particularly for 2 classes). Note that the output will be a vector of probabilities, i.e. real valued, which is used to choose the discrete target class.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Correct. For the output layer you'll need a cross-entropy loss function, which corresponds to the softmax layer, and it's derivative for the backward pass.
If there will be hidden layers that still use Leaky ReLu, you'll also need Leaky ReLu's derivative accordingly, for these particular layers.
Highly recommend this post on backpropagation details.

Can ReLU handle a negative input?

I'm training a neural network on data that comes in as negative & positive values.
Is there any way to feed the data into a ReLU network without converting it all to positive and having a separate input which says if the data is negative or positive?
The problem I see is that a negative input at the input layer means that unless you have initialised your weights to be negative, the ReLU node isn't ever activated and is forever dead.
I'm not really 100% sure what you're asking, as there are many activation functions and you can easy code your own. If you dont want to code your own, maybe try some alternatives:
Leaky ReLU
Parameteric ReLU
Basically, take a look here
If you really use an activation function with the input layer, I would suggest either using another activation function like ELU or transform your data to the range [0,1], for example.
If the ReLU function is in some hidden layer, the ReLU function should become dead only temporarily.
Suppose you have a ReLU function in the last hidden layer of a feed-forward network. With the backpropagation algorithm it should be possible that the outputs of the previous hidden layers are changed in such a way that, eventually, the input to the ReLU function will become positive again. Then the ReLU would not be dead anymore. Chances are that I am missing something here.
Anyway, you should definitely give ELU a try! I have experienced better results with it than with the ReLU.

simple perceptron model and XOR

Sorry that i only keep asking here. I will study hard to get ready to answer questions too!
Many papers and articles claim that there is no restriction on choosing activation functions for MLP.
It seems like it is only matter which one fits most for given condition.
And also the articles say that it is mathematically proven simple perceptron can not solve XOR problem.
I know that simple perceptron model used to use step function for its activation function.
But if basically it doesn't matter which activation function to use, then using
f(x)=1 if |x-a|<b
f(x)=0 if |x-a|>b
as an activation function works on XOR problem. (for 2input 1output no hidden layer perceptron model)
I know that using artificial functions is not good for learning model. But if it works anyway, then why the articles say that it is proven it doesn't work?
Does the article means simple perceptron model by one using step function? or does activation function for simple perceptron has to be step function unlike MLP? or am i wrong?
In general,
The problem is that non-differentiable activation functions (like the one you proposed) cannot be used for back-propagation and other techniques. Back propagation is a convenient way to estimate the correct threshold values (a and b in your example). All the popular activation functions are selected such that they approximate step behaviour while remaining differentiable.
As bgbg mentioned, your activation is non-differentiable. If you use a differentiable activation function , which is required for MLP's to compute the gradients and update the weights, then the perceptron is simply fitting a line, which intuitively cannot solve the nonlinear XOR problem.

Why use tanh for activation function of MLP?

Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.

Resources