Recently I was studying LSTM networks. I mostly figured out the forward pass. There are tons of great materials out there. I am now trying to figure out how backward pass works. It seems a bit tricky to me. When I was doing some searches, I saw an article on arxiv and something confused me.
In this article, the author describes the LSTM forward pass and makes the derivations for backward pass. However, he is using tanh activation function for output gate. I could not find any references explaining why he is doing so.
o_{t} = tanh(W_{xo}x_{t} + W_{ho}h_{t−1} + b_{o})
Can someone please care to explain me or at least provide me some articles to read? because I cannot find one. Articles and/or blogs and/or course slides I see are using sigmoid function for output gate, which its reasoning totally makes sense to me. An example: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Here, output gate is given by;
o_{t} = \sigma(W_{o}x_{t} + W_{o}h_{t−1} + b_{o})
In the same article, he uses sigmoid function to determine candidate cell state (internal state) values (or sometimes called "input node" OR the authors call it "input modulation gate", given by g_{t}) to update old cell state into the new cell state. Typically, the summed weighted input is run through a tanh activation function (as in this tutorial in this blog post that I linked above), although in the original LSTM paper, the activation function is a sigmoid. I totally understand that and can comprehend the intuition.
NOTE: this post is not a duplicate. I am not asking the intuition why to use sigmoid or tanh functions.
EDIT: In the article above, it takes tanh function of output gate. But, if you look at one of the articles of Jürgen Schmidhuber (LSTM: A Search Space Odyssey, https://arxiv.org/abs/1503.04069, see Figure 1), it says that the activation function of output gate is always sigmoid. I know arxiv papers are not officially published and not peer-reviewed. However, I just want to be sure before I claim that there is something wrong this this article "A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation" by Gang Chen.
Related
While studying about LSTM , I got to know about use of 2 different activation functions in input gate - sigmoid and tanh. I got the use of sigmoid but not tanh. In this stackoverflow article about use of tanh says that we want the second derivative of it to be sustain for long time before going to zero , I don't get why he is talking about second derivative. Also, he kind of saying that tanh eliminates vanishing gradient(in 2nd para) but in all articles that i read they say that Leaky ReLU helps in eliminating it. Therefore I want to understand about tanh in LSTM.This is not duplicate question ,I just want to understand the previously answered question. Thank You!🙌
I've been trying to find out what exactly non-linear activation functions do when implemented in a neural network.
I know they modify the output of a neuron, but how and for what purpose?
I know they add non-linearity to otherwise linear neural networks, but for what purpose?
What exactly do they do to the output of each layer? Is it some form of classification?
I want to know what exactly their purpose is within neural networks.
Wikipedia says that "the activation function of a node defines the output of that node given an input or set of inputs." This article states that the activation function checks whether a neuron has "fired" or not. I've looked at a bunch more articles and other questions on Stack Overflow as well, but none of them gave a satisfying answer as to what is occurring.
The main reason for using non-linear activation functions is to be able to learn non-linear target functions, i.e. learn a non-linear relationship between the inputs and outputs. If a network consists of only linear activation functions, it can only model a linear relationship between the inputs and outputs, which is not useful in almost all applications.
I am by no means an ML expert, so maybe this video can explain it better: https://www.coursera.org/lecture/neural-networks-deep-learning/why-do-you-need-non-linear-activation-functions-OASKH
Hope this helps!
First of all it's better to have a clear idea on why we use activation functions.
We use activation functions to propagate the output of one layer’s nodes to
the next layer. Activation functions are scalar-to-scalar functions and we use activation functions for hidden neurons in a neural network to introduce non-linearity into the network’s model. So in a simpler level, activation function are used to introduce non-linearity into the network.
So what is the use of introducing non-linearity? Before that, non-linearity means that an output cannot be reproduced from a linear combination of the inputs. Therefore without a non-linear activation function in a neural-network, even though it may have hundreds of hidden layers it would still behave like a single-layer perceptron. The reason is whichever the way you sum them, it would only result a linear output.
Anyhow for more deeper level understanding, I suggest you to look at this Medium post as well as this video by Andrew Ng himself.
From the Andrew Ng's video let me rephrase some important parts below.
...if you don't have an activation function, then no matter how many
layers your neural network has, all it's doing is just computing a
linear activation function. So you might as well not have any hidden
layers.
...it turns out that if you have a linear activation function here and
a sigmoid function here, then this model is no more expressive than
standard logistic regression without any hidden layer.
...so unless
you throw a non-linear in there, then you're not computing more
interesting functions even as you go deeper in the network.
Honestly im learning the neural network but i have a question in the activation part.
I know that the question is general and a lot of explanation around the internet. But i still don't understand clearly.
Why we need to derivate the sigmoid function? why do not we just use
it?
It will be good if you give the clear explanation. Thankyou.
I've seen many videos on youtube, i've read many article about it but still don't get it.
Thanks for your help.
Your question is not entirely clear, but I assume you are asking: "Why don't we just use the Sigmoid function without having to calculate its derivative?".
Your question is also very broad, so my answer is very broad and wordy, you will need to read more to understand all the details, for which I'll try to provide links.
Activation function: as the name suggests, we are wanting to know if a given node is "on" or "off", for which the sigmoid function provides an easy way to turn continuous variables (X) into a range of {0,1}.
Use cases can vary and this function has certain properties, and so that is why there are many alternative "activation" functions, like tanh, ReLU, etc. Read more here: https://en.wikipedia.org/wiki/Sigmoid_function
Differentiate (derivate): most models we want to find the best-fit beta parameters for all our activation functions. To do this we, we typically want to minimise a "cost" function that describes how good our model is at predicting observed data. One way to solve this optimisation problem is Gradient Descent. Each step of gradient descent updates the parameters by following the multi-dimensional cost-function space. To do this, it needs the gradient of the activation function. This is important for back propagation that uses gradient descent to optimise the network, it requires that the activation functions you use (in most cases) to be differentiateable.
Read more here: https://en.wikipedia.org/wiki/Gradient_descent
I suggest if you have a deeper question that you take it to one of the machine learning stackexchange sites.
So after you have a machine learning algorithm trained, with your layers, nodes, and weights, how exactly does it go about getting a prediction for an input vector? I am using MultiLayer Perceptron (neural networks).
From what I currently understand, you start with your input vector to be predicted. Then you send it to your hidden layer(s) where it adds your bias term to each data point, then adds the sum of the product of each data point and the weight for each node (found in training), then runs that through the same activation function used in training. Repeat for each hidden layer, then does the same for your output layer. Then each node in the output layer is your prediction(s).
Is this correct?
I got confused when using opencv to do this, because in the guide it says when you use the function predict:
If you are using the default cvANN_MLP::SIGMOID_SYM activation
function with the default parameter values fparam1=0 and fparam2=0
then the function used is y = 1.7159*tanh(2/3 * x), so the output
will range from [-1.7159, 1.7159], instead of [0,1].
However, when training it is also stated in the documentation that SIGMOID_SYM uses the activation function:
f(x)= beta*(1-e^{-alpha x})/(1+e^{-alpha x} )
Where alpha and beta are user defined variables.
So, I'm not quite sure what this means. Where does the tanh function come into play? Can anyone clear this up please? Thanks for the time!
The documentation where this is found is here:
reference to the tanh is under function descriptions predict.
reference to activation function is by the S looking graph in the top part of the page.
Since this is a general question, and not code specific, I did not post any code with it.
I would suggest that you read about appropriate algorithm that your are using or plan to use. To be honest there is no one definite algorithm to solve a problem but you can explore what features you got and what you need.
Regarding how an algorithm performs prediction is totally depended on the choice of algorithm. Support Vector Machine (SVM) performs prediction by fitting hyperplanes on the feature space and using some metric such as distance for learning and than the learnt model is used for prediction. KNN on the other than uses simple nearest neighbor measurement for prediction.
Please do more work on what exactly you need and read through the research papers to get proper understanding. There is not magic involved in prediction but rather mathematical formulations.
Sorry that i only keep asking here. I will study hard to get ready to answer questions too!
Many papers and articles claim that there is no restriction on choosing activation functions for MLP.
It seems like it is only matter which one fits most for given condition.
And also the articles say that it is mathematically proven simple perceptron can not solve XOR problem.
I know that simple perceptron model used to use step function for its activation function.
But if basically it doesn't matter which activation function to use, then using
f(x)=1 if |x-a|<b
f(x)=0 if |x-a|>b
as an activation function works on XOR problem. (for 2input 1output no hidden layer perceptron model)
I know that using artificial functions is not good for learning model. But if it works anyway, then why the articles say that it is proven it doesn't work?
Does the article means simple perceptron model by one using step function? or does activation function for simple perceptron has to be step function unlike MLP? or am i wrong?
In general,
The problem is that non-differentiable activation functions (like the one you proposed) cannot be used for back-propagation and other techniques. Back propagation is a convenient way to estimate the correct threshold values (a and b in your example). All the popular activation functions are selected such that they approximate step behaviour while remaining differentiable.
As bgbg mentioned, your activation is non-differentiable. If you use a differentiable activation function , which is required for MLP's to compute the gradients and update the weights, then the perceptron is simply fitting a line, which intuitively cannot solve the nonlinear XOR problem.