Artificial Neural Network RELU Activation Function and Gradients

Artificial Neural Network RELU Activation Function and Gradients - machine-learning

I have a question. I watched a really detailed tutorial on implementing an artificial neural network in C++. And now I have more than a basic understanding of how a neural network works and how to actually program and train one.
So in the tutorial a hyperbolic tangent was used for calculating outputs, and obviously its derivative for calculating gradients. However I wanted to move on to a different function. Specifically Leaky RELU (to avoid dying neurons).
My question is, it specifies that this activation function should be used for the hidden layers only. For the output layers a different function should be used (either a softmax or a linear regression function). In the tutorial the guy taught the neural network to be an XOR processor. So is this a classification problem or a regression problem?
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
So I implemented the Leaky RELU function and its derivative but I don't know whether I should use a softmax or a regression function for the output layer.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Thanks in advance.

I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
In short, classification is for discrete target, regression is for continuous target. If it were a floating point operation, you had a regression problem. But here the result of XOR is 0 or 1, so it's a binary classification (already suggested by Sid). You should use a softmax layer (or a sigmoid function, which works particularly for 2 classes). Note that the output will be a vector of probabilities, i.e. real valued, which is used to choose the discrete target class.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Correct. For the output layer you'll need a cross-entropy loss function, which corresponds to the softmax layer, and it's derivative for the backward pass.
If there will be hidden layers that still use Leaky ReLu, you'll also need Leaky ReLu's derivative accordingly, for these particular layers.
Highly recommend this post on backpropagation details.

Related

Last layer of U-Net Semantic Segmentation Softmax or Sigmoid and Why?

I'm asking about the last Layer of U-Net model for Semantic Segmentation
what it should be and why?
As I've found a lot of different architectures part of them are using Sigmoid and others are using Softmax in last layer

There's a good foundational article that goes in depth about sigmoid and softmax functions. Here is their summary:
If your model’s output classes are NOT mutually exclusive and you can choose many of them at the same time, use a sigmoid function on the network’s raw outputs.
If your model’s output classes are mutually exclusive and you can only choose one, then use a softmax function on the network’s raw outputs.
The article however specifically gives examples of classification tasks. In segmentation tasks, a pixel can only be one class at a time. (For example, in segmenting items on a beach, a pixel can't be both sand AND water.) This results in the often use of softmax in segmentation models, as the classes are mutually exclusive. In other words, a multi-class classification problem.
Sigmoid deals with multi-label classification problems, allowing for a pixel to share a label (a pixel can be both sand and water, both sky and water, even sky+water+sand+sun+etc.), which doesn't make sense. The exception, however, is if there's only one class, in other words, binary classification (water vs no water). Then you may use sigmoid in segmentation.
Softmax is actually a generalization of a sigmoid function. See this question over on Cross Validated for more info, but this is extra credit.
To finish answering your question, I should briefly speak about loss functions. Depending on your loss function, you may be preferring sigmoid or softmax. (E.g. if your loss function requires logits, softmax is inadequate.)
In summary, using softmax or sigmoid in the last layer depends on the problem you're working on, along with the associated loss function and other intricacies in your pipeline/software. In practice, if you have a multi-class problem, chances are you'll be using softmax. If you have one-class/binary problem, sigmoid or softmax are possibilities.

Artificial Neural Network- why usually use sigmoid activation function in the hidden layer instead of tanh-sigmoid activation function?

why is log-sigmoid activation function the primary selection in the hidden layer instead of tanh-sigmoid activation function? And also, if I use Z-score normalization, could I use sigmoid activation function in the hidden layer?

Ancient history
The motivation for using the sigmoid function was historically physically motivated. The first neural networks, in the very early days, in fact used the step function
The motivation was that this is how neurons work in the brain, at least to the understanding of that time. At a certain fixed activation energy the neuron "activates", going from inactive (0) to active (1). However, these networks are very hard to train, and the standard paradigm was also physically motivated, e.g. "neurons that are used often, get a stronger connection". This worked for very small networks, but did not scale at all to larger networks.
Gradient descent and the advent of the sigmoid
In the 80's a slight revolution was had in neural networks when it was discovered that they can be trained using gradient descent. This allowed the networks to scale to much larger scales, but it also spelled the end of the step activation, since it is not differentiable. However, given the long history of the step activation and its plausible physical motivation, people were hesitant to abandon it fully, and hence approximated it by the sigmoid function, which shares many of its characteristics, but is differentiable around 0.
Later on, people started using the tanh function since it is zero centered, which gives somewhat better characteristics in some cases.
The RevoLUtion
Then in 2000, a seminal paper was published in Nature that suggested the use of the the ReLU activation function:
This was motivated by problems with the earlier actiation functions, but most important is speed and the fact that it does not suffer from the vanishing gradient problem. Since then, basically all top neural network research has been using the ReLU activation or slight variations thereof.
The only exception is perhaps recurrent networks, where the output is fed back as input. In these, using the unbounded actiation functions such as the ReLU would quickly lead to an explosion in results, and people still use the sigmoid and/or tanh in these cases.

Neural network (non) linearity

I am somewhat confused by the use of the term linear/non-linear when discussing neural networks. Can anyone clarify these 3 points for me:
Each node in a neural net is the weighted sum of inputs. This is a linear combination of inputs. So the value for each node (ignoring activation) is given by some linear function. I hear that neural nets are universal function approximators. Does this mean that, despite containing linear functions within each node, the total network is able to approximate a non-linear function as well? Are there any clear examples of how this works in practise?
An activation function is applied to the output of that node to squash/transform the output for further propagation through the rest of the network. Am I correct in interpreting this output from the activation function as the "strength" of that node?
Activation functions are also referred to as nonlinear functions. Where does the term non-linear come from? Because the input into activation is the result of linear combination of inputs into the node. I assume it's referring to the idea that something like the sigmoid function is a non-linear function? Why does it matter that the activation is non-linear?

1 Linearity
A neural network is only non-linear if you squash the output signal from the nodes with a non-linear activation function. A complete neural network (with non-linear activation functions) is an arbitrary function approximator.
Bonus: It should be noted that if you are using linear activation functions in multiple consecutive layers, you could just as well have pruned them down to a single layer due to them being linear. (The weights would be changed to more extreme values). Creating a network with multiple layers using linear activation functions would not be able to model more complicated functions than a network with a single layer.
2 Activation signal
Interpreting the squashed output signal could very well be interpreted as the strength of this signal (biologically speaking). Thought it might be incorrect to interpret the output strength as an equivalent of confidence as in fuzzy logic.
3 Non-linear activation functions
Yes, you are spot on. The input signals along with their respective weights are a linear combination. The non-linearity comes from your selection of activation functions. Remember that a linear function is drawn as a line - sigmoid, tanh, ReLU and so on may not be drawn with a single straight line.
Why do we need non-linear activation functions?
Most functions and classification tasks are probably best described by non-linear functions. If we decided to use linear activation functions we would end up with a much coarser approximation on a complex function.
Universal approximators
You can sometimes read in papers that neural networks are universal approximators. This implies that a "perfect" network could be fitted to any model/function you could throw at it, though configuring the perfect network (#nodes and #layers ++) is a non-trivial task.
Read more about the implications at this Wikipedia page.

Graphically, how does the non-linear activation function project the input onto the classification space?

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.
Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?
I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.

Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.
This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.
NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider
this animation
This pictures where taken from this article. Also check out that cool visualization applet.

Activation functions have very little to do with classifying non-linearly separable sets of data.
Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).
You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.
Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.
When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.
Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set:
I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

Why use tanh for activation function of MLP?

Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!

Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data

In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.

To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.

Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.

Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.

In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.

In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart