Neural network (non) linearity - machine-learning

I am somewhat confused by the use of the term linear/non-linear when discussing neural networks. Can anyone clarify these 3 points for me:
Each node in a neural net is the weighted sum of inputs. This is a linear combination of inputs. So the value for each node (ignoring activation) is given by some linear function. I hear that neural nets are universal function approximators. Does this mean that, despite containing linear functions within each node, the total network is able to approximate a non-linear function as well? Are there any clear examples of how this works in practise?
An activation function is applied to the output of that node to squash/transform the output for further propagation through the rest of the network. Am I correct in interpreting this output from the activation function as the "strength" of that node?
Activation functions are also referred to as nonlinear functions. Where does the term non-linear come from? Because the input into activation is the result of linear combination of inputs into the node. I assume it's referring to the idea that something like the sigmoid function is a non-linear function? Why does it matter that the activation is non-linear?

1 Linearity
A neural network is only non-linear if you squash the output signal from the nodes with a non-linear activation function. A complete neural network (with non-linear activation functions) is an arbitrary function approximator.
Bonus: It should be noted that if you are using linear activation functions in multiple consecutive layers, you could just as well have pruned them down to a single layer due to them being linear. (The weights would be changed to more extreme values). Creating a network with multiple layers using linear activation functions would not be able to model more complicated functions than a network with a single layer.
2 Activation signal
Interpreting the squashed output signal could very well be interpreted as the strength of this signal (biologically speaking). Thought it might be incorrect to interpret the output strength as an equivalent of confidence as in fuzzy logic.
3 Non-linear activation functions
Yes, you are spot on. The input signals along with their respective weights are a linear combination. The non-linearity comes from your selection of activation functions. Remember that a linear function is drawn as a line - sigmoid, tanh, ReLU and so on may not be drawn with a single straight line.
Why do we need non-linear activation functions?
Most functions and classification tasks are probably best described by non-linear functions. If we decided to use linear activation functions we would end up with a much coarser approximation on a complex function.
Universal approximators
You can sometimes read in papers that neural networks are universal approximators. This implies that a "perfect" network could be fitted to any model/function you could throw at it, though configuring the perfect network (#nodes and #layers ++) is a non-trivial task.
Read more about the implications at this Wikipedia page.

Related

What do non-linear activation functions do at a fundamental level in neural networks?

I've been trying to find out what exactly non-linear activation functions do when implemented in a neural network.
I know they modify the output of a neuron, but how and for what purpose?
I know they add non-linearity to otherwise linear neural networks, but for what purpose?
What exactly do they do to the output of each layer? Is it some form of classification?
I want to know what exactly their purpose is within neural networks.
Wikipedia says that "the activation function of a node defines the output of that node given an input or set of inputs." This article states that the activation function checks whether a neuron has "fired" or not. I've looked at a bunch more articles and other questions on Stack Overflow as well, but none of them gave a satisfying answer as to what is occurring.
The main reason for using non-linear activation functions is to be able to learn non-linear target functions, i.e. learn a non-linear relationship between the inputs and outputs. If a network consists of only linear activation functions, it can only model a linear relationship between the inputs and outputs, which is not useful in almost all applications.
I am by no means an ML expert, so maybe this video can explain it better: https://www.coursera.org/lecture/neural-networks-deep-learning/why-do-you-need-non-linear-activation-functions-OASKH
Hope this helps!
First of all it's better to have a clear idea on why we use activation functions.
We use activation functions to propagate the output of one layer’s nodes to
the next layer. Activation functions are scalar-to-scalar functions and we use activation functions for hidden neurons in a neural network to introduce non-linearity into the network’s model. So in a simpler level, activation function are used to introduce non-linearity into the network.
So what is the use of introducing non-linearity? Before that, non-linearity means that an output cannot be reproduced from a linear combination of the inputs. Therefore without a non-linear activation function in a neural-network, even though it may have hundreds of hidden layers it would still behave like a single-layer perceptron. The reason is whichever the way you sum them, it would only result a linear output.
Anyhow for more deeper level understanding, I suggest you to look at this Medium post as well as this video by Andrew Ng himself.
From the Andrew Ng's video let me rephrase some important parts below.
...if you don't have an activation function, then no matter how many
layers your neural network has, all it's doing is just computing a
linear activation function. So you might as well not have any hidden
layers.
...it turns out that if you have a linear activation function here and
a sigmoid function here, then this model is no more expressive than
standard logistic regression without any hidden layer.
...so unless
you throw a non-linear in there, then you're not computing more
interesting functions even as you go deeper in the network.

Artificial Neural Network- why usually use sigmoid activation function in the hidden layer instead of tanh-sigmoid activation function?

why is log-sigmoid activation function the primary selection in the hidden layer instead of tanh-sigmoid activation function? And also, if I use Z-score normalization, could I use sigmoid activation function in the hidden layer?
Ancient history
The motivation for using the sigmoid function was historically physically motivated. The first neural networks, in the very early days, in fact used the step function
The motivation was that this is how neurons work in the brain, at least to the understanding of that time. At a certain fixed activation energy the neuron "activates", going from inactive (0) to active (1). However, these networks are very hard to train, and the standard paradigm was also physically motivated, e.g. "neurons that are used often, get a stronger connection". This worked for very small networks, but did not scale at all to larger networks.
Gradient descent and the advent of the sigmoid
In the 80's a slight revolution was had in neural networks when it was discovered that they can be trained using gradient descent. This allowed the networks to scale to much larger scales, but it also spelled the end of the step activation, since it is not differentiable. However, given the long history of the step activation and its plausible physical motivation, people were hesitant to abandon it fully, and hence approximated it by the sigmoid function, which shares many of its characteristics, but is differentiable around 0.
Later on, people started using the tanh function since it is zero centered, which gives somewhat better characteristics in some cases.
The RevoLUtion
Then in 2000, a seminal paper was published in Nature that suggested the use of the the ReLU activation function:
This was motivated by problems with the earlier actiation functions, but most important is speed and the fact that it does not suffer from the vanishing gradient problem. Since then, basically all top neural network research has been using the ReLU activation or slight variations thereof.
The only exception is perhaps recurrent networks, where the output is fed back as input. In these, using the unbounded actiation functions such as the ReLU would quickly lead to an explosion in results, and people still use the sigmoid and/or tanh in these cases.

What should be the activaiton function for Neural Network in case of regression?

My question is based on the understanding from https://www.youtube.com/watch?v=oYbVFhK_olY&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=43
In Neural Network a Neuron is activated by a threshold(activation) function, which in the above example is a sigmoid function. For regression problems do we need an activation function?
regards
Souvik
Yes. It requires activation function in neural network regression too. If one is regressing neural network to the continuous range output, sigmoid like function can be used as activation functions. Avoid threshold-like activation function. Apart from output node, it will also require to have activation function across all the intermediate nodes of input and hidden layers.
For the hidden layers, you should use Relu and don't use activation function for the output layer.
Please check this document: Linear Regression
let's get some terminology clear:
by "regression" we usually refer to the simplest model where each of the independent variables have a parameter, and we multiply the variable with the parameter. In linear regression we stop there, with a continuous output. in logistic regression we also use an "activation function", probably the sigmoid, to get the output to be in the range (0,1), and then we can use a threshold (like x>0.5) to decide whether the output is 0 or 1.
So regession is like a neural network with 0 hidden layers.
Now I assume that what you meant is that you want the neural network to have a continuous output, like linear regression. so naturally we don't want to use any "distorting" function on the output layer.
However, we do want to use an activation function on the neurons in the hidden layers. the reason behind this is that we want the network to be able to break the linearity - so the network will be able to do something more interesting than just multiplying the variables by different parameters, allow it to combine them and different ways. The idea is to allow the network to simulate any function, even non linear ones.
So which activation should we choose to break the linearity? there are numerous options... sigmoid, tanh, relu, leaky-relu, elu, and many many more. the most common one today is relu, but this can change... it's mostly an empirical study of which ones work best and allow for the fastest learning. And it is up to the neural network architect to decide which function they want to use. As you get to know more of these functions and their pros and cons you'll be able to try a few for every problem and see what works best for you.

Graphically, how does the non-linear activation function project the input onto the classification space?

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.
Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?
I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.
Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.
This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.
NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider
this animation
This pictures where taken from this article. Also check out that cool visualization applet.
Activation functions have very little to do with classifying non-linearly separable sets of data.
Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).
You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.
Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.
When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.
Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set:
I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

Why use tanh for activation function of MLP?

Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.

Resources