Why shouldn't we use multiple activation functions in the same layer? - machine-learning

I'm a newbie when in comes to ML and neural nets, have been studying mainly online through coursera videos and a bit of kaggle/github. All the examples or cases where I've seen neural networks being applied have one thing in common - they use a specific type of activation function in all the nodes pertaining to a specific layer.
From what I understand, each node uses non-linear activation functions to learn about a particular pattern in the data. If it were so, why not use multiple types of activation functions?
I did find one link, which basically says that it's easier to manage a network if we use just one activation function per layer. Any other benefits?

Purpose of an activation function is to introduce non-linearity to a neural network. See this answer for more insight on why our deep neural networks would not actually be deep without non-linearity.
Activation functions do their job by controlling the outputs of the neurons. Sometimes they provide a simple threshold like ReLU does, which can be coded as following:
if input > 0:
return input
else:
return 0
And some other times they behave in more complicated ways such as tanh(x) or sigmoid(x). See this answer for more on different sorts of activations.
I also would like to add that I agree with #Joe, an activation function does not learn a particular pattern, it effects the way that a neural network learns multiple patterns. Each activation function have its own kind of effect on the output.
Thus, one benefit of not using multiple activation functions in a single layer would be predictability of their effect. We know what ReLU or Sigmoid does to the output of a convolutional filter for example. But do we now the effect of their cascaded use? In which order btw, does ReLU come first, or is it better for us to use Sigmoid first? Does it matter?
If we want to benefit from the combination of activation functions, all of these questions (and maybe many more) need to be answer with scientific evidences. Tedious experiments and evaluations should be done to get some meaningful results. Only then we would now what does it mean to use them together and after that, maybe a new type of activation function will arise and there will be a new name for it.

Related

question of neural network training:the gradient of the same module which is used multiple times in one iteration

When training a neural network, if the same module is used multiple times in one iteration, does the gradient of the module need special processing during backpropagation?
for example:
One Deformable Compensation is used three times in this model, which means they share the same weights.
What will happen when I use loss.backward()?
Will loss.backward() work correctly?
The nice thing about autograd and backward passes is that the underlying framework is not "algorithmic", but rather a mathematic one: it implements the chain rule of derivatives. Therefore, there are no "algorithmic" considerations of "shared weights" or "weighting different layers", it's pure math. The backward pass provides the derivative of your loss function w.r.t the weights in a purely mathematical way.
Sharing weights can be done globally (e.g., when training Saimese networks), on a "layer level" (as in your example), but also within a layer. When you think about it Convolution layers and Reccurent layers are a fancy way of locally sharing weights.
Naturally, pytorch (as well as all other DL frameworks) can trivially handle these cases.
As long as your "deformable compensation" layer is correctly implemented -- pytorch will take care of the gradients for you, in a mathematically correct manner, thanks to the chain rule.

What do non-linear activation functions do at a fundamental level in neural networks?

I've been trying to find out what exactly non-linear activation functions do when implemented in a neural network.
I know they modify the output of a neuron, but how and for what purpose?
I know they add non-linearity to otherwise linear neural networks, but for what purpose?
What exactly do they do to the output of each layer? Is it some form of classification?
I want to know what exactly their purpose is within neural networks.
Wikipedia says that "the activation function of a node defines the output of that node given an input or set of inputs." This article states that the activation function checks whether a neuron has "fired" or not. I've looked at a bunch more articles and other questions on Stack Overflow as well, but none of them gave a satisfying answer as to what is occurring.
The main reason for using non-linear activation functions is to be able to learn non-linear target functions, i.e. learn a non-linear relationship between the inputs and outputs. If a network consists of only linear activation functions, it can only model a linear relationship between the inputs and outputs, which is not useful in almost all applications.
I am by no means an ML expert, so maybe this video can explain it better: https://www.coursera.org/lecture/neural-networks-deep-learning/why-do-you-need-non-linear-activation-functions-OASKH
Hope this helps!
First of all it's better to have a clear idea on why we use activation functions.
We use activation functions to propagate the output of one layer’s nodes to
the next layer. Activation functions are scalar-to-scalar functions and we use activation functions for hidden neurons in a neural network to introduce non-linearity into the network’s model. So in a simpler level, activation function are used to introduce non-linearity into the network.
So what is the use of introducing non-linearity? Before that, non-linearity means that an output cannot be reproduced from a linear combination of the inputs. Therefore without a non-linear activation function in a neural-network, even though it may have hundreds of hidden layers it would still behave like a single-layer perceptron. The reason is whichever the way you sum them, it would only result a linear output.
Anyhow for more deeper level understanding, I suggest you to look at this Medium post as well as this video by Andrew Ng himself.
From the Andrew Ng's video let me rephrase some important parts below.
...if you don't have an activation function, then no matter how many
layers your neural network has, all it's doing is just computing a
linear activation function. So you might as well not have any hidden
layers.
...it turns out that if you have a linear activation function here and
a sigmoid function here, then this model is no more expressive than
standard logistic regression without any hidden layer.
...so unless
you throw a non-linear in there, then you're not computing more
interesting functions even as you go deeper in the network.

What should be the activaiton function for Neural Network in case of regression?

My question is based on the understanding from https://www.youtube.com/watch?v=oYbVFhK_olY&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=43
In Neural Network a Neuron is activated by a threshold(activation) function, which in the above example is a sigmoid function. For regression problems do we need an activation function?
regards
Souvik
Yes. It requires activation function in neural network regression too. If one is regressing neural network to the continuous range output, sigmoid like function can be used as activation functions. Avoid threshold-like activation function. Apart from output node, it will also require to have activation function across all the intermediate nodes of input and hidden layers.
For the hidden layers, you should use Relu and don't use activation function for the output layer.
Please check this document: Linear Regression
let's get some terminology clear:
by "regression" we usually refer to the simplest model where each of the independent variables have a parameter, and we multiply the variable with the parameter. In linear regression we stop there, with a continuous output. in logistic regression we also use an "activation function", probably the sigmoid, to get the output to be in the range (0,1), and then we can use a threshold (like x>0.5) to decide whether the output is 0 or 1.
So regession is like a neural network with 0 hidden layers.
Now I assume that what you meant is that you want the neural network to have a continuous output, like linear regression. so naturally we don't want to use any "distorting" function on the output layer.
However, we do want to use an activation function on the neurons in the hidden layers. the reason behind this is that we want the network to be able to break the linearity - so the network will be able to do something more interesting than just multiplying the variables by different parameters, allow it to combine them and different ways. The idea is to allow the network to simulate any function, even non linear ones.
So which activation should we choose to break the linearity? there are numerous options... sigmoid, tanh, relu, leaky-relu, elu, and many many more. the most common one today is relu, but this can change... it's mostly an empirical study of which ones work best and allow for the fastest learning. And it is up to the neural network architect to decide which function they want to use. As you get to know more of these functions and their pros and cons you'll be able to try a few for every problem and see what works best for you.

extrapolation with recurrent neural network

I Wrote a simple recurrent neural network (7 neurons, each one is initially connected to all the neurons) and trained it using a genetic algorithm to learn "complicated", non-linear functions like 1/(1+x^2). As the training set, I used 20 values within the range [-5,5] (I tried to use more than 20 but the results were not changed dramatically).
The network can learn this range pretty well, and when given examples of other points within this range, it can predict the value of the function. However, it can not extrapolate correctly and predicting the values of the function outside the range [-5,5]. What are the reasons for that and what can I do to improve its extrapolation abilities?
Thanks!
Neural networks are not extrapolation methods (no matter - recurrent or not), this is completely out of their capabilities. They are used to fit a function on the provided data, they are completely free to build model outside the subspace populated with training points. So in non very strict sense one should think about them as an interpolation method.
To make things clear, neural network should be capable of generalizing the function inside subspace spanned by the training samples, but not outside of it
Neural network is trained only in the sense of consistency with training samples, while extrapolation is something completely different. Simple example from "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8" shows how NN behave in this context
All of these networks are consistent with training data, but can do anything outside of this subspace.
You should rather reconsider your problem's formulation, and if it can be expressed as a regression or classification problem then you can use NN, otherwise you should think about some completely different approach.
The only thing, which can be done to somehow "correct" what is happening outside the training set is to:
add artificial training points in the desired subspace (but this simply grows the training set, and again - outside of this new set, network's behavious is "random")
add strong regularization, which will force network to create very simple model, but model's complexity will not guarantee any extrapolation strength, as two model's of exactly the same complexity can have for example completely different limits in -/+ infinity.
Combining above two steps can help building model which to some extent "extrapolates", but this, as stated before, is not a purpose of a neural network.
As far as I know this is only possible with networks which do have the echo property. See Echo State Networks on scholarpedia.org.
These networks are designed for arbitrary signal learning and are capable to remember their behavior.
You can also take a look at this tutorial.
The nature of your post(s) suggests that what you're referring to as "extrapolation" would be more accurately defined as "sequence recognition and reproduction." Training networks to recognize a data sequence with or without time-series (dt) is pretty much the purpose of Recurrent Neural Network (RNN).
The training function shown in your post has output limits governed by 0 and 1 (or -1, since x is effectively abs(x) in the context of that function). So, first things first, be certain your input layer can easily distinguish between negative and positive inputs (if it must).
Next, the number of neurons is not nearly as important as how they're layered and interconnected. How many of the 7 were used for the sequence inputs? What type of network was used and how was it configured? Network feedback will reveal the ratios, proportions, relationships, etc. and aid in the adjustment of network weight adjustments to match the sequence. Feedback can also take the form of a forward-feed depending on the type of network used to create the RNN.
Producing an 'observable' network for the exponential-decay function: 1/(1+x^2), should be a decent exercise to cut your teeth on RNNs. 'Observable', meaning the network is capable of producing results for any input value(s) even though its training data is (far) smaller than all possible inputs. I can only assume that this was your actual objective as opposed to "extrapolation."

Activation function for neural network

I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.

Resources