Why use softmax only in the output layer and not in hidden layers? - machine-learning

Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.
What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
Are there any publications about this, something to quote?

I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :
1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.

Actually, Softmax functions are already used deep within neural networks, in certain cases, when dealing with differentiable memory and with attention mechanisms!
Softmax layers can be used within neural networks such as in Neural Turing Machines (NTM) and an improvement of those which are Differentiable Neural Computer (DNC).
To summarize, those architectures are RNNs/LSTMs which have been modified to contain a differentiable (neural) memory matrix which is possible to write and access through time steps.
Quickly explained, the softmax function here enables a normalization of a fetch of the memory and other similar quirks for content-based addressing of the memory. About that, I really liked this article which illustrates the operations in an NTM and other recent RNN architectures with interactive figures.
Moreover, Softmax is used in attention mechanisms for, say, machine translation, such as in this paper. There, the Softmax enables a normalization of the places to where attention is distributed in order to "softly" retain the maximal place to pay attention to: that is, to also pay a little bit of attention to elsewhere in a soft manner. However, this could be considered like to be a mini-neural network that deals with attention, within the big one, as explained in the paper. Therefore, it could be debated whether or not Softmax is used only at the end of neural networks.
Hope it helps!
Edit - More recently, it's even possible to see Neural Machine Translation (NMT) models where only attention (with softmax) is used, without any RNN nor CNN: http://nlp.seas.harvard.edu/2018/04/03/attention.html

Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum({o_i}) = 1 is a linear dependency, which is intentional at this layer. Additional layers may provide desired sparsity and/or feature independence downstream.
Page 198 of Deep Learning (Goodfellow, Bengio, Courville)
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability
distribution over a binary variable.
Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable.

Softmax function is used for the output layer only (at least in most cases) to ensure that the sum of the components of output vector is equal to 1 (for clarity see the formula of softmax cost function). This also implies what is the probability of occurrence of each component (class) of the output and hence sum of the probabilities(or output components) is equal to 1.

Softmax function is one of the most important output function used in deep learning within the neural networks (see Understanding Softmax in minute by Uniqtech). The Softmax function is apply where there are three or more classes of outcomes. The softmax formula takes the e raised to the exponent score of each value score and devide it by the sum of e raised the exponent scores values. For example, if I know the Logit scores of these four classes to be: [3.00, 2.0, 1.00, 0.10], in order to obtain the probabilities outputs, the softmax function can be apply as follows:
import numpy as np
def softmax(x):
z = np.exp(x - np.max(x))
return z / z.sum()
scores = [3.00, 2.0, 1.00, 0.10]
print(softmax(scores))
Output: probabilities (p) = 0.642 0.236 0.087 0.035
The sum of all probabilities (p) = 0.642 + 0.236 + 0.087 + 0.035 = 1.00. You can try to substitute any value you know in the above scores, and you will get a different values. The sum of all the values or probabilities will be equal to one. That’s makes sense, because the sum of all probability is equal to one, thereby turning Logit scores to probability scores, so that we can predict better. Finally, the softmax output, can help us to understand and interpret Multinomial Logit Model. If you like the thoughts, please leave your comments below.

Related

Is softmax used when only the most probable class will be used?

I have a deep learning classification problem with 17 classes and I am working in Pytorch. The architecture includes the crossEntropy loss, implemented after a linear layer.
I believe that, normally, one computes a softmax activation and interprets as probablity for the corresponding output classes. But softmax is a monotonic function and it seems that, if I just want the most probable class, I can simply choose the class with the maximum score after the linear layer, leaving the softmax out.
Given that softmax is the default, widely used activation in classification problems, I wonder if I am missing something important here. Can anyone guide me?
Note that I have googled a large number of sites but, as far as I could understand, none answering this basic question (although there was a lot of information that was provided).
Thanks
You are right in that you don't need softmax to predict the most probable class - you can indeed just take the class with the highest score.
Howewer, you need softmax in the training time to calculate the loss function (cross-entropy), because it works well only with probability distributions over classes. The softmax transform guarantees that the output of your network does indeed look like a distribution: all scores are positive and they sum up to 1. If they weren't positive, you could not calculate cross-entropy, because it involves logarithms. And if the scores didn't sum to one (or any other constant), then the model could minimize loss by making all the scores infinitely large, without actually learning anything useful.
Moreover, at the prediction time softmax can be useful as well, because when you report probability instead of just score, you can interpret it as confidence: e.g. the model is 98% sure in its prediction.
In some cases, it is not the most probable class that you are interested in. E.g. if you do credit scoring, then even low probability of default (e.g. 20%) may be high enough to reject an application for loan. In such cases, instead of the most probable class you want to look at the probabilities themselves - and softmax helps to estimate them correctly.

Sigmoid activation for multi-class classification?

I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?

RL Activation Functions with Negative Rewards

I have a question regarding appropriate activation functions with environments that have both positive and negative rewards.
In reinforcement learning, our output, I believe, should be the expected reward for all possible actions. Since some options have a negative reward, we would want an output range that includes negative numbers.
This would lead me to believe that the only appropriate activation functions would either be linear or tanh. However, I see any many RL papers the use of Relu.
So two questions:
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
Many RL papers indeed use Relu's for most layers, but typically not for the final output layer. You mentioned the Human Level Control through Deep Reinforcement Learning paper and the Hindsight Experience Replay paper in one of the comments, but neither of those papers describe architectures that use Relu's for the output layer.
In the Human Level Control through Deep RL paper, page 6 (after references), Section "Methods", last paragraph for the part on "Model architecture" mentions that the output layer is a fully-connected linear layer (not a Relu). So, indeed, all hidden layers can only have nonnegative activation levels (since they all use Relus), but the output layer can have negative activation levels if there are negative weights between the output layer and last hidden layer. This is indeed necessary because the outputs it should create can be interpreted as Q-values (which may be negative).
In the Hindsight Experience Replay paper, they do not use DQN (like the paper above), but DDPG. This is an "Actor-Critic" algorithm. The "critic" part of this architecture is also intended to output values which can be negative, similar to the DQN architecture, so this also cannot use a Relu for the output layer (but it can still use Relus everywhere else in the network). In Appendix A of this paper, under "Network architecture", it is also described that the actor output layer uses tanh as activation function.
To answer your specific questions:
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
Well, there are also other activations (leaky relu, sigmoid, lots of others probably). But a Relu indeed cannot result in negative outputs.
Not 100% sure, possibly. It would often be difficult though, if you have no domain knowledge about how big or small rewards (and/or returns) can possibly get. I have a feeling it would typically be easier to simply end with one fully connected linear layer.
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
No, this is only the case for the activation function of the output layer. For all other layers, it does not matter because you can have negative weights which means neurons with only positive values can still contribute with negative values to the next layer.

Why use tanh for activation function of MLP?

Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.

Activation function for neural network

I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.

Resources