What is negative sampling method use- sigmoid or softmax? - machine-learning

I am recently reading this paper: Word2Vec explained(https://arxiv.org/pdf/1402.3722.pdf)
And there's something I can't understand..
In page 3, they say that p is defined using softmax
$p(D=1|w, c, \theta) = \frac{1}{1+e^{-v_c\dotv_w}}$
enter image description here
but i am confused because i have seen that formula in sigmoid function, not softmax function.
How you derive that definition from softmax?

It can be called a small abuse of notation on the author's part but it's totally fine. Sometimes people use the softmax and sigmoid interchangeably. However, in this case it is indeed a sigmoid function because of binary class problem.

Related

Last layer of U-Net Semantic Segmentation Softmax or Sigmoid and Why?

I'm asking about the last Layer of U-Net model for Semantic Segmentation
what it should be and why?
As I've found a lot of different architectures part of them are using Sigmoid and others are using Softmax in last layer
There's a good foundational article that goes in depth about sigmoid and softmax functions. Here is their summary:
If your model’s output classes are NOT mutually exclusive and you can choose many of them at the same time, use a sigmoid function on the network’s raw outputs.
If your model’s output classes are mutually exclusive and you can only choose one, then use a softmax function on the network’s raw outputs.
The article however specifically gives examples of classification tasks. In segmentation tasks, a pixel can only be one class at a time. (For example, in segmenting items on a beach, a pixel can't be both sand AND water.) This results in the often use of softmax in segmentation models, as the classes are mutually exclusive. In other words, a multi-class classification problem.
Sigmoid deals with multi-label classification problems, allowing for a pixel to share a label (a pixel can be both sand and water, both sky and water, even sky+water+sand+sun+etc.), which doesn't make sense. The exception, however, is if there's only one class, in other words, binary classification (water vs no water). Then you may use sigmoid in segmentation.
Softmax is actually a generalization of a sigmoid function. See this question over on Cross Validated for more info, but this is extra credit.
To finish answering your question, I should briefly speak about loss functions. Depending on your loss function, you may be preferring sigmoid or softmax. (E.g. if your loss function requires logits, softmax is inadequate.)
In summary, using softmax or sigmoid in the last layer depends on the problem you're working on, along with the associated loss function and other intricacies in your pipeline/software. In practice, if you have a multi-class problem, chances are you'll be using softmax. If you have one-class/binary problem, sigmoid or softmax are possibilities.

Use of tanh activation function in input gate of LSTM

While studying about LSTM , I got to know about use of 2 different activation functions in input gate - sigmoid and tanh. I got the use of sigmoid but not tanh. In this stackoverflow article about use of tanh says that we want the second derivative of it to be sustain for long time before going to zero , I don't get why he is talking about second derivative. Also, he kind of saying that tanh eliminates vanishing gradient(in 2nd para) but in all articles that i read they say that Leaky ReLU helps in eliminating it. Therefore I want to understand about tanh in LSTM.This is not duplicate question ,I just want to understand the previously answered question. Thank You!🙌

Does the sigmoid function effect the slowdown for weights not connected to the output layer when using cross entropy function?

I've been reading on error functions for neural nets on my own. http://neuralnetworksanddeeplearning.com/chap3.html explains that using cross entropy function avoids slowdown (ie the network learns faster if the predicted output is far from the target output). The author shows that the weights that are connected to the output layer will ignore the sigmoid prime function, which is causing the slowdown.
But what about the weights that are further back? By deriving (I'm getting the same derivation when the quadratic error function was used), I'm finding the sigmoid prime term appears in those weights. Wouldn't that contribute to slowdown? (Maybe I derived it incorrectly?)
Yes, all sigmoid layers will suffer from slowing down learning except last one. I guess your derivation is correct, actually Quadratic Error, Sigmoid + BinaryCrossEntropyLoss and Softmax + SoftmaxCrossEntropyLoss share same form of backpropagation formula y_i - y. See the code here of the three losses: L2Loss, BinaryLoss, SoftmaxLoss

Artificial Neural Network RELU Activation Function and Gradients

I have a question. I watched a really detailed tutorial on implementing an artificial neural network in C++. And now I have more than a basic understanding of how a neural network works and how to actually program and train one.
So in the tutorial a hyperbolic tangent was used for calculating outputs, and obviously its derivative for calculating gradients. However I wanted to move on to a different function. Specifically Leaky RELU (to avoid dying neurons).
My question is, it specifies that this activation function should be used for the hidden layers only. For the output layers a different function should be used (either a softmax or a linear regression function). In the tutorial the guy taught the neural network to be an XOR processor. So is this a classification problem or a regression problem?
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
So I implemented the Leaky RELU function and its derivative but I don't know whether I should use a softmax or a regression function for the output layer.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Thanks in advance.
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
In short, classification is for discrete target, regression is for continuous target. If it were a floating point operation, you had a regression problem. But here the result of XOR is 0 or 1, so it's a binary classification (already suggested by Sid). You should use a softmax layer (or a sigmoid function, which works particularly for 2 classes). Note that the output will be a vector of probabilities, i.e. real valued, which is used to choose the discrete target class.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Correct. For the output layer you'll need a cross-entropy loss function, which corresponds to the softmax layer, and it's derivative for the backward pass.
If there will be hidden layers that still use Leaky ReLu, you'll also need Leaky ReLu's derivative accordingly, for these particular layers.
Highly recommend this post on backpropagation details.

Why use softmax only in the output layer and not in hidden layers?

Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.
What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
Are there any publications about this, something to quote?
I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :
1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.
Actually, Softmax functions are already used deep within neural networks, in certain cases, when dealing with differentiable memory and with attention mechanisms!
Softmax layers can be used within neural networks such as in Neural Turing Machines (NTM) and an improvement of those which are Differentiable Neural Computer (DNC).
To summarize, those architectures are RNNs/LSTMs which have been modified to contain a differentiable (neural) memory matrix which is possible to write and access through time steps.
Quickly explained, the softmax function here enables a normalization of a fetch of the memory and other similar quirks for content-based addressing of the memory. About that, I really liked this article which illustrates the operations in an NTM and other recent RNN architectures with interactive figures.
Moreover, Softmax is used in attention mechanisms for, say, machine translation, such as in this paper. There, the Softmax enables a normalization of the places to where attention is distributed in order to "softly" retain the maximal place to pay attention to: that is, to also pay a little bit of attention to elsewhere in a soft manner. However, this could be considered like to be a mini-neural network that deals with attention, within the big one, as explained in the paper. Therefore, it could be debated whether or not Softmax is used only at the end of neural networks.
Hope it helps!
Edit - More recently, it's even possible to see Neural Machine Translation (NMT) models where only attention (with softmax) is used, without any RNN nor CNN: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum({o_i}) = 1 is a linear dependency, which is intentional at this layer. Additional layers may provide desired sparsity and/or feature independence downstream.
Page 198 of Deep Learning (Goodfellow, Bengio, Courville)
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability
distribution over a binary variable.
Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable.
Softmax function is used for the output layer only (at least in most cases) to ensure that the sum of the components of output vector is equal to 1 (for clarity see the formula of softmax cost function). This also implies what is the probability of occurrence of each component (class) of the output and hence sum of the probabilities(or output components) is equal to 1.
Softmax function is one of the most important output function used in deep learning within the neural networks (see Understanding Softmax in minute by Uniqtech). The Softmax function is apply where there are three or more classes of outcomes. The softmax formula takes the e raised to the exponent score of each value score and devide it by the sum of e raised the exponent scores values. For example, if I know the Logit scores of these four classes to be: [3.00, 2.0, 1.00, 0.10], in order to obtain the probabilities outputs, the softmax function can be apply as follows:
import numpy as np
def softmax(x):
z = np.exp(x - np.max(x))
return z / z.sum()
scores = [3.00, 2.0, 1.00, 0.10]
print(softmax(scores))
Output: probabilities (p) = 0.642 0.236 0.087 0.035
The sum of all probabilities (p) = 0.642 + 0.236 + 0.087 + 0.035 = 1.00. You can try to substitute any value you know in the above scores, and you will get a different values. The sum of all the values or probabilities will be equal to one. That’s makes sense, because the sum of all probability is equal to one, thereby turning Logit scores to probability scores, so that we can predict better. Finally, the softmax output, can help us to understand and interpret Multinomial Logit Model. If you like the thoughts, please leave your comments below.

Resources