My Understanding is Softmax Regression is generalization of Logistic Regression to support multiple classes .
Softmax Regression model first computes a score for each class then estimates the probability of each class by applying the softmax function to the scores.
Each class has its own dedicated parameter vector
My question : Why can't we use Logistic Regression to classify to multiple classes in a much simpler way like if probability is 0 to 0.3 then Class A ; 0.3 to 0.6 then Class B : 0.6 to 0.9 then Class C etc.
Why separate coefficient vector is always needed ?
I'm new to ML . Not sure if this question is due to lack of any fundamental concept understanding .
First up, in terms of terminology, I'd say a more established terminology is multinomial logistic regression.
Softmax function is a natural choice for computing probabilities because it corresponds to MLE. Cross-entropy loss has a probabilistic interpretation as well - that's the "distance" between two distributions (output and target).
What you suggest is to discriminate classes in an artificial way - output a binary distribution and somehow compare it to a multi-class distribution. In theory, it is possible and may work, but surely has drawbacks. For example, it is harder to train.
Suppose the output is 0.2 (i.e. class A) and the ground truth is class B. You'd like to tell the network to shift towards a higher value. Next time, the output is 0.7 - the network actually learned and moved in the right direction, but you punish it again. In fact, there are unstable points (0.3 and 0.6 in your example) that the network need time to learn as critical ones. Two values - 0.2999999 and 0.3000001 are almost indistinguishable for the network, but they determine if the result is correct or not.
In general, output as a probability distribution is always better than direct discrimination, because it gives more information.
Related
I'm working on a binary classification problem. I had this situation that I used the logistic regression and support vector machine model imported from sklearn. These two models were fit with the same , imbalanced training data and class weights were adjusted. And they have achieved comparable performances. When I used these two pre-trained models to predict a new dataset. The LR model and the SVM models predicted similar number of instances as positives. And the predicted instances share a big overlap.
However, when I looked at the probability scores of being classified as positives, the distribution by LR is from 0.5 to 1 while the SVM starts from around 0.1. I called the function model.predict(prediction_data) to find out the instances predicted as each class and the function
model.predict_proba(prediction_data) to give the probability scores of being classified as 0(neg) and 1(pos), and assume they all have a default threshold 0.5.
There is no error in my code and I have no idea why the SVM predicted instances with probability scores < 0.5 as positives as well. Any thoughts on how to interpret this situation?
That's a known fact in sklearn when it comes to binary classification problems with SVC(), which is reported, for instance, in these github issues
(here and here). Moreover, it is also
reported in the User guide where it is said that:
In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
or directly within libsvm faq, where it is said that
Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.
All in all, the point is that:
on one side, predictions are based on decision_function values: if the decision value computed on a new instance is positive, the predicted class is the positive class and viceversa.
on the other side, as stated within one of the github issues, np.argmax(self.predict_proba(X), axis=1) != self.predict(X) which is where the inconsistency comes from. In other terms, in order to always have consistency on binary classification problems you would need a classifier whose predictions are based on the output of predict_proba() (which is btw what you'll get when considering calibrators), like so:
def predict(self, X):
y_proba = self.predict_proba(X)
return np.argmax(y_proba, axis=1)
I'd also suggest this post on the topic.
After training any classifier, the classifier tells the probability of data point belonging to a class.
y_pred = clf.predict_proba(test_point)
Does the classifier predicts the class with the max probability or does it considers the probabilities as a distribution draws according to distribution?
In other words, suppose the output probability is -
C1 - 0.1 C2 - 0.2 C3 - 0.7
Will the output be C3 always or only 70% of the times?
When clf predict it won’t calculate the probably of each class . It will use the full connect get a array like [itemsnum ,classisnum] then you can use max output[1] get the items class
by the way when clf training it use softmax to get the probably of each class which is more smooth to optimize you can find some doc about softmax if you are interested about train process
How to go from class probability scores to a class is often called the 'decision function', and is often considered separate from the classifier itself. In scikit-learn, many estimators have a default decision function accessible via predict() for multi-class problems this generally just returns the largest value (argmax function).
However this may be extended in various ways, depending on needs. For instance if the effects of one prediction one of the classes is very costly, then one might weight those probabilities down (class weighting). Or one can have a decision function that only gives a class as output if the confidence is high, else returns an error or a fallback class.
One can also have multi-label classification, there the output is not a single class but a list of classes. [ 0.6, 0.1, 0.7, 0.2 ] -> (class0, class2) These can then use a common threshold, or a per-class threshold. This is common in tagging problems.
But in almost all cases the decision function is a deterministic function, not a probabilistic one.
When using SKlearn and getting probabilities with the predict_proba(x) function for a binary classification [1, 0] the function returns the probability that the classification falls into each class. example [.8, .34].
Is there a community adopted standard way to reduce this down to a single classification confidence which takes all factors into consideration?
Option 1)
Just take the probability for the classification that was predicted (.8 in this example)
Option 2)
Some mathematical formula or function call which which takes into consideration all of the different probabilities and returns a single number. Such a confidence approach could take into consideration who close the probabilities of the different classes and return a lower confidence if there is not much separation between the different classes.
Theres no standard of of doing it. But what you can do is vary the threshold. What I exactly mean is if you use predict instead it throws out a binary out classifying your dataset, what its doing is taking 0.5 as a threshhold for predicting. Like if the probability of classifying in 1 is >0.5 classify it as 1 and 0 if <=0.5. But this can lead to a bad f1-score in some cases.
So, the approach should be to vary the threshhold and and choose one which yields maximum f1-score or any other metric you want to use as a score function. ROC(Receiver operating characteristic)curves are meant for this purpose only. And infact, the motive behind sklearn for giving out the class probabilities for this only, to let you choose the best threshhold.
A very nice example is predicting whether the patient has cancer or not. So you have to choose your threshhold wisely, if you choose it high you'll might be getting false-negatives a lot or if you choose it low you might get false-positives a lot. So you just choose the threshold according to your needs (as its better to get more false-positives).
Hope it helps!
Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.
What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
Are there any publications about this, something to quote?
I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :
1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.
Actually, Softmax functions are already used deep within neural networks, in certain cases, when dealing with differentiable memory and with attention mechanisms!
Softmax layers can be used within neural networks such as in Neural Turing Machines (NTM) and an improvement of those which are Differentiable Neural Computer (DNC).
To summarize, those architectures are RNNs/LSTMs which have been modified to contain a differentiable (neural) memory matrix which is possible to write and access through time steps.
Quickly explained, the softmax function here enables a normalization of a fetch of the memory and other similar quirks for content-based addressing of the memory. About that, I really liked this article which illustrates the operations in an NTM and other recent RNN architectures with interactive figures.
Moreover, Softmax is used in attention mechanisms for, say, machine translation, such as in this paper. There, the Softmax enables a normalization of the places to where attention is distributed in order to "softly" retain the maximal place to pay attention to: that is, to also pay a little bit of attention to elsewhere in a soft manner. However, this could be considered like to be a mini-neural network that deals with attention, within the big one, as explained in the paper. Therefore, it could be debated whether or not Softmax is used only at the end of neural networks.
Hope it helps!
Edit - More recently, it's even possible to see Neural Machine Translation (NMT) models where only attention (with softmax) is used, without any RNN nor CNN: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum({o_i}) = 1 is a linear dependency, which is intentional at this layer. Additional layers may provide desired sparsity and/or feature independence downstream.
Page 198 of Deep Learning (Goodfellow, Bengio, Courville)
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability
distribution over a binary variable.
Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable.
Softmax function is used for the output layer only (at least in most cases) to ensure that the sum of the components of output vector is equal to 1 (for clarity see the formula of softmax cost function). This also implies what is the probability of occurrence of each component (class) of the output and hence sum of the probabilities(or output components) is equal to 1.
Softmax function is one of the most important output function used in deep learning within the neural networks (see Understanding Softmax in minute by Uniqtech). The Softmax function is apply where there are three or more classes of outcomes. The softmax formula takes the e raised to the exponent score of each value score and devide it by the sum of e raised the exponent scores values. For example, if I know the Logit scores of these four classes to be: [3.00, 2.0, 1.00, 0.10], in order to obtain the probabilities outputs, the softmax function can be apply as follows:
import numpy as np
def softmax(x):
z = np.exp(x - np.max(x))
return z / z.sum()
scores = [3.00, 2.0, 1.00, 0.10]
print(softmax(scores))
Output: probabilities (p) = 0.642 0.236 0.087 0.035
The sum of all probabilities (p) = 0.642 + 0.236 + 0.087 + 0.035 = 1.00. You can try to substitute any value you know in the above scores, and you will get a different values. The sum of all the values or probabilities will be equal to one. That’s makes sense, because the sum of all probability is equal to one, thereby turning Logit scores to probability scores, so that we can predict better. Finally, the softmax output, can help us to understand and interpret Multinomial Logit Model. If you like the thoughts, please leave your comments below.
I have a binary logistic regression model (0/1), built over binary features. The feature coefficients are usually in the range (-1, 1). After training, can I use the feature coefficients as a proxy for the 'importance' of a feature? If the coefficient is < 0, does that mean the presence of the feature is a negative for the class (i.e., reduces the probability of the output being 1)?
Right; a negative coefficient means that the feature contra-indicates that class. The magnitude is, indeed, the relative importance. -1 and +1 are imperatives: all members of the class do not / do have that feature.
You absolutely can. In fact, this idea of important or 'blame' is the main concept behind machine learning algorithms. The coefficients change many times during the training process through gradient descent. How much the weights update by is actually determined by each weight's contribution towards the cost.
That is, the more a weight is to be blamed for a high cost the more extreme the update will be. Therefore, more extreme values (high positives or low negatives) are an indication of how impactful the respective feature is when the model is making its decision.