I'm new to Machine learning. When I learn about the Logistic regression, using one-vs-all (one-vs-rest) method for multiclassification:
In logistic regression, the hypothesis function try to estimate the probability of the positive class.
Assume we have 3 classes, then each class, we should predict the hypothesis function h(x)
h1(x)=P(y=1|x)
h2(x)=P(y=2|x)
h3(x)=P(y=3|x)
However,the sum of the three probabilities doesn't equal to 1?
I "feel" that it equal to 1, and I don't understand why it doesn't.
Can someone explain why?
Your results are correct and the sum of h1(x), h2(x) and h3(x) shouldn't be equal to 1.
As you perform one-vs-all classification, then for each class (e.g., class 1) you have two probabilities p(y=1|x) and p(y!=1|x) which sum up to 1:
p(y=1|x) + p(y!=1|x) = 1.
But, as you one-vs-all classifications are independent, then
p(y!=1|x) != p(y=2|x) + p(y=3|x) (at least not necessarily).
Maybe, it is easier to understand with an example:
the first classifier says that p(y=1|x) = 0.7 and p(y!=1|x) = 0.3;
the second classifier says that p(y=2|x) = 0.7 and p(y!=2|x) = 0.3;
the third classifier says that p(y=3|x) = 0.7 and p(y!=3|x) = 0.3.
All of them are valid classifiers, but
p(y=1|x) + p(y=2|x) + p(y=3|x) != 1.
Related
According to the pytorch doc of nn.BCEWithLogitsLoss, pos_weight is an optional argument a that takes the weight of positive examples. I don't fully understand the statement "pos_weight > 1 increases recall and pos_weight < 1 increases precision" in that page. How do you guys understand this statement?
The binary cross-entropy with logits loss (nn.BCEWithLogitsLoss, equivalent to F.binary_cross_entropy_with_logits) is a sigmoid layer (nn.Sigmoid) followed with a binary cross-entropy loss (nn.BCELoss). The general case assumes you are in a multi-label classification task i.e. a single input can be labeled with multiple classes. One common sub-case is to have a single class: the binary classification task. If you define q as your tensor of predicted classes and p the ground-truth [0,1] corresponding to the true probabilities for each class.
The explicit formulation for the binary cross-entropy would be:
z = torch.sigmoid(q)
loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
introducing the w_p, the weight associated with the true label for each class. Read this post for more details on the weighting scheme used by the BCELoss.
For a given class:
precision = TP / (TP + FP)
recall = TP / (TP + FN)
Then if w_p > 1, it increases the weight on the positive classification (classifying as true). This will tend to increase false positives (FP), thus decreasing the precision. Similarly if if w_p < 1, we are decreasing the weight on the true class which means it will tend to increase false negatives (FN), which decreases recall.
Say for example, a dataset contains 60% instances for "Yes" class and 30% instances for "NO" class.
In this scenario, Precision, Recall for the random classifier are
Precision =60%
Recall =50%
Then, what will be the accuracy for random classifier in this scenario?
Some caution is required here, since the very definition of a random classifier is somewhat ambiguous; this is best illustrated in cases of imbalanced data.
By definition, the accuracy of a binary classifier is
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
where P stands for probability.
Indeed, if we stick to the intuitive definition of a random binary classifier as giving
P(prediction=0) = P(prediction=1) = 0.5
then the accuracy computed by the above formula is always 0.5, irrespectively of the class distribution (i.e. the values of P(class=0) and P(class=1)).
However, in this definition, there is an implicit assumption, i.e. that our classes are balanced, each one consisting of 50% of our dataset.
This assumption (and the corresponding intuition) breaks down in cases of class imbalance: if we have a dataset where, say, 90% of samples are of class 0 (i.e. P(class=0)=0.9), then it doesn't make much sense to use the above definition of a random binary classifier; instead, we should use the percentages of the class distributions themselves as the probabilities of our random classifier, i.e.:
P(prediction=0) = P(class=0) = 0.9
P(prediction=1) = P(class=1) = 0.1
Now, plugging these values to the formula defining the accuracy, we get:
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
= (0.9 * 0.9) + (0.1 * 0.1)
= 0.82
which is nowhere close to the naive value of 0.5...
As I already said, AFAIK there are no clear-cut definitions of a random classifier in the literature. Sometimes the "naive" random classifier (always flip a fair coin) is referred to as a "random guess" classifier, while what I have described is referred to as a "weighted guess" one, but still this is far from being accepted as a standard...
The bottom line here is the following: since the main reason for using a random classifier is as a baseline, it makes sense to do so only in relatively balanced datasets. In your case of a 60-40 balance, the result turns out to be 0.52, which is admittedly not far from the naive one of 0.5; but for highly imbalanced datasets (e.g. 90-10), the usefulness itself of the random classifier as a baseline ceases to exist, since the correct baseline has become "always predict the majority class", which here would give an accuracy of 90%, in contrast to the random classifier accuracy of just 82% (let alone the 50% accuracy of the naive approach)...
As #desertnaut mentioned, if you're after a naïve benchmark for your model you're always better using "always predict the majority class" as your benchmark, achieving accuracy of %of_samples_in_majority_class (which is always better than either a random guess or a weighted guess).
In Deepchecks (a package I maintain) we have a check that automatically compares the performance of your model to a simple model (either weighted random, majority class or simple decision tree).
from deepchecks.checks import SimpleModelComparison
from deepchecks import Dataset
SimpleModelComparison().run(Dataset(train_df, label='target'), Dataset(test_df, label='target'), model)
Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.
I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.
I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.
Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?
I did not find anything in the documentation page.
Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?
There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;
pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
print ('\n******** For i = {} ******'.format(i))
Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))
print('Our testing accuracy is {}'.format(test_accuracy))
print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))
Best!
Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).
Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.
Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:
One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.
As another option, one can graphically view precision vs. recall at various thresholds using the following code.
### Predict test_y values and probabilities based on fitted logistic
regression model
pred_y=log.predict(test_x)
probs_y=log.predict_proba(test_x)
# probs_y is a 2-D array of probability of being labeled as 0 (first
column of
array) vs 1 (2nd column in array)
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:,
1])
#retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)
plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])
we can use a wrapper as follows:
model = LogisticRegression()
model.fit(X, y)
def custom_predict(X, threshold):
probs = model.predict_proba(X)
return (probs[:, 1] > threshold).astype(int)
new_preds = custom_predict(X=X, threshold=0.4)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
According to what I have understood, linear regression predicts the outcome which can have continuous values, whereas logistic regression predicts outcome which is discrete. It seems to me that logistic regression is similar to a classification problem. So, why is it called regression?
There is also a related question: What is the difference between linear regression and logistic regression?
There is a strict link between linear regression and logistic regression.
With linear regression you're looking for the ki parameters:
h = k0 + Σ ki ˙ Xi = Kt ˙ X
With logistic regression you've the same aim but the equation is:
h = g(Kt ˙ X)
Where g is the sigmoid function:
g(w) = 1 / (1 + e-w)
So:
h = 1 / (1 + e-Kt ˙ X)
and you need to fit K to your data.
Assuming a binary classification problem, the output h is the estimated probability that the example x is a positive match in the classification task:
P(Y = 1) = 1 / (1 + e-Kt ˙ X)
When the probability is greater than 0.5 then we can predict "a match".
The probability is greater than 0.5 when:
g(w) > 0.5
and this is true when:
w = Kt ˙ X ≥ 0
The hyperplane:
Kt ˙ X = 0
is the decision boundary.
In summary:
logistic regression is a generalized linear model using the same basic formula of linear regression but it is regressing for the probability of a categorical outcome.
This is a very abridged version. You can find a simple explanation in these videos (third week of Machine Learning by Andrew Ng).
You can also take a look at http://www.holehouse.org/mlclass/06_Logistic_Regression.html for some notes on the lessons.
As explained earlier,logistic regression is a generalized linear model using the same basic formula of linear regression but it is regressing for the probability of a categorical outcome.
As you can see, we get similar type of equation for both linear and logistic regression.
Difference lies in fact that linear regression give continous values of y for given x where logistic regression also gives continous values of p(y=1) for given x which is coverted later to y=0 or y=1 based on threshold value(0.5).
Logistic regression falls under the category of supervised learning.It measures the relationship between categorical dependent variable and one or more independent variables by estimating probabilities using logistic/sigmoid function.
Logistic regression is a bit similar to linear regression or we can see it as a generalized linear model.
In linear regression we predict output y based on a weighted sum of input variables.
y=c+ x1*w1 + x2*w2 + x3*w3 + .....+ xn*wn
The main purpose of linear regression is to estimate values of c,w1,w2,...,wn and minimize the cost function and predict y.
Logistic regression also does the same thing but with one addition. It pass the result through a special function called logistic/sigmoid function to produce the output y.
y=logistic(c + x1*w1 + x2*w2 + x3*w3 + ....+ xn*wn)
y=1/1+e[-(c + x1*w1 + x2*w2 + x3*w3 + ....+ xn*wn)]