SKLearn SVM proba threshold != 0.5?

SKLearn SVM proba threshold != 0.5? - machine-learning

I have an SVM model that I trained (SVC(class_weight='balanced')). I use predict_proba() to get probabilities to compute ROC AUC and predict() to get predictions for f1_score. From the documentation, I'd expect (predict_proba() > 0.5).astype(int) == predict(), however this is not the case. Can anyone help me understand why not? Are my f1_score and ROC AUC scores still valid?
a = svm.predict_proba(vec.transform(X))[:,1]
b = svm.predict(vec.transform(X))
print(np.mean(b))
print(np.mean((a>0.5).astype(int)))
0.2517116391461941
0.12907772855416835

The problem lies in a fact that there is no such thing as "probability" of belonging to a given class under SVM model. It is just not a probabilistic classifier.
What sklearn does, is it retrospectively fits another, probabilistic model, into SVM scores (distances from hyperplane), so there is no direct correspondence between predict_proba() and predict(). Instead, consider using .decision_function and then threshold of 0 will correspond to .predict()

Related

Metric for ML algorithm evaluation

I have a question. Is the best score from GridSearchCV, which corresponds to mean cross-validation score, the right metric to evaluate an algorithm trained with unbalanced data?

GridSearchCV can be used to find appropriate parameter values for your model.
For the right metric to evaluate an algorithm trained with unbalanced data, you want to look at the area under the precision-recall curve (PR AUC) or 'average precision' or maybe even a cost-sensitive one (Jason Brownlee has a bunch of blogs on this topic).

SVM model predicts instances with probability scores greater than 0.1(default threshold 0.5) as positives

I'm working on a binary classification problem. I had this situation that I used the logistic regression and support vector machine model imported from sklearn. These two models were fit with the same , imbalanced training data and class weights were adjusted. And they have achieved comparable performances. When I used these two pre-trained models to predict a new dataset. The LR model and the SVM models predicted similar number of instances as positives. And the predicted instances share a big overlap.
However, when I looked at the probability scores of being classified as positives, the distribution by LR is from 0.5 to 1 while the SVM starts from around 0.1. I called the function model.predict(prediction_data) to find out the instances predicted as each class and the function
model.predict_proba(prediction_data) to give the probability scores of being classified as 0(neg) and 1(pos), and assume they all have a default threshold 0.5.
There is no error in my code and I have no idea why the SVM predicted instances with probability scores < 0.5 as positives as well. Any thoughts on how to interpret this situation?

That's a known fact in sklearn when it comes to binary classification problems with SVC(), which is reported, for instance, in these github issues
(here and here). Moreover, it is also
reported in the User guide where it is said that:
In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
or directly within libsvm faq, where it is said that
Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.
All in all, the point is that:
on one side, predictions are based on decision_function values: if the decision value computed on a new instance is positive, the predicted class is the positive class and viceversa.
on the other side, as stated within one of the github issues, np.argmax(self.predict_proba(X), axis=1) != self.predict(X) which is where the inconsistency comes from. In other terms, in order to always have consistency on binary classification problems you would need a classifier whose predictions are based on the output of predict_proba() (which is btw what you'll get when considering calibrators), like so:
def predict(self, X):
y_proba = self.predict_proba(X)
return np.argmax(y_proba, axis=1)
I'd also suggest this post on the topic.

How will i identify which evaluation metric should i use for classification problem statement in machine learning?

Which Evaluation metric should i use for classification problem statement ? On what factor should i decide ?
1. Accuracy
2. F1 Score
3. AUC ROC Score
4. Log Loss

Accuracy is a great metric when you are working with a balanced dataset. It's the number of true predictions over the total number of predictions.
F1 Score is a great metric when you want to maximaze the precision and the recall of the prediction, it's also great to unbalanced datasets.
AUC ROC Score represents how much of your data is covered by the algorithm. I really like using this evaluation metric, it works well for both balanced and unbalanced datasets.
Log Loss is the logarithmic loss of the prediction, beased on the cross-entropy between the predicted label and the true label. I never used this metric before.

How to get scores along with Machine Learning results?

I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.

What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)

Why does multiclass Logistic Regression give different results than choosing the most probable label in a OvR classifier?

I noticed that my f-scores are slightly lower when using SK-learn's LogisticRegression classifier in conjunction with the following one-vs-rest classifier than using it by itself to do multi-class classification.
class MyOVRClassifier(sklearn.OneVsRestClassifier):
"""
This OVR classifier will always choose at least one label,
regardless of the probability
"""
def predict(self, X):
probs = self.predict_proba(X)[0]
p_max = max(probs)
return [tuple([self.classes_[i] for i, p in enumerate(probs) if p == p_max ])]
Since the documentation of the logistic regression classifier states it uses a one-vs-all strategy, I'm wondering what factors could account for the difference in performance. My one-vs-rest LR classifier seems to be over-predicting one of the classes more than the LR classifier does on its own.

Just guessing, but probably when "no one votes" you get many tinny floating point values, and with LR you end up underflowing to zero. So instead of picking the person most confident / closest, you end up picking based on tie-breaking zero. See an example here of the difference.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

SKLearn SVM proba threshold != 0.5? - machine-learning

Related

Metric for ML algorithm evaluation

SVM model predicts instances with probability scores greater than 0.1(default threshold 0.5) as positives

How will i identify which evaluation metric should i use for classification problem statement in machine learning?

How to get scores along with Machine Learning results?

Why does multiclass Logistic Regression give different results than choosing the most probable label in a OvR classifier?

Categories

Resources