How to predict a continuous dependent variable that expresses target class probabilities? - machine-learning

My samples can either belong to class 0 or class 1 but for some of my samples I only have a probability available for them to belong to class 1. So far I've discretized my target variable by applying a threshold i.e. all y >= t I assigned to class 1 and I've discarded all samples that have non-zero probability to belong to class 1. Then I fitted a linear SVM to the data using scitkit-learn.
Of cause this way I through away quite a bit of the training data. One idea I had was to omit the discretization and use regression instead but usually it's not a good idea to approach classification by regression as for example it doesn't guarantee predicted values to be in the interval [0,1].
By the way the nature of my features x is similar as for some of them I also only have probabilities for the respective feature to be present. For the error it didn't make a big difference if I discretized my features in the same way I discretized the dependent variable.

You might be able to approximate this using sample weighting - assign a sample to the class which has the highest probability, but weight that sample by the probability of it actually belonging. Many of the scikit-learn estimators allow for this.
Example:
X = [1, 2, 3, 4] -> class 0 with probability .7 would become X = [1, 2, 3, 4] y = [0] with sample weight of .7 . You might also normalize so the sample weights are between 0 and 1 (since your probabilities and sample weights will only be from .5 to 1. in this scheme). You could also incorporate non-linear penalties to "strengthen" the influence of high probability samples.

Related

Which metric to use for imbalanced classification problem?

I am working on a classification problem with very imbalanced classes. I have 3 classes in my dataset : class 0,1 and 2. Class 0 is 11% of the training set, class 1 is 13% and class 2 is 75%.
I used and random forest classifier and got 76% accuracy. But I discovered 93% of this accuracy comes from class 2 (majority class). Here is the Crosstable I got.
The results I would like to have :
fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1
What I found on the internet to solve the problem and what I've tried :
using class_weight='balanced' or customized class_weight ( 1/11% for class 0, 1/13% for class 1, 1/75% for class 2), but it doesn't change anything (the accuracy and crosstable are still the same). Do you have an interpretation/explenation of this ?
as I know accuracy is not the best metric in this context, I used other metrics : precision_macro, precision_weighted, f1_macro and f1_weighted, and I implemented the area under the curve of precision vs recall for each class and use the average as a metric.
Here's my code (feedback welcome) :
from sklearn.preprocessing import label_binarize
def pr_auc_score(y_true, y_pred):
y=label_binarize(y_true, classes=[0, 1, 2])
return average_precision_score(y[:,:],y_pred[:,:])
pr_auc = make_scorer(pr_auc_score, greater_is_better=True,needs_proba=True)
and here's a plot of the precision vs recall curves.
Alas, for all these metrics, the crosstab remains the same... they seem to have no effect
I also tuned the parameters of Boosting algorithms ( XGBoost and AdaBoost) (with accuracy as metric) and again the results are not improved.. I don't understand because boosting algorithms are supposed to handle imbalanced data
Finally, I used another model (BalancedRandomForestClassifier) and the metric I used is accuracy. The results are good as we can see in this crosstab. I am happy to have such results but I notice that, when I change the metric for this model, there is again no change in the results...
So I'm really interested in knowing why using class_weight, changing the metric or using boosting algorithms, don't lead to better results...
As you have figured out, you have encountered the "accuracy paradox";
Say you have a classifier which has an accuracy of 98%, it would be amazing, right? It might be, but if your data consists of 98% class 0 and 2% class 1, you obtain a 98% accuracy by assigning all values to class 0, which indeed is a bad classifier.
So, what should we do? We need a measure which is invariant to the distribution of the data - entering ROC-curves.
ROC-curves are invariant to the distribution of the data, thus are a great tool to visualize classification-performances for a classifier whether or not it is imbalanced. But, they only work for a two-class problem (you can extend it to multiclass by creating a one-vs-rest or one-vs-one ROC-curve).
F-score might a bit more "tricky" to use than the ROC-AUC since it's a trade off between precision and recall and you need to set the beta-variable (which is often a "1" thus the F1 score).
You write: "fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1". Remember, that all algorithms work by either minimizing something or maximizing something - often we minimize a loss function of some sort. For a random forest, lets say we want to minimize the following function L:
L = (w0+w1+w2)/n
where wi is the number of class i being classified as not class i i.e if w0=13 we have missclassified 13 samples from class 0, and n the total number of samples.
It is clear that when class 0 consists of most of the data then an easy way to get a small L is to classify most of the samples as 0. Now, we can overcome this by adding a weight instead to each class e.g
L = (b0*w0+b1*w1+b2*x2)/n
as an example say b0=1, b1=5, b2=10. Now you can see, we cannot just assign most of the data to c0 without being punished by the weights i.e we are way more conservative by assigning samples to class 0, since assigning a class 1 to class 0 gives us 5 times as much loss now as before! This is exactly how the weight in (most) of the classifiers work - they assign a penalty/weight to each class (often proportional to it's ratio i.e if class 0 consists of 80% and class 1 consists of 20% of the data then b0=1 and b1=4) but you can often specify the weight your self; if you find that the classifier still generates to many false negatives of a class then increase the penalty for that class.
Unfortunately "there is no such thing as a free lunch" i.e it's a problem, data and usage specific choice, of what metric to use.
On a side note - "random forest" might actually be bad by design when you don't have much data due to how the splits are calculated (let me know, if you want to know why - it's rather easy to see when using e.g Gini as splitting). Since you have only provided us with the ratio for each class and not the numbers, I cannot tell.

Confused about sklearn’s implementation of OSVM

I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed
this example .
I apologize for the silly questions But I’m a bit confused about two things :
Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?
Which of labels predicted by the OSVM and represent outliers is it 1 or -1
Once again i apologize for those questions but for some reason i cannot find this documented anyware
As this example you reference is about novelty-detection, the docs say:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
Meaning: you should train on regular examples only.
The approach is based on:
Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.
Extract:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
The above docs also say:
Inliers are labeled 1, while outliers are labeled -1.
This can also be seen in your example code, extracted:
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
...
# all regular = inliers (defined above)
y_pred_test = clf.predict(X_test)
...
# -1 = outlier <-> error as assumed to be inlier
n_error_test = y_pred_test[y_pred_test == -1].size

How to adjust Logistic Regression classification threshold value in Scikit-learn? [duplicate]

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.
I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.
Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?
I did not find anything in the documentation page.
Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?
There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;
pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
print ('\n******** For i = {} ******'.format(i))
Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))
print('Our testing accuracy is {}'.format(test_accuracy))
print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))
Best!
Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).
Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.
Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:
One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.
As another option, one can graphically view precision vs. recall at various thresholds using the following code.
### Predict test_y values and probabilities based on fitted logistic
regression model
pred_y=log.predict(test_x)
probs_y=log.predict_proba(test_x)
# probs_y is a 2-D array of probability of being labeled as 0 (first
column of
array) vs 1 (2nd column in array)
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:,
1])
#retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)
plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])
we can use a wrapper as follows:
model = LogisticRegression()
model.fit(X, y)
def custom_predict(X, threshold):
probs = model.predict_proba(X)
return (probs[:, 1] > threshold).astype(int)
new_preds = custom_predict(X=X, threshold=0.4)

Deep Learning an Imbalanced data set

I have two data sets that looks like this:
DATASET 1
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)
DATASET 2
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)
I am trying to build a deep feedforward neural net in Tensorflow. I get accuracies in the 90s and AUC scores in the 80s. Of course, the data set is heavily imbalanced so those metrics are useless. My emphasis is on getting a good recall value and I do not want to oversample the Class 1. I have toyed with the complexity of the model to no avail, the best model predicted only 25% of the positive class correctly.
My question is, considering the distribution of these data sets, is it a futile move to build models without getting more data(I can't get more data) or there's a way around getting to work with data that is this much imbalanced.
Thanks!
Question
Can I use tensorflow to learn imbalance classification with a ratio of about 30:1
Answer
Yes, and I have. Specifically Tensorflow provides the ability to feed in a weight matrix. Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. You can feed in a matrix that matches Y in shape and for each value of Y provide the relative weight that training example should have.
One way to find the correct weights is to start different balances and run your training and then look at your confusion matrix and a run down of precision vs accuracy for each class. Once you get both classes to have about the same precision to accuracy ratio then they are balanced.
Example Implementation
Here is an example implementation that converts a Y into a weight matrix that has performed very well for me
def weightMatrix( matrix , most=0.9 ) :
b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
a = 1./( b * 2. )
weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
return weights
The most parameter represents the largest fractional difference to consider. 0.9 equates to .1:.9 = 1:9 , where as .5 equates to 1:1. Values below .5 don't work.
You might be interested to have a look at this question and its answer. Its scope is a priori more restricted than yours, as it addresses specifically weights for classification, but it seems very relevant to your case.
Also, AUC is definitely not irrelevant: it is actually independent of your data imbalance.

How to compute probabilities instead of actual classifications on ML problems

Let's assume that we have a few data points that can be used as the training set. Each row is consisted of 4 say columns (features) that take boolean values. The 5th column expresses the class and it also takes boolean values. Here is an example (they are almost random):
1,1,1,0,1
0,1,1,0,1
1,1,0,0,1
0,0,0,0,0
1,0,0,1,0
0,0,0,0,0
Now, what I want to do is to build a model such that for any given input (new line) the system does not return the class itself (like in the case of a regular classification problem) but instead the probability this particular input belongs to class 0 or class 1. How can I do that? What's more, how can I generate a confidence interval or error rate associated with that computation?
Not all classification algorithms return probabilities, because not all of them have an underlying probabilistic model. For example, a classification tree is just a set of rules that you follow to assign each new input to a particular class.
An example of a classification algorithm that does have an underlying probabilistic model is logistic regression. In this algorithm, the probability that a particular input x is in the class is
prob = 1 / (1 + exp( -theta * x ))
where theta is a vector of coefficients with the same number of dimensions as x. Generally to move from probabilities to classifications, you simply threshold, e.g.
if prob < 0.5
return 0;
else
return 1;
end
Other classification algorithms may have probabilistic interpretations, for example random forests are essentially a voting algorithm with multiple classification trees. If 80% of the trees vote for class 1 and 20% vote for class 2, then you could output an 80% probability of being in class 1. But this is a side effect of how the model works, rather than an explicit underlying probability model.

Resources