How exactly are the thresholds evaluated while plotting roc curves using sklearn - machine-learning

I am currently doing a 3 class classification problem using a random forest. I wanted to do some analysis using ROC curves. Since ROC are usually for binary classifiers, i used OneVSRest with random forest. This gave me 3 binary classifiers and using sklearn roc_curve() i plotted them. I dont understand how the threshold is being taken? From my understanding of roc curves for binary classifier , if we have logistic regression, we can change the threshold between 0 and 1 in order to classify. E.g., alpha = 0.5 : if p > 0.5, set to class 1 else class 2;
for alpha = 0.6 : if p > 0.6 set to class 1 else class 2 etc and so on. This is how the thresholds change (alpha is the threshold here) and thats how we get the roc curve by plotting the TPR vs FPR for different thresholds
Now how does this work in case of a random forest or any tree based algorithm? what is the threshold? and how do we extend this for multi class?
Please do correct me if i made a wrong assumption here
I tried different methods
OneVsRest(RandomForestClassifier())
And got 3 curves
But i dont know how these thresholds are coming and how exactly i need to use this to interpret results - like finding optimal threshold etc

Related

Accuracy in outlier detection

I am having an issue understanding this question while learning about outliers. I have attached an image of the question. Is there anyone help me understanding the question as I am new to Data Mining and unable to crack this question. Resources for expanding my knowledge will be appreciated.
All I know right now is you can check the accuracy of one model for detecting an outlier by comparing the generated results and predicted ones. But in this problem, there is no such actual values which has led me towards the issue. It would be a great favor if anyone can help me out.
Thanks in advanceenter image description here
The purpose of the questions seems to be more related to the ROC curve interpretation than to the task at hand being an outlier prediction problem. It seems that it needs to understand how to compare two algorithms based on the ROC curve, and to conclude that the suitable metric to be used in this case is the AUC score.
Using Python and scikit-learn we can easily plot the two ROC curves like this:
#define three lists with the given data: two sets of scores and their true class
scores1 = [0.44, 0.94, 1.86, 2.15, 0.15, 0.5, 5.4, 3.09, 7.97, 5.21]
scores2 = [0.73, 0.18, 0.76, 1.6, 3.78, 4.45, 0.3, 3.3, 0.44, 9.94]
y = [0,0,0,1,0,0,1,1,0,0]
# calculate fpr, tpr and classification thresholds
from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
fpr1, tpr1, thresholds1 = roc_curve(y, scores1)
fpr2, tpr2, thresholds2 = roc_curve(y, scores2)
auc1 = roc_auc_score(y, scores1)
auc2 = roc_auc_score(y, scores2)
# get the curve displays using the above metrics
curve1 = RocCurveDisplay(fpr=fpr1, tpr=tpr1, roc_auc=auc1,
estimator_name='Algo1')
curve2 = RocCurveDisplay(fpr=fpr2, tpr=tpr2, roc_auc=auc2,
estimator_name='Algo2')
curve1.plot()
curve2.plot()
Then, from the plots you can interpret based on the values you can see for False Positive Rate on the x-axis vs True Positive Rate on the y-axis and the trade-off they imply. Moreover, you will see that the algorithm 1 having a graph that accounts for larger scores of TPR than those of the algorithm 2, is a better algorithm for this task. Moreover, this can be formalized using the AUC as a metric, which was calculated using "roc_auc_score".
Note that you can also have the plot manually if you calculate FPR and TPR for each of the algorithms using their corresponding classification thresholds.
Hope it helps :)
Regards,
Jehona.

How to adjust Logistic Regression classification threshold value in Scikit-learn? [duplicate]

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.
I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.
Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?
I did not find anything in the documentation page.
Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?
There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;
pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
print ('\n******** For i = {} ******'.format(i))
Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))
print('Our testing accuracy is {}'.format(test_accuracy))
print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))
Best!
Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).
Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.
Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:
One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.
As another option, one can graphically view precision vs. recall at various thresholds using the following code.
### Predict test_y values and probabilities based on fitted logistic
regression model
pred_y=log.predict(test_x)
probs_y=log.predict_proba(test_x)
# probs_y is a 2-D array of probability of being labeled as 0 (first
column of
array) vs 1 (2nd column in array)
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:,
1])
#retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)
plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])
we can use a wrapper as follows:
model = LogisticRegression()
model.fit(X, y)
def custom_predict(X, threshold):
probs = model.predict_proba(X)
return (probs[:, 1] > threshold).astype(int)
new_preds = custom_predict(X=X, threshold=0.4)

TensorFlow: Implementing a class-wise weighted cross entropy loss?

Assuming after performing median frequency balancing for images used for segmentation, we have these class weights:
class_weights = {0: 0.2595,
1: 0.1826,
2: 4.5640,
3: 0.1417,
4: 0.9051,
5: 0.3826,
6: 9.6446,
7: 1.8418,
8: 0.6823,
9: 6.2478,
10: 7.3614,
11: 0.0}
The idea is to create a weight_mask such that it could be multiplied by the cross entropy output of both classes. To create this weight mask, we can broadcast the values based on the ground_truth labels or the predictions. Some mathematics in my implementation:
Both labels and logits are of shape [batch_size, height, width, num_classes]
The weight mask is of shape [batch_size, height, width, 1]
The weight mask is broadcasted to the num_classes number of channels of the multiplication between the softmax of the logit and the labels to give an output shape of [batch_size, height, width, num_classes]. In this case, num_classes is 12.
Reduce sum for each example in a batch, then perform reduce mean for all examples in one batch to get a single scalar value of loss.
In this case, should we create the weight mask based on the predictions or the ground truth?
If we build it based on the ground_truth, then it means no matter what the predicted pixel labels are, they get penalized based on the actual labels of the class, which doesn't seem to guide the training in a sensible way.
But if we build it based on the predictions, then for whatever logit predictions that are produced, if the predicted label (from taking the argmax of the logit) is dominant, then the logit values for that pixel will all be reduced by a significant amount.
--> Although this means the maximum logit will still be the maximum since all of the logits in the 12 channels will be scaled by the same value, the final softmax probability of the label predicted (which is still the same before and after scaling), will be lower than before scaling (did some simple math to estimate). --> a lower loss is predicted
But the problem is this: If a lower loss is predicted as a result of this weighting, then wouldn't it contradict the idea that predicting dominant labels should give you a greater loss?
The impression I get in total for this method is that:
For the dominant labels, they are penalized and rewarded much lesser.
For the less dominant labels, they are rewarded highly if the predictions are correct, but they're also penalized heavily for a wrong prediction.
So how does this help to tackle the issue of class-balancing? I don't quite get the logic here.
IMPLEMENTATION
Here is my current implementation for calculating the weighted cross entropy loss, although I'm not sure if it is correct.
def weighted_cross_entropy(logits, onehot_labels, class_weights):
if not logits.dtype == tf.float32:
logits = tf.cast(logits, tf.float32)
if not onehot_labels.dtype == tf.float32:
onehot_labels = tf.cast(onehot_labels, tf.float32)
#Obtain the logit label predictions and form a skeleton weight mask with the same shape as it
logit_predictions = tf.argmax(logits, -1)
weight_mask = tf.zeros_like(logit_predictions, dtype=tf.float32)
#Obtain the number of class weights to add to the weight mask
num_classes = logits.get_shape().as_list()[3]
#Form the weight mask mapping for each pixel prediction
for i in xrange(num_classes):
binary_mask = tf.equal(logit_predictions, i) #Get only the positions for class i predicted in the logits prediction
binary_mask = tf.cast(binary_mask, tf.float32) #Convert boolean to ones and zeros
class_mask = tf.multiply(binary_mask, class_weights[i]) #Multiply only the ones in the binary mask with the specific class_weight
weight_mask = tf.add(weight_mask, class_mask) #Add to the weight mask
#Multiply the logits with the scaling based on the weight mask then perform cross entropy
weight_mask = tf.expand_dims(weight_mask, 3) #Expand the fourth dimension to 1 for broadcasting
logits_scaled = tf.multiply(logits, weight_mask)
return tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits_scaled)
Could anyone verify whether my concept of this weighted loss is correct, and whether my implementation is correct? This is my first time getting acquainted with a dataset with imbalanced class, and so I would really appreciate it if anyone could verify this.
TESTING RESULTS: After doing some tests, I found the implementation above results in a greater loss. Is this supposed to be the case? i.e. Would this make the training harder but produce a more accurate model eventually?
SIMILAR THREADS
Note that I have checked a similar thread here: How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits
But it seems that TF only has a sample-wise weighting for loss but not a class-wise one.
Many thanks to all of you.
Here is my own implementation in Keras using the TensorFlow backend:
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
with open('class_weights.pickle', 'rb') as f:
weight = pickle.load(f)
return -tf.reduce_sum(target * weight * tf.log(output))
where weight is just a standard Python list with the indexes of the weights matched to those of the corresponding class in the one-hot vectors. I store the weights as a pickle file to avoid having to recalculate them. It is an adaptation of the Keras categorical_crossentropy loss function. The first line simply clips the value to make sure we never take the log of 0.
I am unsure why one would calculate the weights using the predictions rather than the ground truth; if you provide further explanation I can update my answer in response.
Edit: Play around with this numpy code to understand how this works. Also review the definition of cross entropy.
import numpy as np
weights = [1,2]
target = np.array([ [[0.0,1.0],[1.0,0.0]],
[[0.0,1.0],[1.0,0.0]]])
output = np.array([ [[0.5,0.5],[0.9,0.1]],
[[0.9,0.1],[0.4,0.6]]])
crossentropy_matrix = -np.sum(target * np.log(output), axis=-1)
crossentropy = -np.sum(target * np.log(output))

Random Forests and ROC Curves in Julia

I'm using the ScikitLearn flavour of the DecisionTree.jl package to create a random forest model for a binary classification problem of one of the RDatasets data sets (see bottom of the DecisionTree.jl home page for what I mean by ScikitLearn flavour). I'm also using the MLBase package for model evaluation.
I have built a random forest model of my data and would like to create a ROC Curve for this model. Reading the documentation available, I do understand what a ROC curve is in theory. I just can't figure out how to create one for a specific model.
From the Wikipedia page the last part of the first sentence that I have marked in bold italics below is the one that is causing my confusion: "In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied." There is more on the threshold value throughout the article but this still confuses me for binary classification problems. What is the threshold value and how do I vary it?
Also, in the MLBase documentation on ROC Curves it says "Compute an ROC instance or an ROC curve (a vector of ROC instances), based on given scores and a threshold thres." But doesn't mention this threshold anywhere else really.
Example code for my project is given below. Basically, I want to create a ROC curve for the random forest but I'm not sure how to or if it's even appropriate.
using DecisionTree
using RDatasets
using MLBase
quakes_data = dataset("datasets", "quakes");
# Add in a binary column as feature column for classification
quakes_data[:MagGT5] = convert(Array{Int32,1}, quakes_data[:Mag] .> 5.0)
# Getting features and labels where label = 1 is mag > 1 and label = 2 is mag <= 5
features = convert(Array, quakes_data[:, [1:3;5]]);
labels = convert(Array, quakes_data[:, 6]);
labels[labels.==0] = 2
# Create a random forest model with the tuning parameters I want
r_f_model = RandomForestClassifier(nsubfeatures = 3, ntrees = 50, partialsampling=0.7, maxdepth = 4)
# Train the model in-place on the dataset (there isn't a fit function without the in-place functionality)
DecisionTree.fit!(r_f_model, features, labels)
# Apply the trained model to the test features data set (here I haven't partitioned into training and test)
r_f_prediction = convert(Array{Int64,1}, DecisionTree.predict(r_f_model, features))
# Applying the model to the training set and looking at model stats
TrainingROC = roc(labels, r_f_prediction) #getting the stats around the model applied to the train set
# p::T # positive in ground-truth
# n::T # negative in ground-truth
# tp::T # correct positive prediction
# tn::T # correct negative prediction
# fp::T # (incorrect) positive prediction when ground-truth is negative
# fn::T # (incorrect) negative prediction when ground-truth is positive
I also read this question and didn't find it helpful really.
The task in binary classification is to give a 0/1 (or true/false, red/blue) label to a new, unlabeled, data-point. Most classification algorithms are designed to output a continuous real value. This value is optimized to be higher for points with known or predicted label 1, and lower for points with known or predicted label 0. To use this value to generate a 0/1 prediction, an additional threshold is used. Points with a value higher than threshold are predicted to be labeled 1 (and for lower than threshold a 0 label is predicted ).
Why is this setup useful? Because, sometimes mispredicting a 0 instead of a 1 is more costly, and then you can set the threshold low, making the algorithm output predict 1s more often.
In an extreme case when predicting 0 instead of a 1 costs nothing for the application, you can set the threshold at infinity, making it always output 0 (which is obviously the best solution, since it incurs no cost).
The threshold trick cannot eliminate errors from the classifier - no classifier in real-world problems is perfect or free from noise. What it can do is change the ratio between the 0-when-really-1 errors and 1-when-really-0 errors for the final classification.
As you increase the threshold, more points are classified with a 0 label. Consider a chart with the fraction of points classified with 0 on the x-axis, and the fraction of points with a 0-when-really-1 error on the y-axis. For each value of the threshold, plot a point for the resulting classifier on this chart. Plotting a point for all thresholds you get a curve. This is (some variant of) the ROC curve, which summarizes the abilities of the classifier. An often used metric for quality of classification is the AUC or area-under-curve of this chart, but in fact, the whole curve can be of interest in applications.
A summary like this appears in many texts on machine learning, which are a google query away.
Hope this clarifies the role of the threshold and its relation to ROC curves.

How to predict a continuous dependent variable that expresses target class probabilities?

My samples can either belong to class 0 or class 1 but for some of my samples I only have a probability available for them to belong to class 1. So far I've discretized my target variable by applying a threshold i.e. all y >= t I assigned to class 1 and I've discarded all samples that have non-zero probability to belong to class 1. Then I fitted a linear SVM to the data using scitkit-learn.
Of cause this way I through away quite a bit of the training data. One idea I had was to omit the discretization and use regression instead but usually it's not a good idea to approach classification by regression as for example it doesn't guarantee predicted values to be in the interval [0,1].
By the way the nature of my features x is similar as for some of them I also only have probabilities for the respective feature to be present. For the error it didn't make a big difference if I discretized my features in the same way I discretized the dependent variable.
You might be able to approximate this using sample weighting - assign a sample to the class which has the highest probability, but weight that sample by the probability of it actually belonging. Many of the scikit-learn estimators allow for this.
Example:
X = [1, 2, 3, 4] -> class 0 with probability .7 would become X = [1, 2, 3, 4] y = [0] with sample weight of .7 . You might also normalize so the sample weights are between 0 and 1 (since your probabilities and sample weights will only be from .5 to 1. in this scheme). You could also incorporate non-linear penalties to "strengthen" the influence of high probability samples.

Resources