Determine accuracy of model which estimates probability of one of the classes - machine-learning

I'm modeling an event with two outcomes, 0(rejection) and 1(acceptance). I have created a model which estimates the probability that 1(acceptance) will happen (i.e. the model will calculate that '1' will happen with 80% chance or in other words probability of acceptance is 0.8)
Now, I have a large record of outcomes of trials with the estimates from the model (For example: probability of acceptance=0.8 and actual class (acceptance=1)). I would like to quantify or validate how accurate the model is. Is this possible, and if so how?
Note: I am just predicting probability of class 1. Let's say prediction for class 1 is 0.8 and the actual class value is 1. Now I want to find performance of my model.

You simply need to convert your probability to one of two discrete classes with thresholded rounding, i.e. if p(y=1|x)>0.5: predict 1, else predict 0. Then all of the metrics are applicable. The threshold can be chosen by inspecting the ROC curve and/or precision-recall changes or can be simply set to 0.5.

Sort the objects by prediction.
Then compute the ROC AUC of the resulting curve.

Related

naive bayes accuracy increasing as increasing in the alpha value

I'm using naive Bayes for text classification and I have 100k records in which 88k are positive class records and 12krecords are negative class records. I converted sentences to unigrams and bigrams using countvectorizer and I took alpha range from [0,10] with 50 values and I draw the plot.
In Laplace additive smoothing, If I keep increasing the alpha value then accuracy on the cross-validation dataset also increasing. My question is is this trend expected or not?
If you keep increasing the alpha value then naive bayes model will bias towards the class which has more records and model becomes a dumb model(underfitting) so by choosing small alpha value is good idea.
Because you have 88k Positive Point and 12K negative point which means that you have unbalanced data set.
You can add more negative point to balanced data set, you can clone or replicate your negative point which we called upsampling. After that, your data set is balanced now you can apply naive bayes with alpha it will work properly, now your model is not dumb model, earlier you model was dumb that's why as increase alpha it increase you Accuracy.

Computing ROC curve for K-NN classifier

As you probably know, in K-NN, the decision is usually taken according to the "majority vote", and not according to some threshold - i.e. there is no parameter to base a ROC curve on.
Note that in the implementation of my K-NN classifier, the votes don't have equal weights. I.e. the weight for each "neighbor" is e^(-d), where d is the distance between the tested sample and the neighbor. This measure gives higher weights for the votes of the nearer neighbors among the K neighbors.
My current decision rule is that if the sum of the scores of the positive neighbors is higher than the sum of the scores of the negative samples, then my classifier says POSITIVE, else, it says NEGATIVE.
But - There is no threshold.
Then, I thought about the following idea:
Deciding on the class of the samples which has a higher sum of votes, could be more generally described as using the threshold 0, for the score computed by: (POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
So I thought changing my decision rule to be using a threshold on that measure, and plotting a ROC curve basing on thresholds on the values of
(POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
Does it sound like a good approach for this task?
Yes, it is more or less what is typically used. If you take a look at scikit-learn it has weights in knn, and they also have predit_proba, which gives you a clear decision threshold. Typically you do not want to condition on a difference, however, but rather ratio
votes positive / (votes negative + votes positive) < T
this way, you know that you just have to "move" threshold from 0 to 1, and not arbitrary values. it also now has a clear interpretation - as an internal probability estimate that you consider "sure enough". By default T = 0.5, if the probability is above 50% you classify as positive, but as said before - you can do anything wit it.

receiver operating characteristic (ROC) on a test set

The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?
The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).

How can I make Weka classify the smaller class, with a 2:1 class imbalance?

How can I make Weka classify the smaller classification? I have a data set where the positive classification is 35% of the data set and the negative classification is 65% of the data set. I want Weka to predict the positive classification but in some cases, the resultant model predicts all instances to be the negative classification. Regardless, it is classifying the negative (larger) class. How can I force it to classify the positive (smaller) classification?
One simple solution is to adjust your training set to be more balanced (50% positive, 50% negative) to encourage classification for both cases. I would guess that more of your cases are negative in the problem space, and therefore you would need to find some way to ensure that the negative cases still represent the problem well.
Since the ratio of positive to negative is 1:2, you could also try duplicating the positive cases in the training set to make it 2:2 and see how that goes.
Use stratified sampling (e.g. train on a 50%/50% sample) or class weights/class priors. It helps greatly if you tell us which specific classifier? Weka seems to have at least 50.
Is the penalty for Type I errors = penalty for Type II errors?
This is a special case of the receiver operating curve (ROC).
If the penalties are not equal, experiment with the cutoff value and the AUC.
You probably also want to read the sister site CrossValidated for statistics.
Use CostSensitiveClassifier, which is available under "meta" classifiers
You will need to change "classifier" to your J48 and (!) change cost matrix
to be like [(0,1), (2,0)]. This will tell J48 that misclassification of a positive instance is twice more costly than misclassification of a negative instance. Of course, you adjust your cost matrix according to your business values.

ROC curves cross validation

How to generate a ROC curve for a cross validation?
For a single test I think I should threshold the classification scores of SVM to generate the ROC curve.
But I am unclear about how to generate it for a cross validation?
After a complete round of cross validation all observations have been classified once (although by different models) and have been give an estimated probability of belonging to the class of interest, or a similar statistic. These probabilities can be used to generate a ROC curve in exactly the same way as probabilities obtained on an external test set. Just calculate the classwise error rates as you vary the classification threshold from 0 to 1 and your are all set.
However, typically you would like to perform more than one round of crossvalidation, as the performance varies depending on how the folds are divided. It is not obvious to me how to calculate the mean ROC curve of all rounds. I suggest plotting them all and calculate the mean AUC.
As follow-up to Backlin:
The variation in the results for different runs of k-fold or leave-n-out cross validation show instability of the models. This is valuable information.
Of course you can pool the results and just generate one ROC.
But you can also plot the set of curves
see e.g. the R package ROCR
or calculate e.g. median and IQR at different thresholds and construct a band depicting these variations.
Here's an example: the shaded areas are the inter quartile ranges observed over 125 iterations of 8-fold cross validation. The thin black areas contain half of the observed specificity-sensitivity pairs for one particular threshold, median marked by x (ignore the + marks).

Resources