In Classification, what is the difference between the test accuracy and the AUC score? - machine-learning

I am working on a classification-based project, and I am evaluating different ML models based on their training accuracy, testing accuracy, confusion matrix, and the AUC score. I am now stuck in understanding the difference between the scores I get by calculating accuracy of a ML model on the test set (X_test), and the AUC score.
If I am correct, both metrics calculate how well a ML model is able to predict the correct class of previously unseen data. I also understand that for both, the higher the number, the better, for as long as the model is not over-fit or under-fit.
Assuming a ML model is neither over-fit nor under-fit, what is the difference between test accuracy score and the AUC score?
I don't have a background in math and stats, and pivoted towards data science from business background. Therefore, I will appreciate an explanation a business person can understand.

Both terms quantify the quality of a classification model, however, the accuracy quantifies a single manifestation of the variables, which means it describes a single confusion matrix. The AUC (area under the curve) represents the trade-off between the true-positive-rate (tpr) and the false-positive-rate (fpr) in multiple confusion matrices, that are generated for different fpr values for the same classifier.
A confusion matrix is of the form:
1) The accuracy is a measure for a single confusion matrix and is defined as:
where tp=true-positives, tn=true-negatives, fp=false-positives and fn=false-negatives (the amount of each).
2) The AUC measures the area under the ROC (receiver operating characteristic), that is the trade-off curve between the true-positive-rate and the false-positive-rate. For each choice of the false-positive-rate (fpr) threshold,the true-positive-rate (tpr) is determined. I.e, for a given classifier a fpr of 0, 0.1, 0.2 and so fourth is accepted, and for each fpr it's dependent tpr is evaluated. Therefore, you get a function tpr(fpr) that maps the interval [0,1] onto the same interval, because both rates are defined in those intervals. The area under this line is called the AUC, that is between 0 and 1, whereby a random classification is expected to yield an AUC of 0.5.
The AUC, as it is the area under the curve, is defined as:
However, in real (and finite) applications, the ROC is a step function and the AUC is determined by a weighted sum these levels.
Graphics are from Borgelt's Intelligent Data Mining Lecture.

Related

Getting a low ROC AUC score but a high accuracy

Using a LogisticRegression class in scikit-learn on a version of the flight delay dataset.
I use pandas to select some columns:
df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]
I fill in NaN values with 0:
df = df.fillna({'ARR_DEL15': 0})
Make sure the categorical columns are marked with the 'category' data type:
df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')
Then call get_dummies() from pandas:
df = pd.get_dummies(df)
Now I train and test my data set:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)
train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]
test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]
lr.fit(train_set_x, train_set_y)
Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583
probabilities = lr.predict_proba(test_set_x)
roc_auc_score(test_set_y, probabilities[:, 1])
Is there any reason why the ROC AUC is much lower than what the score method provides?
To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.
[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).
Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).
The point to take home is that:
when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds
Given these clarifications, your particular example provides a very interesting case in point:
I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?
Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).
(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).
For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:
Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.
[...]
One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system
Emphasis mine - see also On the dangers of AUC...
I don't know what exactly AIR_DEL15 is, which you use as your label (it is not in the original data). My guess is that it is an imbalanced feature, i.e there are much more 0's than 1's; in such a case, accuracy as a metric is not meaningful, and you should use precision, recall, and the confusion matrix instead - see also this thread).
Just as an extreme example, if 87% of your labels are 0's, you can have a 87% accuracy "classifier" simply (and naively) by classifying all samples as 0; in such a case, you would also have a low AUC (fairly close to 0.5, as in your case).
For a more general (and much needed, in my opinion) discussion of what exactly AUC is, see my other answer.

Why do we want to maximize AUC in classification problems?

I wonder why is our objective is to maximize AUC when maximizing accuracy yields the same?
I think that along with the primary goal to maximize accuracy, AUC will automatically be large.
I guess we use AUC because it explains how well our method is able to separate the data independently of a threshold.
For some applications, we don't want to have false positive or negative. And when we use accuracy, we already make an a priori on the best threshold to separate the data regardless of the specificity and sensitivity.
.
In binary classification, accuracy is a performance metric of a single model for a certain threshold and the AUC (Area under ROC curve) is a performance metric of a series of models for a series of thresholds.
Thanks to this question, I have learnt quite a bit on AUC and accuracy comparisons. I don't think that there's a correlation between the two and I think this is still an open problem. At the end of this answer, I've added some links like these that I think would be useful.
One scenario where accuracy fails:
Example Problem
Let's consider a binary classification problem where you evaluate the performance of your model on a data set of 100 samples (98 of class 0 and 2 of class 1).
Take out your sophisticated machine learning model and replace the whole thing with a dumb system that always outputs 0 for whatever the input it receives.
What is the accuracy now?
Accuracy = Correct predictions/Total predictions = 98/100 = 0.98
We got a stunning 98% accuracy on the "Always 0" system.
Now you convert your system to a cancer diagnosis system and start predicting (0 - No cancer, 1 - Cancer) on a set of patients. Assuming there will be a few cases that corresponds to class 1, you will still achieve a high accuracy.
Despite having a high accuracy, what is the point of the system if it fails to do well on the class 1 (Identifying patients with cancer)?
This observation suggests that accuracy is not a good evaluation metric for every type of machine learning problems. The above is known as an imbalanced class problem and there are enough practical problems of this nature.
As for the comparison of accuracy and AUC, here are some links I think would be useful,
An introduction to ROC analysis
Area under curve of ROC vs. overall accuracy
Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
What does AUC stand for and what is it?
Understanding ROC curve
ROC vs. Accuracy vs. AROC

How to deal with this unbalanced-class skewed data-set?

I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.
For example, following is some part of the training data :
93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0
where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.
The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.
Which classifier should be best for handling this kind of data-set ?
I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.
but got no significant improvement in accuracy.
Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.
Some good starting points are:
try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)
EDIT (now knowing you're using scikit-learn)
The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.
The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.
Have you tried plotting a ROC curve and AUC curve to check your parameters and different thresholds? If not that should give you a good starting point.

Reason of having high AUC and low accuracy in a balanced dataset

Given a balanced dataset (size of both classes are the same), fitting it into an SVM model I yield a high AUC value (~0.9) but a low accuracy (~0.5).
I have totally no idea why would this happen, can anyone explain this case for me?
The ROC curve is biased towards the positive class. The described situation with high AUC and low accuracy can occur when your classifier achieves the good performance on the positive class (high AUC), at the cost of a high false negatives rate (or a low number of true negatives).
The question of why the training process resulted in a classifier with poor predictive performance is very specific to your problem/data and the classification methods used.
The ROC analysis tells you how well the samples of the positive class can be separated from the other class, while the prediction accuracy hints on the actual performance of your classifier.
About ROC analysis
The general context for ROC analysis is binary classification, where a classifier assigns elements of a set into two groups. The two classes are usually referred to as "positive" and "negative". Here, we assume that the classifier can be reduced to the following functional behavior:
def classifier(observation, t):
if score_function(observation) <= t:
observation belongs to the "negative" class
else:
observation belongs to the "positive" class
The core of a classifier is the scoring function that converts observations into a numeric value measuring the affinity of the observation to the positive class. Here, the scoring function incorporates the set of rules, the mathematical functions, the weights and parameters, and all the ingenuity that makes a good classifier. For example, in logistic regression classification, one possible choice for the scoring function is the logistic function that estimates the probability p(x) of an observation x belonging to the positive class.
In a final step, the classifier converts the computed score into a binary class assignment by comparing the score against a decision threshold (or prediction cutoff) t.
Given the classifier and a fixed decision threshold t, we can compute actual class predictions y_p for given observations x. To assess the capability of a classifier, the class predictions y_p are compared with the true class labels y_t of a validation dataset. If y_p and y_t match, we refer to as true positives TP or true negatives TN, depending on the value of y_p and y_t; or false positives FP or false negatives FN if y_p and y_t do not match.
We can apply this to the entire validation dataset and count the total number of TPs, TNs, FPs and FNs, as well as the true positive rate (TPR) and false positive rate rate (FPR), which are defined as follows:
TPR = TP / P = TP / (TP+FN) = number of true positives / number of positives
FPR = FP / N = FP / (FP+TN) = number of false positives / number of negatives
Note that the TPR is often referred to as the sensitivity, and FPR is equivalent to 1-specifity.
In comparison, the accuracy is defined as the ratio of all correctly labeled cases and the total number of cases:
accuracy = (TP+TN)/(Total number of cases) = (TP+TN)/(TP+FP+TN+FN)
Given a classifier and a validation dataset, we can evaluate the true positive rate TPR(t) and false positive rate FPR(t) for varying decision thresholds t. And here we are: Plotting FPR(t) against TPR(t) yields the receiver-operator characteristic (ROC) curve. Below are some sample ROC curves, plotted in Python using roc-utils*.
Think of the decision threshold t as a final free parameter that can be tuned at the end of the training process. The ROC analysis offers means to find an optimal cutoff t* (e.g., Youden index, concordance, distance from optimal point).
Furthermore, we can examine with the ROC curve how well the classifier can discriminate between samples from the "positive" and the "negative" class:
Try to understand how the FPR and TPR change for increasing values of t. In the first extreme case (with some very small value for t), all samples are classified as "positive". Hence, there are no true negatives (TN=0), and thus FPR=TPR=1. By increasing t, both FPR and TPR gradually decrease, until we reach the second extreme case, where all samples are classified as negative, and none as positive: TP=FP=0, and thus FPR=TPR=0. In this process, we start in the top right corner of the ROC curve and gradually move to the bottom left.
In the case where the scoring function is able to separate the samples perfectly, leading to a perfect classifier, the ROC curve passes through the optimal point FPR(t)=0 and TPR(t)=1 (see the left figure below). In the other extreme case where the distributions of scores coincide for both classes, resulting in a random coin-flipping classifier, the ROC curve travels along the diagonal (see the right figure below).
Unfortunately, it is very unlikely that we can find a perfect classifier that reaches the optimal point (0,1) in the ROC curve. But we can try to get as close to it as possible.
The AUC, or the area under the ROC curve, tries to capture this characteristic. It is a measure for how well a classifier can discriminate between the two classes. It varies between 1. and 0. In the case of a perfect classifier, the AUC is 1. A classifier that assigns a random class label to input data would yield an AUC of 0.5.
* Disclaimer: I'm the author of roc-utils
I guess you are miss reading the correct class when calculating the roc curve...
That will explain the low accuracy and the high (wrongly calculated) AUC.
It is easy to see that AUC can be misleading when used to compare two
classifiers if their ROC curves cross. Classifier A may produce a
higher AUC than B, while B performs better for a majority of the
thresholds with which you may actually use the classifier. And in fact
empirical studies have shown that it is indeed very common for ROC
curves of common classifiers to cross. There are also deeper reasons
why AUC is incoherent and therefore an inappropriate measure (see
references below).
http://sandeeptata.blogspot.com/2015/04/on-dangers-of-auc.html
Another simple explanation for this behaviour is that your model is actually very good - just its final threshold to make predictions binary is bad.
I came across this problem with a convolutional neural network on a binary image classification task. Consider e.g, that you have 4 samples with labels 0,0,1,1. Lets say your model creates continuous predictions for these four samples like so: 0.7, 0.75, 0.9 and 0.95.
We would consider this to be a good model, since high values (> 0.8) predict class 1 and low values (< 0.8) predict class 0. Hence, the ROC-AUC would be 1. Note how I used a threshold of 0.8. However, if you use a fixed and badly-chosen threshold for these predictions, say 0.5, which is what we sometimes force upon our model output, then all 4 sample predictions would be class 1, which leads to an accuracy of 50%.
Note that most models optimize not for accuracy, but for some sort of loss function. In my CNN, training for just a few epochs longer solved the problem.
Make sure that you know what you are doing when you transform a continuous model output into a binary prediction. If you do not know what threshold to use for a given ROC curve, have a look at Youden's index or find the threshold value that represents the "most top-left" point in your ROC curve.
If this is happening every single time, may be your model is not correct.
Starting from kernel you need to change and try the model with the new sets.
Look the confusion matrix every time and check TN and TP areas. The model should be inadequate to detect one of them.

How to get scores along with Machine Learning results?

I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.
What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)

Resources