Why do we want to maximize AUC in classification problems? - machine-learning

I wonder why is our objective is to maximize AUC when maximizing accuracy yields the same?
I think that along with the primary goal to maximize accuracy, AUC will automatically be large.

I guess we use AUC because it explains how well our method is able to separate the data independently of a threshold.
For some applications, we don't want to have false positive or negative. And when we use accuracy, we already make an a priori on the best threshold to separate the data regardless of the specificity and sensitivity.
.

In binary classification, accuracy is a performance metric of a single model for a certain threshold and the AUC (Area under ROC curve) is a performance metric of a series of models for a series of thresholds.
Thanks to this question, I have learnt quite a bit on AUC and accuracy comparisons. I don't think that there's a correlation between the two and I think this is still an open problem. At the end of this answer, I've added some links like these that I think would be useful.
One scenario where accuracy fails:
Example Problem
Let's consider a binary classification problem where you evaluate the performance of your model on a data set of 100 samples (98 of class 0 and 2 of class 1).
Take out your sophisticated machine learning model and replace the whole thing with a dumb system that always outputs 0 for whatever the input it receives.
What is the accuracy now?
Accuracy = Correct predictions/Total predictions = 98/100 = 0.98
We got a stunning 98% accuracy on the "Always 0" system.
Now you convert your system to a cancer diagnosis system and start predicting (0 - No cancer, 1 - Cancer) on a set of patients. Assuming there will be a few cases that corresponds to class 1, you will still achieve a high accuracy.
Despite having a high accuracy, what is the point of the system if it fails to do well on the class 1 (Identifying patients with cancer)?
This observation suggests that accuracy is not a good evaluation metric for every type of machine learning problems. The above is known as an imbalanced class problem and there are enough practical problems of this nature.
As for the comparison of accuracy and AUC, here are some links I think would be useful,
An introduction to ROC analysis
Area under curve of ROC vs. overall accuracy
Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
What does AUC stand for and what is it?
Understanding ROC curve
ROC vs. Accuracy vs. AROC

Related

Cross-entropy loss influence over F-score

I'm training an FCN (Fully Convolutional Network) and using "Sigmoid Cross Entropy" as a loss function.
my measurements are F-measure and MAE.
The Train/Dev Loss w.r.t #iteration graph is something like the below:
Although Dev loss has a slight increase after #Iter=2200, my measurements on Dev set have been improved up to near #iter = 10000. I want to know is it possible in machine learning at all? If F-measure has been improved, should the loss also be decreased? How do you explain it?
Every answer would be appreciated.
Short answer, yes it's possible.
How I would explain it is by reasoning on the Cross-Entropy loss and how it differs from the metrics. Loss Functions for classification, generally speaking, are used to optimize models relying on probabilities (0.1/0.9), while metrics usually use the predicted labels. (0/1)
Assuming having strong confidence (close to 0 or to 1) in a model probability hypothesis, a wrong prediction will greatly increase the loss and have a small decrease in F-measure.
Likewise, in the opposite scenario, a model with low confidence (e.g. 0.49/0.51) would have a small impact on the loss function (from a numerical perspective) and a greater impact on the metrics.
Plotting the distribution of your predictions would help to confirm this hypothesis.

Percent difference between accuracy of two machine learning models

I have trained two machine learning models. Both have slightly different accuracies.
Model-A Accuracy = 0.78 or 78%
Model-B Accuracy = 0.80 or 80%
Can I infer from the above results that Model-B is 2% better than Model-A?
The answer depends on how you evaluate the models, and on target distribution.
Metric
If distribution of classes is not balanced, accuracy might not be as useful to describe the generalization error. Use ROC AUC or F1-score.
Evaluation process
Cross-validation will give you more robust estimation of the evaluation metric than hold-out validation. Stratified Cross-validation is even better for the unbalanced dataset.
If you're confident in your validation method, then yes, you can iterpret the results in the way you described: Model-B is 2% better than Model-A.
It's still only an estimate, after all. You can use bootstrapping to estimate confidence intervals, select threshold and infer whether the difference is statistically significant.

Getting a low ROC AUC score but a high accuracy

Using a LogisticRegression class in scikit-learn on a version of the flight delay dataset.
I use pandas to select some columns:
df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]
I fill in NaN values with 0:
df = df.fillna({'ARR_DEL15': 0})
Make sure the categorical columns are marked with the 'category' data type:
df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')
Then call get_dummies() from pandas:
df = pd.get_dummies(df)
Now I train and test my data set:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)
train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]
test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]
lr.fit(train_set_x, train_set_y)
Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583
probabilities = lr.predict_proba(test_set_x)
roc_auc_score(test_set_y, probabilities[:, 1])
Is there any reason why the ROC AUC is much lower than what the score method provides?
To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.
[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).
Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).
The point to take home is that:
when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds
Given these clarifications, your particular example provides a very interesting case in point:
I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?
Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).
(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).
For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:
Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.
[...]
One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system
Emphasis mine - see also On the dangers of AUC...
I don't know what exactly AIR_DEL15 is, which you use as your label (it is not in the original data). My guess is that it is an imbalanced feature, i.e there are much more 0's than 1's; in such a case, accuracy as a metric is not meaningful, and you should use precision, recall, and the confusion matrix instead - see also this thread).
Just as an extreme example, if 87% of your labels are 0's, you can have a 87% accuracy "classifier" simply (and naively) by classifying all samples as 0; in such a case, you would also have a low AUC (fairly close to 0.5, as in your case).
For a more general (and much needed, in my opinion) discussion of what exactly AUC is, see my other answer.

How to deal with this unbalanced-class skewed data-set?

I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.
For example, following is some part of the training data :
93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0
where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.
The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.
Which classifier should be best for handling this kind of data-set ?
I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.
but got no significant improvement in accuracy.
Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.
Some good starting points are:
try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)
EDIT (now knowing you're using scikit-learn)
The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.
The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.
Have you tried plotting a ROC curve and AUC curve to check your parameters and different thresholds? If not that should give you a good starting point.

Understanding the meaning of logistic regression coefficients

I have a binary logistic regression model (0/1), built over binary features. The feature coefficients are usually in the range (-1, 1). After training, can I use the feature coefficients as a proxy for the 'importance' of a feature? If the coefficient is < 0, does that mean the presence of the feature is a negative for the class (i.e., reduces the probability of the output being 1)?
Right; a negative coefficient means that the feature contra-indicates that class. The magnitude is, indeed, the relative importance. -1 and +1 are imperatives: all members of the class do not / do have that feature.
You absolutely can. In fact, this idea of important or 'blame' is the main concept behind machine learning algorithms. The coefficients change many times during the training process through gradient descent. How much the weights update by is actually determined by each weight's contribution towards the cost.
That is, the more a weight is to be blamed for a high cost the more extreme the update will be. Therefore, more extreme values (high positives or low negatives) are an indication of how impactful the respective feature is when the model is making its decision.

Resources