I'm using XGBoost for binary classification. The standard/default loss function (binary logistic) considers all classifications (both in the positive and negative classes) for performance.
All I care about is precision. I don't mind if it makes a very small number of classifications, as long as it maximises it's strike rate of getting it right. So I'd like a loss function/evaluation metric combination that doesn't care about missed opportunities at all (ie. false negatives, or true negatives), and only seeks to maximise true positives (and minimise false positives).
I have a relatively balanced panel.
Is there a straightforward way to do this in xgboost (either through existing hyperparameters, or through a new loss function)? If there is a better loss/objective function (and gradient/hessian), is there a paper or reference for this?
I have the below F1 and AUC scores for 2 different cases
Model 1: Precision: 85.11 Recall: 99.04 F1: 91.55 AUC: 69.94
Model 2: Precision: 85.1 Recall: 98.73 F1: 91.41 AUC: 71.69
The main motive of my problem to predict the positive cases correctly,ie, reduce the False Negative cases (FN). Should I use F1 score and choose Model 1 or use AUC and choose Model 2. Thanks
Introduction
As a rule of thumb, every time you want to compare ROC AUC vs F1 Score, think about it as if you are comparing your model performance based on:
[Sensitivity vs (1-Specificity)] VS [Precision vs Recall]
Note that Sensitivity is the Recall (they are the same exact metric).
Now we need to understand what are: Specificity, Precision and Recall (Sensitivity) intuitively!
Background
Specificity: is given by the following formula:
Intuitively speaking, if we have 100% specific model, that means it did NOT miss any True Negative, in other words, there were NO False Positives (i.e. negative result that is falsely labeled as positive). Yet, there is a risk of having a lot of False Negatives!
Precision: is given by the following formula:
Intuitively speaking, if we have a 100% precise model, that means it could catch all True positive but there were NO False Positive.
Recall: is given by the following formula:
Intuitively speaking, if we have a 100% recall model, that means it did NOT miss any True Positive, in other words, there were NO False Negatives (i.e. a positive result that is falsely labeled as negative). Yet, there is a risk of having a lot of False Positives!
As you can see, the three concepts are very close to each other!
As a rule of thumb, if the cost of having False negative is high, we want to increase the model sensitivity and recall (which are the exact same in regard to their formula)!.
For instance, in fraud detection or sick patient detection, we don't want to label/predict a fraudulent transaction (True Positive) as non-fraudulent (False Negative). Also, we don't want to label/predict a contagious sick patient (True Positive) as not sick (False Negative).
This is because the consequences will be worse than a False Positive (incorrectly labeling a a harmless transaction as fraudulent or a non-contagious patient as contagious).
On the other hand, if the cost of having False Positive is high, then we want to increase the model specificity and precision!.
For instance, in email spam detection, we don't want to label/predict a non-spam email (True Negative) as spam (False Positive). On the other hand, failing to label a spam email as spam (False Negative) is less costly.
F1 Score
It's given by the following formula:
F1 Score keeps a balance between Precision and Recall. We use it if there is uneven class distribution, as precision and recall may give misleading results!
So we use F1 Score as a comparison indicator between Precision and Recall Numbers!
Area Under the Receiver Operating Characteristic curve (AUROC)
It compares the Sensitivity vs (1-Specificity), in other words, compare the True Positive Rate vs False Positive Rate.
So, the bigger the AUROC, the greater the distinction between True Positives and True Negatives!
AUROC vs F1 Score (Conclusion)
In general, the ROC is for many different levels of thresholds and thus it has many F score values. F1 score is applicable for any particular point on the ROC curve.
You may think of it as a measure of precision and recall at a particular threshold value whereas AUC is the area under the ROC curve. For F score to be high, both precision and recall should be high.
Consequently, when you have a data imbalance between positive and negative samples, you should always use F1-score because ROC averages over all possible thresholds!
Further read:
Credit Card Fraud: Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
If you look at the definitions, you can that both AUC and F1-score optimize "something" together with the fraction of the sample labeled "positive" that is actually true positive.
This "something" is:
For the AUC, the specificity, which is the fraction of the negatively labeled sample that is correctly labeled. You're not looking at the fraction of your positively labeled samples that is correctly labeled.
Using the F1 score, it's precision: the fraction of the positively labeled sample that is correctly labeled. And using the F1-score you don't consider the purity of the sample labeled as negative (the specificity).
The difference becomes important when you have highly unbalanced or skewed classes: For example there are many more true negatives than true positives.
Suppose you are looking at data from the general population to find people with a rare disease. There are far more people "negative" than "positive", and trying to optimize how well you are doing on the positive and the negative samples simultaneously, using AUC, is not optimal. You want the positive sample to include all positives if possible and you don't want it to be huge, due to a high false positive rate. So in this case you use the F1 score.
Conversely if both classes make up 50% of your dataset, or both make up a sizable fraction, and you care about your performance in identifying each class equally, then you should use the AUC, which optimizes for both classes, positive and negative.
just adding my 2 cents here:
AUC does an implicit weighting of the samples, which F1 does not.
In my last use case comparing the effectiveness of drugs on patients, it's easy to learn which drugs are generally strong, and which are weak. The big question is whether you can hit the outliers (the few positives for a weak drug or the few negatives for a strong drug). To answer that, you have to specifically weigh the outliers up using F1, which you don't need to do with AUC.
to predict the positive cases correctly
one can rewrite a bit your goal and get: when a case is really positive you want classify it as positive too. The probability of such event p(predicted_label = positive | true_label = positive) is a recall by definition. If you want to maximize this property of your model, you'd choose the Model 1.
The title says it all: Should a neural network be able to have a perfect train accuracy? Mine saturates at ~0.9 accuracy and I am wondering if that indicates a problem with my network or the training data.
Training instances: ~4500 sequences with an average length of 10 elements.
Network: Bi-directional vanilla RNN with a softmax layer on top.
Perfect accuracy on training data is usually a sign of a phenomenon called overfitting (https://en.wikipedia.org/wiki/Overfitting) and the model may generalize poorly to unseen data. So, no, probably this alone is not an indication that there is something wrong (you could still be overfitting but it is not possible to tell from the information in your question).
You should check the accuracy of the NN on the validation set (data your network has not seen during training) and judge its generalizability. usually it's an iterative process where you train many networks with different configurations in parallel and see which one performs best on the validation set. Also see cross validation (https://en.wikipedia.org/wiki/Cross-validation_(statistics))
If you have low measurement noise, a model may still not get zero training error. This could be for many reasons including that the model is not flexible enough to capture the true underlying function (which can be a complicated, high-dimensional, non-linear function). You can try increasing the number of hidden layers and nodes but you have to be careful about the same things like overfitting and only judge based on evaluation through cross validation.
You can definitely get a 100% accuracy on training datasets by increasing model complexity but I would be wary of that.
You cannot expect your model to be better on your test set than on your training set. This means if your training accuracy is lower than the desired accuracy, you have to change something. Most likely you have to increase the number of parameters of your model.
The reason why you might be ok with not having a perfect training accuracy is (1) the problem of overfitting (2) training time. The more complex your model is, the more likely is overfitting.
You might want to have a look at Structural Risc Minimization:
(source: svms.org)
I'm trying to tackle a binary classification problem with some custom random forest implementation.
The goal is to predict the likelihood that the item belongs to class A. The evaluation strategy is defined such that false positives (a high likelihood for A while the actual class is B) are punished harder than false negatives (a low likelihood for A while the actual class is A).
How should the standard algorithm be adapted to take advantage of this to get a higher evaluation score?
If you haven't already, try using the package rfUtilities: https://cran.r-project.org/web/packages/rfUtilities/rfUtilities.pdf
It was designed to deal with class imbalance by predicting the liklihood of occurence for a single category.
I use the VL-Feat and LIBLINEAR to handle the 2-category classification. The #(-)/#(+) for the training set is 35.01 and the dimension of each feature vector is 3.6e5. I have around 15000 examples.
I have set the weight of positive example to be 35.01 and negative examples to be 1 as default. But what I get is extremely poor performance on the test dataset.
So in order to find out the reason, I set the training examples as input. What I see is negative examples get slightly higher decision values than positive ones. It is really weird, right? I've checked the input to make sure I did not mislabel the examples. I've done normalization to the histogram vectors.
Has anybody met this situation before?
Here are the parameters of trained model. I can feel strange about parameters like bias, regularizer and dualityGap, because they are so small that may lose accuracy easily.
model.info =
solver: 'sdca'
lambda: 0.0100
biasMultiplier: 1
bias: -1.6573e-14
objective: 1.9439
regularizer: 6.1651e-04
loss: 1.9432
dualObjective: 1.9439
dualLoss: 1.9445
dualityGap: -2.6645e-15
iteration: 43868
epoch: 2
elapsedTime: 228.9374
One thing that could be happening is that LIBSVM takes the first example in the data set as the positive class and the negative class the one that isn't the first example in the dataset. So it could be that since you have 35x more negatives than positives, your first example is negative and your classes are being inverted. How to check this? Make sure that the first data point in the training set is of the positive class.
I've checked in the FAQ of LIBLINEAR and it seems it happens in LIBLINEAR as well (I'm not as familiar with LIBLINEAR):
http://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html (search for reversed)