XGBoost for precision - machine-learning

I'm using XGBoost for binary classification. The standard/default loss function (binary logistic) considers all classifications (both in the positive and negative classes) for performance.
All I care about is precision. I don't mind if it makes a very small number of classifications, as long as it maximises it's strike rate of getting it right. So I'd like a loss function/evaluation metric combination that doesn't care about missed opportunities at all (ie. false negatives, or true negatives), and only seeks to maximise true positives (and minimise false positives).
I have a relatively balanced panel.
Is there a straightforward way to do this in xgboost (either through existing hyperparameters, or through a new loss function)? If there is a better loss/objective function (and gradient/hessian), is there a paper or reference for this?

Related

F1 Score vs ROC AUC

I have the below F1 and AUC scores for 2 different cases
Model 1: Precision: 85.11 Recall: 99.04 F1: 91.55 AUC: 69.94
Model 2: Precision: 85.1 Recall: 98.73 F1: 91.41 AUC: 71.69
The main motive of my problem to predict the positive cases correctly,ie, reduce the False Negative cases (FN). Should I use F1 score and choose Model 1 or use AUC and choose Model 2. Thanks
Introduction
As a rule of thumb, every time you want to compare ROC AUC vs F1 Score, think about it as if you are comparing your model performance based on:
[Sensitivity vs (1-Specificity)] VS [Precision vs Recall]
Note that Sensitivity is the Recall (they are the same exact metric).
Now we need to understand what are: Specificity, Precision and Recall (Sensitivity) intuitively!
Background
Specificity: is given by the following formula:
Intuitively speaking, if we have 100% specific model, that means it did NOT miss any True Negative, in other words, there were NO False Positives (i.e. negative result that is falsely labeled as positive). Yet, there is a risk of having a lot of False Negatives!
Precision: is given by the following formula:
Intuitively speaking, if we have a 100% precise model, that means it could catch all True positive but there were NO False Positive.
Recall: is given by the following formula:
Intuitively speaking, if we have a 100% recall model, that means it did NOT miss any True Positive, in other words, there were NO False Negatives (i.e. a positive result that is falsely labeled as negative). Yet, there is a risk of having a lot of False Positives!
As you can see, the three concepts are very close to each other!
As a rule of thumb, if the cost of having False negative is high, we want to increase the model sensitivity and recall (which are the exact same in regard to their formula)!.
For instance, in fraud detection or sick patient detection, we don't want to label/predict a fraudulent transaction (True Positive) as non-fraudulent (False Negative). Also, we don't want to label/predict a contagious sick patient (True Positive) as not sick (False Negative).
This is because the consequences will be worse than a False Positive (incorrectly labeling a a harmless transaction as fraudulent or a non-contagious patient as contagious).
On the other hand, if the cost of having False Positive is high, then we want to increase the model specificity and precision!.
For instance, in email spam detection, we don't want to label/predict a non-spam email (True Negative) as spam (False Positive). On the other hand, failing to label a spam email as spam (False Negative) is less costly.
F1 Score
It's given by the following formula:
F1 Score keeps a balance between Precision and Recall. We use it if there is uneven class distribution, as precision and recall may give misleading results!
So we use F1 Score as a comparison indicator between Precision and Recall Numbers!
Area Under the Receiver Operating Characteristic curve (AUROC)
It compares the Sensitivity vs (1-Specificity), in other words, compare the True Positive Rate vs False Positive Rate.
So, the bigger the AUROC, the greater the distinction between True Positives and True Negatives!
AUROC vs F1 Score (Conclusion)
In general, the ROC is for many different levels of thresholds and thus it has many F score values. F1 score is applicable for any particular point on the ROC curve.
You may think of it as a measure of precision and recall at a particular threshold value whereas AUC is the area under the ROC curve. For F score to be high, both precision and recall should be high.
Consequently, when you have a data imbalance between positive and negative samples, you should always use F1-score because ROC averages over all possible thresholds!
Further read:
Credit Card Fraud: Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
If you look at the definitions, you can that both AUC and F1-score optimize "something" together with the fraction of the sample labeled "positive" that is actually true positive.
This "something" is:
For the AUC, the specificity, which is the fraction of the negatively labeled sample that is correctly labeled. You're not looking at the fraction of your positively labeled samples that is correctly labeled.
Using the F1 score, it's precision: the fraction of the positively labeled sample that is correctly labeled. And using the F1-score you don't consider the purity of the sample labeled as negative (the specificity).
The difference becomes important when you have highly unbalanced or skewed classes: For example there are many more true negatives than true positives.
Suppose you are looking at data from the general population to find people with a rare disease. There are far more people "negative" than "positive", and trying to optimize how well you are doing on the positive and the negative samples simultaneously, using AUC, is not optimal. You want the positive sample to include all positives if possible and you don't want it to be huge, due to a high false positive rate. So in this case you use the F1 score.
Conversely if both classes make up 50% of your dataset, or both make up a sizable fraction, and you care about your performance in identifying each class equally, then you should use the AUC, which optimizes for both classes, positive and negative.
just adding my 2 cents here:
AUC does an implicit weighting of the samples, which F1 does not.
In my last use case comparing the effectiveness of drugs on patients, it's easy to learn which drugs are generally strong, and which are weak. The big question is whether you can hit the outliers (the few positives for a weak drug or the few negatives for a strong drug). To answer that, you have to specifically weigh the outliers up using F1, which you don't need to do with AUC.
to predict the positive cases correctly
one can rewrite a bit your goal and get: when a case is really positive you want classify it as positive too. The probability of such event p(predicted_label = positive | true_label = positive) is a recall by definition. If you want to maximize this property of your model, you'd choose the Model 1.

ResNet How to achieve accuracy as in the document?

I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.

Machine Learning: Is it better to retrain a model if the loss stagnates at a high value?

Meaning to say if during training you have set your learning rate too high and you had unfortunately reached a local minimum where the value is too high, is it good to retrain with a lower learning rate or should you start from a higher learning rate for the poor-performing model, in hopes that the loss will escape the local minimum?
In the strict sense, you don't have to retrain as you can continue training with a lower learning rate (this is called a learning shedule). A very common approach is to lower the learning rate (by usually dividing by 10) each time the loss stagnates or becomes constant.
Another approach is to use an optimizer that scales the learning rate with the gradient magnitude, so the learning rate naturally decays as you get closer to the minima. Examples of this are ADAM, Adagrad and RMSProp.
In any case, make sure to find the optimal learning rate on a validation set, this will considerably improve performance and make learning faster. This applies to both plain SGD and with any other optimizer.

Suggestions to improve my normalized accuracy with libsvm

I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.

What evaluation classifiers? Precision & recall?

I have some labeled data which classifies datasets as positive or negative. Now i have an algorithm that does the same automatically and I want to compare the results.
I was said to use precision and recall, but I'm not sure whether those are appropriate because the true negatives don't even appear in the formulas. I'd rather tend to use a general "prediction rate" for both, positives and negatives.
How would be a good way to evaluate the algorithm? Thanks!!
There is no general "best" method of evaluation, everything depends on what is your aim, as each method captures different phenomena:
Accuracy is the simple measure, well suited for multi-label classification and rather well balanced data
F1-score captures precision/recall tradeoff
MCC is a good measure which is well suited for dataset with large dissproportion in the class sizes

Resources