Measuring performance of the classifiers in imbalanced datasets - machine-learning

I am trying to do classification over an imbalnced dataset (2000 data-points from positive class and 98880 data-points from negative class). I use Precision, Recall, F-Score and AUC to report the models performacne but the way that these models behave made me suprised. You can see the models results in the following:
TP:1982, TN:87920, FP:10960, FN:18 | PR:0.153, RE:0.991, F1:0.265, AUC:0.972
TP:22, TN:98877, FP:3, FN:1978 | PR:0.880, RE:0.011, F1:0.022, AUC:0.810
TP:148, TN:98271, FP:609, FN:1852 | PR:0.196, RE:0.074, F1:0.107, AUC:0.700
TP:1611, TN:98847, FP:33, FN:389 | PR:0.980, RE:0.805, F1:0.884, AUC:0.998
As you can see,
In the first model, the precision is very low and recall is very high, which leads to low F-Score and high AUC.
In the second model, the precision is high and the recall is low, but the results is similar, high AUC and low F-Score.
In the third model, both precison and reacall are very low which results low F-Score, but suprisingly AUC is still fairly high
In the fourth model, the precision and recall are high, therefore the F-Score and AUC are high
So, can I conclude, for my problem F-Score is a better performance metric than AUC ?

Related

Why is train loss and validate loss both in a straight line?

I am using Conv-LSTM for training, and the input features have been proven to be effective in some papers, and I can use CNN+FC networks to extract features and classify them. I change the task to regression here, and I can also achieve model convergence with Conv+FC. Later, I tried to use Conv-LSTM for processing to consider the timing characteristics of the corresponding data. Specifically: return the output of the current moment based on multiple historical inputs and the input of the current moment. The Conv-LSTM code I used: https://github.com/ndrplz/ConvLSTM_pytorch. My Loss is L1-Loss and optimizer is Adam.
A loss curve is below:
Example loss value:
Epoch:1/500 AVG Training Loss:16.40108 AVG Valid Loss:22.40100
Best validation loss: 22.400997797648113
Saving best model for epoch 1
Epoch:2/500 AVG Training Loss:16.42522 AVG Valid Loss:22.40100
Epoch:3/500 AVG Training Loss:16.40599 AVG Valid Loss:22.40100
Epoch:4/500 AVG Training Loss:16.40175 AVG Valid Loss:22.40100
Epoch:5/500 AVG Training Loss:16.42198 AVG Valid Loss:22.40101
Epoch:6/500 AVG Training Loss:16.41907 AVG Valid Loss:22.40101
Epoch:7/500 AVG Training Loss:16.42531 AVG Valid Loss:22.40101
My attempt:
Adjust the data set to only a few samples, verify that it can be overfitted, and the network code should be fine.
Adjusting the learning rate, I tried 1e-3, 1e-4, 1e-5 and 1e-6, but the loss curve is still flat as before, and even the value of the loss curve has not changed much.
Replace the optimizer with SGD, and the training result is also the above problem.
Because my data is wireless data (I-Q), neither CV nor NLP input type, here are some questions to ask about deep learning training.
After some testing, I finally found that my initial learning rate was too small. According to my previous single-point data training, the learning rate of 1e-3 is large enough, so here is preconceived, and it is adjusted from 1e-3 to a small tune, but in fact, the learning rate of 1e-3 is too small, resulting in the network not learning at all. Later, the learning rate was adjusted to 1e-2, and both the train loss and validate loss of the network achieved rapid decline (And the optimizer is Adam). When adjusting the learning rate later, you can start from 1 to the minor, do not preconceive.

Performance measure for classification problem with unbalanced dataset

I have an anomaly detection problem with a big difference between healthy and anomalous data (i.e. >20.000 healthy datapoints against <30 anomalies).
Currently, I use just precision, recall and f1 score to measure the performance of my model. But I have no good method to set the threshold parameter. But that is not the problem at the moment.
I want to measure if the model is able to distinguish between the two classes independent of the threshold. I have read, that the ROC-AUC measure can be used if the data is unbalanced (https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428). But with my data I get very high ROC-AUC scores (>0.97), even if the model outputs low scores if an anomaly occurs.
Maybe someone knows a better performance measure for this task or should I stick with the ROC-AUC score?
I try to add an example for my problem:
We consider a case where we have 20448 data points. We have 26 anomalies in this data. With my model I get the following anomaly scores for this anomalies:
[1.26146367, 1.90735495, 3.08136725, 1.35184909, 2.45533306,
2.27591039, 2.5894709 , 1.8333928 , 2.19098432, 1.64351134,
1.38457746, 1.87627623, 3.06143893, 2.95044859, 1.35565042,
2.26926566, 1.59751463, 3.1462369 , 1.6684134 , 3.02167491,
3.14508974, 1.0376038 , 1.86455995, 1.61870919, 1.35576177,
1.64351134]
If I now output how many data points have a higher anomaly score as, for example 1.38457746, I get 281 data points. That look like a bad performance from my perspective. But at the end the ROC AUC score is still 0.976038.
len(np.where(scores > 1.38457746)[0]) # 281

`BCEWithLogitsLoss` and training class dataset imbalances in Pytorch

A bit of clarification on pytorch's BCEWithLogitsLoss: I am using : pos_weights = torch.tensor([len_n/(len_n + len_y), len_y/(len_n + len_y)]) to initialize the loss, with [1.0, 0.0] being the negative class and [0.0, 1.0] being the positive class, and len_n, len_y being respectively the length of negative and positive samples.
The reason to use BCEWithLogitsLoss in the first place is precisely because I assume that it is compensating an imbalance between the quantity of positive and negative samples by avoiding the network from simply "defaulting" to the most abundant class type in the training set. I want to control the priorization of the loss on detecting the less abundant class correctly. In my case, negative train samples exceed positive samples by a factor of 25 to 1, so it is very important that the network predicts a high fraction of positive samples correctly, rather than having a high overall prediction rate (even by defaulting always to negative, that would lead to 96% prediction if I only cared about that).
Question Is it correct my assumption about BCEWithLogitsLoss using the pos_weights parameter to control training class imbalances? Any insight into how the imbalance is being addressed in the loss evaluation?

The difference between accuracy of test(60%) and training(99.9%) data sets is huge indicating high variance

What should I do to to reduce variance.I checked for multicollinearity using VIF.VIF for all the parameters was less than 2.AIC and BIC are high.Adj R^2 is around 0.45 which is less.The condition number is also high.What should I do?
I am using decision trees and random forests to build the model.

what is f-measure for each class in weka

When we evaluate a classifier in WEKA, for example a 2-class classifier, it gives us 3 f-measures: f-measure for class 1, for class 2 and the weighted f-measure.
I'm so confused! I thought f-measure is a balanced measure that show balanced performance measure for multiple class, so what does f-measure for class 1 and 2 mean?
The f-score (or f-measure) is calculated based on the precision and recall. The calculation is as follows:
Precision = t_p / (t_p + f_p)
Recall = t_p / (t_p + f_n)
F-score = 2 * Precision * Recall / (Precision + Recall)
Where t_p is the number of true positives, f_p the number of false positives and f_n the number of false negatives. Precision is defined as the fraction of elements correctly classified as positive out of all the elements the algorithm classified as positive, whereas recall is the fraction of elements correctly classified as positive out of all the positive elements.
In the multiclass case, each class i have a respective precision and recall, in which a "true positive" is an element predicted to be in i is really in it and a "true negative" is an element predicted to not be in i that isn't in it.
Thus, with this new definition of precision and recall, each class can have its own f-score by doing the same calculation as in the binary case. This is what Weka's showing you.
The weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class.
I am confused too,
I used the same equation for f-score for each class depending of their precision and recall, but the results are different!
example:
f-score different from weka claculaton

Resources