Performance measure for classification problem with unbalanced dataset - machine-learning

I have an anomaly detection problem with a big difference between healthy and anomalous data (i.e. >20.000 healthy datapoints against <30 anomalies).
Currently, I use just precision, recall and f1 score to measure the performance of my model. But I have no good method to set the threshold parameter. But that is not the problem at the moment.
I want to measure if the model is able to distinguish between the two classes independent of the threshold. I have read, that the ROC-AUC measure can be used if the data is unbalanced (https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428). But with my data I get very high ROC-AUC scores (>0.97), even if the model outputs low scores if an anomaly occurs.
Maybe someone knows a better performance measure for this task or should I stick with the ROC-AUC score?
I try to add an example for my problem:
We consider a case where we have 20448 data points. We have 26 anomalies in this data. With my model I get the following anomaly scores for this anomalies:
[1.26146367, 1.90735495, 3.08136725, 1.35184909, 2.45533306,
2.27591039, 2.5894709 , 1.8333928 , 2.19098432, 1.64351134,
1.38457746, 1.87627623, 3.06143893, 2.95044859, 1.35565042,
2.26926566, 1.59751463, 3.1462369 , 1.6684134 , 3.02167491,
3.14508974, 1.0376038 , 1.86455995, 1.61870919, 1.35576177,
1.64351134]
If I now output how many data points have a higher anomaly score as, for example 1.38457746, I get 281 data points. That look like a bad performance from my perspective. But at the end the ROC AUC score is still 0.976038.
len(np.where(scores > 1.38457746)[0]) # 281

Related

Which metric to use for imbalanced classification problem?

I am working on a classification problem with very imbalanced classes. I have 3 classes in my dataset : class 0,1 and 2. Class 0 is 11% of the training set, class 1 is 13% and class 2 is 75%.
I used and random forest classifier and got 76% accuracy. But I discovered 93% of this accuracy comes from class 2 (majority class). Here is the Crosstable I got.
The results I would like to have :
fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1
What I found on the internet to solve the problem and what I've tried :
using class_weight='balanced' or customized class_weight ( 1/11% for class 0, 1/13% for class 1, 1/75% for class 2), but it doesn't change anything (the accuracy and crosstable are still the same). Do you have an interpretation/explenation of this ?
as I know accuracy is not the best metric in this context, I used other metrics : precision_macro, precision_weighted, f1_macro and f1_weighted, and I implemented the area under the curve of precision vs recall for each class and use the average as a metric.
Here's my code (feedback welcome) :
from sklearn.preprocessing import label_binarize
def pr_auc_score(y_true, y_pred):
y=label_binarize(y_true, classes=[0, 1, 2])
return average_precision_score(y[:,:],y_pred[:,:])
pr_auc = make_scorer(pr_auc_score, greater_is_better=True,needs_proba=True)
and here's a plot of the precision vs recall curves.
Alas, for all these metrics, the crosstab remains the same... they seem to have no effect
I also tuned the parameters of Boosting algorithms ( XGBoost and AdaBoost) (with accuracy as metric) and again the results are not improved.. I don't understand because boosting algorithms are supposed to handle imbalanced data
Finally, I used another model (BalancedRandomForestClassifier) and the metric I used is accuracy. The results are good as we can see in this crosstab. I am happy to have such results but I notice that, when I change the metric for this model, there is again no change in the results...
So I'm really interested in knowing why using class_weight, changing the metric or using boosting algorithms, don't lead to better results...
As you have figured out, you have encountered the "accuracy paradox";
Say you have a classifier which has an accuracy of 98%, it would be amazing, right? It might be, but if your data consists of 98% class 0 and 2% class 1, you obtain a 98% accuracy by assigning all values to class 0, which indeed is a bad classifier.
So, what should we do? We need a measure which is invariant to the distribution of the data - entering ROC-curves.
ROC-curves are invariant to the distribution of the data, thus are a great tool to visualize classification-performances for a classifier whether or not it is imbalanced. But, they only work for a two-class problem (you can extend it to multiclass by creating a one-vs-rest or one-vs-one ROC-curve).
F-score might a bit more "tricky" to use than the ROC-AUC since it's a trade off between precision and recall and you need to set the beta-variable (which is often a "1" thus the F1 score).
You write: "fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1". Remember, that all algorithms work by either minimizing something or maximizing something - often we minimize a loss function of some sort. For a random forest, lets say we want to minimize the following function L:
L = (w0+w1+w2)/n
where wi is the number of class i being classified as not class i i.e if w0=13 we have missclassified 13 samples from class 0, and n the total number of samples.
It is clear that when class 0 consists of most of the data then an easy way to get a small L is to classify most of the samples as 0. Now, we can overcome this by adding a weight instead to each class e.g
L = (b0*w0+b1*w1+b2*x2)/n
as an example say b0=1, b1=5, b2=10. Now you can see, we cannot just assign most of the data to c0 without being punished by the weights i.e we are way more conservative by assigning samples to class 0, since assigning a class 1 to class 0 gives us 5 times as much loss now as before! This is exactly how the weight in (most) of the classifiers work - they assign a penalty/weight to each class (often proportional to it's ratio i.e if class 0 consists of 80% and class 1 consists of 20% of the data then b0=1 and b1=4) but you can often specify the weight your self; if you find that the classifier still generates to many false negatives of a class then increase the penalty for that class.
Unfortunately "there is no such thing as a free lunch" i.e it's a problem, data and usage specific choice, of what metric to use.
On a side note - "random forest" might actually be bad by design when you don't have much data due to how the splits are calculated (let me know, if you want to know why - it's rather easy to see when using e.g Gini as splitting). Since you have only provided us with the ratio for each class and not the numbers, I cannot tell.

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

Is Total Error Mean an adequate performance metric for regression models?

I'm working on a regression model and to evaluate the model performance, my boss thinks that we should use this metric:
Total Absolute Error Mean = mean(y_predicted) / mean(y_true) - 1
Where mean(y_predicted) is the average of all the predictions and mean(y_true) is the average of all the true values.
I have never seen this metric being used in machine learning before and I convinced him to add Mean Absolute Percentage Error as an alternative, yet even though my model is performing better regarding MAPE, some areas underperform when we look at Total Absolute Error Mean.
My gut feeling is that this metric is wrong in displaying the real accuracy, but I can't seem to understand why.
Is Total Absolute Error Mean a valid performance metric? If not, then why? If it is, why would a regression model's accuracy increase in terms of MAPE, but not in terms of Total Absolute Error Mean?
Thank you in advance!
I would kindly suggest to inform your boss that, when one wishes to introduce a new metric, it is on him/her to demonstrate why it is useful on top of the existing ones, not the other way around (i.e. us demonstrating why it is not); BTW, this is exactly the standard procedure when someone really comes up with a new proposed metric in a research paper, like the recent proposal of the Maximal Information Coefficient (MIC).
That said, it is not difficult to demonstrate in practice that this proposed metric is a poor one with some dummy data:
import numpy as np
from sklearn.metrics import mean_squared_error
# your proposed metric:
def taem(y_true, y_pred):
return np.mean(y_true)/np.mean(y_pred)-1
# dummy true data:
y_true = np.array([0,1,2,3,4,5,6])
Now, suppose that we have a really awesome model, which predicts perfectly, i.e. y_pred1 = y_true; in this case both MSE and your proposed TAEM will indeed be 0:
y_pred1 = y_true # PERFECT predictions
mean_squared_error(y_true, y_pred1)
# 0.0
taem(y_true, y_pred1)
# 0.0
So far so good. But let's now consider the output of a really bad model, which predicts high values when it should have predicted low ones, and vice versa; in other words, consider a different set of predictions:
y_pred2 = np.array([6,5,4,3,2,1,0])
which is actually y_pred1 in reverse order. Now, it easy to see that here we will also have a perfect TAEM score:
taem(y_true, y_pred2)
# 0.0
while of course MSE would have warned us that we are very far indeed from perfect predictions:
mean_squared_error(y_true, y_pred2)
# 16.0
Bottom line: Any metric that ignores element-wise differences in favor of only averages suffers from similar limitations, namely taking identical values for any permutation of the predictions, a characteristic which is highly undesirable for a useful performance metric.

Parameter tuning/model selection using resampling

I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).

Logistic regression training data set true/false ratio

I am working on a classifier, by logistic regression, based on Spark ML.
and I wonder should I train the equal quantity of data for true , false.
I mean
When I want to classify people into male or female,
Is it ok that train a model with 100 male data + 100 female data.
The online people may 40% male and 60% female , but this percent is forcasted based on the past, so it can be change(like 30% female, 70% male)
In this situation.
what female/male percent of data should I train?
is this related with overfitting?
when If I trained a model with 40%female + 60%male, It is useless to classifying a field data composed with 70%female+30%male?
Spark classification sample data has 43 false, 57true.
https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
what means the true/false ratio of trainig data in logisticregression?
I am really not good at English, but hope you understand me.
It should not matter what ratio you use, as long as it is reasonable.
60:40, 30:70, 50:50, it's okay. Just make sure it's not too lopsided, like 99:1.
If the entire data set is 70:30 female:male, and you want to only use a subset of this dataset, going for a 60:40 female:male ratio will not kill you.
Consider the following example:
Your test data contains 99% males, and 1 % female.
Technically, you can classify all males correctly, ALL females incorrectly, and your algorithm would show an error of 1%. Seems pretty good right? No, because your data is too lopsided.
This low error is not a result of overfitting (high variance), but rather a result of a lopsided data set.
This is an extreme example, but you get the point.

Resources