Micro VS Macro VS Weighted F1 Score [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have an imbalanced multi-classification dataset.
I calculated Micro F1, Macro F1 and Weighted F1.
I think Macro is the best when predicting overall performance on an imbalanced dataset.
But, Some people said use Micro if you want to see overall performance and others said Micro is only see when dataset is imbalanced.
Why Micro used in imbalanced datasets?
When do I use Micro, Macro, and Weighted?
In other words, what circumstances are these means used?

Firstly See this answer
Imbalanced data is always a big problem to deal with. Here is a binary classification of imbalanced data example. Overall accuracy looks great but when you look at individual scores you can see it is a big fail! For this kind of data I always check the minor class' scores before come up a result. You can consider data augmentation in this kind of data. There are good libraries to deal with imbalanced data. Here is a good example of a library to deal with imbalanced data in Python.
Finally, Micro avg use individual true and false positives also false negatives. Micro avg is just mean of presicion + recall and does not consider the proportion of classes in the data . So micro average reftlect the accuracy on imbalanced data better.
Note:
Here is an explanation in sklearn website:
'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
If you look at the macro, it says: This does not take label imbalance into account. So better to use micro if you have an imbalanced data. Source
#Confusion Matrix:
[[3808 0]
[ 182 2]]
precision recall f1-score support
0 0.95 1.00 0.98 3808
1 1.00 0.01 0.02 184
accuracy 0.95 3992
macro avg 0.98 0.51 0.50 3992
weighted avg 0.96 0.95 0.93 3992

Related

When do micro- and macro-averages differ a lot? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am learning Machine Learning theory. I have a confusion matrix of a prediction using a Logistic Regression with multiple classes.
Now I have calculated the micro and macro averages (precision & recall).
The values are quite different. Now I wonder which factors influence this. Under which conditions does it happen that micro and macro differ much?
What I noticed is that the accuracies of the predictions differ for the different classes. Is this the reason? Or what other factors can cause this?
The sample confusion matrix:
And my calculated micro-macro-averages:
precision-micro = ~0.7329
recall-micro = ~0,7329
precision-macro = ~0.5910
recall-macro = ~0.6795
The difference between micro and macro averages becomes apparent in imbalanced datasets.
The micro average is a global strategy that basically ignores that there is a distinction between classes. It is calculated by counting the total true positives, false negatives and false positives over all classes.
In classification tasks where the underlying problem is not a multilabel classification, the micro average actually equals the accuracy score. See that your micro precision and recall are equal. Compute the accuracy score and compare, you will see no difference.
In case of macro average, the precision and recall are calculated for each label separately and reported as their unweighted mean. Depending on how your classifier performs on each class, this can heavily influence the result.
You can also refer to this answer of mine, where it has been addressed in a bit more detail.

How do you calculate accuracy metric for a regression problem with multiple outputs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My CNN (Conv1D) on pytorch has 20 inputs and 6 outputs. The predicted output is said to be "accurate" only of all 6 them match,right? So, unless all my predicted results are accurate to the 8th decimal point ,will I ever be able to get decent accuracy?
The standard accuracy metric is used for classification tasks. In order to sue accuracy you have to say if an output if one of the following: True positive (TP), True Negative (TN), False positive (FP), False negative (FN).
These classification metrics and be used to a certain extentin regression tasks, when you can apply these labels (TP, TN, FP, FN) to the outputs, maybe via simple threshold. This heavily depends on the kind of problem you are dealing with and may or may not be possible or useful.
As Andrey said in general you wan't to use metrics like the Mean absolute error (MAE) or the Mean squared error (MSE). But these metrics can be hard to interpret. I would suggest to look into papers who have a similar problem as you do and see which metrics they use to evaluate their results and compare themselves to other work.
Accuracy isn't a suitable metric for regression tasks. For regression tasks you should use such metrics as MAE, RMSE and so on.

Deep learning: parameter selection and result interpretation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a multiclass(7 labels) classification problem implemented in MLP. I am trying to classify 7 cancers based on some data. The overall accuracy is quite low, around 58%. However, some of the cancers is classified with accuracy around 90% for different parameters. Below cancer 1,2,3 etc means different types of cancer, For example 1= breast cancer, 2 = lung cancer etc. Now, for different parameter settings I get different classification accuracy. For example,
1. hyper parameters
learning_rate = 0.001
training_epochs = 10
batch_size = 100
hidden_size = 256
#overall accuracy 53%, cancer 2 accuracy 91%, cancer 5 accuracy 88%,
#cancer 6 accuracy 89%
2. hyper parameters
learning_rate = 0.01
training_epochs = 30
batch_size = 100
hidden_size = 128
#overall accuracy 56%, cancer 2 accuracy 86%, cancer 5 accuracy 93%,
#caner 6 accuracy 75%
As you can see, for different parameter settings I am getting totally different results. Cancer 1,3,4,7 have very low accuracy, so I excluded them. But cancer 2, 5,6 have comparatively better results. But, for cancer 6, the results vary by great number depending on the parameter settings.
An important note is, here overall accuracy is not important but if I can classify 2-3 cancers with more than 90% accuracy that is more important. So my question is, how do I interpret the results? In my paper how should I show the results? which parameter settings should I show/use? Or should I show different parameter settings for different cancer types? So basically, how to handle this type of situations?
Data Imbalance?
The first question you'll have to ask yourself is, do you have a balanced dataset, or do you have data imbalance? With this I mean, how many instances of each class do you have in your training and test datasets?
Suppose, for example, suppose that 90% of all the instances in your dataset are cancer 2, and the remaining 10% is spread out over the other classes. Then, you can very easily get 90% accuracy by implementing a very dumb classifier that simply classifies everything as cancer 2. This is probably not what you want out of your classifier though.
Interpretation of Results
I'd recommend reporting confusion matrices instead of just raw accuracy numbers. This will provide some information about which classes get confused for which other classes by the classifier, which may be interesting (e.g. different types of cancer may be similar to some extent if they often get confused for each other). Especially if you have data imbalance, I'd also recommend reporting other metrics such as Precision and/or Recall, instead of Accuracy.
Which parameter settings to use/show?
This depends on what problem you're really trying to solve. Is correct detection of every class equally important? If so, overall accuracy is probably the most important metric. Are certain classes more important to accurately detect than others? In this case you may want to look into "cost-sensitive classification", where different classification mistakes have different costs. If you simply don't know (don't have this domain knowledge), I'd recommend reporting as many different settings and metrics as you realistically can.

what is f1-score and what its value indicates? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
There is an evaluation metric on sklearn, it is f1-score(Also f-beta score exists).
I know how to use it, but I could not quite understand what is stands for.
What does it indicates when it is big or small.
if we put formula aside, what should I understand from a f-score value?
F-score is a simple formula to gather the scores of precision and recall. Imagine you want to predict labels for a binary classification task (positive or negative). You have 4 types of predictions:
true positive: correctly assigned as positive.
true negative: correctly assigned as negative.
false positive: wrongly assigned as positive.
false negative: wrongly assigned as negative.
Precision is the proportion of true positive on all positives predictions. A precision of 1 means that you have no false positive, which is good because you never says that an element is positive whereas it is not.
Recall is the proportion of true positives on all actual positive elements. A recall of 1 means that you have no false negative, which is good because you never says an element belongs to the opposite class whereas it actually belongs to your class.
If you want to know if your predictions are good, you need these two measures. You can have a precision of 1 (so when you say it's positive, it's actutally positive) but still have a very low recall (you predicted 3 good positives but forgot 15 others). Or you can have a good recall and a bad precision.
This is why you might check f1-score, but also any other type of f-score. If one of these two values decreases dramatically, the f-score also does. But be aware that in many problems, we prefer giving more weight to precision or to recall (in web security, it is better to wrongly block some good requests than to let go some bad ones).
The f1-score is one of the most popular performance metrics. From what I recall this is the metric present in sklearn.
In essence f1-score is the harmonic mean of the precision and recall. As when we create a classifier we always make a compromise between the recall and precision, it is kind of hard to compare a model with high recall and low precision versus a model with high precision but low recall. f1-score is measure that we can use to compare two models.
This is not to say that a model with higher f1 score is always better - this could depend on your specific case.

How to deal with small AND unbalanced datasets for machine learning classification problems [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last month.
Improve this question
I am dealing with a very challenging classification problem where I have three problems: A small dataset (about 800 samples), unbalanced dataset (4 classes where 1 - 600 samples, 2/3/4 - 50 samples each) and missing data in one of the features.
Some things that I have been considering:
Generate synthetic data, for example using SMOTE (Synthetic Minority Over-sampling Technique).
Turn the classification into a binary classification between the minority and majority.
Combine different classifiers giving more weight on negative samples (in case I turn into a binary classifier).
Cost sensitive learning by applying specific weights in the cost function (kinda similar to the previous, but using all 4 classes).
I intend to use as classifiers Naive Bayes, SVM, Random Forests and Neural Networks and 2 fold cross validation. Later I might move to 5 to 10 folds.
Some characteristics of the features:
5 continuous, where 3 of them are just different properties based on graph location (min, max and distribution) and some of them have very low variance and repeated data
2 binary features where one of them have missing data.
Snippet of the data:
Y X1 X2_min X2_max X2_distribution X3 X4 X5
3 6 1 11 3.3058739 0 1 1
3 662 1 11 1.7779095 1 15 1
1 6 1 7 3.060274 0 1 1
3 8 1 6 2.9697127 0 1 1
3 82 1 14 3.0341356 0 1 1
2 39 1 7 4.2189913 0 1 1
4 1 3 14 4.6185904 1 1
I would appreciate very much any second thought.
I would recommend either going for more weight or duplicating the data belonging to the smaller class. One way would be to add random noise to the instances of the smaller class while duplicating it. The variance of the noise can be estimated from the variance of the features within each class.
A small dataset isnt a problem if they are the most representative examples (e.g., currently there are advances being made where even deep learning techniques are being applied to small datasets). How can you tell if your data set is representative? It requires proper sampling techniques such as stratified sampling instead of say, random sampling.
To tackle unbalanced datasets, there are various techniques: undersampling (not applicable in your case because of a small dataset), oversampling (can work, but there is a risk of model overfitting), and cost-sensitive learning (see the Vowpal Wabbit toolkit for an implementation)

Resources