Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have three classes. Suppose the number of elements of the first class is 30, the second-30, the third-1000.
Some algorithm gave predictions and the following error matrix was obtained(rows are predictions, columns are true labels).
[[ 1 0 10]
[ 29 2 10]
[ 0 28 980]]
From this matrix, it can be seen that the third class is well classified, although other classes are almost always wrong.
The result is the following precision and recall:
Precision.
micro: 0.927
macro: 0.371
Recall.
micro: 0.927
macro: 0.360
From the official documentation and from many articles, questions (for example, from here) it is said that it is better to use micro when classes are unbalanced. Although intuitively it seems that in this case micro shows too good metric values, despite the fact that the two classes are practically not classified.
The micro-precision/recall are not "better" for imbalanced classes.
In fact, if you look at the results, it is clear that the macro precision/recall have very small values when you have bad predictions on an unbalanced dataset (poor results on the less well represented label).
The micro-precision however does take into account the number of elements per class when it is computed.
From sklearn's micro and macro f1-score for example (same for precision and recall):
'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
So macro actually penalises you when you have poor results in a label which is not well represented.
Micro-average on the other hand does not do that as it computes globally the metrics.
For example this means that if you have many samples in class 0, and say, many of the predictions are correct, while few samples in class 1 with many bad predictions, a micro-precision/recall could potentially yield a high number, while a macro-metric (precision/recall/f1-score) would penalise (yield a small number) for poor results on a specific label.
Now it really depends on what you are interested. If you want to globally have good results and you do not care about the distribution of labels, micro-metric could be suitable.
However we usually care about the results on less well-represented classes within our datasets, hence the utility of a macro-metric in spite of the micro-metric.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am learning Machine Learning theory. I have a confusion matrix of a prediction using a Logistic Regression with multiple classes.
Now I have calculated the micro and macro averages (precision & recall).
The values are quite different. Now I wonder which factors influence this. Under which conditions does it happen that micro and macro differ much?
What I noticed is that the accuracies of the predictions differ for the different classes. Is this the reason? Or what other factors can cause this?
The sample confusion matrix:
And my calculated micro-macro-averages:
precision-micro = ~0.7329
recall-micro = ~0,7329
precision-macro = ~0.5910
recall-macro = ~0.6795
The difference between micro and macro averages becomes apparent in imbalanced datasets.
The micro average is a global strategy that basically ignores that there is a distinction between classes. It is calculated by counting the total true positives, false negatives and false positives over all classes.
In classification tasks where the underlying problem is not a multilabel classification, the micro average actually equals the accuracy score. See that your micro precision and recall are equal. Compute the accuracy score and compare, you will see no difference.
In case of macro average, the precision and recall are calculated for each label separately and reported as their unweighted mean. Depending on how your classifier performs on each class, this can heavily influence the result.
You can also refer to this answer of mine, where it has been addressed in a bit more detail.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm recently working on CNN and I want to know what is the function of temperature in softmax formula? and why should we use high temperatures to see a softer norm in probability distribution?Softmax Formula
One reason to use the temperature function is to change the output distribution computed by your neural net. It is added to the logits vector according to this equation :
𝑞𝑖 =exp(𝑧𝑖/𝑇)/ ∑𝑗exp(𝑧𝑗/𝑇)
where 𝑇 is the temperature parameter.
You see, what this will do is change the final probabilities. You can choose T to be anything (the higher the T, the 'softer' the distribution will be - if it is 1, the output distribution will be the same as your normal softmax outputs). What I mean by 'softer' is that is that the model will basically be less confident about it's prediction. As T gets closer to 0, the 'harder' the distribution gets.
a) Sample 'hard' softmax probs : [0.01,0.01,0.98]
b) Sample 'soft' softmax probs : [0.2,0.2,0.6]
'a' is a 'harder' distribution. Your model is very confident about its predictions. However, in many cases, you don't want your model to do that. For example, if you are using an RNN to generate text, you are basically sampling from your output distribution and choosing the sampled word as your output token(and next input). IF your model is extremely confident, it may produce very repetitive and uninteresting text. You want it to produce more diverse text which it will not produce because when the sampling procedure is going on, most of the probability mass will be concentrated in a few tokens and thus your model will keep selecting a select number of words over and over again. In order to give other words a chance of being sampled as well, you could plug in the temperature variable and produce more diverse text.
With regards to why higher temperatures lead to softer distributions, that has to do with the exponential function. The temperature parameter penalizes bigger logits more than the smaller logits. The exponential function is an 'increasing function'. So if a term is already big, penalizing it by a small amount would make it much smaller (% wise) than if that term was small.
Here's what I mean,
exp(6) ~ 403
exp(3) ~ 20
Now let's 'penalize' this term with a temperature of let's say 1.5:
exp(6/1.5) ~ 54
exp(3/1.5) ~ 7.4
You can see that in % terms, the bigger the term is, the more it shrinks when the temperature is used to penalize it. When the bigger logits shrink more than your smaller logits, more probability mass (to be computed by the softmax) will be assigned to the smaller logits.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I built this ML model in Azure ML studio with 4 features including a date column.
Trying to predict if the price is going to be higher tomorrow than it is today. Higher = 1, not higher = 0
It is a Two class neural network (with a Tune model hyperparameters).
When I test it I expect to get a answer between 0 - 1 which I do. The problem comes when I change the feature from 1 to 0. And get almost a similar answer.
I thought that if 1 = a score probabilities of 0.6
Then a 0 (with the same features) should give a score of 0.4
A snapshot of the data (yes I know I need more)
Model is trained/tuned on the "Over5" feature, and I hope to get an answer from the Two class neural network module in the range between 0 -1.
The Score module also produce results between 1 and 0. Everything looks to be correct.
I changed normalization method (after rekommendation from commenter) but it does not change the output much.
Everything seems to be in order but my goal is to get a prediction of the likelihood that a day would finish "Over5" and result in a 1.
When I test the model by using a "1" in the Over5 column I get a prediction of 0.55... then I tested the model with the same settings only changing the 1 to a 0 and I still get the same output 0.55...
I do not understand why this is since the model is trained/tuned on the Over5 feature. Changing input in that column should produce different results?
Outputs of a neural network are not probabilities (generally), so that could be a reason that you're not getting the "1 - P" result you're looking for.
Now, if it's simple logistic regression, you'd get probabilities as output, but I'm assuming what you said is true and you're using a super-simple neural network.
Also, what you may be changing is the bias "feature", which could also lead to the model giving you the same result after training. Honestly there's too little information in this post to say for certain what's going on. I'd advise you try normalizing your features and trying again.
EDIT: Do you know if your neural network actually has 2 output nodes, or if it's just one output node? If there are two, then the raw output doesn't matter quite as much as which node had the higher output. If it's just one, I'd look into thresholding it somewhere (like >0.5 means the price will rise, but <=0.5 means the price will fall, or however you want to threshold it.) Some systems used in applications where false positives are more acceptable than false negatives threshold at much lower values, like 0.2.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a multiclass(7 labels) classification problem implemented in MLP. I am trying to classify 7 cancers based on some data. The overall accuracy is quite low, around 58%. However, some of the cancers is classified with accuracy around 90% for different parameters. Below cancer 1,2,3 etc means different types of cancer, For example 1= breast cancer, 2 = lung cancer etc. Now, for different parameter settings I get different classification accuracy. For example,
1. hyper parameters
learning_rate = 0.001
training_epochs = 10
batch_size = 100
hidden_size = 256
#overall accuracy 53%, cancer 2 accuracy 91%, cancer 5 accuracy 88%,
#cancer 6 accuracy 89%
2. hyper parameters
learning_rate = 0.01
training_epochs = 30
batch_size = 100
hidden_size = 128
#overall accuracy 56%, cancer 2 accuracy 86%, cancer 5 accuracy 93%,
#caner 6 accuracy 75%
As you can see, for different parameter settings I am getting totally different results. Cancer 1,3,4,7 have very low accuracy, so I excluded them. But cancer 2, 5,6 have comparatively better results. But, for cancer 6, the results vary by great number depending on the parameter settings.
An important note is, here overall accuracy is not important but if I can classify 2-3 cancers with more than 90% accuracy that is more important. So my question is, how do I interpret the results? In my paper how should I show the results? which parameter settings should I show/use? Or should I show different parameter settings for different cancer types? So basically, how to handle this type of situations?
Data Imbalance?
The first question you'll have to ask yourself is, do you have a balanced dataset, or do you have data imbalance? With this I mean, how many instances of each class do you have in your training and test datasets?
Suppose, for example, suppose that 90% of all the instances in your dataset are cancer 2, and the remaining 10% is spread out over the other classes. Then, you can very easily get 90% accuracy by implementing a very dumb classifier that simply classifies everything as cancer 2. This is probably not what you want out of your classifier though.
Interpretation of Results
I'd recommend reporting confusion matrices instead of just raw accuracy numbers. This will provide some information about which classes get confused for which other classes by the classifier, which may be interesting (e.g. different types of cancer may be similar to some extent if they often get confused for each other). Especially if you have data imbalance, I'd also recommend reporting other metrics such as Precision and/or Recall, instead of Accuracy.
Which parameter settings to use/show?
This depends on what problem you're really trying to solve. Is correct detection of every class equally important? If so, overall accuracy is probably the most important metric. Are certain classes more important to accurately detect than others? In this case you may want to look into "cost-sensitive classification", where different classification mistakes have different costs. If you simply don't know (don't have this domain knowledge), I'd recommend reporting as many different settings and metrics as you realistically can.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
There is an evaluation metric on sklearn, it is f1-score(Also f-beta score exists).
I know how to use it, but I could not quite understand what is stands for.
What does it indicates when it is big or small.
if we put formula aside, what should I understand from a f-score value?
F-score is a simple formula to gather the scores of precision and recall. Imagine you want to predict labels for a binary classification task (positive or negative). You have 4 types of predictions:
true positive: correctly assigned as positive.
true negative: correctly assigned as negative.
false positive: wrongly assigned as positive.
false negative: wrongly assigned as negative.
Precision is the proportion of true positive on all positives predictions. A precision of 1 means that you have no false positive, which is good because you never says that an element is positive whereas it is not.
Recall is the proportion of true positives on all actual positive elements. A recall of 1 means that you have no false negative, which is good because you never says an element belongs to the opposite class whereas it actually belongs to your class.
If you want to know if your predictions are good, you need these two measures. You can have a precision of 1 (so when you say it's positive, it's actutally positive) but still have a very low recall (you predicted 3 good positives but forgot 15 others). Or you can have a good recall and a bad precision.
This is why you might check f1-score, but also any other type of f-score. If one of these two values decreases dramatically, the f-score also does. But be aware that in many problems, we prefer giving more weight to precision or to recall (in web security, it is better to wrongly block some good requests than to let go some bad ones).
The f1-score is one of the most popular performance metrics. From what I recall this is the metric present in sklearn.
In essence f1-score is the harmonic mean of the precision and recall. As when we create a classifier we always make a compromise between the recall and precision, it is kind of hard to compare a model with high recall and low precision versus a model with high precision but low recall. f1-score is measure that we can use to compare two models.
This is not to say that a model with higher f1 score is always better - this could depend on your specific case.