How to calculate the accuracy of classes from a 7x7 confusion matrix? - machine-learning

So I've got the following results from Naïves Bayes classification on my data set:
I am stuck however on understanding how to interpret the data. I am wanting to find and compare the accuracy of each class (a-g).
I know accuracy is found using this formula:
However, lets take the class a. If I take the number of correctly classified instances - 313 - and divide it by the total number of 'a' (4953) from the row a, this gives ~6.32%. Would this be the accuracy?
EDIT: if we use the column instead of the row, we get 313/1199 which gives ~26.1% which seems a more reasonable number.
EDIT 2: I have done a calculation of the accuracy of a in excel which gives me 84% as the accuracy, using the accuracy calculation shown above:
This doesn't seem right, as the overall accuracy of classification successfully is ~24%

No -- all you've calculated is tp/(tp+fn), the total correct identifications of class a, divided by the total of actual a examples. This is recall, not accuracy. You need to include the other two figures.
fp is the rest of the a column; tn is all of the other figures in the non-a rows and columns, the 6x6 sub-matrix. This will reduce all 35K+ trials to a 2x2 matrix with labels a and not a, the 2x2 confusion matrix with which you're already familiar.
Yes, you get to repeat that reduction for each of the seven features. I recommend doing it programmatically.
RESPONSE TO OP UPDATE
Your accuracy is that high: you have a huge quantity of true negatives, not-a samples that were properly classified as not-a.
Perhaps it doesn't feel right because our experience focuses more on the class in question. There are [other statistics that handle that focus.
Recall is tp / (tp+fn) -- of all items actually in class a, what percentage did we properly identify? This is the 6.32% figure.
Precision is tp / (tp + fp) -- of all items identified as class a, what percentage were actually in that class. This is the 26.1% figure you calculated.

Related

How to handle weighted average for AUC and selecting the right threshold for building the confusion matrix?

I have a binary classification task, where I fit the model using XGBClassifier classifier and try to predict ’1’ and ‘0’ using the test set. In this task I have a very unbalanced data majority ‘0‘ and minority ‘1’ at training data (of coarse the same in the test set). My data looks like this:
F1 F2 F3 …. Target
S1 2 4 5 …. 0
S2 2.3 4.3 6.4 1
… … … …. ..
S4000 3 6 7 0
I used the following code to train the model and calculate the roc value:
my_cls=XGBClassifier()
X=mydata_train.drop(['target'])
y= mydata_train['target']
x_tst=mydata_test.drop['target']
y_tst= mydata_test['target']
my_cls.fit(X, y)
pred= my_cls.predict_proba(x_tst)[:,1]
auc_score=roc_auc_score(y_tst,pred)
The above code gives me a value as auc_score, but it seems this value is for one class using this my_cls.predict_proba(x_tst)[:,1], If I change it to my_cls.predict_proba(x_tst)[:,0], it gives me another value as auc value. My first question is how can I directly get the weighted average for auc? My second question is how to select the right cut point to build the confusion matrix having the unbalanced data? This is because by default the classifier uses 50% as the threshold to build the matrix, but since my data is very unbalanced it seems we need to select a right threshold. I need to count TP and FP thats why I need to have this cut point.
If I use weight class to train the model, does it handle the problem (I mean can I use the 50% cut point by default)? For example some thing like this:
My_clss_weight=len(X) / (2 * np.bincount(y))
Then try to fit the model with this:
my_cls.fit(X, y, class_weight= My_clss_weight)
However the above code my_cls.fit(X, y, class_weight= My_clss_weight)
does not work with XGBClassifier and gives me error. This works with LogessticRegression, but I want to apply with XGBClassifier! any idea to handle the issues?
To answer your first question, you can simply use the parameter weighted of the roc_auc_score function.
For example -
roc_auc_score(y_test, pred, average = 'weighted')
To answer the second half of your question, can you please elaborate a bit. I can help you with that.

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

NLP: How to correctly normalise a feature for gender classification?

NOTE Before I begin, this F-measure is not related to precision and recall, and its title and definition is taken from this paper.
I have a feature known as the F-measure, which is used to measure formality in a given text. It is mostly used in gender classification of text which is what I'm working on as a project.
The F-measure is defined as:
F = 0.5 * (noun freq. + adjective freq. + preposition freq. + article freq. – pronoun
freq. – verb freq. – adverb freq. – interjection freq. + 100)
where the frequencies are taken from a given text (for example, a blog post).
I would like to normalize this feature for use in a classification task. Initially, my first thought was that since the value F is bound by the number of words in the given text (text_length), I thought of first taking F and dividing by text_length. Secondly, and finally, since this measure can take on both positive and negative values (as can be inferred from the equation) I then thought of squaring (F/text_length) to only get a positive value.
Trying this I found that the normalised values did not seem to be too correct as I started getting really small values in (below 0.10) for all the cases I tested the feature with and I am thinking that the reason might be because I am squaring the value which would essentially make it smaller since its the square of a fraction. However this is required if I want to guarantee positive values only. I am not sure what else to consider to improve the normalisation such that a nice distribution within [0,1] is produced, and would like to know if there is some kind of strategy involved to correctly normalise NLP features.
How should I approach the normalisation of my feature, and what might I be doing wrong?
If you carefully read the article, you'll find that the measure is already normalized:
F will then vary between 0 and 100%
The reason for this is that "frequencies" in the formula are calculated as follows:
The frequencies are here expressed as percentages of the number of words belonging to a particular category with respect to the total number of words in the excerpt.
I.e. you should normalize them by the total number of words (just as you suggested). But afterwards don't forget to multiply each one by 100.

Understanding Recall and Precision

I am currently learning Information retrieval and i am rather stuck with an example of recall and precision
A searcher uses a search engine to look for information. There are 10 documents on the first screen of results and 10 on the second.
Assuming there is known to be 10 relevant documents in the search engines index.
Soo... there is 20 searches all together of which 10 are relevant.
Can anyone help me make sense of this?
Thanks
Recall and precision measure the quality of your result. To understand them let's first define the types of results. A document in your returned list can either be
classified correctly
a true positive (TP): a document which is relevant (positive) that was indeed returned (true)
a true negative (TN): a document which is not relevant (negative) that was indeed NOT returned (true)
misclassified
a false positive (FP): a document which is not relevant but was returned positive
a false negative (FN): a document which is relevant but was not returned negative
the precision is then:
|TP| / (|TP| + |FP|)
i.e. the fraction of retrieved documents which are indeed relevant
the recall is then:
|TP| / (|TP| + |FN|)
i.e. the fraction of relevant documents which are in your result set
So, in your example 10 out of 20 results are relevant. This gives you a precision of 0.5. If there are no more than these 10 relevant documents, you have got a recall of 1.
(When measuring the performance of an Information Retrieval system it only makes sense to consider both precision and recall. You can easily get a precision of 100% by returning no result at all (i.e. no spurious returned instance => no FP) or a recall of 100% by returning every instance (i.e. no relevant document was missed => no FN). )
Well, this is an extension of my answer on recall at: https://stackoverflow.com/a/63120204/6907424. First read about precision here and than go to read recall. Here I am only explaining Precision using the same example:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
For now I am calculating precision for Cat. So Cat is our Positive Class and the rest of the classes (Here Dog only) are the Negative Classes. Precision means what the percentage of positive detection was actually positive. So here for Cat there are 3 detection by the model. But are all of them correct? No! Out of them only 2 are correct (in example 0 and 2) and another is wrong (in example 3). So the percentage of correct detection is 2 out of 3 which is (2 / 3) * 100 % = 66.67%.
Now coming to the formulation, here:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FP (False positive): Predicting something as positive but which is not actually positive, i.e, saying something positive "falsely".
Now the number of correct detection of a certain class is the number of TP of that class. But apart from them the model also predicted some other examples as positives but which were not actually positives and so these are the false positives (FP). So irrespective of correct or wrong the total number of positive class detected by the model is TP + FP. So the percentage of correct detection of positive class among all detection of that class will be: TP / (TP + FP) which is the precision of the detection of that class.
Like recall we can also generalize this formula for any number of classes. Just take one class at a time and consider it as the positive class and the rest of the classes as negative classes and continue the same process for all of the classes to calculate precision for each of them.
You can calculate precision and recall in another way (basically the other way of thinking the same formulae). Say for Cat, first count how many examples at the same time have Cat in both Ground-truth and Model's prediction (i.e, count the number of TP). Therefore if you are calculating precision then divide this count by the number of "Cat"s in the Model's Prediction. Otherwise for recall divide by the number of "Cat"s in the Ground-truth. This works as the same as the formulae of precision and recall. If you can't understand why then you should think for a while and review what actually TP, FP, TN and FN means.
If you have difficulty understanding precision and recall, consider reading this
https://medium.com/seek-product-management/8-out-of-10-brown-cats-6e39a22b65dc

The best way to calculate the best threshold with P. Viola, M. Jones Framework

I'm trying to implement P. Viola and M. Jones detection framework in C++ (at the beginning, simply sequence classifier - not cascaded version). I think I have designed all required class and modules (e.g Integral images, Haar features), despite one - the most important: the AdaBoost core algorithm.
I have read the P. Viola and M. Jones original paper and many other publications. Unfortunately I still don't understand how I should find the best threshold for the one weak classifier? I have found only small references to "weighted median" and "gaussian distribution" algorithms and many pieces of mathematics formulas...
I have tried to use OpenCV Train Cascade module sources as a template, but it is so comprehensive that doing a reverse engineering of code is very time-consuming. I also coded my own simple code to understand the idea of Adaptive Boosting.
The question is: could you explain me the best way to calculate the best threshold for the one weak classifier?
Below I'm presenting the AdaBoost pseudo code, rewritten from sample found in Google, but I'm not convinced if it's correctly approach. Calculating of one weak classifier is very slow (few hours) and I have doubts about method of calculating the best threshold especially.
(1) AdaBoost::FindNewWeakClassifier
(2) AdaBoost::CalculateFeatures
(3) AdaBoost::FindBestThreshold
(4) AdaBoost::FindFeatureError
(5) AdaBoost::NormalizeWeights
(6) AdaBoost::FindLowestError
(7) AdaBoost::ClassifyExamples
(8) AdaBoost::UpdateWeights
DESCRIPTION (1)
-Generates all possible arrangement of features in detection window and put to the vector
DO IN LOOP
-Runs main calculating function (2)
END
DESCRIPTION(2)
-Normalizes weights (5)
DO FOR EACH HAAR FEATURE
-Puts sequentially next feature from list on all integral images
-Finds the best threshold for each feature (3)
-Finds the error for each the best feature in current iteration (4)
-Saves errors for each the best feature in current iteration in array
-Saves threshold for each the best feature in current iteration in array
-Saves the threshold sign for each the best feature in current iteration in array
END LOOP
-Finds for classifier index with the lowest error selected by above loop (6)
-Gets the value of error from the best feature
-Calculates the value of the best feature in the all integral images (7)
-Updates weights (8)
-Adds new, weak classifier to vector
DESCRIPTION (3)
-Calculates an error for each feature threshold on positives integral images - seperate for "+" and "-" sign (4)
-Returns threshold and sign of the feature with the lowest error
DESCRIPTION(4)
- Returns feature error for all samples, by calculating inequality f(x) * sign < sign * threshold
DESCRIPTION (5)
-Ensures that samples weights are probability distribution
DESCRIPTION (6)
-Finds the classifier with the lowest error
DESCRIPTION (7)
-Calculates a value of the best features at all integral images
-Counts false positives number and false negatives number
DESCRIPTION (8)
-Corrects weights, depending on classification results
Thank you for any help
In the original viola-Jones paper here, section 3.1 Learning Discussion (para 4, to be precise) you will find out the procedure to find optimal threshold.
I'll sum up the method quickly below.
Optimal threshold for each feature is sample-weight dependent and therefore calculated in very iteration of adaboost. The best weak classifier's threshold is saved as mentioned in the pseudo code.
In every round, for each weak classifier, you have to arrange the N training samples according to the feature value. Putting a threshold will separate this sequence in 2 parts. Both parts will have either positive or negative samples in majority along with a few samples of other type.
T+ : total sum of positive sample weights
T- : total sum of negative sample weights
S+ : sum of positive sample weights below the threshold
S- : sum of negative sample weights below the threshold
Error for this particular threshold is -
e = MIN((S+) + (T-) - (S-), (S-) + (T+) - (S+))
Why the minimum? here's an example:
If the samples and threshold is like this -
+ + + + + - - | + + - - - - -
In the first round, if all weights are equal(=w), taking the minimum will give you the error of 4*w, instead of 10*w.
You calculate this error for all N possible ways of separating the samples.
The minimum error will give you the range of threshold values. The actual threshold is probably the average of the adjacent feature values (I'm not sure though, do some research on this).
This was the second step in your DO FOR EACH HAAR FEATURE loop.
The cascades given along with OpenCV were created by Rainer Lienhart and I don't know what method he used.
You could closely follow the OpenCV source codes to get any further improvements on this procedure.

Resources