Approach to interpret the following confusion matrix - machine-learning

I know that from the confusion matrix, we can figure out how good a classifier is in terms of guessing what is right and wrong.
In the case below, I have sample of the following data:
After running the Random Tree classifier, I get the following results.
Does that mean that out of the build wind float, the classifier was only able to get 53/70 correct?
Or in the case of the build wind non float, the classifier was only able to get 53/76 correct?
Just need some clarity - thanks.

Yes it does. While the columns represent "classified as", the rows indicate the true label.
So for build wind float the confusion matrix can be read as:
From all the samples we have labeled with class a:
53 were classified as a (true positives here)
11 were classified a b
6 were classified as c
...
So you find the correct guesses at the diagonal of the matrix and the for the rest you can see which classes were assigned instead.

Related

How to handle weighted average for AUC and selecting the right threshold for building the confusion matrix?

I have a binary classification task, where I fit the model using XGBClassifier classifier and try to predict ’1’ and ‘0’ using the test set. In this task I have a very unbalanced data majority ‘0‘ and minority ‘1’ at training data (of coarse the same in the test set). My data looks like this:
F1 F2 F3 …. Target
S1 2 4 5 …. 0
S2 2.3 4.3 6.4 1
… … … …. ..
S4000 3 6 7 0
I used the following code to train the model and calculate the roc value:
my_cls=XGBClassifier()
X=mydata_train.drop(['target'])
y= mydata_train['target']
x_tst=mydata_test.drop['target']
y_tst= mydata_test['target']
my_cls.fit(X, y)
pred= my_cls.predict_proba(x_tst)[:,1]
auc_score=roc_auc_score(y_tst,pred)
The above code gives me a value as auc_score, but it seems this value is for one class using this my_cls.predict_proba(x_tst)[:,1], If I change it to my_cls.predict_proba(x_tst)[:,0], it gives me another value as auc value. My first question is how can I directly get the weighted average for auc? My second question is how to select the right cut point to build the confusion matrix having the unbalanced data? This is because by default the classifier uses 50% as the threshold to build the matrix, but since my data is very unbalanced it seems we need to select a right threshold. I need to count TP and FP thats why I need to have this cut point.
If I use weight class to train the model, does it handle the problem (I mean can I use the 50% cut point by default)? For example some thing like this:
My_clss_weight=len(X) / (2 * np.bincount(y))
Then try to fit the model with this:
my_cls.fit(X, y, class_weight= My_clss_weight)
However the above code my_cls.fit(X, y, class_weight= My_clss_weight)
does not work with XGBClassifier and gives me error. This works with LogessticRegression, but I want to apply with XGBClassifier! any idea to handle the issues?
To answer your first question, you can simply use the parameter weighted of the roc_auc_score function.
For example -
roc_auc_score(y_test, pred, average = 'weighted')
To answer the second half of your question, can you please elaborate a bit. I can help you with that.

Confusion matrix subset of classes not working properly

I have searched for an answer to this question on the internet including suggestion when writing the title but still to no avail so hopefully someone can help!
I am trying to construct a confusion matrix using sci-kit learn. This comes after a keras model.
This is bizarre because i am having the following problem: For the training and test set of the original data... I can construct the confusion matrix as follows (please note this is a multi-label problem and so data has to be subset for the different labels.
The following works fine:
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1))
and the 6:18 etc... until all classes have been subset. The confusion matrix that forms as a result reflects the true outcome of the keras model..
The problem arises when i deploy the model on completely unseen data.
I deploy the model by calling model.predict() and get results as above. However, now I cannot subset confusion matrices in the same way.
The code cm=confusion_matrix etc...causes the output of the CM to be the wrong dimensions, even when specifying 0:6 etc..
I therefore used the code from above used but with the labels argument modification:
age[0,1,2,3,4]
organ[5,6,7,8]
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1), labels=age)
The FIRST label (1:5) works perfectly... However, the next labels do not! I dont get the right values in the confusion matrices and the matching is also incorrect for those that are in there.
To put this in to context: there are over 400 samples in the unseen test data.
model.predict shows very high classification and correct scores for most labels..
calling CM=ytest[:,4:8]etc, does indeed produce a 4x4 matrix, however there are like 5 values in there not 400, and those values that are in there are not correctly matching.
Also.. with the labels age being 012345, subsetting the ytest to 0:6 causes the correct confusion matrix to form (i am unsure as to why the 6 has to be included in the subset... nevertheless i have tried different combinations with the same issue!
I have searched high and low for this answer so would really appreciate some assistance as it is incredibly frustrating. any more code/information i can provide i will be happy to!!
Many thanks!
This is happening because you are trying to subset the generated confusion matrix, but you actually have to generate a new confusion matrix manually with the specified class labels. If you classes A, B, C you will get a 3X3 matrix. If you want to create matrix focusing only on class A, the other classes will become the false class, but the false positive and false negative will change and hence you cannot just sample the initial matrix.
This is how you show actually do it
import matplotlib.pytplot as plt
import seaborn as sns
def generate_matrix(y_true, predict, class_name):
TP, FP, FN, TN = 0, 0, 0, 0
for i in range(len(y_true)):
if y_true[i] == class_name:
if y_true[i] == predict[i]:
TP += 1
else:
FN += 1
else:
if y_true[i] == predict[i]:
TN += 1
else:
FP += 1
return np.array([[TP, FP],
[FN, TN]])
# Plot new matrix
matrix = generate_matrix(actual_labels,
predicted_labels,
class_name = 'A')
This will generate a confusion matrix for class A.

SVM machine learning - How to define the target in the training set?

I am working on a project where I have to implement SVM machine learning algorithm. I am trying to predict the forearm movement intention. I am using accelometer (attached to my forearm) for measuring the angle change for x,y,z axes. I have never used machine before. The problem I am having is I do not exactly know how to structure the training set. I know the angle changes for each of the axis and I know i.e if x=45 degrees, y = 65 degrees, z=30 degrees gesture performed i performed is flexion. I would like to implement 3 gestures.So the data I am having is :
x y z Target
20 60 90 flexion
100 63 23 internal rotation
89 23 74 twist
.
.
.
.
I have a file with around 2000 entries. I know, I have to normalize the training set so the data are scaled. I would like to scale it so they are in range [0.9, 0.1]. The problem is that I do not know how to represent the target in my training set. Can I just use random numbers as 1 for flexion, 2 for internal rotation, 3 for twist??
Also once the training is completed, can I do the predictions based on values for x,y,z only?? without having to supply the target value. Is my understanding correct??
First of all, I suggest that you not scale or code your data. Leave it in human-readable form. Rather, write front-end routines to perform these tasks, and back-end routines to reverse the process. Also have internal routines that can display the data in the internal forms. Doing these up front will greatly enhance your debugging later on.
Yes, you will likely want to code your classifications as 1, 2, 3. Another possibility is to have a "one-hot" ordered triple: (1,0,0) or (0,1,0) or (0,0,1). However, most SVM algorithms are set up for scalar output. Also, note that the typical treatment for a multi-class algorithm is to run three separate SVM calculations, "one against all". For each class, you take that class as "plus" data and all the others as "minus" data.
Scaling data is important for regression convergence. If you're building your SVM via complete and direct computation of the support vectors, you don't need to scale numbers that are in compatible ranges, such as these. If you're doing it by some sort of iterative approximation, you still won't need it for this data -- but keep it in mind for the future.
Yes, prediction gives only the inputs: x, y, z. It will return the target classification. That's the purpose of supervised learning: summarize experience to classify the future.

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

The best way to calculate the best threshold with P. Viola, M. Jones Framework

I'm trying to implement P. Viola and M. Jones detection framework in C++ (at the beginning, simply sequence classifier - not cascaded version). I think I have designed all required class and modules (e.g Integral images, Haar features), despite one - the most important: the AdaBoost core algorithm.
I have read the P. Viola and M. Jones original paper and many other publications. Unfortunately I still don't understand how I should find the best threshold for the one weak classifier? I have found only small references to "weighted median" and "gaussian distribution" algorithms and many pieces of mathematics formulas...
I have tried to use OpenCV Train Cascade module sources as a template, but it is so comprehensive that doing a reverse engineering of code is very time-consuming. I also coded my own simple code to understand the idea of Adaptive Boosting.
The question is: could you explain me the best way to calculate the best threshold for the one weak classifier?
Below I'm presenting the AdaBoost pseudo code, rewritten from sample found in Google, but I'm not convinced if it's correctly approach. Calculating of one weak classifier is very slow (few hours) and I have doubts about method of calculating the best threshold especially.
(1) AdaBoost::FindNewWeakClassifier
(2) AdaBoost::CalculateFeatures
(3) AdaBoost::FindBestThreshold
(4) AdaBoost::FindFeatureError
(5) AdaBoost::NormalizeWeights
(6) AdaBoost::FindLowestError
(7) AdaBoost::ClassifyExamples
(8) AdaBoost::UpdateWeights
DESCRIPTION (1)
-Generates all possible arrangement of features in detection window and put to the vector
DO IN LOOP
-Runs main calculating function (2)
END
DESCRIPTION(2)
-Normalizes weights (5)
DO FOR EACH HAAR FEATURE
-Puts sequentially next feature from list on all integral images
-Finds the best threshold for each feature (3)
-Finds the error for each the best feature in current iteration (4)
-Saves errors for each the best feature in current iteration in array
-Saves threshold for each the best feature in current iteration in array
-Saves the threshold sign for each the best feature in current iteration in array
END LOOP
-Finds for classifier index with the lowest error selected by above loop (6)
-Gets the value of error from the best feature
-Calculates the value of the best feature in the all integral images (7)
-Updates weights (8)
-Adds new, weak classifier to vector
DESCRIPTION (3)
-Calculates an error for each feature threshold on positives integral images - seperate for "+" and "-" sign (4)
-Returns threshold and sign of the feature with the lowest error
DESCRIPTION(4)
- Returns feature error for all samples, by calculating inequality f(x) * sign < sign * threshold
DESCRIPTION (5)
-Ensures that samples weights are probability distribution
DESCRIPTION (6)
-Finds the classifier with the lowest error
DESCRIPTION (7)
-Calculates a value of the best features at all integral images
-Counts false positives number and false negatives number
DESCRIPTION (8)
-Corrects weights, depending on classification results
Thank you for any help
In the original viola-Jones paper here, section 3.1 Learning Discussion (para 4, to be precise) you will find out the procedure to find optimal threshold.
I'll sum up the method quickly below.
Optimal threshold for each feature is sample-weight dependent and therefore calculated in very iteration of adaboost. The best weak classifier's threshold is saved as mentioned in the pseudo code.
In every round, for each weak classifier, you have to arrange the N training samples according to the feature value. Putting a threshold will separate this sequence in 2 parts. Both parts will have either positive or negative samples in majority along with a few samples of other type.
T+ : total sum of positive sample weights
T- : total sum of negative sample weights
S+ : sum of positive sample weights below the threshold
S- : sum of negative sample weights below the threshold
Error for this particular threshold is -
e = MIN((S+) + (T-) - (S-), (S-) + (T+) - (S+))
Why the minimum? here's an example:
If the samples and threshold is like this -
+ + + + + - - | + + - - - - -
In the first round, if all weights are equal(=w), taking the minimum will give you the error of 4*w, instead of 10*w.
You calculate this error for all N possible ways of separating the samples.
The minimum error will give you the range of threshold values. The actual threshold is probably the average of the adjacent feature values (I'm not sure though, do some research on this).
This was the second step in your DO FOR EACH HAAR FEATURE loop.
The cascades given along with OpenCV were created by Rainer Lienhart and I don't know what method he used.
You could closely follow the OpenCV source codes to get any further improvements on this procedure.

Resources