We get statistics of evaluation metrics (like precision, recall, accuracy, etc.) from the confusion matrix.
Is there any way to reverse the process?
In a binary classification problem, I have results of the evaluation metrics: Precision, Recall, F1-Measure, Accuracy, Specificity, and Balanced Accuracy. I want to reverse construct confusion matrix using them.
Is there any reverse derivation for getting values for TN, FP, FN, TP?
Since all these metrics are ratios, its unlikely you will be able to determine the original size of the data, if it not known. All the values will be relative to the size of the data.
p = TP / (TP + FP) => FP = p' * TP (Precision)
r = TP / (FN + TP) => FN = r' * TP (Recall)
s = TN / (TN + FP) => FP = s' * TN (Specifity)
where
p' = (1-p)/p, s' = (1-s)/s, r' = (1-r)/r
Size = TP + FP + FN + TN =>
Size = TP + p' TP + r' TP + 1/s' FP =>
Size = (1+p'+r'+p'/s')FP
And the rest of wanted values can be easily calculated from the above equations.
Whilst I realise that this question is over a year old, I had a similar problem yesterday where I needed to know if certain metrics are achievable.
Google searches for this problem only show this question, so I decided to solve it and share it here in case anyone in the future requires a solution.
My solution is too long to post as code here, but I have made a project which will extract all of the possible confusion matrices for a given dataset size against certain metrics.
Here is the link to my repository housing the project:
Sam-Pewton/reverse_engineer_confusion_matrix
The code will extract all possible confusion matrices for given parameters and dump all outputs to a .csv file.
As eluded to by UziGoozie, you need to know the size of each of the classes for this problem - it would be impossible without. As a bare minimum my project requires the sizes of both classes, the number of decimal places to round your metrics to, and the accuracy you are trying to prove.
Other optional metrics I have included to filter by are:
Sensitivity (Recall/TPR)
Specificity (TNR)
F1 Score
Precision
Explaination on how to use the code is all documented in the projects README.
I hope this helps.
Related
I'm having trouble understanding a lecture slide in my school's machine learning course
why does the expected value of Y = f(X)? what does it mean
my understanding is that X, Y are vectors and f(X) outputs a vector of Y where each individual value (y_i) in the Y vector corresponds to a f(x_i) where x_i is the value in X at index i; But now it's taking the expected value of Y, which is going to be a single value, so how is that equal to f(X)?
X, Y (uppercase) are vectors
x_i,y_i (lowercase with subscript) are scalars at index i in X,Y
There is a lot of confusion here. First let's start with definitions
Definitions
Expectation operator E[.]: Takes a random variable as an input and gives a scalar/vector as an output. Let's say Y is a normally distributed random variable with mean Mu and Variance Sigma^{2} (usually stated as:
Y ~ N( Mu , Sigma^{2} ), then E[Y] = Mu
Function f(.): Takes a scalar/vector (not a random variable) and gives a scalar/vector. In this context it is an affine function, that is f(X) = a*X + b where a and b are fixed constants.
What's Going On
Now you can view linear regression from two angles.
Stats View
One angle assumes that your response variable-Y- is a normally distributed random variable because:
Y ~ a*X + b + epsilon
where
epsilon ~ N( 0 , sigma^sq )
and X is some other distribution. We don't really care how X is distributed and treat it as given. In that case the conditional distribution is
Y|X ~ N( a*X + b , sigma^sq )
Notice here that a,b and also X is a number, there is no randomness associated with them.
Maths View
The other view is the math view where I assume that there is a function f(.) that governs the real life process, that if in real life I observe X, then f(X) should be the output. Of course this is not the case and the deviations are assumed to be due to various reasons such as gauge error etc. The claim is that this function is linear:
f(X) = a*X + b
Synthesis
Now how do we combine these? Well, as follows:
E[Y|X] = a*X + b = f(X)
About your question, I first would like to challenge that it should be Y|X and not Y by itself.
Second, there are tons of possible ontological discussions over what each term here represents in real life. X,Y (uppercase) could be vectors. X,Y (uppercase) could also be random variables. A sample of these random variables might be stored in vectors and both would be represented with uppercase letters (the best way is to use different fonts for each). In this case, your sample will become your data. Discussions about the general view of the model and its relevance to real life should be made at random variable level. The way to infer the parameters, how linear regression algorithms works should be made at matrix and vectors levels. There could be other discussion where you should care about both.
I hope this overly unorganized answer helps you. In general if you want to learn such stuff, be sure you know what kind of math objects and operators you are dealing with , what do they take as input and what are their relevance to real life.
I would like to prove properties of expressions involving matrices and vectors (potentially large size, but size is fixed).
For example I want to prove that the outcome of an expression is a diagonal matrix or a triangular matrix, or it is positive definite, ...
To that end I'd like encode well known properties and identities from linear algebra, such as:
||x + y|| <= ||x|| + ||y||
(A * B) * C = A * (B * C)
det(A+B) = det(A) + det(B)
Tr(zA) = z * Tr(A)
(I + AB) ^ (-1) = I - A(I + BA) ^ (-1) * B
...
I have attempted to implement this in Z3. But even for simple properties it returns unknown or times out. I've tried with array theory and quantifiers.
I'd like know if this problem can be solved with an SMT solver or is it not suited for these kind of problems? Could you give a hint by giving a small example?
You can certainly use Z3 to do this.
I have constructed a small example here, which defines the identity matrix and what it means to be a diagonal matrix, and then proves that the identity matrix is diagonal.
So, it is definitely possible to do this kind of work in Z3. Though you may find you have a better time using a tool built on top of Z3 that has more interactive proving features, such as Dafny or F*.
I started to learn Machine Learning. Now i tried to play around with tensorflow.
Often i see examples like this:
pred = tf.add(tf.mul(X, W), b)
I also saw such a line in a plain numpy implementation. Why is always x*W+b used instead of W*x+b? Is there an advantage if matrices are multiplied in this way? I see that it is possible (if X, W and b are transposed), but i do not see an advantage. In school in the math class we always only used Wx+b.
Thank you very much
This is the reason:
By default w is a vector of weights and in maths a vector is considered a column, not a row.
X is a collection of data. And it is a matrix nxd (where n is the number of data and d the number of features) (upper case X is a matrix n x d and lower case only 1 data 1 x d matrix).
To correctly multiply both and use the correct weight in the correct feature you must use X*w+b:
With X*w you mutliply every feature by its corresponding weight and by adding b you add the bias term on every prediction.
If you multiply w * X you multipy a (1 x d)*(n x d) and it has no sense.
I'm also confused with this. I guess this may be a dimension matter. For a n*m-dimension matrix W and a n-dimension vector x, using xW+b can be easily viewed as that maping a n-dimension feature to a m-dimension feature, i.e., you can easily think W as a n-dimension -> m-dimension operation, where as Wx+b (x must be m-dimension vector now) becomes a m-dimension -> n-dimension operation, which looks less comfortable in my opinion. :D
I'm working on homework for my machine learning course and am having trouble understanding the question on Naive Bayes. The problem I have is a variation of question number 2 on the following page:
https://www.cs.utexas.edu/~mooney/cs343/hw3-old/hw3.html
The numbers I have are slightly different, so I'll replace the numbers from my assignment with the example above. I'm currently attempting to figure out the probability that the first text is physics. To do so, I have something that looks a little like this:
P(physics|c) = P(physics) * P(carbon|physics) * p(atom|physics) * p(life|physics) * p(earth|physics) / [SOMETHING]
P(physics|c) = .35 * .005 * .1 * .001 * .005 / [SOMETHING]
I'm basing this off of an example that I've seen in my notes, but I can't seem to figure out what I'm supposed to divide by. I'll provide the example from the notes as well.
Perhaps I'm going about this in the wrong way, but I'm unsure where the P(X) term that we're dividing by is coming from. How does this relate to the probability that the text is physics? I feel that getting this issue resolved will make the remainder of the assignment simple.
The denominator P(X) is just the sum of P(X|Y)*P(Y) for all possible classes.
Now, it's important to note that in Naive Bayes, you do not have to compute this P(X). You only have to compute P(X|Y)*P(Y) for each class, and then select the class that produced the highest probability.
In your case, I assume you must have several classes. You mentioned physics, but there must be others like chemistry or math.
So you can compute:
P(physics|X) = P(X|physics) * P(physics) / P(X)
P(chemistry|X) = P(X|chemistry) * P(chemistry) / P(X)
P(math|X) = P(X|math) * P(math) / P(X)
P(X) is the sum of P(X|Y)*P(Y) for all classes:
P(X) = P(X|physics)*P(physics) + P(X|chemistry)*P(chemistry) + P(X|math)*P(math)
(By the way, the above statement is exactly analogous to the example in the image that you provided. The equations are a bit complicated there, but if you rearrange them, you will find that P(X) = P(X|positive)*P(positive) + P(X|negative)*P(negative) in that example).
To produce the answer (that is, to determine Y among physics, chemistry, or math), you would select the maximum value among P(physics|X), P(chemistry|X), and P(math|X).
As I mentioned, you do not need to compute P(X) because this term exists in the denominator of all of P(physics|X), P(chemistry|X), and P(math|X). Thus, you only need to determine the max among P(X|physics)*P(physics), P(X|chemistry)*P(chemistry), and P(X|math)*P(math).
The point is that you don't really need a value for P(x) because it is the same among all classes. So you should ignore it and just compare the numbers before the division step. The highest number is the predicted class.
The reason it is in the equation is originating from the Bayes rule:
P(C1|X) = P(X|C1) * P(C1) / P(X)
Does it help in classifying better if I add linear, non-linear combinatinos of the existing features ? For example does it help to add mean, variance as new features computed from the existing features ? I believe that it definitely depends on the classification algorithm as in the case of PCA, the algorithm by itself generates new features which are orthogonal to each other and are linear combinations of the input features. But how does it effect in the case of decision tree based classifiers or others ?
Yes, combination of existing features can give new features and help for classification. Moreover, combination of the feature with itself (e.g. polynomial from the feature) can be used as this additional data to be used during classification.
As an example, consider logistic regression classifier with such linear formula as its core:
g(x, y) = 1*x + 2*y
Imagine, that you have 2 observations:
x = 6; y = 1
x = 3; y = 2.5
In both cases g() will be equal to 8. If observations belong to different classes, you have no possibility to distinguish them. But let's add one more variable (feature) z, which is combination of the previous 2 features - z = x * y:
g(x, y, z) = 1*x + 2*y + 0.5*z
Now for same observations we have:
x = 6; y = 1; z = 6 * 1 = 6 ==> g() = 11
x = 3; y = 2.5; z = 3 * 2.5 = 7.5 ==> g() = 11.75
So now we get 2 different points and can distinguish between 2 observations.
Polynomial features (x^2, x^3, y^2, etc.) do not give additional points, but instead change the graph of the function. For example, g(x) = a0 + a1*x is a line, while g(x) = a0 + a1*x + a2*x^2 is parabola and thus can fit data much more closely.
In general, it's always better to have more features. Unless you have very predictive features (i.e. they allow for perfect separation of the classes to predict) already, I would always recommend adding more features. In practice, many classification algorithms (and in particular decision tree inducers) select the best features for their purposes anyway.
There are open-source Python libraries that automate feature creation / combination:
We can automate polynomial feature creations with sklearn.
We can automatically create spline features with sklearn.
We can combine features mathematically with Feature-engine. With MathFeatures we combine feature groups, and with RelativeFeatures we combine feature pairs.