How this method or formula for calculating ROC AUC works? - machine-learning

I was trying to calculate the AUC using MySQL for the data in table like below:
y p
1 0.872637
0 0.130633
0 0.098054
...
...
1 0.060190
0 0.110938
I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).
SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
FROM (
SELECT y, row_number() OVER (ORDER BY p) r
FROM probs
) t
Using pandas this can be done as follows:
temp = df.sort_values(by="p")
temp['r'] = np.arange(1, len(df)+1, 1)
temp['yr'] = temp['y']*temp['r']
print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )
I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?
I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.

Short answer: it is Wilcoxon-Mann-Whitney statistic, see
https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
The page has proof as well.
The bottom part of your formula is identical to the formula in the wiki. The top part is trickier. f in wiki corresponds to p in your data and t_0 and t_1 are indexes in the data frame. Note that we first sort by p, which makes our life easier.
Note that the double sum may be decomposed as
Sum_{t_1 such that y(t_1)=1} #{t_0 such that p(t_0) < p(t_1) and y(t_0)=0}
Here # stands for the total number of such indexes.
For each row index t_1 (such that y(t_1) =1), how many t_0 are such that p(t_0) < p(t_1) and y(t_0)=0? We know that there are exactly t_1 values of p that are less or equal than t_1 because values are sorted. We conclude that
#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = t_1 - #{t_0: t_0 <= t_1 and y(t_0)=1}
Now imagine scrolling down the sorted dataframe. For the first time we meet y=1, #{t_0: t_0 <= t_1 and y(t_0)=1}=1, for the second time we meet y=1, the same quantity is 2, for the third time we meet y=1, the quantity is 3, and so on. Therefore, when we sum the equality over all indexes t_1 when y=1, we get
Sum_{t_1: y(t_1)=1}#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = Sum_{t_1: y(t_1)=1} t_1 - (1 + 2 + 3 + ... + n),
where n is the total number of ones in y column. Now we need to do one more simplification. Note that
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1)
If y(t_1) is not one, it is zero. Therefore,
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1) = Sum_{t} t y(t)
Plugging this to our formula and using that
1 + 2+ 3 + ... + n = n(n+1)/2
finished the proof of the formula you found.
P.S. I think that posting this question on math or stats overflow would make more sense.

Related

Numerical roundoff error in bilinear alternation leads to infeasibility

We are trying an alteration optimization strategy to solve Lyapunov problems.
We break down our decision variables into two sets, Set 1 and Set 2.
We were perplexed how it was possible that, after getting a solution to Set 1, and plugging in those solved variables into the optimization over Set 2, the transferred variables would not be feasible.
The constraints that fail are those due to the SOS coefficient matching equality constraints.
Here, we print in each row the constraint that failed, and the value of our Initial Guess. We can see that the Initial Guess is off only a very small amount compared to the constraints.
LinearEqualityConstraint
(2 * Symmetric(97,40) + 2 * Symmetric(96,41)) == 9.50028
[9.50027496]
LinearEqualityConstraint
(2 * Symmetric(97,47) + 2 * Symmetric(96,48)) == 234.465
[234.4647013]
LinearEqualityConstraint
(2 * Symmetric(97,54) + 2 * Symmetric(96,55)) == -234.463
[-234.46336504]
LinearEqualityConstraint
(2 * Symmetric(97,61) + 2 * Symmetric(96,62)) == 12.7962
[12.79618825]
LinearEqualityConstraint
(2 * Symmetric(97,68) + 2 * Symmetric(96,69)) == -12.7964
[-12.79637068]
LinearEqualityConstraint
(2 * Symmetric(97,75) + 2 * Symmetric(96,76)) == -51.4061
[-51.40605828]
LinearEqualityConstraint
(2 * Symmetric(97,81) + 2 * Symmetric(96,82)) == 51.406
[51.40604213]
LinearEqualityConstraint
(2 * Symmetric(97,86) + 2 * Symmetric(96,87)) == 192.794
[192.79430158]
LinearEqualityConstraint
(2 * Symmetric(97,90) + 2 * Symmetric(96,91)) == -141.924
[-141.92366183]
LinearEqualityConstraint
(2 * Symmetric(97,93) + 2 * Symmetric(96,94)) == -37.6674
[-37.66740401]
InitialGuess V_sos and
Our guess for what's happening is:
When you extract the solution from one optimization using result.GetSolution(var), you lose some precision.
Or, when you set the previous solution using prog.SetInitialGuess(np_array) you lose some precision.
What's the solution here? Should we just keep feeding the solution back in even though it says infeasible?
This is a partial cookbook when I debug SOS problem, especially when working with Lyapunov problems:
Choose the right monomial basis
The main idea is to remove the 0-th order monomial 1 from the monomial basis of the sos polynomial. Here is a quick explanation:
The mathematical problem is
Find λ(x)
−Vdot − λ(x) * (ρ − V) is sos
λ(x) is sos
Namely you want to prove that V≤ρ ⇒ Vdot ≤ 0
So first I would suggest to re-write your dynamics to make sure that 0 is the goal state (you can always shift your state).
Second you can see that since x=0 is the equilibrium point, then both V(0) = 0 and Vdot(0) = 0 (Because x=0 is the global minimal of V(x), hence ∂V/∂x=0 at x=0, indicating Vdot(0) = 0), now your sos polynomial p(x) = −Vdot − λ(x) * (ρ − V) must satisfy p(0) = -λ(0) * ρ. But λ(x) >= 0 and ρ > 0, so we know λ(0) = 0.
Lemma
If a sos polynomial s(x) satisfies s(0) = 0, then its monomial basis cannot contain the 0-th order monomial (namely 1).
Proof
Remember that s(x) is a sos polynomial, namely
s(x) = m(x)ᵀQm(x)
where m(x) contains the monomial basis, and Q is a psd matrix. Now let's decompose the monomial basis m(x) into two parts, the 0-th order monomial 1 and the remaining monomials mbar(x). For example, if m(x) = [x1, x2, 1], then mbar(x) = [x1, x2]. We also decompose the psd matrix Q accordingly
s(x) = [mbar(x)]ᵀ [Q11 Q10] [mbar(x)]
[ 1] [Q10 Q00] [ 1]
Since s(0) = Q00 = 0, we also know that Q10 = 0, so now we can use a smaller psd matrix Q11 rather than Q. Equivalently we write s(x) = mbar(x)ᵀ * Q11 * mbar(x), where mbar(x) is the monomial basis that doesn't contain the 0-th order monomial, QED.
So why removing the 0-th order monomial from the monomial basis and use the smaller psd matrix Q11 is a good idea when your sos polynomial s(x) satisfies s(0) = 0? The reason is that if the monomial basis contains 1, then your psd matrix Q has to be on the boundary of the psd cone, namely your SDP problem doesn't have a strict interior. This could leads to violation of Slater's condition, which also breaks the strong duality. One example is that if your s(x) = x², by including the 0'th order monomial, it is written as
x² = [x] [1 0] [x]
[1] [0 0] [1]
And you see that the Gram matrix [[1 0], [0, 0]] is on the boundary of the psd cone (with one eigen value equal to 0). But if you remove 1 from the monomial basis, then its Gram matrix is just Q11=1, strictly in the interior of the psd cone.
In Drake, after removing 1 from the monomial basis, you can create your sos polynomial λ(x) as
lambda_poly, lambda_gram = prog.NewSosPolynomial(monomial_basis)
whee monomial_basis doesn't contain the 0-th order monomial.
Backoff during bilinear alternation
This is a typical problem in bilinear alternation. The issue is that when you solve a conic optimization problem with an objective function, the optimal solution always occurs at the boundary of the cone, namely it is very close to being infeasible. Then when you fix some variables to this solution at the cone boundary, in the next iteration the problem is very likely infeasible due to numerical roundoff error.
A typical solution is that after solving the optimization problem on variable Set 1 with an objective, now "backoff" a little bit by solving a feasibility problem on variable Set 1. This new solution is often strictly feasible (namely it is inside the strict interior of the cone), now pass this strictly feasible solution Set 1 to the next iteration and search for Set 2.
More concretely, suppose at one iteration you solve the following optimization problem
min c'*x
s.t constraint_on_x
and denote the optimal cost as p. Now solve a new feasibility problem
find x
s.t c'*x <= p + epsilon
constraint_on_x
where epsilon can be a small positive number. This new solution will be used in the next iteration to search for a different set of variables.
You can check if your solution is on the boundary of the positive semidefinite cone by checking the Eigen value of your psd matrix. Here is the pseudo-code
for binding : prog.positive_semidefinite_constraints():
psd_sol = result.GetSolution(binding.variables())
psd_sol.reshape((binding.evaluator().matrix_rows(), binding.evaluator().matrix_rows()))
print(f"minimal eigenvalue {np.linalg.eig(psd_sol)[0].min()}")
You should see that before doing this "backoff" some of the minimal eigen value is almost 0. After "backoff" the minimal eigen value gets larger.

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.

What would be a good loss function to penalize the magnitude and sign difference

I'm in a situation where I need to train a model to predict a scalar value, and it's important to have the predicted value be in the same direction as the true value, while the squared error being minimum.
What would be a good choice of loss function for that?
For example:
Let's say the predicted value is -1 and the true value is 1. The loss between the two should be a lot greater than the loss between 3 and 1, even though the squared error of (3, 1) and (-1, 1) is equal.
Thanks a lot!
This turned out to be a really interesting question - thanks for asking it! First, remember that you want your loss functions to be defined entirely of differential operations, so that you can back-propagation though it. This means that any old arbitrary logic won't necessarily do. To restate your problem: you want to find a differentiable function of two variables that increases sharply when the two variables take on values of different signs, and more slowly when they share the same sign. Additionally, you want some control over how sharply these values increase, relative to one another. Thus, we want something with two configurable constants. I started constructing a function that met these needs, but then remembered one you can find in any high school geometry text book: the elliptic paraboloid!
The standard formulation doesn't meet the requirement of sign agreement symmetry, so I had to introduce a rotation. The plot above is the result. Note that it increases more sharply when the signs don't agree, and less sharply when they do, and that the input constants controlling this behaviour are configurable. The code below is all that was needed to define and plot the loss function. I don't think I've ever used a geometric form as a loss function before - really neat.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
def elliptic_paraboloid_loss(x, y, c_diff_sign, c_same_sign):
# Compute a rotated elliptic parabaloid.
t = np.pi / 4
x_rot = (x * np.cos(t)) + (y * np.sin(t))
y_rot = (x * -np.sin(t)) + (y * np.cos(t))
z = ((x_rot**2) / c_diff_sign) + ((y_rot**2) / c_same_sign)
return(z)
c_diff_sign = 4
c_same_sign = 2
a = np.arange(-5, 5, 0.1)
b = np.arange(-5, 5, 0.1)
loss_map = np.zeros((len(a), len(b)))
for i, a_i in enumerate(a):
for j, b_j in enumerate(b):
loss_map[i, j] = elliptic_paraboloid_loss(a_i, b_j, c_diff_sign, c_same_sign)
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(a, b)
surf = ax.plot_surface(X, Y, loss_map, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()
From what I understand, your current loss function is something like:
loss = mean_square_error(y, y_pred)
What you could do, is to add one other component to your loss, being this a component that penalizes negative numbers and does nothing with positive numbers. And you can choose a coefficient for how much you want to penalize it. For that, we can use like a negative shaped ReLU. Something like this:
Let's call "Neg_ReLU" to this component. Then, your loss function will be:
loss = mean_squared_error(y, y_pred) + Neg_ReLU(y_pred)
So for example, if your result is -1, then the total error would be:
mean_squared_error(1, -1) + 1
And if your result is 3, then the total error would be:
mean_squared_error(1, -1) + 0
(See in the above function how Neg_ReLU(3) = 0, and Neg_ReLU(-1) = 1.
If you want to penalize more the negative values, then you can add a coefficient:
coeff_negative_value = 2
loss = mean_squared_error(y, y_pred) + coeff_negative_value * Neg_ReLU
Now the negative values are more penalized.
The ReLU negative function we can build it like this:
tf.nn.relu(tf.math.negative(value))
So summarizing, in the end your total loss will be:
coeff = 1
Neg_ReLU = tf.nn.relu(tf.math.negative(y))
total_loss = mean_squared_error(y, y_pred) + coeff * Neg_ReLU

How do I determine the weight to assign to each bucket?

Someone will answer a series of questions and will mark each important (I), very important (V), or extremely important (E). I'll then match their answers with answers given by everyone else, compute the percent of the answers in each bucket that are the same, then combine the percentages to get a final score.
For example, I answer 10 questions, marking 3 as extremely important, 5 as very important, and 2 as important. I then match my answers with someone else's, and they answer the same to 2/3 extremely important questions, 4/5 very important questions, and 2/2 important questions. This results in percentages of 66.66 (extremely important), 80.00 (very important), and 100.00 (important). I then combine these 3 percentages to get a final score, but I first weigh each percentage to reflect the importance of each bucket. So the result would be something like: score = E * 66.66 + V * 80.00 + I * 100.00. The values of E, V, and I (the weights) are what I'm trying to figure out how to calculate.
The following are the constraints present:
1 + X + X^2 = X^3
E >= X * V >= X^2 * I > 0
E + V + I = 1
E + 0.9 * V >= 0.9
0.9 > 0.9 * E + 0.75 * V >= 0.75
E + I < 0.75
When combining the percentages, I could give important a weight of 0.0749, very important a weight of .2501, and extremely important a weight of 0.675, but this seems arbitrary, so I'm wondering how to go about calculating the optimal value for each weight. Also, how do I calculate the optimal weights if I ignore all constraints?
As far as what I mean by optimal: while adhering to the last 4 constraints, I want the weight of each bucket to be the maximum possible value, while having the weights be as far apart as possible (extremely important questions weighted maximally more than very important questions, and very important questions weighted maximally more than important questions).

Confusion on the cost function in video lecture

I am unable to understand the graph of 2nd and 3rd in the below.
What does "x" represent here? In graph 1 the value of x doesn't matter as theta 1 is zero. But in graph 2 and 3 which is the value of "x"?
In graph 2, how did the instructor decide that h(x) value is 2?
x can be any independent variable. A signal from a sensor, a person's age, my weight after an all you can eat buffet, ...
theta0 is the zeroth order, i.e. independent of x.
theta1 is the first order, linear with x.
In graph 2: theta0 is 0, theta1 is 0.5. Thus when x is 2. h = 0 + 0.5*2 = 1

Resources