How to use ROC curve - machine-learning

For Logistic regression we usually follow below approach ---
[1] Randomly initiallize parameters (theta), and choose cutoff/deciding point (we consider points above this cutoff point as one class and below ones as another class)
[2] Predict output values (h) with theta and chosen input features
[3] Calculate cost using predicted (h) and actual result
[4] Calculate gradient, so that we can minimize theta using it
[5] Recalculate theta using obtained gradient
[6] repeat steps 1-5 for few iterations and then plot the cost values (obtained in 3rd step of each iteration) against no of iteration
[7] If the cost values are getting decreased with increase in no of iterations, then our classifier is good otherwise we have to randomly choose another value of theta and start against
We use ROC curve to analyse the trade off between cutoff point and true positive as well as true negative rate. My question is when can we use ROC curve? Is it after finding the minimized theta using gradient descent? Please help!!

The ROC curve can be measured for any tunable predictor, no matter how bad. You could measure it straight away after step one. That would get you a very bad curve, of course: simultaneously many FP and FN.
The whole point of all those iterations is to push the ROC curve lower.

Related

K-means++ clustering Algorithm

The algorithm for the K-means++ is:
Take one centroid c(i), chosen uniformly at random from the dataset.
Take a new Centroid c(i), choosing an instance x(i) from the dataset with the probability
D(X(i))^2/Sum(D(X(j))^2) from j=1 to m, where D(X(i)) is the distance between the instance and the closest centroid which is selected.
What is this parameter m used in the summation of the probability?
It might have been helpful to see the original formulation, but the algorithm is quite clear: during the initialization phase, for each point not used as a centroid, calculate the distance between said point and the nearest centroid, that will be the distance D(X[i]), the pick a random point in this set of points with probability weighted with D(X[i])^2
In your formulation it seems you got m points unused.

Do I need to include my scaled outputs in my back-propagation equation (SGD)?

Quick question, when I am backpropagating the loss function to my parameters and I used a scaled output (ex. tanh(x) * 2), do I need to include the derivative of the scaled output w.r.t the original output? Thank you!
Before we can backprop the errors, we've to compute the gradient of the loss function with respect to each of the parameters. This computation involves computing the gradients of the outputs first and then use chain rule repeatedly. So, when you do this, the scaling constant remains as is. So, yes, you've to scale the errors accordingly.
As an example, you might have observed the following L2 regularized loss - a.k.a Ridge regression:
Loss = 1/2 * |T - Y|^2 + \lambda * ||w||^2
Here, we are scaling down the squared error. So, when we compute the gradient 1/2 & 2 would cancel out. If we would not have multiplied this by 0.5 in the first place, then we would have to scale up our gradient by 2. Else the gradient vector would point in some other direction instead of the direction which minimizes the loss.

Perceptron training rule, why multiply by x

I was reading tom Mitchell's machine learning book and he mentioned the formula for perceptron training rule is
where
: training rate
: expected output
: actual output
: ith input
This implies that if is very large then so is , but I don't understand the purpose of a large update when is large
on the contrary, I feel like if there is a large then the update should be small since a small fluctuation in will result in a big change in the final output (due to )
The adjustments are vector additions and subtractions, which can be thought as rotating a hyperplane such that class 0 falls on one part and class 1 falls on the other part.
Consider a 1xd weight vector indicating the weights of the perceptron model. Also, consider a 1xd datapoint . Then the predicted value of the perceptron model, considering a linear threshold without a loss of generality, will be
-- Eq. 1
Here '.' is a dot product, or
The hyperplane above equation is
(Ignoring the iteration indices for the weight updates for simplicity)
Let us consider we have two classes 0 and 1, again without a loss of generality, datapoints labelled 0 fall on one side where Eq.1 <= 0 of the hyperplane, and the datapoints labelled 1 fall on the other side where Eq.1 > 0.
The vector which is normal to this hyperplane is . The angle between the datapoints with label 0 should be more that 90 degrees and the datapoints between the datapoints with label 1 should be less than 90 degrees.
There are three possibilities of (ignoring the training rate)
: implying that this example is classified correctly by the present set of weights. Therefore we do not need any changes for the specific datapoint.
implying that the target was 1, but the present set of weights classified it as 0. The Eq1. which was supposed to be . Eq1. in this case is , which indicates that the angle between and is greater that 90 degrees, which should have been lesser. The update rule is . If you imagine a vector addition in 2d, this will rotate the hyperplane so that the angle between and is closer than before and less than 90 degrees.
implying that the target was 0, but the present set of weights classified it as 1. The eq1. which was supposed to be . Eq1. in this case is indicates that the angle between and is lesser that 90 degrees, which should have been greater. The update rule is . Similarly this will rotate the hyperplane so that the angle between and is greater than 90 degrees.
This is iterated over and overe and the hyperplane is rotated and adjusted so that the angle of the hyperplane's normal has less than 90 degrees with the datapoint with class labeled 1 and greater than 90 degrees with the datapoints of class labelled 0.
If the magnitude of is huge there will be big changes, and therefore it will cause problems in the process, and may take more iterations to converge depending on the magnitude of the initial weights. Therefore it is a good idea to normalise or standardise the datapoints. From this perspective it is easy to visualise what exactly the update rules are doing visually (consider the bias as a part of the hyperplane Eq.1). Now extend this to more complicated networks and or with thresholds.
Recommended reading and reference: Neural Network, A Systematic Introduction by Raul Rojas: Chapter 4

how to calculate theta in univariate linear regression model?

I have hypothesis function h(x) = theta0 + theta1*x.
How can I select theta0 and theta1 value for the linear regression model?
The question is unclear whether you would like to do this by hand (with the underlying math), use a program like Excel, or solve in a language like MATLAB or Python.
To start, here is a website offering a summary of the math involved for a univariate calculation: http://www.statisticshowto.com/probability-and-statistics/regression-analysis/find-a-linear-regression-equation/
Here, there is some discussion of the matrix formulation of the multivariate problem (I know you asked for univariate but some people find the matrix formulation helps them conceptualize the problem): https://onlinecourses.science.psu.edu/stat501/node/382
We should start with a bit of an intuition, based on the level of the question. The goal of a linear regression is to find a set of variables, in your case thetas, that minimize the distance between the line formed and the data points observed (often, the square of this distance). You have two "free" variables in the equation you defined. First, theta0: this is the intercept. The intercept is the value of the response variable (h(x)) when the input variable (x) is 0. This visually is the point where the line will cross the y axis. The second variable you have defined is the slope (theta1), this variable expresses how much the response variable changes when the input changes. If theta1 = 0, h(x) does not change when x changes. If theta1 = 1, h(x) increases and decreases at the same rate as x. If theta1 = -1, h(x) responds in the opposite direction: if x increases, h(x) decreases by the same amount; if x decreases, h(x) increases by the quantity.
For more information, Mathworks provides a fairly comprehensive explanation: https://www.mathworks.com/help/symbolic/mupad_ug/univariate-linear-regression.html
So after getting a handle on what we are doing conceptually, lets take a stab at the math. We'll need to calculate the standard deviation of our two variables, x and h(x). WTo calculate the standard deviation, we will calculate the mean of each variable (sum up all the x's and then divide by the number of x's, do the same for h(x)). The standard deviation captures how much a variable differs from its mean. For each x, subtract the mean of x. Sum these differences up and then divide by the number of x's minus 1. Finally, take the square root. This is your standard deviation.
Using this, we can normalize both variables. For x, subtract the mean of x and divide by the standard deviation of x. Do this for h(x) as well. You will now have two lists of normalized numbers.
For each normalized number, multiply the value by its pair (the first normalized x value with its h(x) pair, for all values). Add these products together and divide by N. This gives you the correlation. To get the least squares estimate of theta1, calculate this correlation value times the standard deviation of h(x) divided by the standard deviation of x.
Given all this information, calculating the intercept (theta0) is easy, all we'll have to do is take the mean of h(x) and subtract the product (multiply!) of our calculated theta1 and the average of x.
Phew! All taken care of! We have our least squares solution for those two variables. Let me know if you have any questions! One last excellent resource: https://people.duke.edu/~rnau/mathreg.htm
If you are asking about the hypothesis function in linear regression, then those theta values are selected by an algorithm called gradient descent. This helps in finding the theta values to minimize the cost function.

Handling zero rows/columns in covariance matrix during em-algorithm

I tried to implement GMMs but I have a few problems during the em-algorithm.
Let's say I've got 3D Samples (stat1, stat2, stat3) which I use to train the GMMs.
One of my training sets for one of the GMMs has in nearly every sample a "0" for stat1. During training I get really small Numbers (like "1.4456539880060609E-124") in the first row and column of the covariance matrix which leads in the next iteration of the EM-Algorithm to 0.0 in the first row and column.
I get something like this:
0.0 0.0 0.0
0.0 5.0 6.0
0.0 2.0 1.0
I need the inverse covariance matrix to calculate the density but since one column is zero I can't do this.
I thought about falling back to the old covariance matrix (and mean) or to replace every 0 with a really small number.
Or is there a another simple solution to this problem?
Simply your data lies in degenerated subspace of your actual input space, and GMM is not well suited in most generic form for such setting. THe problem is that empirical covariance estimator that you use simply fail for such data (as you said - you cannot inverse it). What you usually do? You chenge covariance estimator to the constrained/regularized ones, which contain:
Constant-based shrinking, thus instead of using Sigma = Cov(X) you do Sigma = Cov(X) + eps * I, where eps is prefedefined small constant, and I is identity matrix. Consequently you never have a zero values on the diagonal, and it is easy to prove that for reasonable epsilon, this will be inversible
Nicely fitted shrinking, like Oracle Covariance Estimator or Ledoit-Wolf Covariance Estimator which find best epsilon based on the data itself.
Constrain your gaussians to for example spherical family, thus N(m, sigma I), where sigma = avg_i( cov( X[:, i] ) is the mean covariance per dimension. This limits you to spherical gaussians, and also solves the above issue
There are many more solutions possible, but all based on the same thing - chenge covariance estimator in such a way, that you have a guarantee of invertability.

Resources