Why there divided by 2 in cost function derivation process ? - machine-learning

In Linear Regression With One Variable | CostFunction — Andrew Ng video 5:17,
the cost function derivation process, I understand this step:
but why there(5:17) comes :

When you're calculating cost function, you're trying to get mean square deviation (MSD). If you don't divide by m, it's not really the mean square value, it's basically sum of deviations.
And the half, it's nothing but taking halves of MSD, can be called half-MSD. When you take the derivative of the cost function, that is used in updating the parameters during gradient descent, that 2 in the power get cancelled with the 1/2 multiplier, thus the derivation is cleaner. These techniques are or somewhat similar are widely used in math in order "To make the derivations mathematically more convenient".

IIRC it's so when you take the derivative you don't need to scale it. It doesn't really make a difference as the loss will be mathematically equivalent, see: https://stats.stackexchange.com/a/313172

Related

Does the Cost Function matter when CODING Logistic Regression

NOTE: when you see (0) in the functions it represents Theta not Zero
I've been studying Andrew Ng's Machine Learning Course, and I have the following inquery:
(Short Version: If one were to look at all the mathematical expressions/calculations used for both Forward AND Backward propagation, then it appears to me that we never use the Cost Function directly, but its Derivative , so what is the importance of the cost function and its choice anyway? is it purely to evaluate our system whenever we feel like it?)
Andrew mentioned that for Logistic Regression, using the MSE (Mean Squared Error) Cost function
wouldn't be good, because applying it to our Sigmoid function would yield a non-convex cost function that has a lot of Local Optima, so it is best that we use the following logistic cost function:
Which will have 2 graphs (one for y=0 and one for y=1), both of which are convex.
My question is the following, since it is our objective to minimize the cost function (aka have the Derivative reach 0), which we achieve by using Gradient Descent, updating our weights using the Derivative of the Cost Function, which in both cases (both cost functions) is the same derivative:
dJ = (h0(x(i)) - y(i)) . x(i)
So how did the different choice of cost function in this case effect our algorithm in any way? because in forward propagation, all we need is
h0(x(i)) = Sigmoid(0Tx)
which can be calculated without ever needing to calculate the cost function, then in backward propagation and in updating the weights, we always use the derivative of the cost function, so when does the Cost Function itself come into play? is it just necessary when we want an indication of how well our network is doing? (then why not just depend on the derivative to know that)
The forward propagation does not need the cost function in any way because you just applying all your learned weights to the corresponding input.
The cost function is generally used to measure how good your algorihm is by comparing your models outcome (therefore applying your current weights to your input) with the true label of the input (in supervised algorithms). The main objective is therefore to minimize the cost function error as (in most cases) you want the difference of the prediction and the true label as small as possible. In optimization it is pretty helpful if your function you want to optimize is convex because it guarantees that if you find a local minimum it is at the same time the global minimum.
For minimizing the cost function, gradient descent is used to iteratively update your weights to get closer to the minimum. This is done w.r.t to the learned weights such that you are able to update your weights of the model for achieving the lowest possible costs. The backpropagation algorithm is used to adjust the weights using the cost function in the backward pass.
Technically, you are correct: we do not explicitly use the cost function in any of the calculations for forward propagation and back propagation.
You asked 'what is the importance of the cost function and its choice anyway?'. I have two answers:
The cost function is incredibly important because its gradient is what allows us to update our weights. Although we are only actually computing the gradient of the cost function and not the cost function itself, choosing a different cost function would mean we would have a different gradient, thus changing how we update our weights.
The cost function allows us to evaluate our model performance. It is common practice to plot cost vs epoch to understand how the cost decreases over time.
Your answer indicted you essentially understood all of this already but I hoped to clarify it a bit. Thanks!

Question about a new type of confidence interval

I came up with the following result, tested on many data sets, but I do not have a formal proof yet:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L=A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.
See here and here for details. The B exponent seems to be very similar to the Hurst exponent in time series, not only in terms of what it represents, but also in the values that it takes: B=1/2 corresponds to perfect data (no auto-correlation or undesirable features) and B=1 corresponds to "bad data" typically with strong auto-correlations.
Note that B=1/2 is what everyone uses nowadays, assuming observations are independently and identically distributed, with an underlying normal distribution. I also devised a method to make the interval width converges faster to zero: O(1/n) rather than O(1/SQRT(n)). This is also described in section 3.3. in my article on re-sampling (here) and my approach in this context seems very much related to what is called second-order accurate intervals (usually achieved with modern versions of bootstrapping, see here.)
My question is whether my theorem is original, ground-breaking, and correct, and how would someone prove it (or refute it.)
Example of Confidence Interval
Perl code to produce confidence intervals for the correlation
The first problem is, what do you mean by confidence interval?
Let's say i do non parametric estimation of a density probability function with a kernel density estimator.
Interval confidence has no meaning in this setting. however you can compute something which is the "speed" of convergence of your kernel density estimator to your target function. Depending on the choice of the distance you choose between function, you can get different speed of convergence. And for example, the best speed with $L^{\infty}$ distance depends on a $\log(n)$ factor.
By the way you give yourself a counterexample in your first article.
So for me your theorem can not exist for two reasons :
It is not clear, you need to specify exactly what you mean by confidence interval. You need to say what do you mean by depending on the dataset (does it depends on $N$ the number of observations?)
There is "counter example", since asymptotic speed of convergence of estimators can be more complicated than what you say.

Why is inference in Markov Random Fields hard?

I'm studying Markov Random Fields, and, apparently, inference in MRF is hard / computationally expensive. Specifically, Kevin Murphy's book Machine Learning: A Probabilistic Perspective says the following:
"In the first term, we fix y to its observed values; this is sometimes called the clamped term. In the second term, y is free; this is sometimes called the unclamped term or contrastive term. Note that computing the unclamped term requires inference in the model, and this must be done once per gradient step. This makes training undirected graphical models harder than training directed graphical models."
Why are we performing inference here? I understand that we're summing over all y's, which seems expensive, but I don't see where we're actually estimating any parameters. Wikipedia also talks about inference, but only talks about calculating the conditional distribution, and needing to sum over all non-specified nodes.. but.. that's not what we're doing here, is it?
Alternatively, any have good intuition on why inference in MRF is difficult?
Sources:
Chapter 19 of ML:PP: https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf
Specific section seen below
When training your CRF, you want to estimate your parameters, \theta.
In order to do this, you can differentiate your loss function (Equation 19.38) with respect to \theta, set it to 0, and solve for \theta.
You can't analytically solve the equation for \theta if you do this though. You can, however, minimise Equation 19.38 by gradient descent. Since the loss function is convex, it is guaranteed that gradient descent will get you the globally optimal solution when it converges.
Equation 19.41 is the actual gradient which you need to compute in order to be able to do gradient descent. The first term is easy (and computationally cheap) to compute as you are summing up over the observed values of y. However, the second term requires you to do inference. In this term, you are not summing up over the observed value of y as in the first term. Instead, you need to compute the configuration of y (inference), and then calculate the value of the potential function under this configuration.

max-margin linear separator using libsvm

I have a set of N data points X with +/- labels for which I'd like to calculate the max-margin linear separator (aka classifier, hyperplane) or fail if no such linear separator exist.
I do not want to avoid overfitting in the context of this question, as I do so elsewhere. So no slack variables ; no cross-validation ; no limits on the number of support vectors ; just find max-margin separator or fail.
How to I use libsvm to do so? I believe you can't give c=0 in C-SVM and you can't give nu=1 in nu-svm.
Related question (which I think didn't provide an answer):
Which of the parameters in LibSVM is the slack variable?
In the case of C-SVM, you should use a linear kernel and a very large C value (or nu = 0.999... for nu-SVM). If you still have slacks with this setting, probably your data is not linearly separable.
Quick explanation: the C-SVM optimization function tries to find the hyperplane having maximum margin and lowest misclassification costs at the same time. The misclassification costs in the C-SVM formulation is defined by: distance from the misclassified point to its correct side of the hyperplane, multiplied by C. If you increase the C value (or nu value for nu-SVM), every misclassified point will be too costly and an hyperplane that separates the data perfectly will be preferable for the optimization function.

Can someone explain to me the difference between a cost function and the gradient descent equation in logistic regression?

I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function

Resources