I am new to machine learning and statistics and am confused with the cost function & Mean Squared Error (MSE) formulas.
In Machine learning class at stanford - coursera, Cost function formula is mentioned as shown below:
Cost Function formula
And at some other sources, cost function is termed as mean squared error (MSE) and it is given with the formula as shown in picture below.
Mean Squared Error formula
What will be the Cost Function formula & is cost function and MSE different or same. Please let me know why are the formulas are different.
Thanks in Advance
Raj
The cost function is just telling you how bad you're doing. If it's high, that means your prediction is far away from the actual value. if it's zero, it means that you are predicting every single output correctly. In coursera version the sigma term is divided by 2m and in the other version it is divided by m (m is the number of training examples). It doesn't matter in what the cost function is doing, since m is just a constant. Dividing by 2m is just a mathematical convenience. For example, if the sigma term is 100 and m is 10, cost function in coursera version is going to be 5 and in the other version it's going to be 10. Since you're trying to make cost function 0, it doesn't matter what value it returns. You just need a tool to measure how bad you're doing.
Related
NOTE: when you see (0) in the functions it represents Theta not Zero
I've been studying Andrew Ng's Machine Learning Course, and I have the following inquery:
(Short Version: If one were to look at all the mathematical expressions/calculations used for both Forward AND Backward propagation, then it appears to me that we never use the Cost Function directly, but its Derivative , so what is the importance of the cost function and its choice anyway? is it purely to evaluate our system whenever we feel like it?)
Andrew mentioned that for Logistic Regression, using the MSE (Mean Squared Error) Cost function
wouldn't be good, because applying it to our Sigmoid function would yield a non-convex cost function that has a lot of Local Optima, so it is best that we use the following logistic cost function:
Which will have 2 graphs (one for y=0 and one for y=1), both of which are convex.
My question is the following, since it is our objective to minimize the cost function (aka have the Derivative reach 0), which we achieve by using Gradient Descent, updating our weights using the Derivative of the Cost Function, which in both cases (both cost functions) is the same derivative:
dJ = (h0(x(i)) - y(i)) . x(i)
So how did the different choice of cost function in this case effect our algorithm in any way? because in forward propagation, all we need is
h0(x(i)) = Sigmoid(0Tx)
which can be calculated without ever needing to calculate the cost function, then in backward propagation and in updating the weights, we always use the derivative of the cost function, so when does the Cost Function itself come into play? is it just necessary when we want an indication of how well our network is doing? (then why not just depend on the derivative to know that)
The forward propagation does not need the cost function in any way because you just applying all your learned weights to the corresponding input.
The cost function is generally used to measure how good your algorihm is by comparing your models outcome (therefore applying your current weights to your input) with the true label of the input (in supervised algorithms). The main objective is therefore to minimize the cost function error as (in most cases) you want the difference of the prediction and the true label as small as possible. In optimization it is pretty helpful if your function you want to optimize is convex because it guarantees that if you find a local minimum it is at the same time the global minimum.
For minimizing the cost function, gradient descent is used to iteratively update your weights to get closer to the minimum. This is done w.r.t to the learned weights such that you are able to update your weights of the model for achieving the lowest possible costs. The backpropagation algorithm is used to adjust the weights using the cost function in the backward pass.
Technically, you are correct: we do not explicitly use the cost function in any of the calculations for forward propagation and back propagation.
You asked 'what is the importance of the cost function and its choice anyway?'. I have two answers:
The cost function is incredibly important because its gradient is what allows us to update our weights. Although we are only actually computing the gradient of the cost function and not the cost function itself, choosing a different cost function would mean we would have a different gradient, thus changing how we update our weights.
The cost function allows us to evaluate our model performance. It is common practice to plot cost vs epoch to understand how the cost decreases over time.
Your answer indicted you essentially understood all of this already but I hoped to clarify it a bit. Thanks!
I'm interested in learning machine learning but only if I know it's possible to find the max of an unknown function that get some numbers and return a number and for which a max exists.
Very unspecific question and not exclusively related to machine learning and AI.
Let's say you have a "black-box"-function f(x1,x2,x,...,xn) which takes n input variables and returns a scalar value, but you don't know it exactly, all you know is it has a maximum.
What you can do is use so called Evolution Strategy optimization techniques like CMA-ES. You basically sample random points x from a sampling function (e.g. gaussian), evaluate the points, rank them by their evaluation, adapt your sample function (e.g. by maximum likelihood of you k best points) and resample. After doing this over and over the mean of your gaussian will be a maximum (however not necessarily the global one!)
This however has nothing to do with machine learning or AI. In ML/AI you usually have points x and results y and you try to find the function f(x)=y that matches your points.
So I've done a coursera ml course, and now i am looking at scikit-learn logistic regression which is a little bit different. I've been using sigmoid function and a cost function was divided to two separate cases when y=0 and y=1. But scikit learn have one function(I've found out that this is Generalised logistic function) which really don't make sense for me.
http://scikit-learn.org/stable/_images/math/760c999ccbc78b72d2a91186ba55ce37f0d2cf37.png
I am sorry i don't have enough reputation to post image.
So the main concern of the function is the case when y=0, than the cost function always have this log(e^0+1) value, so it does not matter what was the X or w. Can someone explain me that ?
If I am correct the $y_i$ you have in this formula can only assume the values -1 or 1 (as derived in the link given by #jinyu0310). Normally you use this cost function (with regularisation)
(sorry for the equation inserted as image, but cannot use latex here, the image comes from goo.gl/M53u44)
So you always have the two terms that plays a role when yi = 0 or when yi=1. I am trying to find a better explanation on the notation that scikit uses in this formula but so far no luck.
Hope that helps you. Is just another way of writing the cost function with a regularisation factor. Keep also in mind that a constant factor in front of everything will not play a role in the optimisation process. Since you want to find the minimum and are not interested in overall factors that multiply everything.
I understand the derivation of the kernelized perceptron function, but I'm trying to figure out the intuition behind the final formula
f(X) = sum_i (alpha_i*y_i*K(X,x_i))
Where (x_i,y_i) are all the samples in the training data, alpha_i is the number of times we've made a mistake on that sample, and X is the sample we're trying to predict (during training or otherwise). Now, I understand why the kernel function is considered to be a measure of similarity (since it's a dot product in a higher dimensional space), but what I don't get is how this formula comes together.
My original attempt was that we're trying to predict a sample based on how similar it is to the other samples - and multiply it by y_i so that it contributes the correct sign (points that are closer are better indicators of the label than points that are farther). But why should a sample that we've made several mistakes on contribute more?
tl;dr: In a Kernelized perceptron, why should a sample that we've made several mistakes on contribute more to the prediction than ones we haven't made mistakes on?
My original attempt was that we're trying to predict a sample based on how similar it is to the other samples - and multiply it by y_i so that it contributes the correct sign (points that are closer are better indicators of the label than points that are farther).
This is pretty much what's going on. Although the idea is if alpha_i*y_i*K(X,x_i) already is well classified, then you don't need to update it further.
But if the point is misclassified we need to update it. The best way would be in the opposite direction right? that's if the result is negative we should be adding a possitive quantity (y_i). If the result is possitive (and it is missclassified) then we want to sum a negative value (y_i again).
As you can see, y_i already give us the right update direction and hence we use a misclassification counter to give a magnitude to that update.
I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function