Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
What is the partial derivative for MAE? I understand that for mean squared error (MSE) the partial derivative with respect to some x1 would be -x1 * (y_pred - y_actual) assuming the the following version of MSE is used.
What is the partial derivative for x1 when the loss function is MAE instead of MSE? I've been trying to find this but I haven't had any luck. Would it just be -(y_pred - y_actual) when x1 is greater than 0, and (y_pred - y_actual) when x1 is less than 0? Or is there something else that I'm missing?
Unless you're having a single neuron, there's no fixed formula for partial derivative of loss function with respect to each weight; it depends strictly on the connections between neurons in the network. And the partial derivative formula is not only one, each weight has a different one.
For small network with kinda 2, 3 layers, apply chain rule, and sum rule to find the partial derivative of loss function manually, otherwise, dynamic programming in backpropagation is needed.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have a question about cost functions in Machine Learning and their graphs. For instance, look at the following images. What function shapes them, the cost function or the model? I though was the cost function, like MSE in the first image. The second image I have no idea what function has that shape. All this is very confusing to me because in the book "Hands on Machine Learning... 2nd Edition" page 122 is written:
Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function...
and
This implies that there are no local minima, just one global minimum.
What I don't understand is why MSE is convex only with Linear Regression model if it is quadratic? I mean, I believe that function always will have that "bowl" shape because it is quadratic. Or maybe not always because if was like that would be easy to choose MSE for any model and I would find the global minimum always since the main goal in a machine learning process is minimize the value of the cost function.
Why MSE is convex only with Linear Regression model if it is quadratic? I mean, I believe that function always will have that "bowl" shape because it is quadratic.
You're right.
The MSE cost function will be always convex over θ.
It will also be always convex over x if a model, θ = f(x), is linear.
It could be, however, non-convex over x if a model is non-linear.
For example, if a model is θ = x2
MSE(θ) = √(θ' - θ)2 = √(θ' - x2)2
will have two global minima, one at x = √|θ'| and the other at x = -√|θ'|. (Kind of "w" shape rather than "bowl" shape.)
But over the axis of θ, there is only one global minimum at θ = θ'.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Determine the regression line for the below data points:
(x1, y1) = (1, 4), (x2, y2) = (2, 3), (x3, y3) = (3, 9)
i.e. the function h(x) = w + hx that minimizes the squared error loss on this data.
This question just boils down to math.
First, we write our error function
The derivative of the error function tells us how the error changes as we change variables. Because there are two variables (m and b), it is a partial derivative. When the derivative is equal to zero we know we have reached a minimum (and because we're taking the derivative of a quadratic we know there is a single global minimum).
Writing out each term in the sum gives us
Two variables, two equations means we can solve for both!
In your case we have h=m and w=b
As a double check, Desmos is a great tool https://www.desmos.com/calculator
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Here is an open question:
suppose I need to predict a student's exam score given some inputs, e.g. hours spent on prep, previous scores, etc. How should I bound the output between 0 - 100? What are the best practices out there?
Thanks!
Edit:
Since the answers are mostly concerned about bounding model output after we have the predictions, is it possible to train the model beforehand such that this bound is implicitly learned by the model?
You would train an Isotonic Regression model: http://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html
Or you could simply clip the predicted values that are out of bounds.
It is general practice, when training multi-flavored data to appropriately scale it between 0 - 1, so for example, say ur test data was:
[input: [10 hrs studying, 100% on last test], output: [95% on this test] ]
then you should first standardize both input and output by dividing by the greatest numerical value in each of their elements or the greatest possible value:
input = input/input.max
output = output/100
[input: [0.1 , 1], output: [0.95] ]
When you are done training and want to predict a test scores, simply multiply the output by 100 and you are done.
BTW what you want to do is well documented on stephenwelch's Neural Network Youtube series.
You can either do Normalisation or Standardisation. They would transform your values within [0, 1].
I am not sure why you need the range to be 0-100, but if it is really so, you can multiply by 100 to get that range post the above transformation.
Normalise: Here each value of your feature column is converted like so:
X_new = (X - X_min) / (X_max - X_min)
where X_min and X_max are min and max values in the feature.
Standardise: Here each value of your feature column is converted like so:
X_new = (X - Mean) / StandardDeviation
where Mean and StandardDeviation are the mean and SD values of your feature.
Check which one gives you better results. If your data has extreme outliers, Standardisation might give better results.
In sklearn, you can use sklearn.preprocessing.normalize or sklearn.preprocessing.StandardScaler to do the conversions.
HTH
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am new to ML, I am not sure on how to solve this problem
Could someone tell me how to solve this problem of finding values in a a step by step manner?
From newcomer view point you can actually just test:
h1=0.5+0.5x
h2=0+0.5x
h3=0.5+0x
h4=1+0.5x
h5=1+x
Then which one of the hs(1..5) gives exact observed values of y(0.5,1,2,0) for a given set of dependent variables x(1,2,4,0).
You can answer that by passing sample values of x in the above equation.
I hope i made it simple enough
Here is the cache It's one of most easy problems in machine learning.
Just see that we have to create a linear regression model to fit the following data:-
STEP 1:UNDERSTANDING THE PROBLEM
And as mentioned at the last of question it should completely fit the data.
We have to find theta0 and theta1 in such a way such that given value of x Htheta(x) will give the correct value of y.
STEP 2:FINDING THETA1
In these m examples take any 2 random examples
Htheta(x2)-Htheta(x1) = theta1*(x2)-theta1*(x1)
-----Subtracting those 2 variables(eliminating theta0)
hteta(x2) = y2
(y corresponding to that x in the data as the parameters exactly fit the data provided )
(y2-y1)/(x2-x1) = theta1
----taking common and then dividing by(x2-x1) on both sides of equation
From this:
theta1 = 0.5
STEP3 :CALCULATING THETA0
Take any random example and put the values of theta1, y and x in this equation
y = theta1*x + theta0
theta0 will come out to be 0
My approach would be to view these points by plotting a graph with x,y values. Since it's a straight line, calculate tan(theta) using normal trigonometry, which in this case is y/x(Since it's mentioned they fit perfectly!!). eg:-
tan(theta1) = 0.5/1 or 1/2
Calculate arctan(1/2) // Approx 0.5
Note:- This is not a scalable approach but just some maths fun! Sorry.
In general you would execute some non-iterative algorithmic approach (probably based on solving a system of linear equations) or some iterative approach like GD (Gradient Descent), but this is more simple here, as it's already given that there is a perfect fit.
Perfect fit means: loss/error of zero.
Loss of zero implicates, that sigma0 needs to be zero or else sample 4 (last one) induces a loss
Overall loss is the sum of sample-losses and each loss/component is nonnegative -> we can't tolerate a loss here
When sigma0 is fixed, sample 4 has an infinite amount of solutions producing no loss
But sample 1 shows that it has to be 0.5 to induce no loss
Check the others, it's fitting perfectly
One assumption i made:
Gradient-descent will converge to the optimal solution (which is not always true, even for convex-optimization problems; it's depending learning-rates; one might use line-searches to proof convergence based on some assumptions about the problem; but all that is irrelevant here)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I try to implement a neural network. I'm using backpropagation to compute the gradients. After obtaining the gradients, I multiply them by the learning rate and subtract them from the corresponding weights. (basically trying to apply gradient descent, please tell me if this is wrong).
So the first thing I tried after having the backpropagation and gradient descent ready, was to train a simple XOR classifier where the inputs can be (0,0), (1,0), (0,1), (1,1) and the corresponding outputs are 0, 1, 1, 0. So my neural network contains 2 input units, 1 output unit and one hidden layer with 3 units on it. When training it with a learning rate of 3.0 for >100 (even tried >5000), the cost drops until a specific point where it gets stuck, so it's remaining constant. The weights are randomly initialized each time I run the program, but it always gets stuck at the same specific cost. Anyways, after the training is finished I tried to run my neural network on any of the above inputs and the output is always 0.5000. I thought about changing the inputs and outputs so they are : (-1,-1), (1, -1), (-1, 1), (1, 1) and the outputs -1, 1, 1, -1. Now when trained with the same learning rate, the cost is dropping continuously, no matter the number of iterations but the results are still wrong, and they always tend to be very close to 0. I even tried to train it for an insane number of iterations and the results are the following: [ iterations: (20kk), inputs:(1, -1), output:(1.6667e-08) ] and also [iterations: (200kk), inputs:(1, -1), output:(1.6667e-09) ], also tried for inputs(1,1) and others, the output is also very close to 0. It seems like the output is always mean(min(y), max(y)), it doesn't matter in what form I provide the input/output. I can't figure out what I'm doing wrong, can someone please help?
There are so many places where you might be wrong:
check your gradients numerically
you have to use nonlinear hidden units to learn XOR - do you have non-linear activation there?
you need bias neuron, do you have one?
minor things that should not cause the mentioned problem, but worth fixing either way:
do you have sigmoidal activation in the output node (as your network is a classifier)?
do you train with cross-entropy cost (although this is minor problem)?