How to find the value of theta 0 and theta 1? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am new to ML, I am not sure on how to solve this problem
Could someone tell me how to solve this problem of finding values in a a step by step manner?

From newcomer view point you can actually just test:
h1=0.5+0.5x
h2=0+0.5x
h3=0.5+0x
h4=1+0.5x
h5=1+x
Then which one of the hs(1..5) gives exact observed values of y(0.5,1,2,0) for a given set of dependent variables x(1,2,4,0).
You can answer that by passing sample values of x in the above equation.
I hope i made it simple enough

Here is the cache It's one of most easy problems in machine learning.
Just see that we have to create a linear regression model to fit the following data:-
STEP 1:UNDERSTANDING THE PROBLEM
And as mentioned at the last of question it should completely fit the data.
We have to find theta0 and theta1 in such a way such that given value of x Htheta(x) will give the correct value of y.
STEP 2:FINDING THETA1
In these m examples take any 2 random examples
Htheta(x2)-Htheta(x1) = theta1*(x2)-theta1*(x1)
-----Subtracting those 2 variables(eliminating theta0)
hteta(x2) = y2
(y corresponding to that x in the data as the parameters exactly fit the data provided )
(y2-y1)/(x2-x1) = theta1
----taking common and then dividing by(x2-x1) on both sides of equation
From this:
theta1 = 0.5
STEP3 :CALCULATING THETA0
Take any random example and put the values of theta1, y and x in this equation
y = theta1*x + theta0
theta0 will come out to be 0

My approach would be to view these points by plotting a graph with x,y values. Since it's a straight line, calculate tan(theta) using normal trigonometry, which in this case is y/x(Since it's mentioned they fit perfectly!!). eg:-
tan(theta1) = 0.5/1 or 1/2
Calculate arctan(1/2) // Approx 0.5
Note:- This is not a scalable approach but just some maths fun! Sorry.

In general you would execute some non-iterative algorithmic approach (probably based on solving a system of linear equations) or some iterative approach like GD (Gradient Descent), but this is more simple here, as it's already given that there is a perfect fit.
Perfect fit means: loss/error of zero.
Loss of zero implicates, that sigma0 needs to be zero or else sample 4 (last one) induces a loss
Overall loss is the sum of sample-losses and each loss/component is nonnegative -> we can't tolerate a loss here
When sigma0 is fixed, sample 4 has an infinite amount of solutions producing no loss
But sample 1 shows that it has to be 0.5 to induce no loss
Check the others, it's fitting perfectly
One assumption i made:
Gradient-descent will converge to the optimal solution (which is not always true, even for convex-optimization problems; it's depending learning-rates; one might use line-searches to proof convergence based on some assumptions about the problem; but all that is irrelevant here)

Related

Is the model or cost function that shapes these lines? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have a question about cost functions in Machine Learning and their graphs. For instance, look at the following images. What function shapes them, the cost function or the model? I though was the cost function, like MSE in the first image. The second image I have no idea what function has that shape. All this is very confusing to me because in the book "Hands on Machine Learning... 2nd Edition" page 122 is written:
Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function...
and
This implies that there are no local minima, just one global minimum.
What I don't understand is why MSE is convex only with Linear Regression model if it is quadratic? I mean, I believe that function always will have that "bowl" shape because it is quadratic. Or maybe not always because if was like that would be easy to choose MSE for any model and I would find the global minimum always since the main goal in a machine learning process is minimize the value of the cost function.
Why MSE is convex only with Linear Regression model if it is quadratic? I mean, I believe that function always will have that "bowl" shape because it is quadratic.
You're right.
The MSE cost function will be always convex over θ.
It will also be always convex over x if a model, θ = f(x), is linear.
It could be, however, non-convex over x if a model is non-linear.
For example, if a model is θ = x2
MSE(θ) = √(θ' - θ)2 = √(θ' - x2)2
will have two global minima, one at x = √|θ'| and the other at x = -√|θ'|. (Kind of "w" shape rather than "bowl" shape.)
But over the axis of θ, there is only one global minimum at θ = θ'.

How to identify the normalized feature

I have been trying to solve a problem stated in an exam of coursera. I am not seeking the solution but I need to get the steps and concepts to resolve this.
Can any one share the concept and steps to help me find the solution.
UPDATE:
I was expecting a down-vote and its not unusual, as its the most easiest thing people can do. I am seeking the direction to solve the problem as I wasn't able to get the idea to solve it after watching the videos on Coursera. I hope someone sensible out there can share a direction and step to achieve the mentioned goal.
Mean Normalization
Mean normalization, also known as 'standardization', is one of the most popular techniques of feature scaling.
Andrew Ng describes it in the 12a slide of lecture 4:
How to resolve the problem
The problem asks you to standardize the first feature in the third example: midterm = 94;
well, we have just to resolve the equation!
Just for clarity, the notation:
μ (mu) = "avg value of x in training set", in other words: the mean of the x1 column.
σ (sigma) = "range (max-min)", literaly σ = max-min (of the x1 column).
So:
μ = ( 89 + 72 + 94 +69 )/4 = 81
σ = ( 94 - 69 ) = 25
x_std = (94 - 81)/25 = 0.52
Result: 0.52
Best regards,
Marco.
The first step of solving this question is to identify what is , from the content of the lecture, it refers to the first feature of the third training case. Which is the unsquared version of the midterm score in the third row of the table.
Secondly, you need to understand the concept of normalization. The reason why we need normalization is that the value of some features among all training examples may much larger than the value of other features, which may make the cost function have pretty bad shape and this will make it harder gradient descent to find the minimum. In order to solve this, we want to make all features have nearly the same scale, and make the range of the feature to be centered at zero.
In this question, we want to scale every feature to a scale of 1, in order to do this, you need to find the max and min value of the feature among all training cases. Then squeeze the range of the feature to 0 and 1. The second step is to find the center value of the feature (average value in this case) and move the center value of the feature to 0.
I think this is pretty much all hints I can give you, you will totally be able to calculate the answer to this question by yourself from this point.

Why the hypothesis has to introduce two parameters, namely θ0 and θ1

I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
hθ(x) = θ0 + θ1(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as y(i) = a * x(i) where a is a co-efficient? Later we can go about finding a "good" value of a for a given example (i) using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.
Thanks!
The a corresponds to θ1. Your proposed linear model is leaving out the intercept, which is θ0.
Consider an output function y equal to the constant 5, or perhaps equal to a constant plus some tiny fraction of x which never exceeds .01. Driving the error function to zero is going to be difficult if your model doesn't have a θ0 that can soak up the D.C. component.

How to cut down train error for a dense-matrix factorization task?

This problem may seem very different from the normal Matrix Factorization task which is widely used in recommender system.
My problem is described as below:
Given a dense Matrix M
(approximately 55000*200, may contain much negative elements, 0.1< abs(M[i][j]) <1 )
I have to find two matrix A(55000*1400) and B(1400*200), such that:
AB=M
However, we have some knowledge about A. We have another Matrix C, if C[i][j] = 0, then A[i][j] must be zero, otherwise it can be any value(C[i][j] = 1).
In my practice , I use machine learning to solve the problem, my loss function is:
||(A*C)(element-wise product) x B - M ||(2)(L2 norm)
I have tried adagrad,momentum,adadelta and some other optimization method, but the train error is pretty much and is cut down slowly (learning_rate = 0.1)
UP1:
Well, actually I've got a machine with 32GB memory and I only need 2 min for each epoch. I decompose an element in M only if its corresponding element in C is anotated as 1. Practically , I only decompose M[i][j] when C[i][j] = 1, and after I decompose M[i][j], I solve the gradient for M[i][j] to update A[i : ] and B[ : j] at once. So, the batch I used is too small--just contain one element. Also , I have to mention that C is a pretty sparse matrix. For each line in C, there is only 2-3 elements that are anotated as 1.
After struggling with it for nearly half month, I finally got the answer: I should update the matrix A much more quickly, say, update the parameters at a more smaller step. I originally updated every element in A only once per epoch, much less than B. However, after I changed the code to let A be updated at the same speed as B, then surprise happened: it worked pretty well!
Maybe smaller steps will help SGD work better? I don't really believe it mathematically.

Wouldn't setting the first derivative of Cost function J to 0 gives the exact Theta values that minimize the cost?

I am currently doing Andrew NG's ML course. From my calculus knowledge, the first derivative test of a function gives critical points if there are any. And considering the convex nature of Linear / Logistic Regression cost function, it is a given that there will be a global / local optima. If that is the case, rather than going a long route of taking a miniscule baby step at a time to reach the global minimum, why don't we use the first derivative test to get the values of Theta that minimize the cost function J in a single attempt , and have a happy ending?
That being said, I do know that there is a Gradient Descent alternative called Normal Equation that does just that in one successful step unlike the former.
On a second thought, I am thinking if it is mainly because of multiple unknown variables involved in the equation (which is why the Partial Derivative comes into play?) .
Let's take an example:
Gradient simple regression cost function:
Δ[RSS(w) = [(y-Hw)T(y-Hw)]
y : output
H : feature vector
w : weights
RSS: residual sum of squares
Equating this to 0 for getting the closed form solution will give:
w = (H T H)-1 HT y
Now assuming there are D features, the time complexity for calculating transpose of matrix is around O(D3). If there are a million features, it is computationally impossible to do within reasonable amount of time.
We use these gradient descent methods since they give solutions with reasonably acceptable solutions within much less time.

Resources