Denormalizing thetas after a linear regression with gradient descent

Denormalizing thetas after a linear regression with gradient descent - machine-learning

I have the following set of data:
km,price
240000,3650
139800,3800
150500,4400
185530,4450
176000,5250
114800,5350
166800,5800
89000,5990
144500,5999
84000,6200
82029,6390
63060,6390
74000,6600
97500,6800
67000,6800
76025,6900
48235,6900
93000,6990
60949,7490
65674,7555
54000,7990
68500,7990
22899,7990
61789,8290
After normalizing them, I'm performing a gradient descent that gives me the following thetas:
θ0 = 0.9362124793084768
θ1 = -0.9953762249792935
I can correctly predict the price if I feed a normalized mileage, and then denormalize the predicted price, ie:
Asked price for a mileage of 50000km:
normalized mileage: 0.12483129971764294
normalized price: (mx + c) = 0.8119583714362707
real price: 7417.486843464296
What I'm looking for is to revert my thetas back to their non-normalized values, but I've been unable to, no matter which equation I tried. Is there a way to do so ?

As usual, it takes me asking a question on stackoverflow to manage to solve it by myself moments later....
It was simply a two variables equation to solve, as you can see here (excuse the handwriting): https://ibb.co/178qWcQ.
Here is the python code that does the computation:
x0, x1 = self.training_set[0][0], self.training_set[1][0]
x0n, x1n = self.normalized_training_set[0][0], self.normalized_training_set[1][0]
y0n, y1n = self.hypothesis(x0n), self.hypothesis(x1n)
p_diff = self.max_price - self.min_price
theta0 = (x1 / (x1 - x0)) * (y0n * p_diff + self.min_price - (x0 / x1 * (y1n * p_diff + self.min_price)))
y0 = self.training_set[0][1]
theta1 = (y0 - theta0) / x0
print(theta0, theta1) //RESULT: 8481.172796984529 -0.020129886654102203

Related

Cost function for logistic regression: weird/oscillating cost history

Background and my thought process:
I wanted to see if I could utilize logistic regression to create a hypothesis function that could predict recessions in the US economy by looking at a date and its corresponding leading economic indicators. Leading economic indicators are known to be good predictors of the economy.
To do this, I got data from OECD on the composite leading (economic) indicators from January, 1970 to July, 2021 in addition to finding when recessions occurred from 1970 to 2021. The formatted data that I use for training can be found further below.
Knowing the relationship between a recession and the Date/LEI wouldn't be a simple linear relationship, I decided to make more parameters for each datapoint so I could fit a polynominal equation to the data. Thus, each datapoint has the following parameters: Date, LEI, LEI^2, LEI^3, LEI^4, and LEI^5.
The Problem:
When I attempt to train my hypothesis function, I get a very strange cost history that seems to indicate that I either did not implement my cost function correctly or that my gradient descent was implemented incorrectly. Below is the imagine of my cost history:
I have tried implementing the suggestions from this post to fix my cost history, as originally I had the same NaN and Inf issues described in the post. While the suggestions helped me fix the NaN and Inf issues, I couldn't find anything to help me fix my cost function once it started oscillating. Some of the other fixes I've tried are adjusting the learning rate, double checking my cost and gradient descent, and introducing more parameters for datapoints (to see if a higher-degree polynominal equation would help).
My Code
The main file is predictor.m.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Program: Predictor.m
% Author: Hasec Rainn
% Desc: Predictor.m uses logistic regression
% to predict when economic recessions will occur
% in the United States. The data it uses is from the past 50 years.
%
% In particular, it uses dates and their corresponding economic leading
% indicators to learn a non-linear hypothesis function to fit to the data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
LI_Data = dlmread("leading_indicators_formatted.csv"); %Get LI data
RD_Data = dlmread("recession_dates_formatted.csv"); %Get RD data
%our datapoints of interest: Dates and their corresponding
%leading Indicator values.
%We are going to increase the number of parameters per datapoint to allow
%for a non-linear hypothesis function. Specifically, let the 3rd, 4th
%5th, and 6th columns represent LI^2, LI^3, LI^4, and LI^5 respectively
X = LI_Data; %datapoints of interest (row = 1 datapoint)
X = [X, X(:,2).^2]; %Adding LI^2
X = [X, X(:,2).^3]; %Adding LI^3
X = [X, X(:,2).^4]; %Adding LI^4
X = [X, X(:,2).^5]; %Adding LI^5
%normalize data
X(:,1) = normalize( X(:,1) );
X(:,2) = normalize( X(:,2) );
X(:,3) = normalize( X(:,3) );
X(:,4) = normalize( X(:,4) );
X(:,5) = normalize( X(:,5) );
X(:,6) = normalize( X(:,6) );
%What we want to predict: if a recession happens or doesn't happen
%for a corresponding year
Y = RD_Data(:,2); %row = 1 datapoint
%defining a few useful variables:
nIter = 4000; %how many iterations we want to run gradient descent for
ndp = size(X, 1); %number of data points we have to work with
nPara = size(X,2); %number of parameters per data point
alpha = 1; %set the learning rate to 1
%Defining Theta
Theta = ones(1, nPara); %initialize the weights of Theta to 1
%Make a cost history so we can see if gradient descent is implemented
%correctly
costHist = zeros(nIter, 1);
for i = 1:nIter
costHist(i, 1) = cost(Theta, Y, X);
Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X));
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Cost
% Author: Hasec Rainn
% Parameters: Theta (vector), Y (vector), X (matrix)
% Desc: Uses Theta, Y, and X to determine the cost of our current
% hypothesis function H_theta(X). Uses manual loop approach to
% avoid errors that arrise from log(0).
% Additionally, limits the range of H_Theta to prevent Inf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function expense = cost(Theta, Y, X)
m = size(X, 1); %number of data points
hTheta = sigmoid(X*Theta'); %hypothesis function
%limit the range of hTheta to [10^-50, 0.9999999999999]
for i=1:size(hTheta, 1)
if (hTheta(i) <= 10^(-50))
hTheta(i) = 10^(-50);
endif
if (hTheta(i) >= 0.9999999999999)
hTheta(i) = 0.9999999999999;
endif
endfor
expense = 0;
for i = 1:m
if Y(i) == 1
expense = expense + -log(hTheta(i));
endif
if Y(i) == 0
expense = expense + -log(1-hTheta(i));
endif
endfor
endfunction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: normalization
% Author: Hasec Rainn
% Parameters: vector
% Desc: Takes in an input and normalizes its value(s)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function n = normalize(data)
dMean = mean(data);
dStd = std(data);
n = (data - dMean) ./ dStd;
endfunction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Sigmoid
% Author: Hasec Rainn
% Parameters: scalar, vector, or matrix
% Desc: Takes an input and forces its value(s) to be between
% 0 and 1. If a matrix or vector, sigmoid is applied to
% each element.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function result = sigmoid(z)
result = 1 ./ ( 1 + e .^(-z) );
endfunction
The data I used for my learning process can be found here: formatted LI data and recession dates data.

The problem you're running into here is your gradient descent function.
In particular, while you correctly calculate the cost portion (aka, (hTheta - Y) or (sigmoid(X * Theta') - Y) ), you do not calculate the derivative of the cost correctly; in Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X)), the .*X is not correct.
The derivative is equivalent to the cost of each datapoint (found in the vector hTheta - Y) multiplied by their corresponding parameter j, for every parameter. For more information, check out this article.

How to debug if weight keep increasing. Pytorch program

I m having some doubt when practicing Pytorch program.
I have function like y = m1x1 + m2x2 + c (just 2 weights to learn here). The expected values of weight should be 16,-14 and bias should be 36. But in every epoch the learned wight goes very big. Can any one help me to debug and understand this 20 lines of code, what going wrong here.
import torch
x = torch.randint(size = (1,2), high = 10)
w = torch.Tensor([16,-14])
b = 36
#Compute Ground Truth
y = w * x + b
#Find weights by program
epoch = 20
learning_rate = 30
#initialize random
w1 = torch.rand(size= (1,2), requires_grad= True)
b1 = torch.ones(size = [1], requires_grad= True)
for i in range(epoch):
y1 = w1 * x + b1
#loss function RMSQ
loss = torch.sum((y1-y)**2)
#Find gradient
loss.backward()
with torch.no_grad():
#update parameters
w1 -= (learning_rate * w1.grad)
b1 -= (learning_rate * b1.grad)
w1.grad.zero_()
b1.grad.zero_()
print("B ", b1)
print("W ", w1)
Thanks,
Ganesh

You have a very large learning rate.
This is an illustration from Jeremy Jordan's blog that explains exactly what is going on in your case.

Understanding code wrt Logistic Regression using gradient descent

I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?

In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.

Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.

How do I implement the optimization function in tensorflow?

minΣ(||xi-Xci||^2+ λ||ci||),
s.t cii = 0,
where X is a matrix of shape d * n and C is of the shape n * n, xi and ci means a column of X and C separately.
X is known here and based on X we want to find C.

Usually with a loss like that you need to vectorize it, instead of working with columns:
loss = X - tf.matmul(X, C)
loss = tf.reduce_sum(tf.square(loss))
reg_loss = tf.reduce_sum(tf.square(C), 0) # L2 loss for each column
reg_loss = tf.reduce_sum(tf.sqrt(reg_loss))
total_loss = loss + lambd * reg_loss
To implement the zero constraint on the diagonal of C, the best way is to add it to the loss with another constant lambd2:
reg_loss2 = tf.trace(tf.square(C))
total_loss = total_loss + lambd2 * reg_loss2

Gradient in continuous regression using a neural network

I'm trying to implement a regression NN that has 3 layers (1 input, 1 hidden and 1 output layer with a continuous result). As a basis I took a classification NN from coursera.org class, but changed the cost function and gradient calculation so as to fit a regression problem (and not a classification one):
My nnCostFunction now is:
function [J grad] = nnCostFunctionLinear(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
J = 1/(2*m)*sum(sum((a3 - Y).^2))
th1 = Theta1;
th1(:,1) = 0; %set bias = 0 in reg. formula
th2 = Theta2;
th2(:,1) = 0;
t1 = th1.^2;
t2 = th2.^2;
th = sum(sum(t1)) + sum(sum(t2));
th = lambda * th / (2*m);
J = J + th; %regularization
del_3 = a3 - Y;
t1 = del_3'*a2;
Theta2_grad = 2*(t1)/m + lambda*th2/m;
t1 = del_3 * Theta2;
del_2 = t1 .* a2;
del_2 = del_2(:,2:end);
t1 = del_2'*a1;
Theta1_grad = 2*(t1)/m + lambda*th1/m;
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
Then I use this func in fmincg algorithm, but in firsts iterations fmincg end it's work. I think my gradient is wrong, but I can't find the error.
Can anybody help?

If I understand correctly, your first block of code (shown below) -
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
is to get the output a(3) at the output layer.
Ng's slides about NN has the below configuration to calculate a(3). It's different from what your code presents.
in the middle/output layer, you are not doing the activation function g, e.g., a sigmoid function.
In terms of the cost function J without regularization terms, Ng's slides has the below formula:
I don't understand why you can compute it using:
J = 1/(2*m)*sum(sum((a3 - Y).^2))
because you are not including the log function at all.

Mikhaill, I´ve been playing with a NN for continuous regression as well, and had a similar issues at some point. The best thing to do here would be to test gradient computation against a numerical calculation before running the model. If that´s not correct, fmincg won´t be able to train the model. (Btw, I discourage you of using numerical gradient as the time involved is much bigger).
Taking into account that you took this idea from Ng´s Coursera class, I´ll implement a possible solution for you to try using the same notation for Octave.
% Cost function without regularization.
J = 1/2/m^2*sum((a3-Y).^2);
% In case it´s needed, regularization term is added (i.e. for Training).
if (reg==true);
J=J+lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
endif;
% Derivatives are computed for layer 2 and 3.
d3=(a3.-Y);
d2=d3*Theta2(:,2:end);
% Theta grad is computed without regularization.
Theta1_grad=(d2'*a1)./m;
Theta2_grad=(d3'*a2)./m;
% Regularization is added to grad computation.
Theta1_grad(:,2:end)=Theta1_grad(:,2:end)+(lambda/m).*Theta1(:,2:end);
Theta2_grad(:,2:end)=Theta2_grad(:,2:end)+(lambda/m).*Theta2(:,2:end);
% Unroll gradients.
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Note that, since you have taken out all the sigmoid activation, the derivative calculation is quite simple and results in a simplification of the original code.
Next steps:
1. Check this code to understand if it makes sense to your problem.
2. Use gradient checking to test gradient calculation.
3. Finally, use fmincg and check you get different results.

Try to include sigmoid function to compute second layer (hidden layer) values and avoid sigmoid in calculating the target (output) value.
function [J grad] = nnCostFunction1(nnParams, ...
inputLayerSize, ...
hiddenLayerSize, ...
numLabels, ...
X, y, lambda)
Theta1 = reshape(nnParams(1:hiddenLayerSize * (inputLayerSize + 1)), ...
hiddenLayerSize, (inputLayerSize + 1));
Theta2 = reshape(nnParams((1 + (hiddenLayerSize * (inputLayerSize + 1))):end), ...
numLabels, (hiddenLayerSize + 1));
Theta1Grad = zeros(size(Theta1));
Theta2Grad = zeros(size(Theta2));
m = size(X,1);
a1 = [ones(m, 1) X]';
z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [ones(1, m); a2];
z3 = Theta2 * a2;
a3 = z3;
Y = y';
r1 = lambda / (2 * m) * sum(sum(Theta1(:, 2:end) .* Theta1(:, 2:end)));
r2 = lambda / (2 * m) * sum(sum(Theta2(:, 2:end) .* Theta2(:, 2:end)));
J = 1 / ( 2 * m ) * (a3 - Y) * (a3 - Y)' + r1 + r2;
delta3 = a3 - Y;
delta2 = (Theta2' * delta3) .* sigmoidGradient([ones(1, m); z2]);
delta2 = delta2(2:end, :);
Theta2Grad = 1 / m * (delta3 * a2');
Theta2Grad(:, 2:end) = Theta2Grad(:, 2:end) + lambda / m * Theta2(:, 2:end);
Theta1Grad = 1 / m * (delta2 * a1');
Theta1Grad(:, 2:end) = Theta1Grad(:, 2:end) + lambda / m * Theta1(:, 2:end);
grad = [Theta1Grad(:) ; Theta2Grad(:)];
end
Normalize the inputs before passing it in nnCostFunction.

In accordance with Week 5 Lecture Notes guideline for a Linear System NN you should make following changes in the initial code:
Remove num_lables or make it 1 (in reshape() as well)
No need to convert y into a logical matrix
For a2 - replace sigmoid() function to tanh()
In d2 calculation - replace sigmoidGradient(z2) with (1-tanh(z2).^2)
Remove sigmoid from output layer (a3 = z3)
Replace cost function in the unregularized portion to linear one: J = (1/(2*m))*sum((a3-y).^2)
Create predictLinear(): use predict() function as a basis, replace sigmoid with tanh() for the first layer hypothesis, remove second sigmoid for the second layer hypothesis, remove the line with max() function, use output of the hidden layer hypothesis as a prediction result
Verify your nnCostFunctionLinear() on the test case from the lecture note

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Denormalizing thetas after a linear regression with gradient descent - machine-learning

Related

Cost function for logistic regression: weird/oscillating cost history

How to debug if weight keep increasing. Pytorch program

Understanding code wrt Logistic Regression using gradient descent

How do I implement the optimization function in tensorflow?

Gradient in continuous regression using a neural network

Categories

Resources