Gradient descent on linear regression not converging - machine-learning

I have implemented a very simple linear regression with gradient descent algorithm in JavaScript, but after consulting multiple sources and trying several things, I cannot get it to converge.
The data is absolutely linear, it's just the numbers 0 to 30 as inputs with x*3 as their correct outputs to learn.
This is the logic behind the gradient descent:
train(input, output) {
const predictedOutput = this.predict(input);
const delta = output - predictedOutput;
this.m += this.learningRate * delta * input;
this.b += this.learningRate * delta;
}
predict(x) {
return x * this.m + this.b;
}
I took the formulas from different places, including:
Exercises from Udacity's Deep Learning Foundations Nanodegree
Andrew Ng's course on Gradient Descent for Linear Regression (also here)
Stanford's CS229 Lecture Notes
this other PDF slides I found from Carnegie Mellon
I have already tried:
normalizing input and output values to the [-1, 1] range
normalizing input and output values to the [0, 1] range
normalizing input and output values to have mean = 0 and stddev = 1
reducing the learning rate (1e-7 is as low as I went)
having a linear data set with no bias at all (y = x * 3)
having a linear data set with non-zero bias (y = x * 3 + 2)
initializing the weights with random non-zero values between -1 and 1
Still, the weights (this.b and this.m) do not approach any of the data values, and they diverge into infinity.
I'm obviously doing something wrong, but I cannot figure out what it is.
Update: Here's a little bit more context that may help figure out what my problem is exactly:
I'm trying to model a simple approximation to a linear function, with online learning by a linear regression pseudo-neuron. With that, my parameters are:
weights: [this.m, this.b]
inputs: [x, 1]
activation function: identity function z(x) = x
As such, my net will be expressed by y = this.m * x + this.b * 1, simulating the data-driven function that I want to approximate (y = 3 * x).
What I want is for my network to "learn" the parameters this.m = 3 and this.b = 0, but it seems I get stuck at a local minima.
My error function is the mean-squared error:
error(allInputs, allOutputs) {
let error = 0;
for (let i = 0; i < allInputs.length; i++) {
const x = allInputs[i];
const y = allOutputs[i];
const predictedOutput = this.predict(x);
const delta = y - predictedOutput;
error += delta * delta;
}
return error / allInputs.length;
}
My logic for updating my weights will be (according to the sources I've checked so far) wi -= alpha * dError/dwi
For the sake of simplicity, I'll call my weights this.m and this.b, so we can relate it back to my JavaScript code. I'll also call y^ the predicted value.
From here:
error = y - y^
= y - this.m * x + this.b
dError/dm = -x
dError/db = 1
And so, applying that to the weight correction logic:
this.m += alpha * x
this.b -= alpha * 1
But this doesn't seem correct at all.

I finally found what's wrong, and I'm answering my own question in hopes it will help beginners in this area too.
First, as Sascha said, I had some theoretical misunderstandings. It may be correct that your adjustment includes the input value verbatim, but as he said, it should already be part of the gradient. This all depends on your choice of the error function.
Your error function will be the measure of what you use to measure how off you were from the real value, and that measurement needs to be consistent. I was using mean-squared-error as a measurement tool (as you can see in my error method), but I was using a pure-absolute error (y^ - y) inside of the training method to measure the error. Your gradient will depend on the choice of this error function. So choose only one and stick with it.
Second, simplify your assumptions in order to test what's wrong. In this case, I had a very good idea what the function to approximate was (y = x * 3) so I manually set the weights (this.b and this.m) to the right values and I still saw the error diverge. This means that weight initialization was not the problem in this case.
After searching some more, my error was somewhere else: the function that was feeding data into the network was mistakenly passing a 3 hardcoded value into the predicted output (it was using a wrong index in an array), so the oscillation I saw was because of the network trying to approximate to y = 0 * x + 3 (this.b = 3 and this.m = 0), but because of the small learning rate and the error in the error function derivative, this.b wasn't going to get near to the right value, making this.m making wild jumps to adjust to it.
Finally, keep track of the error measurement as your network trains, so you can have some insight into what's going on. This helps a lot to identify a difference between simple overfitting, big learning rates and plain simple mistakes.

Related

Importance weighted autoencoder doing worse than VAE

I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.

regression with stochastic gradient descent algorithm

I am studying regression with Machine Learning in Action book and I saw a source like below :
def stocGradAscent0(dataMatrix, classLabels):
m, n = np.shape(dataMatrix)
alpha = 0.01
weights = np.ones(n) #initialize to all ones
for i in range(m):
h = sigmoid(sum(dataMatrix[i]*weights))
error = classLabels[i] - h
weights = weights + alpha * error * dataMatrix[i]
return weights
You may guess what the code means. But I didn't understand it. I read the book several times and searched related stuff like wiki or google, where exponential function is from to get weights for minimum differences. And why do we get proper weight using the exponential function with sum of X*weights? It would be kind of OLS. Anyway then we get the result like below:
Thanks!
It just the basics in linear regression. In the for loop it tries to calculate the error function
Z = β₀ + β₁X ; where β₁ AND X are matrices
hΘ(x) = sigmoid(Z)
i.e. hΘ(x) = 1/(1 + e^-(β₀ + β₁X)
then update the weights. normally it's better to give it a high number for iterations in the for loop like 1000, m it would be small i guess.
i want to explain more but i can't explain better than this dude here
Happy learning!!

What would be a good loss function to penalize the magnitude and sign difference

I'm in a situation where I need to train a model to predict a scalar value, and it's important to have the predicted value be in the same direction as the true value, while the squared error being minimum.
What would be a good choice of loss function for that?
For example:
Let's say the predicted value is -1 and the true value is 1. The loss between the two should be a lot greater than the loss between 3 and 1, even though the squared error of (3, 1) and (-1, 1) is equal.
Thanks a lot!
This turned out to be a really interesting question - thanks for asking it! First, remember that you want your loss functions to be defined entirely of differential operations, so that you can back-propagation though it. This means that any old arbitrary logic won't necessarily do. To restate your problem: you want to find a differentiable function of two variables that increases sharply when the two variables take on values of different signs, and more slowly when they share the same sign. Additionally, you want some control over how sharply these values increase, relative to one another. Thus, we want something with two configurable constants. I started constructing a function that met these needs, but then remembered one you can find in any high school geometry text book: the elliptic paraboloid!
The standard formulation doesn't meet the requirement of sign agreement symmetry, so I had to introduce a rotation. The plot above is the result. Note that it increases more sharply when the signs don't agree, and less sharply when they do, and that the input constants controlling this behaviour are configurable. The code below is all that was needed to define and plot the loss function. I don't think I've ever used a geometric form as a loss function before - really neat.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
def elliptic_paraboloid_loss(x, y, c_diff_sign, c_same_sign):
# Compute a rotated elliptic parabaloid.
t = np.pi / 4
x_rot = (x * np.cos(t)) + (y * np.sin(t))
y_rot = (x * -np.sin(t)) + (y * np.cos(t))
z = ((x_rot**2) / c_diff_sign) + ((y_rot**2) / c_same_sign)
return(z)
c_diff_sign = 4
c_same_sign = 2
a = np.arange(-5, 5, 0.1)
b = np.arange(-5, 5, 0.1)
loss_map = np.zeros((len(a), len(b)))
for i, a_i in enumerate(a):
for j, b_j in enumerate(b):
loss_map[i, j] = elliptic_paraboloid_loss(a_i, b_j, c_diff_sign, c_same_sign)
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(a, b)
surf = ax.plot_surface(X, Y, loss_map, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()
From what I understand, your current loss function is something like:
loss = mean_square_error(y, y_pred)
What you could do, is to add one other component to your loss, being this a component that penalizes negative numbers and does nothing with positive numbers. And you can choose a coefficient for how much you want to penalize it. For that, we can use like a negative shaped ReLU. Something like this:
Let's call "Neg_ReLU" to this component. Then, your loss function will be:
loss = mean_squared_error(y, y_pred) + Neg_ReLU(y_pred)
So for example, if your result is -1, then the total error would be:
mean_squared_error(1, -1) + 1
And if your result is 3, then the total error would be:
mean_squared_error(1, -1) + 0
(See in the above function how Neg_ReLU(3) = 0, and Neg_ReLU(-1) = 1.
If you want to penalize more the negative values, then you can add a coefficient:
coeff_negative_value = 2
loss = mean_squared_error(y, y_pred) + coeff_negative_value * Neg_ReLU
Now the negative values are more penalized.
The ReLU negative function we can build it like this:
tf.nn.relu(tf.math.negative(value))
So summarizing, in the end your total loss will be:
coeff = 1
Neg_ReLU = tf.nn.relu(tf.math.negative(y))
total_loss = mean_squared_error(y, y_pred) + coeff * Neg_ReLU

LSTM RNN Backpropagation

Could someone give a clear explanation of backpropagation for LSTM RNNs?
This is the type structure I am working with. My question is not posed at what is back propagation, I understand it is a reverse order method of calculating the error of the hypothesis and output used for adjusting the weights of neural networks. My question is how LSTM backpropagation is different then regular neural networks.
I am unsure of how to find the initial error of each gates. Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation? I am unsure how the cell state plays a role in the backprop of LSTMs if it does at all. I have looked thoroughly for a good source for LSTMs but have yet to find any.
That's a good question. You certainly should take a look at suggested posts for details, but a complete example here would be helpful too.
RNN Backpropagaion
I think it makes sense to talk about an ordinary RNN first (because LSTM diagram is particularly confusing) and understand its backpropagation.
When it comes to backpropagation, the key idea is network unrolling, which is way to transform the recursion in RNN into a feed-forward sequence (like on the picture above). Note that abstract RNN is eternal (can be arbitrarily large), but each particular implementation is limited because the memory is limited. As a result, the unrolled network really is a long feed-forward network, with few complications, e.g. the weights in different layers are shared.
Let's take a look at a classic example, char-rnn by Andrej Karpathy. Here each RNN cell produces two outputs h[t] (the state which is fed into the next cell) and y[t] (the output on this step) by the following formulas, where Wxh, Whh and Why are the shared parameters:
In the code, it's simply three matrices and two bias vectors:
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias
The forward pass is pretty straightforward, this example uses softmax and cross-entropy loss. Note each iteration uses the same W* and h* arrays, but the output and hidden state are different:
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
Now, the backward pass is performed exactly as if it was a feed-forward network, but the gradient of W* and h* arrays accumulates the gradients in all cells:
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
Both passes above are done in chunks of size len(inputs), which corresponds to the size of the unrolled RNN. You might want to make it bigger to capture longer dependencies in the input, but you pay for it by storing all outputs and gradients per each cell.
What's different in LSTMs
LSTM picture and formulas look intimidating, but once you coded plain vanilla RNN, the implementation of LSTM is pretty much same. For example, here is the backward pass:
# Loop over all cells, like before
d_h_next_t = np.zeros((N, H))
d_c_next_t = np.zeros((N, H))
for t in reversed(xrange(T)):
d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t])
d_c_next_t = d_c_prev_t
d_h_next_t = d_h_prev_t
d_x[:,t,:] = d_x_t
d_h0 = d_h_prev_t
d_Wx += d_Wx_t
d_Wh += d_Wh_t
d_b += d_b_t
# The step in each cell
# Captures all LSTM complexity in few formulas.
def lstm_step_backward(d_next_h, d_next_c, cache):
"""
Backward pass for a single timestep of an LSTM.
Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass
Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache
d_z = o * d_next_h
d_o = z * d_next_h
d_next_c += (1 - z * z) * d_z
d_f = d_next_c * prev_c
d_prev_c = d_next_c * f
d_i = d_next_c * g
d_g = d_next_c * i
d_a_g = (1 - g * g) * d_g
d_a_o = o * (1 - o) * d_o
d_a_f = f * (1 - f) * d_f
d_a_i = i * (1 - i) * d_i
d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1)
d_prev_h = d_a.dot(Wh.T)
d_Wh = prev_h.T.dot(d_a)
d_x = d_a.dot(Wx.T)
d_Wx = x.T.dot(d_a)
d_b = np.sum(d_a, axis=0)
return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b
Summary
Now, back to your questions.
My question is how is LSTM backpropagation different then regular Neural Networks
The are shared weights in different layers, and few more additional variables (states) that you need to pay attention to. Other than this, no difference at all.
Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation?
First up, the loss function is not necessarily L2. In the example above it's a cross-entropy loss, so initial error signal gets its gradient:
# remember that ps is the probability distribution from the forward pass
dy = np.copy(ps[t])
dy[targets[t]] -= 1
Note that it's the same error signal as in ordinary feed-forward neural network. If you use L2 loss, the signal indeed equals to ground-truth minus actual output.
In case of LSTM, it's slightly more complicated: d_next_h = d_h_next_t + d_h[:,t,:], where d_h is the upstream gradient the loss function, which means that error signal of each cell gets accumulated. But once again, if you unroll LSTM, you'll see a direct correspondence with the network wiring.
I think your questions could not be answered in a short response. Nico's simple LSTM has a link to a great paper from Lipton et.al., please read this. Also his simple python code sample helps to answer most of your questions.
If you understand Nico's last sentence
ds = self.state.o * top_diff_h + top_diff_s
in detail, please give me a feed back. At the moment I have a final problem with his "Putting all this s and h derivations together".

Perceptron training - delta rule

according to wikipedia, with the delta rule we adjust the weight by:
dw = alpha * (ti-yi)*g'(hj)xi
when alpha = learning constant, ti - true answer, yi - perceptron's guess,g' = the derivative of the activation function g with respect to the weighted sum of the perceptron's inputs, xi - input.
The part that I don't understand in this formula is the multiplication by the derivative g'. let g = sign(x) (the sign of the weighted sum). so g' is always 0, and dw = 0. However, in code examples I saw in the internet, the writers just omitted the g' and used the formula:
dw = alpha * (ti-yi)*(hj)xi
I will be glad to read a proper explanation!
thank you in advance.
You're correct that if you use a step function for your activation function g, the gradient is always zero (except at 0), so the delta rule (aka gradient descent) just does nothing (dw = 0). This is why a step-function perceptron doesn't work well with gradient descent. :)
For a linear perceptron, you'd have g'(x) = 1, for dw = alpha * (t_i - y_i) * x_i.
You've seen code that uses dw = alpha * (t_i - y_i) * h_j * x_i. We can reverse-engineer what's going on here, because apparently g'(h_j) = h_j, which means remembering our calculus that we must have g(x) = e^x + constant. So apparently the code sample you found uses an exponential activation function.
This must mean that the neuron outputs are constrained to be on (0, infinity) (or I guess (a, infinity) for any finite a, for g(x) = e^x + a). I haven't run into this before, but I see some references online. Logistic or tanh activations are more common for bounded outputs (either classification or regression with known bounds).

Resources