How can I optimize gradient flow in LSTM with Pytorch?

How can I optimize gradient flow in LSTM with Pytorch? - time-series

I'm working in lstm with time-series data and I've observed a problem in the gradients of my network. I've one layer of 121 lstm cells. For each cell I've one input value and I get one output value. I work with a batch size of 121 values and I define lstm cell with batch_first = True, so my outputs are [batch,timestep,features].
Once I've the outputs (tensor of size [121,121,1]), I calculate the loss using MSELoss() and I backpropagate it. And here appears the problem. Looking the gradients of each cell, I notice that the gradients of first 100 cells (more or less) are null.
In theory, if I'm not wrong, when I backpropagate the error I calculate a gradient for each output, so I have a gradient for each cell. If that is true, I can't understand why in the first cells they are zero.
Does somebody knows what is happening?
Thank you!
PS.: I show you the gradient flow of the last cells:
Update:
As I tried to ask before, I still have a question about LSTM backpropagation. As you can see from the image below, in one cell, apart from the gradients that come from other cells, I think there’s also another gradient form itself.
For example, let’s look at the cell 1. I get the output y1 and I calculate the loss E1. I do the same with other cells. So, when I backpropagate in cell 1, I get dE2/dy2 * dy2/dh1 * dh1/dw1 + ... which are the gradients related to following cells in the network (BPTT) as #kmario23 and #DavidNg explained. And I also have the gradient related to E1 (dE1/dy1 * dy1/dw1). The first gradients can vanish during the flow, but this one not.
So, to sum up, although having a long layer of lstm cells, to my understand I have a gradient related only to each cell, therefore I don’t understand why I have gradients equal to zero. What does it happens with the error related to E1? Why is only bptt calculated?

I have been dealing with these problems several times. And here is my advice:
Use smaller number of timesteps
The hidden output of the previous timestep is passed to the current steps and multiplied by the weights. When you multiply several times, the gradient will explode or vanish exponentially with the number of timesteps.
Let's say:
# it's exploding
1.01^121 = 101979 # imagine how large it is when the weight is not 1.01
# or it's vanishing
0.9^121 = 2.9063214161987074e-06 # ~ 0.0 when we init the weight smaller than 1.0
For less cluttering, I take an example of simple RNNCell - with weights W_ih and W_hh with no bias. And in your case, W_hh is just a single number but the case might generalize for any matrix W_hh. We use the indentity activation as well.
If we unroll the RNN along all the time steps K=3, we get:
h_1 = W_ih * x_0 + W_hh * h_0 (1)
h_2 = W_ih * x_1 + W_hh * h_1 (2)
h_3 = W_ih * x_2 + W_hh * h_2 (3)
Therefore, when we need to update the weights W_hh, we have to accummulate all the gradients in the step (1), (2), (3).
grad(W_hh) = grad(W_hh at step 1) + grad(W_hh at step 2) + grad(W_hh at step 3)
# step 3
grad(W_hh at step3) = d_loss/d(h_3) * d(h_3)/d(W_hh)
grad(W_hh at step3) = d_loss/d(h_3) * h_2
# step 2
grad(W_hh at step2) = d_loss/d(h_2) * d(h_2)/d(W_hh)
grad(W_hh at step2) = d_loss/d(h_3) * d_(h_3)/d(h_2) * d(h_2)/d(W_hh)
grad(W_hh at step2) = d_loss/d(h_3) * d_(h_3)/d(h_2) * h_1
# step 1
grad(W_hh at step1) = d_loss/d(h_1) * d(h_1)/d(W_hh)
grad(W_hh at step1) = d_loss/d(h_3) * d(h_3)/d(h_2) * d(h_2)/d(h_1) * d(h_1)/d(W_hh)
grad(W_hh at step1) = d_loss/d(h_3) * d(h_3)/d(h_2) * d(h_2)/d(h_1) * h_0
# As we also:
d(h_i)/d(h_i-1) = W_hh
# Then:
grad(W_hh at step3) = d_loss/d(h_3) * h_2
grad(W_hh at step2) = d_loss/d(h_3) * W_hh * h_1
grad(W_hh at step1) = d_loss/d(h_3) * W_hh * W_hh * h_0
Let d_loss/d(h_3) = v
# We accumulate all gradients for W_hh
grad(W_hh) = v * h_2 + v * W_hh * h_1 + v * W_hh * W_hh * h_0
# If W_hh is initialized too big >> 1.0, grad(W_hh) explode quickly (-> infinity).
# If W_hh is initialized too small << 1.0, grad(W_hh) vanishes quickly (-> 0), since h_2, h_1 are vanishing after each forward step (exponentially)
Although LSTM cell has different gates (like forget gate reduce irrelevantly lengthy dependency in timestep) to mitigate these problems, it will be affected by long nummber of timesteps. It still a big question for sequential data about how to design network architecture for learning long dependency.
To avoid the problems, just reduce the number of timesteps (seq_len) into subsequences.
bs = 121
seq_len = 121
new_seq_len = seq_len // k # k = 2, 2.5 or anything to experiment
X (of [bs,seq_len, 1]) -> [ X1[bs, new_seq_len, 1], X2[bs, new_seq_len, 1],...]
Then, you pass each small batch Xi into the model, such that the initial hidden is h_(i-1) which is the hidden output of previous batch `X(i-1)
h_i = model(Xi, h_(i-1))
So it will help the model to learn some long dependency as the model of 121 timesteps.

Related

Modifying the loss in ppo in stable-baselines3

I'm trying to implement an addition to the loss function of the ppo algorithm in stable-baselines3. For this I collected additional observations for the states s(t-10) and s(t+1) which I can access in the train-function of the PPO class in ppo.py as part of the rollout_buffer.
I'm using a 3-layer-mlp as my network architecture and need the outputs of the second layer for the triplet (s(t-α), s(t), s(t+1)) to use them to calculate L = max(d(s(t+1) , s(t)) − d(s(t+1) , s(t−α)) + γ, 0), where d is the L2-distance.
Finally I want to add this term to the old loss, so loss = loss + 0.3 * L
This is my implementation starting with the original loss in line 242:
loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
###############################
net1 = nn.Sequential(*list(self.policy.mlp_extractor.policy_net.children())[:-1])
L_losses = []
a = 0
obs = rollout_data.observations
obs_alpha = rollout_data.observations_alpha
obs_plusone = rollout_data.observations_plusone
inds = rollout_data.inds
for i in inds:
if i > alpha: # only use observations for which L can be calculated
fs_t = net1(obs[a])
fs_talpha = net1(obs_alpha[a])
fs_tone = net1(obs_plusone[a])
L = max(
th.norm(th.subtract(fs_tone, fs_t)) - th.norm(th.subtract(fs_tone, fs_talpha)) + 1.0, 0.0)
L_losses.append(L)
else:
L_losses.append(0)
a += 1
L_loss = th.mean(th.FloatTensor(L_losses))
loss += 0.3 * L_loss
So with net1 I tried to get a clone of the original network with the outputs from the second layer. I am unsure if this is the right way to do this.
I do have some questions about my approach as the resulting performance is slightly worse compared to without the added term although it should be slightly better:
Is my way of getting the outputs of the second layer of the mlp network working?
When loss.backward() is called can the gradient be calculated correctly (with the new term included)?

Perceptron : Weight for each sample data , or one common weight

With Perception learning, I am really confused on initializing and updating weight. If I have a sample data that contains 2 inputs x0 and x1 and I have 80 rows of these 2 inputs, hence 80x2 matrix.
Do I need to initialize weight as a matrix of 80x2 or just 2 values w0 and w1 ? Is final goal of perceptron learning is to find 2 weights w0 and w1 which should fit for all 80 input sample rows ?
I have following code and my errors never get to 0, despite going up to 10,000 iterations.
x=input matrix of 80x2
y=output matrix of 80x1
n = number of iterations
w=[0.1,0.1]
learningRate = 0.1
for i in range(n):
expectedT = y.transpose();
xT = x.transpose()
prediction = np.dot (w,xT)
for i in range (len(x)):
if prediction[i] >= 0:
ypred[i] = 1
else:
ypred[i] = 0
error = expectedT - ypred
# updating the weights
w = np.add(w,learningRate*(np.dot(error,x)))
globalError = globalError + np.square(error)

For each feature you will have one weight. Thus you have two features and two weights. It also helps to introduce a bias which adds another weight. For more information about bias check this Role of Bias in Neural Networks. The weights indeed should learn how to fit the sample data best. Depending on the data this can mean that you will never reach error of 0. For example a single layer perceptron can not learn an XOR gate when using a monotonic activation function. (solving XOR with single layer perceptron).
For your example I would recommend two things. Introducing a bias and stopping the training when the error is below a certain threshold or if error is 0 for example.
I completed your example to learn a logical AND gate:
# AND input and output
x = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,1])
n = 1000
w=[0.1,0.1,0.1]
learningRate = 0.01
globalError = 0
def predict(X):
prediction = np.dot(w[0:2],X) + w[2]
ypred = np.zeros(len(y))
for i in range (len(y)):
if prediction[i] >= 0:
ypred[i] = 1
else:
ypred[i] = 0
return ypred
for i in range(n):
expectedT = y.transpose();
xT = x.transpose()
ypred = predict(xT)
error = expectedT - ypred
if sum(error) == 0:
break
# updating the weights
w[0:2] = np.add(w[0:2],learningRate*(np.dot(error,x)))
w[2] += learningRate*sum(error)
globalError = globalError + np.square(error)
After the training the error is 0
print(error)
# [0. 0. 0. 0.]
And the weights are as follows
print(w)
#[0.1, 0.1, -0.00999999999999999]
The perceptron can be used now as AND gate:
predict(x.transpose())
#array([0., 1., 1., 1.])
Hope that helps

Strange Loss function behaviour when training CNN

I'm trying to train my network on MNIST using a self-made CNN (C++).
It gives enough good results when I use a simple model, like:
Convolution (2 feature maps, 5x5) (Tanh) -> MaxPool (2x2) -> Flatten -> Fully-Connected (64) (Tanh) -> Fully-Connected (10) (Sigmoid).
After 4 epochs, it behaves like here 1.
After 16 epochs, it gives ~6,5% error on a test dataset.
But in the case of 4 feature maps in Conv, the MSE value isn't improving, sometimes even increasing 2,5 times 2.
The online training mode is used, with help of Adam optimizer (alpha: 0.01, beta_1: 0.9, beta_2: 0.999, epsilon: 1.0e-8). It is calculated as:
double AdamOptimizer::calc(int t, double& m_t, double& v_t, double g_t)
{
m_t = this->beta_1 * m_t + (1.0 - this->beta_1) * g_t;
v_t = this->beta_2 * v_t + (1.0 - this->beta_2) * (g_t * g_t);
double m_t_aver = m_t / (1.0 - std::pow(this->beta_1, t + 1));
double v_t_aver = v_t / (1.0 - std::pow(this->beta_2, t + 1));
return -(this->alpha * m_t_aver) / (std::sqrt(v_t_aver) + this->epsilon);
}
So, can be this problem caused by lack of some additional learning techniques (dropout, batch-normalization), or wrongly set parameters? Or it is caused by some implementation issues?
P. S. I provide a github link if necessary.

Try to decrease the learning rate.

Use neural network to learn a square wave function

Out of curiosity, I am trying to build a simple fully connected NN using tensorflow to learn a square wave function such as the following one:
Therefore the input is a 1D array of x value (as the horizontal axis), and the output is a binary scalar value. I used tf.nn.sparse_softmax_cross_entropy_with_logits as loss function, and tf.nn.relu as activation. There are 3 hidden layers (100*100*100) and a single input node and output node. The input data are generated to match the above wave shape and therefore the data size is not a problem.
However, the trained model seems to fail completed, predicting for the negative class always.
So I am trying to figure out why this happened. Whether the NN configuration is suboptimal, or it is due to some mathematical flaw in NN beneath the surface (though I think NN should be able to imitate any function).
Thanks.
As per suggestions in the comment section, here is the full code. One thing I noticed saying wrong earlier is, there were actually 2 output nodes (due to 2 output classes):
"""
See if neural net can find piecewise linear correlation in the data
"""
import time
import os
import tensorflow as tf
import numpy as np
def generate_placeholder(batch_size):
x_placeholder = tf.placeholder(tf.float32, shape=(batch_size, 1))
y_placeholder = tf.placeholder(tf.float32, shape=(batch_size))
return x_placeholder, y_placeholder
def feed_placeholder(x, y, x_placeholder, y_placeholder, batch_size, loop):
x_selected = [[None]] * batch_size
y_selected = [None] * batch_size
for i in range(batch_size):
x_selected[i][0] = x[min(loop*batch_size, loop*batch_size % len(x)) + i, 0]
y_selected[i] = y[min(loop*batch_size, loop*batch_size % len(y)) + i]
feed_dict = {x_placeholder: x_selected,
y_placeholder: y_selected}
return feed_dict
def inference(input_x, H1_units, H2_units, H3_units):
with tf.name_scope('H1'):
weights = tf.Variable(tf.truncated_normal([1, H1_units], stddev=1.0/2), name='weights')
biases = tf.Variable(tf.zeros([H1_units]), name='biases')
a1 = tf.nn.relu(tf.matmul(input_x, weights) + biases)
with tf.name_scope('H2'):
weights = tf.Variable(tf.truncated_normal([H1_units, H2_units], stddev=1.0/H1_units), name='weights')
biases = tf.Variable(tf.zeros([H2_units]), name='biases')
a2 = tf.nn.relu(tf.matmul(a1, weights) + biases)
with tf.name_scope('H3'):
weights = tf.Variable(tf.truncated_normal([H2_units, H3_units], stddev=1.0/H2_units), name='weights')
biases = tf.Variable(tf.zeros([H3_units]), name='biases')
a3 = tf.nn.relu(tf.matmul(a2, weights) + biases)
with tf.name_scope('softmax_linear'):
weights = tf.Variable(tf.truncated_normal([H3_units, 2], stddev=1.0/np.sqrt(H3_units)), name='weights')
biases = tf.Variable(tf.zeros([2]), name='biases')
logits = tf.matmul(a3, weights) + biases
return logits
def loss(logits, labels):
labels = tf.to_int32(labels)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits, name='xentropy')
return tf.reduce_mean(cross_entropy, name='xentropy_mean')
def inspect_y(labels):
return tf.reduce_sum(tf.cast(labels, tf.int32))
def training(loss, learning_rate):
tf.summary.scalar('lost', loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
return train_op
def evaluation(logits, labels):
labels = tf.to_int32(labels)
correct = tf.nn.in_top_k(logits, labels, 1)
return tf.reduce_sum(tf.cast(correct, tf.int32))
def run_training(x, y, batch_size):
with tf.Graph().as_default():
x_placeholder, y_placeholder = generate_placeholder(batch_size)
logits = inference(x_placeholder, 100, 100, 100)
Loss = loss(logits, y_placeholder)
y_sum = inspect_y(y_placeholder)
train_op = training(Loss, 0.01)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
max_steps = 10000
for step in range(max_steps):
start_time = time.time()
feed_dict = feed_placeholder(x, y, x_placeholder, y_placeholder, batch_size, step)
_, loss_val = sess.run([train_op, Loss], feed_dict = feed_dict)
duration = time.time() - start_time
if step % 100 == 0:
print('Step {}: loss = {:.2f} {:.3f}sec'.format(step, loss_val, duration))
x_test = np.array(range(1000)) * 0.001
x_test = np.reshape(x_test, (1000, 1))
_ = sess.run(logits, feed_dict={x_placeholder: x_test})
print(min(_[:, 0]), max(_[:, 0]), min(_[:, 1]), max(_[:, 1]))
print(_)
if __name__ == '__main__':
population = 10000
input_x = np.random.rand(population)
input_y = np.copy(input_x)
for bin in range(10):
print(bin, bin/10, 0.5 - 0.5*(-1)**bin)
input_y[input_x >= bin/10] = 0.5 - 0.5*(-1)**bin
batch_size = 1000
input_x = np.reshape(input_x, (population, 1))
run_training(input_x, input_y, batch_size)
Sample output shows that the model always prefer the first class over the second, as shown by min(_[:, 0]) > max(_[:, 1]), i.e. the minimum logit output for the first class is higher than the maximum logit output for the second class, for a sample size of population.
My mistake. The problem occurred in the line:
for i in range(batch_size):
x_selected[i][0] = x[min(loop*batch_size, loop*batch_size % len(x)) + i, 0]
y_selected[i] = y[min(loop*batch_size, loop*batch_size % len(y)) + i]
Python is mutating the whole list of x_selected to the same value. Now this code issue is resolved. The fix is:
x_selected = np.zeros((batch_size, 1))
y_selected = np.zeros((batch_size,))
for i in range(batch_size):
x_selected[i, 0] = x[(loop*batch_size + i) % x.shape[0], 0]
y_selected[i] = y[(loop*batch_size + i) % y.shape[0]]
After this fix, the model is showing more variation. It currently outputs class 0 for x <= 0.5 and class 1 for x > 0.5. But this is still far from ideal.
So after changing the network configuration to 100 nodes * 4 layers, after 1 million training steps (batch size = 100, sample size = 10 million), the model is performing very well showing only errors at the edges when y flips.
Therefore this question is closed.

You essentially try to learn a periodic function and the function is highly non-linear and non-smooth. So it is NOT simple as it looks like. In short, a better representation of the input feature helps.
Suppose your have a period T = 2, f(x) = f(x+2).
For a reduced problem when input/output are integers, your function is then f(x) = 1 if x is odd else -1. In this case, your problem would be reduced to this discussion in which we train a Neural Network to distinguish between odd and even numbers.
I guess the second bullet in that post should help (even for the general case when inputs are float numbers).
Try representing the numbers in binary using a fixed length precision.
In our reduced problem above, it's easy to see that the output is determined iff the least-significant bit is known.
decimal binary -> output
1: 0 0 1 -> 1
2: 0 1 0 -> -1
3: 0 1 1 -> 1
...

I created the model and the structure for the problem of recognizing odd/even numbers in here.
If you abstract the fact that:
decimal binary -> output
1: 0 0 1 -> 1
2: 0 1 0 -> -1
3: 0 1 1 -> 1
Is almost equivalent to:
decimal binary -> output
1: 0 0 1 -> 1
2: 0 1 0 -> 0
3: 0 1 1 -> 1
You may update the code to fit your need.

LSTM Batches vs Timesteps

I've followed the TensorFlow RNN tutorial to create a LSTM model. However, in the process, I've grown confused as to the difference, if any, between 'batches' and 'timesteps', and I'd appreciate help in clarifying this matter.
The tutorial code (see following) essentially creates 'batches' based on a designated number of steps:
with tf.variable_scope("RNN"):
for time_step in range(num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
However, the following appears to do the same:
for epoch in range(5):
print('----- Epoch', epoch, '-----')
total_loss = 0
for i in range(inputs_cnt // BATCH_SIZE):
inputs_batch = train_inputs[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]
orders_batch = train_orders[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]
feed_dict = {story: inputs_batch, order: orders_batch}
logits, xent, loss = sess.run([...], feed_dict=feed_dict)

Assuming you are working with text, BATCH_SIZE would be the number of sentences that you are processing in parallel and num_steps would be the maximum number of words in any sentence. These are different dimensions of your input to the LSTM.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart