I'm trying to implement an addition to the loss function of the ppo algorithm in stable-baselines3. For this I collected additional observations for the states s(t-10) and s(t+1) which I can access in the train-function of the PPO class in ppo.py as part of the rollout_buffer.
I'm using a 3-layer-mlp as my network architecture and need the outputs of the second layer for the triplet (s(t-α), s(t), s(t+1)) to use them to calculate L = max(d(s(t+1) , s(t)) − d(s(t+1) , s(t−α)) + γ, 0), where d is the L2-distance.
Finally I want to add this term to the old loss, so loss = loss + 0.3 * L
This is my implementation starting with the original loss in line 242:
loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
###############################
net1 = nn.Sequential(*list(self.policy.mlp_extractor.policy_net.children())[:-1])
L_losses = []
a = 0
obs = rollout_data.observations
obs_alpha = rollout_data.observations_alpha
obs_plusone = rollout_data.observations_plusone
inds = rollout_data.inds
for i in inds:
if i > alpha: # only use observations for which L can be calculated
fs_t = net1(obs[a])
fs_talpha = net1(obs_alpha[a])
fs_tone = net1(obs_plusone[a])
L = max(
th.norm(th.subtract(fs_tone, fs_t)) - th.norm(th.subtract(fs_tone, fs_talpha)) + 1.0, 0.0)
L_losses.append(L)
else:
L_losses.append(0)
a += 1
L_loss = th.mean(th.FloatTensor(L_losses))
loss += 0.3 * L_loss
So with net1 I tried to get a clone of the original network with the outputs from the second layer. I am unsure if this is the right way to do this.
I do have some questions about my approach as the resulting performance is slightly worse compared to without the added term although it should be slightly better:
Is my way of getting the outputs of the second layer of the mlp network working?
When loss.backward() is called can the gradient be calculated correctly (with the new term included)?
With Perception learning, I am really confused on initializing and updating weight. If I have a sample data that contains 2 inputs x0 and x1 and I have 80 rows of these 2 inputs, hence 80x2 matrix.
Do I need to initialize weight as a matrix of 80x2 or just 2 values w0 and w1 ? Is final goal of perceptron learning is to find 2 weights w0 and w1 which should fit for all 80 input sample rows ?
I have following code and my errors never get to 0, despite going up to 10,000 iterations.
x=input matrix of 80x2
y=output matrix of 80x1
n = number of iterations
w=[0.1,0.1]
learningRate = 0.1
for i in range(n):
expectedT = y.transpose();
xT = x.transpose()
prediction = np.dot (w,xT)
for i in range (len(x)):
if prediction[i] >= 0:
ypred[i] = 1
else:
ypred[i] = 0
error = expectedT - ypred
# updating the weights
w = np.add(w,learningRate*(np.dot(error,x)))
globalError = globalError + np.square(error)
For each feature you will have one weight. Thus you have two features and two weights. It also helps to introduce a bias which adds another weight. For more information about bias check this Role of Bias in Neural Networks. The weights indeed should learn how to fit the sample data best. Depending on the data this can mean that you will never reach error of 0. For example a single layer perceptron can not learn an XOR gate when using a monotonic activation function. (solving XOR with single layer perceptron).
For your example I would recommend two things. Introducing a bias and stopping the training when the error is below a certain threshold or if error is 0 for example.
I completed your example to learn a logical AND gate:
# AND input and output
x = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,1])
n = 1000
w=[0.1,0.1,0.1]
learningRate = 0.01
globalError = 0
def predict(X):
prediction = np.dot(w[0:2],X) + w[2]
ypred = np.zeros(len(y))
for i in range (len(y)):
if prediction[i] >= 0:
ypred[i] = 1
else:
ypred[i] = 0
return ypred
for i in range(n):
expectedT = y.transpose();
xT = x.transpose()
ypred = predict(xT)
error = expectedT - ypred
if sum(error) == 0:
break
# updating the weights
w[0:2] = np.add(w[0:2],learningRate*(np.dot(error,x)))
w[2] += learningRate*sum(error)
globalError = globalError + np.square(error)
After the training the error is 0
print(error)
# [0. 0. 0. 0.]
And the weights are as follows
print(w)
#[0.1, 0.1, -0.00999999999999999]
The perceptron can be used now as AND gate:
predict(x.transpose())
#array([0., 1., 1., 1.])
Hope that helps
I'm trying to train my network on MNIST using a self-made CNN (C++).
It gives enough good results when I use a simple model, like:
Convolution (2 feature maps, 5x5) (Tanh) -> MaxPool (2x2) -> Flatten -> Fully-Connected (64) (Tanh) -> Fully-Connected (10) (Sigmoid).
After 4 epochs, it behaves like here 1.
After 16 epochs, it gives ~6,5% error on a test dataset.
But in the case of 4 feature maps in Conv, the MSE value isn't improving, sometimes even increasing 2,5 times 2.
The online training mode is used, with help of Adam optimizer (alpha: 0.01, beta_1: 0.9, beta_2: 0.999, epsilon: 1.0e-8). It is calculated as:
double AdamOptimizer::calc(int t, double& m_t, double& v_t, double g_t)
{
m_t = this->beta_1 * m_t + (1.0 - this->beta_1) * g_t;
v_t = this->beta_2 * v_t + (1.0 - this->beta_2) * (g_t * g_t);
double m_t_aver = m_t / (1.0 - std::pow(this->beta_1, t + 1));
double v_t_aver = v_t / (1.0 - std::pow(this->beta_2, t + 1));
return -(this->alpha * m_t_aver) / (std::sqrt(v_t_aver) + this->epsilon);
}
So, can be this problem caused by lack of some additional learning techniques (dropout, batch-normalization), or wrongly set parameters? Or it is caused by some implementation issues?
P. S. I provide a github link if necessary.
Try to decrease the learning rate.
Out of curiosity, I am trying to build a simple fully connected NN using tensorflow to learn a square wave function such as the following one:
Therefore the input is a 1D array of x value (as the horizontal axis), and the output is a binary scalar value. I used tf.nn.sparse_softmax_cross_entropy_with_logits as loss function, and tf.nn.relu as activation. There are 3 hidden layers (100*100*100) and a single input node and output node. The input data are generated to match the above wave shape and therefore the data size is not a problem.
However, the trained model seems to fail completed, predicting for the negative class always.
So I am trying to figure out why this happened. Whether the NN configuration is suboptimal, or it is due to some mathematical flaw in NN beneath the surface (though I think NN should be able to imitate any function).
Thanks.
As per suggestions in the comment section, here is the full code. One thing I noticed saying wrong earlier is, there were actually 2 output nodes (due to 2 output classes):
"""
See if neural net can find piecewise linear correlation in the data
"""
import time
import os
import tensorflow as tf
import numpy as np
def generate_placeholder(batch_size):
x_placeholder = tf.placeholder(tf.float32, shape=(batch_size, 1))
y_placeholder = tf.placeholder(tf.float32, shape=(batch_size))
return x_placeholder, y_placeholder
def feed_placeholder(x, y, x_placeholder, y_placeholder, batch_size, loop):
x_selected = [[None]] * batch_size
y_selected = [None] * batch_size
for i in range(batch_size):
x_selected[i][0] = x[min(loop*batch_size, loop*batch_size % len(x)) + i, 0]
y_selected[i] = y[min(loop*batch_size, loop*batch_size % len(y)) + i]
feed_dict = {x_placeholder: x_selected,
y_placeholder: y_selected}
return feed_dict
def inference(input_x, H1_units, H2_units, H3_units):
with tf.name_scope('H1'):
weights = tf.Variable(tf.truncated_normal([1, H1_units], stddev=1.0/2), name='weights')
biases = tf.Variable(tf.zeros([H1_units]), name='biases')
a1 = tf.nn.relu(tf.matmul(input_x, weights) + biases)
with tf.name_scope('H2'):
weights = tf.Variable(tf.truncated_normal([H1_units, H2_units], stddev=1.0/H1_units), name='weights')
biases = tf.Variable(tf.zeros([H2_units]), name='biases')
a2 = tf.nn.relu(tf.matmul(a1, weights) + biases)
with tf.name_scope('H3'):
weights = tf.Variable(tf.truncated_normal([H2_units, H3_units], stddev=1.0/H2_units), name='weights')
biases = tf.Variable(tf.zeros([H3_units]), name='biases')
a3 = tf.nn.relu(tf.matmul(a2, weights) + biases)
with tf.name_scope('softmax_linear'):
weights = tf.Variable(tf.truncated_normal([H3_units, 2], stddev=1.0/np.sqrt(H3_units)), name='weights')
biases = tf.Variable(tf.zeros([2]), name='biases')
logits = tf.matmul(a3, weights) + biases
return logits
def loss(logits, labels):
labels = tf.to_int32(labels)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits, name='xentropy')
return tf.reduce_mean(cross_entropy, name='xentropy_mean')
def inspect_y(labels):
return tf.reduce_sum(tf.cast(labels, tf.int32))
def training(loss, learning_rate):
tf.summary.scalar('lost', loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
return train_op
def evaluation(logits, labels):
labels = tf.to_int32(labels)
correct = tf.nn.in_top_k(logits, labels, 1)
return tf.reduce_sum(tf.cast(correct, tf.int32))
def run_training(x, y, batch_size):
with tf.Graph().as_default():
x_placeholder, y_placeholder = generate_placeholder(batch_size)
logits = inference(x_placeholder, 100, 100, 100)
Loss = loss(logits, y_placeholder)
y_sum = inspect_y(y_placeholder)
train_op = training(Loss, 0.01)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
max_steps = 10000
for step in range(max_steps):
start_time = time.time()
feed_dict = feed_placeholder(x, y, x_placeholder, y_placeholder, batch_size, step)
_, loss_val = sess.run([train_op, Loss], feed_dict = feed_dict)
duration = time.time() - start_time
if step % 100 == 0:
print('Step {}: loss = {:.2f} {:.3f}sec'.format(step, loss_val, duration))
x_test = np.array(range(1000)) * 0.001
x_test = np.reshape(x_test, (1000, 1))
_ = sess.run(logits, feed_dict={x_placeholder: x_test})
print(min(_[:, 0]), max(_[:, 0]), min(_[:, 1]), max(_[:, 1]))
print(_)
if __name__ == '__main__':
population = 10000
input_x = np.random.rand(population)
input_y = np.copy(input_x)
for bin in range(10):
print(bin, bin/10, 0.5 - 0.5*(-1)**bin)
input_y[input_x >= bin/10] = 0.5 - 0.5*(-1)**bin
batch_size = 1000
input_x = np.reshape(input_x, (population, 1))
run_training(input_x, input_y, batch_size)
Sample output shows that the model always prefer the first class over the second, as shown by min(_[:, 0]) > max(_[:, 1]), i.e. the minimum logit output for the first class is higher than the maximum logit output for the second class, for a sample size of population.
My mistake. The problem occurred in the line:
for i in range(batch_size):
x_selected[i][0] = x[min(loop*batch_size, loop*batch_size % len(x)) + i, 0]
y_selected[i] = y[min(loop*batch_size, loop*batch_size % len(y)) + i]
Python is mutating the whole list of x_selected to the same value. Now this code issue is resolved. The fix is:
x_selected = np.zeros((batch_size, 1))
y_selected = np.zeros((batch_size,))
for i in range(batch_size):
x_selected[i, 0] = x[(loop*batch_size + i) % x.shape[0], 0]
y_selected[i] = y[(loop*batch_size + i) % y.shape[0]]
After this fix, the model is showing more variation. It currently outputs class 0 for x <= 0.5 and class 1 for x > 0.5. But this is still far from ideal.
So after changing the network configuration to 100 nodes * 4 layers, after 1 million training steps (batch size = 100, sample size = 10 million), the model is performing very well showing only errors at the edges when y flips.
Therefore this question is closed.
You essentially try to learn a periodic function and the function is highly non-linear and non-smooth. So it is NOT simple as it looks like. In short, a better representation of the input feature helps.
Suppose your have a period T = 2, f(x) = f(x+2).
For a reduced problem when input/output are integers, your function is then f(x) = 1 if x is odd else -1. In this case, your problem would be reduced to this discussion in which we train a Neural Network to distinguish between odd and even numbers.
I guess the second bullet in that post should help (even for the general case when inputs are float numbers).
Try representing the numbers in binary using a fixed length precision.
In our reduced problem above, it's easy to see that the output is determined iff the least-significant bit is known.
decimal binary -> output
1: 0 0 1 -> 1
2: 0 1 0 -> -1
3: 0 1 1 -> 1
...
I created the model and the structure for the problem of recognizing odd/even numbers in here.
If you abstract the fact that:
decimal binary -> output
1: 0 0 1 -> 1
2: 0 1 0 -> -1
3: 0 1 1 -> 1
Is almost equivalent to:
decimal binary -> output
1: 0 0 1 -> 1
2: 0 1 0 -> 0
3: 0 1 1 -> 1
You may update the code to fit your need.
I've followed the TensorFlow RNN tutorial to create a LSTM model. However, in the process, I've grown confused as to the difference, if any, between 'batches' and 'timesteps', and I'd appreciate help in clarifying this matter.
The tutorial code (see following) essentially creates 'batches' based on a designated number of steps:
with tf.variable_scope("RNN"):
for time_step in range(num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
However, the following appears to do the same:
for epoch in range(5):
print('----- Epoch', epoch, '-----')
total_loss = 0
for i in range(inputs_cnt // BATCH_SIZE):
inputs_batch = train_inputs[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]
orders_batch = train_orders[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]
feed_dict = {story: inputs_batch, order: orders_batch}
logits, xent, loss = sess.run([...], feed_dict=feed_dict)
Assuming you are working with text, BATCH_SIZE would be the number of sentences that you are processing in parallel and num_steps would be the maximum number of words in any sentence. These are different dimensions of your input to the LSTM.