Slow training of RL algorith and low cpu usage

Slow training of RL algorith and low cpu usage - machine-learning

I have a python 3.6.4 script, which is a simple reinforcement learning script based on a 4 layer MLP(input, hidden-128, hidden-256, output). I can't publish my code here cause it is too big, I will post only a small part of it.
def train(model, epochs):
total = 0
start = time.time()
entire_hist = []
profits = []
for i in range(epochs):
loss = 0
accuracy = 0
hist = []
env.reset()
game_over = False
input_t = env.observe()
while not game_over:
input_tm1 = input_t
if np.random.random() <= epsilon:
action = np.random.randint(0, num_actions, size=None)
else:
q = model.predict(input_tm1)
action = np.argmax(q[0])
input_t, reward, game_over = env.act(action)
exp_replay.remember([input_tm1, action, reward, input_t], game_over)
inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)
loss, accuracy = model.train_on_batch(inputs, targets)
hist.append([loss, accuracy, env.main])
print(f'counter: {env.counter}, action taken: {action}, reward: {round(reward, 2)}, main: {round(env.main)}, secondary: {env.secondary}')
if game_over:
print('GAME OVER!')
entire_hist.append(hist)
profits.append(total)
print(f'total profit: {env.total_profit}')
print(f'epoch: {i}, loss: {loss}, accuracy: {accuracy}')
print('\n')
print('*'*20)
end = int(time.time() - start)
print(f'training time: {end} seconds')
return entire_hist, total
The problem is, that when I run it, CPU usage is only about 20-30% and GPU usage is about 5%. I tried running on different machines and I get similar results, the more powerful CPU I use, less % of it script uses.
And it would be ok, unless it would take a few days to train such a small network if running it for 1000-5000 epochs. Can someone help to make it train faster and increase the cpu usage.
I tried running on both cpu/gpu version of tensorflow.
My set up:
Latest keras with tensorflow backend
Tensorflow-gpu==1.4.0

Actually, it was quite simple, I was just using not all of my cores. Since my system was disributing load from one core to all 4 I didn't notice that.

Related

Pytorch: Calculating running time on GPU and CPU of a for loop

I am really new to pytorch. And I got really confused the whole day while I was trying out to figure out why my nn runs slower on GPU than CPU. I do not understand when I calculated the running time using time.time(), the time of the whole loop is a lot different with the sum of every single running time. Here is part of my code. Could anybody help me? Appreciate it!
time_out = 0
time_in = 0
for epoch in tqdm(range(self.n_epoch)):
running_loss = 0
running_error = 0
running_acc = 0
if self.cuda:
torch.cuda.synchronize() #time_out_start
epst1 = time.time()
for step, (batch_x, batch_y) in enumerate(self.normal_loader):
if self.cuda:
torch.cuda.synchronize() #time_in_start
t1 = time.time()
batch_x, batch_y = batch_x.to(self.device), batch_y.to(self.device)
b_x = Variable(batch_x)
b_y = Variable(batch_y)
pred_y = self.model(b_x)
#print (pred_y)
loss = self.criterion(pred_y, b_y)
error = mae(pred_y.detach().cpu().numpy(),b_y.detach().cpu().numpy())
acc = r2(b_y.detach().cpu().numpy(),pred_y.detach().cpu().numpy())
#print (loss)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
running_acc += acc
running_loss += loss.item()
running_error += error
if self.cuda:
torch.cuda.synchronize() #time_in_end
t6 = time.time()
time_in += t6-t1
if self.cuda:
torch.cuda.synchronize() #time_out_end
eped1 = time.time()
time_out += eped1-epst1
print ('loop time(out)',time_out)
print ('loop time(in)',time_in)
The result is:
CPU:
EPOCH 10: out: 1.283s in: 0.695s
EPOCH 50: out: 6.43s in: 3.288s
EPOCH 100: out:12.646s in:6.386s
GPU:
EPOCH 10: out: 3.92s in: 1.471s
EPOCH 50: out: 9.35s in:3.04s
EPOCH 100: out: 18.418s in:5.655
I understand that transferring data from cpu to gpu cost some time. So as the epochs go up, the calculation time of GPU should become less than CPU time. My question is:
why the time I record outside of the loop is so different from the inside one? Is there any step that I missed to record the running time?
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
The Network is really simple, which is:
class Model(nn.Module):
def __init__(self,n_input,n_nodes1,n_nodes2):
super(Model, self).__init__()
self.n_input = n_input
self.n_nodes1 = n_nodes1
self.n_nodes2 = n_nodes2
self.l1 = nn.Linear(self.n_input, self.n_nodes1)
self.l2 = nn.Linear(self.n_nodes1, self.n_nodes2)
self.l3 = nn.Linear(self.n_nodes2, 1)
def forward(self,x):
h1 = F.relu(self.l1(x))
h2 = F.relu(self.l2(h1))
h = self.l3(h2)
return h
the training data is formed as:(regression problem, input_x are descriptors and y is the target value)
def load_train_normal(self,x,y,batch_size = 100):
if batch_size:
self.batch_size = batch_size
self.x_train_n, self.y_train_n = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
#x, y = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
self.dataset = Data.TensorDataset(self.x_train_n,self.y_train_n)
self.normal_loader = Data.DataLoader(
dataset = self.dataset,
batch_size = self.batch_size,
shuffle = True, num_workers=2,)

why the time I record outside of the loop is so different with the inside one? Is there any step that I missed to record the running time?
self.normal_loader is not just a plain dictionary, vector or something as simple as that. Iterating over it takes a significant amount of time.
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
torch.cuda.synchronize() is a heavy operation. Even when it didn't even do anything useful such as in this case, as pred_y.detach().cpu() had already enforced synchronization.
As to how to get they faster? Drop the synchronize() calls, they don't do you any good.
And then defer the processing of pred_y until later. Much later. You want to have called the model at least 2 or 3 times before you trigger the first download of results. The simpler the model and the smaller the data, the more iterations you have to wait.
Because transfers to and from the GPU don't just "take time", they imply synchronization. Without synchronization, the execution model on the GPU mostly "lags behind", with data uploads to the GPU already being asynchronous behind the scenes, and actual execution only being queued behind them. If you don't synchronize by accident or explicitly, workloads start to overlap, stuff (uploads, execution, CPU work) starts running in parallel. Your effective execution time approaches max(upload, download, GPU execution, CPU execution).
If you synchronize, there are no tasks to overlap, and no batches to form from same-typed tasks. Upload, execution, download, CPU part, it all happens sequentially. Your execution time ends up upload + download + GPU execution + CPU execution. Some additional overhead for breaking batching on the driver level on top. So easily 5-10x slower than it should be.

is binary cross entropy an additive function?

I am trying to train a machine learning model where the loss function is binary cross entropy, because of gpu limitations i can only do batch size of 4 and i'm having lot of spikes in the loss graph. So I'm thinking to back-propagate after some predefined batch size(>4). So it's like i'll do 10 iterations of batch size 4 store the losses, after 10th iteration add the losses and back-propagate. will it be similar to batch size of 40.
TL;DR
f(a+b) = f(a)+f(b) is it true for binary cross entropy?

f(a+b) = f(a) + f(b) doesn't seem to be what you're after. This would imply that BCELoss is additive which it clearly isn't. I think what you really care about is if for some index i
# false
f(x, y) == f(x[:i], y[:i]) + f([i:], y[i:])
is true?
The short answer is no, because you're missing some scale factors. What you probably want is the following identity
# true
f(x, y) == (i / b) * f(x[:i], y[:i]) + (1.0 - i / b) * f(x[i:], y[i:])
where b is the total batch size.
This identity is used as motivation for the gradient accumulation method (see below). Also, this identity applies to any objective function which returns an average loss across each batch element, not just BCE.
Caveat/Pitfall: Keep in mind that batch norm will not behave exactly the same when using this approach since it updates its internal statistics based on batch size during the forward pass.
We can actually do a little better memory-wise than just computing the loss as a sum followed by backpropagation. Instead we can compute the gradient of each component in the equivalent sum individually and allow the gradients to accumulate. To better explain I'll give some examples of equivalent operations
Consider the following model
import torch
import torch.nn as nn
import torch.nn.functional as F
class MyModel(nn.Module):
def __init__(self):
super().__init__()
num_outputs = 5
# assume input shape is 10x10
self.conv_layer = nn.Conv2d(3, 10, 3, 1, 1)
self.fc_layer = nn.Linear(10*5*5, num_outputs)
def forward(self, x):
x = self.conv_layer(x)
x = F.max_pool2d(x, 2, 2, 0, 1, False, False)
x = F.relu(x)
x = self.fc_layer(x.flatten(start_dim=1))
x = torch.sigmoid(x) # or omit this and use BCEWithLogitsLoss instead of BCELoss
return x
# to ensure same results for this example
torch.manual_seed(0)
model = MyModel()
# the examples will work as long as the objective averages across batch elements
objective = nn.BCELoss()
# doesn't matter what type of optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
and lets say our data and targets for a single batch are
torch.manual_seed(1) # to ensure same results for this example
batch_size = 32
input_data = torch.randn((batch_size, 3, 10, 10))
targets = torch.randint(0, 1, (batch_size, 20)).float()
Full batch
The body of our training loop for an entire batch may look something like this
# entire batch
output = model(input_data)
loss = objective(output, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Weighted sum of loss on sub-batches
We could have computed this using the sum of multiple loss functions using
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
loss = 0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# add loss component for sub_batch
loss = loss + objective(sub_output, sub_targets) / num_sub_batches
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Gradient accumulation
The problem with the previous approach is that in order to apply back-propagation, pytorch needs to store intermediate results of layers in memory for every sub-batch. This ends up requiring a relatively large amount of memory and you may still run into memory consumption issues.
To alleviate this problem, instead of computing a single loss and performing back-propagation once, we could perform gradient accumulation. This gives equivalent results of the previous version. The difference here is that we instead perform a backward pass on each component of
the loss, only stepping the optimizer once all of them have been backpropagated. This way the computation graph is cleared after each sub-batch which will help with memory usage. Note that this works because .backward() actually accumulates (adds) the newly computed gradients to the existing .grad member of each model parameter. This is why optimizer.zero_grad() must be called only once, before the loop, and not during or after.
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
# Important! zero the gradients before the loop
optimizer.zero_grad()
loss_value = 0.0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# compute loss component for sub_batch
sub_loss = objective(sub_output, sub_targets) / num_sub_batches
# accumulate gradients
sub_loss.backward()
loss_value += sub_loss.item()
optimizer.step()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))

I think 10 iterations of batch size 4 is same as one iteration of batch size 40, only here the time taken will be more. Across different training examples losses are added before backprop. But that doesn't make the function linear. BCELoss has a log component, and hence it is not a linear function. However what you said is correct. It will be similar to batch size 40.

cost becoming NaN after certain iterations

I am trying to do a multiclass classification problem (containing 3 labels) with softmax regression.
This is my first rough implementation with gradient descent and back propagation (without using regularization and any advanced optimization algorithm) containing only 1 layer.
Also when learning-rate is big (>0.003) cost becomes NaN, on decreasing learning-rate the cost function works fine.
Can anyone explain what I'm doing wrong??
# X is (13,177) dimensional
# y is (3,177) dimensional with label 0/1
m = X.shape[1] # 177
W = np.random.randn(3,X.shape[0])*0.01 # (3,13)
b = 0
cost = 0
alpha = 0.0001 # seems too small to me but for bigger values cost becomes NaN
for i in range(100):
Z = np.dot(W,X) + b
t = np.exp(Z)
add = np.sum(t,axis=0)
A = t/add
loss = -np.multiply(y,np.log(A))
cost += np.sum(loss)/m
print('cost after iteration',i+1,'is',cost)
dZ = A-y
dW = np.dot(dZ,X.T)/m
db = np.sum(dZ)/m
W = W - alpha*dW
b = b - alpha*db
This is what I get :
cost after iteration 1 is 6.661713420377916
cost after iteration 2 is 23.58974203186562
cost after iteration 3 is 52.75811642877174
.............................................................
...............*upto 100 iterations*.................
.............................................................
cost after iteration 99 is 1413.555298639879
cost after iteration 100 is 1429.6533630169406

Well after some time i figured it out.
First of all the cost was increasing due to this :
cost += np.sum(loss)/m
Here plus sign is not needed as it will add all the previous cost computed on every epoch which is not what we want. This implementation is generally required during mini-batch gradient descent for computing cost over each epoch.
Secondly the learning rate is too big for this problem that's why cost was overshooting the minimum value and becoming NaN.
I looked in my code and find out that my features were of very different range (one was from -1 to 1 and other was -5000 to 5000) which was limiting my algorithm to use greater values for learning rate.
So I applied feature scaling :
var = np.var(X, axis=1)
X = X/var
Now learning rate can be much bigger (<=0.001).

Conv nets accuracy not changing while loss decreases

I'm training several CNNs to do image classification in TensorFlow. The training losses decrease normally. However the test accuracy never changed throughout the whole training procedure, plus the accuracy is very low (0.014) where the accuracy for randomly guessing would be 0.003 (There are around 300 classes). One thing I've noticed is that only those models that I applied batch norm to showed such a weird behavior. What can possibly be wrong to cause this issue? The training set has 80000 samples, in case you might figure this was caused by overfitting. Below is part of the code for evaluation:
Accuracy function:
correct_prediction = tf.equal(tf.argmax(Model(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
the test_image is a batch with only one sample in it while the test_image_label is a scalar.
Session:
with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord, start=True)
print('variables initialized')
step = 0
for epoch in range(epochs):
sess.run(enqueue_train)
print('epoch: %d' %epoch)
if epoch % 5 == 0:
save_path = saver.save(sess, savedir + "/Model")
for batch in range(num_batch):
if step % 400 == 0:
summary_str = cost_summary.eval(feed_dict={phase: True})
file_writer.add_summary(summary_str, step)
else:
sess.run(train_step, feed_dict={phase: True})
step += 1
sess.run(train_close)
sess.run(enqueue_test)
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={phase: False}))
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
sess.run(test_close)
save_path = saver.save(sess, savedir + "/Model_final")
coord.request_stop()
coord.join(threads)
file_writer.close()
The phase above is to indicate if it is training or testing for the batch norm layer.
Note that I tried to calculate the accuracy with the training set, which led to the minimal loss. However it gives the same poor accuracy. Please help me, I really appreciate it!

Tensorflow: loss becomes 'NaN'

I was doing CIFAR-10 training on CPU with Tensorflow. During the first few rounds, the loss seemed alright. But after the step 10210 the loss varies and ends up becoming NaN.
My network model the CIFAR-10 CNN model from their website. Here is my setting,
image_size = 32
num_channels = 3
num_classes = 10
num_batches_to_run = 50000
batch_size = 128
eval_batch_size = 64
initial_learning_rate = 0.1
learning_rate_decay_factor = 0.1
num_epochs_per_decay = 350.0
moving_average_decay = 0.9999
and the result is shown as below.
2017-05-12 21:53:05.125242: step 10210, loss = 4.99 (124.9 examples/sec; 1.025 sec/batch)
2017-05-12 21:53:13.960001: step 10220, loss = 7.55 (139.5 examples/sec; 0.918 sec/batch)
2017-05-12 21:53:23.491228: step 10230, loss = 6.63 (149.5 examples/sec; 0.856 sec/batch)
2017-05-12 21:53:33.355805: step 10240, loss = 8.08 (113.3 examples/sec; 1.129 sec/batch)
2017-05-12 21:53:43.007007: step 10250, loss = 7.18 (126.7 examples/sec; 1.010 sec/batch)
2017-05-12 21:53:52.650118: step 10260, loss = 16.61 (138.0 examples/sec; 0.928 sec/batch)
2017-05-12 21:54:02.537279: step 10270, loss = 9.60 (137.6 examples/sec; 0.930 sec/batch)
2017-05-12 21:54:12.390117: step 10280, loss = 46526.25 (145.5 examples/sec; 0.880 sec/batch)
2017-05-12 21:54:22.060741: step 10290, loss = 133479743509972411931057146822656.00 (130.4 examples/sec; 0.982 sec/batch)
2017-05-12 21:54:31.691058: step 10300, loss = nan (115.8 examples/sec; 1.105 sec/batch)
Any idea about the NaN loss?

This happens a lot in practice when your learning rate is too high, I tend to start at 0.001 and move from there, 0.1 is on the very high side on most datasets, especially if you aren't dividing your loss by your batch size.

You can clip the gradients, if you are using Keras with Tensorflow backend, you could do as follows,
The parameters clipnorm and clipvalue can be used with all optimizers to control gradient clipping:
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum norm of 1.
sgd = optimizers.SGD(lr=0.01, clipnorm=1.)
or
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
sgd = optimizers.SGD(lr=0.01, clipvalue=0.5)

You might have the cross entropy loss and take log(0). Just add a small constant within the log.
(you might also want to look into gradient clipping)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart