Possible incorrect usage of custom eval_metric in MXNet

Possible incorrect usage of custom eval_metric in MXNet - machine-learning

I am working on a problem and were trying to solve using MXNet. I was trying to use a custom metric in the code.
The code for the same is:
def calculate_sales_from_bucket(bucketArray):
return numpy.asarray(numpy.power(10, calculate_max_index_from_bucket(bucketArray)))
def calculate_max_index_from_bucket(bucketArray):
answerArray = []
for bucketValue in bucketArray:
index, value = max(enumerate(bucketValue), key=operator.itemgetter(1))
answerArray.append(index)
return answerArray
def custom_metric(label, bucketArray):
return numpy.mean(numpy.power(calculate_sales_from_bucket(label)-calculate_sales_from_bucket(bucketArray),2))
model.fit(
train_iter, # training data
eval_data=val_iter, # validation data
batch_end_callback = mx.callback.Speedometer(batch_size, 1000), # output progress for each 1000 data batches
num_epoch = 10, # number of data passes for training
optimizer = 'adam',
eval_metric = mx.metric.create(custom_metric),
optimizer_params=(('learning_rate', 1),)
)
I am getting the output as:
INFO:root:Epoch[0] Validation-custom_metric=38263835679935.953125
INFO:root:Epoch[1] Batch [1000] Speed: 91353.72 samples/sec Train-custom_metric=39460550891.057487
INFO:root:Epoch[1] Batch [2000] Speed: 96233.05 samples/sec Train-custom_metric=9483.127650
INFO:root:Epoch[1] Batch [3000] Speed: 90828.09 samples/sec Train-custom_metric=57538.891485
INFO:root:Epoch[1] Batch [4000] Speed: 93025.54 samples/sec Train-custom_metric=59861.927745
INFO:root:Epoch[1] Train-custom_metric=8351.460495
INFO:root:Epoch[1] Time cost=9.466
INFO:root:Epoch[1] Validation-custom_metric=38268.250469
INFO:root:Epoch[2] Batch [1000] Speed: 94028.96 samples/sec Train-custom_metric=58864.659051
INFO:root:Epoch[2] Batch [2000] Speed: 94562.38 samples/sec Train-custom_metric=9482.873310
INFO:root:Epoch[2] Batch [3000] Speed: 93198.68 samples/sec Train-custom_metric=57538.891485
INFO:root:Epoch[2] Batch [4000] Speed: 93722.89 samples/sec Train-custom_metric=59861.927745
INFO:root:Epoch[2] Train-custom_metric=8351.460495
INFO:root:Epoch[2] Time cost=9.341
INFO:root:Epoch[2] Validation-custom_metric=38268.250469
In this case, irrespective of change in train-custom_metric for batches, the train-custom_metric is still the same.
Like in case of batch 1000 for epoch 1 and epoch 2.
I believe that this is an issue as the Train-custom_metric and Validation-custom_metric is not changing irrespective of the value of epoch steps.
I am a beginner in MXNet and I might be wrong in this assumption.
Can you confirm if I am passing eval_metric in the correct way?

Not sure I understand the problem. Your output shows train-custom-metric giving different values, it just happens to have given the same result for the last two batches of each epoch. That may just be a quirk of how your model is converging.
One thing to be clear on is that eval_metric is only used to give debug output -- it's not actually used as the loss function during learning:
https://github.com/apache/incubator-mxnet/issues/1915

Related

LSTM regression model flat prediction

This is a time series regression problem for the battery capacity as output and a single input variable as voltage; the relation is non-linear.
LSTM Model prediction of the test data always returns a semi-flat line, probably the mean of the output variable in the training data.
This is an example of predicted vs test set output values, with the following model parameters:
(Window size: 10, batch site: 256, LSTM nodes: 16)
Prediction of the test data
Data had been normalized, down-sampled to 1 sec and later to 3 sec, original sampling was 10 Hz.
I was suspecting the voltage fluctuation is the problem, but sampling at 3 seconds hadn't resulted into noticeable improvement.
Here are the data after being down-sampled to 3 seconds:
Normalized Training Data ; Y:SOC, X: Voltage
Normalized Test Data ; Y:SOC, X: Voltage
I've tried many changes in the model and learning parameters as follows, but still the behavior is the same.
That's why i think it's not a parameter tuning issue, rather the model is not learning at all.
LSTM layer: always single, followed by Dense with no options.
LSTM nodes: [4,8,16,32]
Epoch: : [16,32,64,128]
window size (input vector depth): [8,32,64,128]
Batch size: [32,64,128,256]
learning rate: [.0005,.0001,.001]
optimizer : ADAM, options:[ none, clipnorm=1, clipvalue=0.5]
Model specification Code:
backend.clear_session()
model1 = Sequential()
model1.add(LSTM(16,input_shape=(win_sz, features_cnt) )) # stateless
model1.add(layers.Dense(1))
model1.summary()
Model training and validation Code:
n_epochs = 12
iterations = tr_samples_sh_cnt // batch_sz_tr
loss = tf.keras.losses.MeanAbsoluteError()
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
loss_history = []
#tf.function
def train_model_on_batch():
start = epoch * batch_sz_tr
X_batch = df_feat_tr_3D[start:start+batch_sz_tr, :, :]
y_batch = df_SOC_tr_2D[start:start+batch_sz_tr, :]
with tf.GradientTape() as tape:
current_loss = loss(model1(X_batch), y_batch)
gradients = tape.gradient(current_loss, model1.trainable_variables)
optimizer.apply_gradients(zip(gradients, model1.trainable_variables))
return current_loss
for epoch in range(n_epochs+1):
for iteration in range(iterations):
current_loss = train_model_on_batch()
if epoch % 1 == 0:
loss_history.append(current_loss.numpy())
print("{}. \t\tLoss: {}".format(
epoch, loss_history[-1]))
print('\nTraining complete.')
P_test = model1.predict(df_feat_test_3D)
After adding sigmoid activation function in both LSTM and Dense layers, a very small change observed, but far from reasonable fit.
Prediction of the test data after adding activation function

The problem was the activation function as #Dr. Snoopy recommended

Pytorch: Calculating running time on GPU and CPU of a for loop

I am really new to pytorch. And I got really confused the whole day while I was trying out to figure out why my nn runs slower on GPU than CPU. I do not understand when I calculated the running time using time.time(), the time of the whole loop is a lot different with the sum of every single running time. Here is part of my code. Could anybody help me? Appreciate it!
time_out = 0
time_in = 0
for epoch in tqdm(range(self.n_epoch)):
running_loss = 0
running_error = 0
running_acc = 0
if self.cuda:
torch.cuda.synchronize() #time_out_start
epst1 = time.time()
for step, (batch_x, batch_y) in enumerate(self.normal_loader):
if self.cuda:
torch.cuda.synchronize() #time_in_start
t1 = time.time()
batch_x, batch_y = batch_x.to(self.device), batch_y.to(self.device)
b_x = Variable(batch_x)
b_y = Variable(batch_y)
pred_y = self.model(b_x)
#print (pred_y)
loss = self.criterion(pred_y, b_y)
error = mae(pred_y.detach().cpu().numpy(),b_y.detach().cpu().numpy())
acc = r2(b_y.detach().cpu().numpy(),pred_y.detach().cpu().numpy())
#print (loss)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
running_acc += acc
running_loss += loss.item()
running_error += error
if self.cuda:
torch.cuda.synchronize() #time_in_end
t6 = time.time()
time_in += t6-t1
if self.cuda:
torch.cuda.synchronize() #time_out_end
eped1 = time.time()
time_out += eped1-epst1
print ('loop time(out)',time_out)
print ('loop time(in)',time_in)
The result is:
CPU:
EPOCH 10: out: 1.283s in: 0.695s
EPOCH 50: out: 6.43s in: 3.288s
EPOCH 100: out:12.646s in:6.386s
GPU:
EPOCH 10: out: 3.92s in: 1.471s
EPOCH 50: out: 9.35s in:3.04s
EPOCH 100: out: 18.418s in:5.655
I understand that transferring data from cpu to gpu cost some time. So as the epochs go up, the calculation time of GPU should become less than CPU time. My question is:
why the time I record outside of the loop is so different from the inside one? Is there any step that I missed to record the running time?
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
The Network is really simple, which is:
class Model(nn.Module):
def __init__(self,n_input,n_nodes1,n_nodes2):
super(Model, self).__init__()
self.n_input = n_input
self.n_nodes1 = n_nodes1
self.n_nodes2 = n_nodes2
self.l1 = nn.Linear(self.n_input, self.n_nodes1)
self.l2 = nn.Linear(self.n_nodes1, self.n_nodes2)
self.l3 = nn.Linear(self.n_nodes2, 1)
def forward(self,x):
h1 = F.relu(self.l1(x))
h2 = F.relu(self.l2(h1))
h = self.l3(h2)
return h
the training data is formed as:(regression problem, input_x are descriptors and y is the target value)
def load_train_normal(self,x,y,batch_size = 100):
if batch_size:
self.batch_size = batch_size
self.x_train_n, self.y_train_n = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
#x, y = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
self.dataset = Data.TensorDataset(self.x_train_n,self.y_train_n)
self.normal_loader = Data.DataLoader(
dataset = self.dataset,
batch_size = self.batch_size,
shuffle = True, num_workers=2,)

why the time I record outside of the loop is so different with the inside one? Is there any step that I missed to record the running time?
self.normal_loader is not just a plain dictionary, vector or something as simple as that. Iterating over it takes a significant amount of time.
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
torch.cuda.synchronize() is a heavy operation. Even when it didn't even do anything useful such as in this case, as pred_y.detach().cpu() had already enforced synchronization.
As to how to get they faster? Drop the synchronize() calls, they don't do you any good.
And then defer the processing of pred_y until later. Much later. You want to have called the model at least 2 or 3 times before you trigger the first download of results. The simpler the model and the smaller the data, the more iterations you have to wait.
Because transfers to and from the GPU don't just "take time", they imply synchronization. Without synchronization, the execution model on the GPU mostly "lags behind", with data uploads to the GPU already being asynchronous behind the scenes, and actual execution only being queued behind them. If you don't synchronize by accident or explicitly, workloads start to overlap, stuff (uploads, execution, CPU work) starts running in parallel. Your effective execution time approaches max(upload, download, GPU execution, CPU execution).
If you synchronize, there are no tasks to overlap, and no batches to form from same-typed tasks. Upload, execution, download, CPU part, it all happens sequentially. Your execution time ends up upload + download + GPU execution + CPU execution. Some additional overhead for breaking batching on the driver level on top. So easily 5-10x slower than it should be.

Slow training of RL algorith and low cpu usage

I have a python 3.6.4 script, which is a simple reinforcement learning script based on a 4 layer MLP(input, hidden-128, hidden-256, output). I can't publish my code here cause it is too big, I will post only a small part of it.
def train(model, epochs):
total = 0
start = time.time()
entire_hist = []
profits = []
for i in range(epochs):
loss = 0
accuracy = 0
hist = []
env.reset()
game_over = False
input_t = env.observe()
while not game_over:
input_tm1 = input_t
if np.random.random() <= epsilon:
action = np.random.randint(0, num_actions, size=None)
else:
q = model.predict(input_tm1)
action = np.argmax(q[0])
input_t, reward, game_over = env.act(action)
exp_replay.remember([input_tm1, action, reward, input_t], game_over)
inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)
loss, accuracy = model.train_on_batch(inputs, targets)
hist.append([loss, accuracy, env.main])
print(f'counter: {env.counter}, action taken: {action}, reward: {round(reward, 2)}, main: {round(env.main)}, secondary: {env.secondary}')
if game_over:
print('GAME OVER!')
entire_hist.append(hist)
profits.append(total)
print(f'total profit: {env.total_profit}')
print(f'epoch: {i}, loss: {loss}, accuracy: {accuracy}')
print('\n')
print('*'*20)
end = int(time.time() - start)
print(f'training time: {end} seconds')
return entire_hist, total
The problem is, that when I run it, CPU usage is only about 20-30% and GPU usage is about 5%. I tried running on different machines and I get similar results, the more powerful CPU I use, less % of it script uses.
And it would be ok, unless it would take a few days to train such a small network if running it for 1000-5000 epochs. Can someone help to make it train faster and increase the cpu usage.
I tried running on both cpu/gpu version of tensorflow.
My set up:
Latest keras with tensorflow backend
Tensorflow-gpu==1.4.0

Actually, it was quite simple, I was just using not all of my cores. Since my system was disributing load from one core to all 4 I didn't notice that.

How to get loss function history using tf.contrib.opt.ScipyOptimizerInterface

I need to get the loss history over time to plot it in graph.
Here is my skeleton of code:
optimizer = tf.contrib.opt.ScipyOptimizerInterface(loss, method='L-BFGS-B',
options={'maxiter': args.max_iterations, 'disp': print_iterations})
optimizer.minimize(sess, loss_callback=append_loss_history)
With append_loss_history definition:
def append_loss_history(**kwargs):
global step
if step % 50 == 0:
loss_history.append(loss.eval())
step += 1
When I see the verbose output of ScipyOptimizerInterface, the loss is actually decrease over time.
But when I print loss_history, the losses are nearly the same over time.
Refer to the doc:
"Variables subject to optimization are updated in-place AT THE END OF OPTIMIZATION"
https://www.tensorflow.org/api_docs/python/tf/contrib/opt/ScipyOptimizerInterface. Is that the reason for the being unchanged of the loss?

I think you have the problem down; the variables themselves are not modified until the end of the optimization (instead being fed to session.run calls), and evaluating a "back channel" Tensor gets the un-modified variables. Instead, use the fetches argument to optimizer.minimize to piggyback on the session.run calls which have the feeds specified:
import tensorflow as tf
def print_loss(loss_evaled, vector_evaled):
print(loss_evaled, vector_evaled)
vector = tf.Variable([7., 7.], 'vector')
loss = tf.reduce_sum(tf.square(vector))
optimizer = tf.contrib.opt.ScipyOptimizerInterface(
loss, method='L-BFGS-B',
options={'maxiter': 100})
with tf.Session() as session:
tf.global_variables_initializer().run()
optimizer.minimize(session,
loss_callback=print_loss,
fetches=[loss, vector])
print(vector.eval())
(Modified from the example in the documentation). This prints Tensors with the updated values:
98.0 [ 7. 7.]
79.201 [ 6.29289341 6.29289341]
7.14396e-12 [ -1.88996808e-06 -1.88996808e-06]
[ -1.88996808e-06 -1.88996808e-06]

theano - how to have many of the same function

The input is a variable sized array. I can only process a given example one sample at a time in train_model. I want to accumulate the sum of objectives for the elements in the batch then apply regularization and gradient descent.
Currently, this is the training stage where updates are done for each element xi.
for epoch in range(n_epochs):
minibatch_avg_cost = 0
for xi in dataset.get_next_xi(batch_size):
minibatch_avg_cost += train_model(xi)
print(minibatch_avg_cost)
How can I get results from train_model(xi) for the number of elements in the batch and then do the updates?

just use all the elements in dataset.get_next_xi(batch_size) as input and create a theano function to calculate the average cost (instead of only one cost) and do the updates using the average cost. You can see the example code from here
they use theano function from train model like this:
train_model = theano.function(
inputs=[index],
outputs=cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
with cost is average cost of datasets batch

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart