tf.cond lowers the training speed

tf.cond lowers the training speed - machine-learning

I was using tensorflow input pipelines like cifar10 model in tensorflow and try to use tf.cond to do validation and I wrote something like this
train_data = model.input(istrain=True)
val_data = model.input(istrain=False)
# This selects which stream to use.
select_val = tf.placeholder(dtype=bool,shape=[],name='select_test')
data = tf.cond(
select_val,
lambda:val_data,
lambda:train_data
)
# Here is the model.
loss = ...
train_op = ...
...
with tf.Session():
...
And if I delete the cond and just use the training data, the speed is 4000 samples/s and if I use the code above, the speed decrease to 2300 samples/s. The validation pipeline capacity is set really small so it won't take too much memory in GPU. The frequency of doing validation is also really low.
I'm not sure what is going wrong and please help me out.

tf.cond is not fully lazy. Any operations that are required by either of the branches of the cond will be run even if the branch that requires it is not the branch to be executed. So in your case, both model.input(istrain=True) and model.input(istrain=False) are being execute every time your data op is being called. The results of one of them is just ignored.
The documentation for cond gives a minimal code example:
Note that the conditional execution applies only to the operations
defined in fn1 and fn2. Consider the following simple program:
z = tf.multiply(a, b)
result = tf.cond(x < y, lambda: tf.add(x, z), lambda: tf.square(y))
If x < y, the tf.add operation will be executed and tf.square
operation will not be executed. Since z is needed for at least one
branch of the cond, the tf.mul operation is always executed,
unconditionally. Although this behavior is consistent with the
dataflow model of TensorFlow, it has occasionally surprised some users
who expected a lazier semantics.
Also note, this means that if your model.input is pulling some set of data from a larger pool (say, a batch from an entire dataset), each time the cond is run, data gets pulled from both validation and training, and one set just gets thrown away. This can cause problems more serious than inefficiencies in some cases. For example, if you're processing only a certain number epochs, then with this code you're not actually processing that number of epochs because data was being pulled that was not used.

Related

Abnormalities of Model training - Various result on different time with same parameter and hyperparamerter

Sounds like a silly question but feel really annoying. During model training, why it's so weird sometimes as to get different output or result or more specifically accuracy or validation accuracy when nothing changes in the model parameter?
Let's say we build a model name A. We train it and get the result and that's something like acc = 60 and val_acc = 70. Ok fine.
Now, train that same model another time (without closing the environment) and this time we get acc = 40 and val_acc = 20 ....? I mean, why? Nothing changes inside the model, no parameter no hyperparameter, nothing at all. Then why it shows this weirdness.

You need to set random seeds of your environment. Then the program should be reproducable. An example on how this is done in keras is shown here(https://machinelearningmastery.com/reproducible-results-neural-networks-keras/), but this exists for every other library as well.

How to put datasets created by torchvision.datasets in GPU in one operation?

I’m dealing with CIFAR10 and I use torchvision.datasets to create it. I’m in need of GPU to accelerate the calculation but I can’t find a way to put the whole dataset into GPU at one time. My model need to use mini-batches and it is really time-consuming to deal with each batch separately.
I've tried to put each mini-batch into GPU separately but it seems really time-consuming.

TL;DR
You won't save time by moving the entire dataset at once.
I don't think you'd necessarily want to do that even if you have the GPU memory to handle the entire dataset (of course, CIFAR10 is tiny by today's standards).
I tried various batch sizes and timed the transfer to GPU as follows:
num_workers = 1 # Set this as needed
def time_gpu_cast(batch_size=1):
start_time = time()
for x, y in DataLoader(dataset, batch_size, num_workers=num_workers):
x.cuda(); y.cuda()
return time() - start_time
# Try various batch sizes
cast_times = [(2 ** bs, time_gpu_cast(2 ** bs)) for bs in range(15)]
# Try the entire dataset like you want to do
cast_times.append((len(dataset), time_gpu_cast(len(dataset))))
plot(*zip(*cast_times)) # Plot the time taken
For num_workers = 1, this is what I got:
And if we try parallel loading (num_workers = 8), it becomes even clearer:

I've got an answer and I'm gonna try it later. It seems promising.
You can write a dataset class where in the init function, you red the entire dataset and apply all the transformations you need, and convert them to tensor format. Then, send this tensor to GPU (assuming there is enough memory). Then, in the getitem function you can simply use the index to retrieve the elements of that tensor which is already on GPU.

How to use tf.metrics.accuracy?

I want to use tf.metrics.accuracy to track the accuracy of my predictions, but I am unsure of how to use the update_op (acc_update_op below) that the function returns:
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.

tf.metrics.accuracy is one of the many streamed metric TensorFlow operations (another one of which is tf.metrics.recall). Upon creation, two variables (count and total) are created in order to accumulate all incoming results for one final outcome. The first returned value is a tensor for the calculation count / total. The second op returned is a stateful function which updates these variables. Streamed metric functions are useful when evaluating the performance of a classifier over multiple batches of data. A quick example of use:
# building phase
with tf.name_scope("streaming"):
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
test_fetches = {
'accuracy': accuracy,
'acc_op': acc_update_op
}
# when testing the classifier
with tf.name_scope("streaming"):
# clear counters for a fresh evaluation
sess.run(tf.local_variables_initializer())
for _i in range(n_batches_in_test):
fd = get_test_batch()
outputs = sess.run(test_fetches, feed_dict=fd)
print("Accuracy:", outputs['accuracy'])
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.
That would not be a good idea unless you are only using the UPDATE_OPS collection for testing purposes. Usually, the collection will already have certain control operations for the training phase (such as moving batch normalization parameters) that are not meant to be run alongside the validation phase. It may be best to either keep them in a new collection or add these operations to the fetch dictionary manually.

Tweaking the Loss before the Optimizer Step

I want to add an extra operation before running the AdamOptimizer operation on my loss, so as to help the model deal with repetitions in my data. The relevant code snippet looks something like this:
loss = tf.nn.softmax_cross_entropy_with_logits(logits=predLogits, labels=actLabels)
loss = tf.reshape(loss, [batchsize, -1])
repMask = tf.sqrt(tf.cast(tf.abs(tf.subtract(tf.cast(Y, tf.int64), tf.cast(X, tf.int64))), tf.float32))
lossPost = loss - repMask
train_step = tf.train.AdamOptimizer(LR).minimize(lossPost)
So, in other words, instead of minimizing loss, I want AdamOptimizer to minimize its slightly tweaked version, which is lossPost. I then train the model in the usual way:
_ = sess.run([train_step], feed_dict=feed_dict)
I noticed that adding this workaround of minimizing lossPost instead of loss has no impact on the accuracy of the model. The model produces the exact same output with or without this workaround. It seems that it continues to optimize the original, unmodified loss. Why is this the case?
My original approach was to perform this tweak at the softmax_cross_entropy_with_logits step, by using the weighted_cross_entropy_with_logits instead, but I have an extra complication there, since there is an extra dimension of Vocabulary (this is a character-level-style model). So I thought it would be easier to do this afterwords, and as long as it's done prior to the optimization step it should be doable?

In your model it seems like X and Y are constants (that is, they depend only on the data). In this case repMask is also constant, as it is defined by
repMask = tf.sqrt(tf.cast(tf.abs(tf.subtract(tf.cast(Y, tf.int64), tf.cast(X, tf.int64))), tf.float32))
Hence loss and lossPost differ by constant value, and this has no effect on the minimization process (it is like finding x that minimizes x^2-1 vs x that minimizes x^2-5. Both x are the same).

How does batching interact with the loss function in TensorFlow?

I'm training a multi-objective neural net in TensorFlow with my own loss function and can't find documentation regarding how batching interacts with that functionality.
For example, I have snippet of my loss function below, which takes the tensor/list of predictions and makes sure that their absolute value sums to no more than one:
def fitness(predictions,actual):
absTensor = tf.abs(predictions)
sumTensor = tf.reduce_sum(absTensor)
oneTensor = tf.constant(1.0)
isGTOne = tf.greater(sumTensor,oneTensor)
def norm(): return predictions/sumTensor
def unchanged(): return predictions
predictions = tf.cond(isGTOne,norm,unchanged)
etc...
But when I'm passing in a batch of estimates I feel like this loss function is normalising the whole set of inputs to sum to 1 at this point, rather than each individual set summing to 1. I.e.
[[.8,.8],[.8,.8]] -> [[.25,.25],[.25,25]]
rather than the desired
[[.8,.8],[.8,.8]] -> [[.5,.5],[.5,.5]]
Can anybody clarify or put to rest my suspicions? If this is how my function is currently working, how do I change that?

You must specify a reduction axis for reduction ops, otherwise all axes will be reduces. Traditionally this is the first dimension of your tensor. So, line 2 should look like this:
sumTensor = tf.reduce_sum(absTensor, 0)
After you make that change you will run into another problem. sumTensor will no longer be a scalar and will thus no longer make sense as a condition for tf.cond (i.e. what does it mean to branch per entry of a batch?). What you really want is tf.select since you don't really want to branch logic per batch entry. Like this:
isGTOne = tf.greater(sumTensor,oneTensor)
norm = predictions/sumTensor
predictions = tf.select(isGTOne,norm,predictions)
But, looking at this now, I wouldn't even bother conditionally normalizing the entries. Since you are operating at the granularity of a batch now, I don't think you can gain performance from normalizing an entry of a batch one at a time. Especially, since dividing is not really an expensive side effect. Might as well just do:
def fitness(predictions,actual):
absTensor = tf.abs(predictions)
sumTensor = tf.reduce_sum(absTensor, 0)
predictions = predictions/sumTensor
etc...
Hope that helps!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart