When training a rnn model using the callback ResetStatesCallback bellow, i get the following warning message:
/var/venv/DSTL/lib/python3.4/site-packages/keras/callbacks.py:97:
UserWarning: Method on_batch_begin() is slow compared to the batch
update (0.791834). Check your callbacks. % delta_t_median)
from keras.callbacks import Callback
#Reset count every RESET_STATES_LENGTH
#RESET_STATES_LENGTH=8
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
if self.counter % RESET_STATES_LENGTH == 0:
self.model.reset_states()
self.counter += 1
Why do i get this message? Should i try something to fix it? Does it really slow down my training that much?
See https://github.com/fchollet/keras/issues/5008 for an explanation. It is stated there that
You are running something like saving the model or rendering images after each batch and it is taking longer than the batches themselves.
So it would seem that at runtime Keras has determined that your callback is slower than the batch itself.
Related
I'm reading over the implementation of the dask-lightgbm estimators (specifically, the _train_part function in dask_lightgb.core.py), and I'm failing to see how the entirety of the training set gets used to fit the final estimator?
The _train_part function accepts the boolean argument return_model, and in the implementation of the train function (which uses client.submit to call _train_part on each worker), return_model is only true when the worker is the "master_worker" (which itself appears to be a randomly chosen Dask worker). Logically, each worker gets dispatched 1/n chunks of the overall model training set - where n = total number of workers - then each worker trains its own independent model on its own subset of the training set. The return_model parameter controls whether each worker's model gets returned by _train_part, so it returns None for all workers - and therefore, models - except for one worker.
Code:
def _train_part(params, model_factory, list_of_parts, worker_addresses, return_model, local_listen_port=12400,
time_out=120, **kwargs):
network_params = build_network_params(worker_addresses, get_worker().address, local_listen_port, time_out)
params.update(network_params)
# Concatenate many parts into one
parts = tuple(zip(*list_of_parts))
data = concat(parts[0])
label = concat(parts[1])
weight = concat(parts[2]) if len(parts) == 3 else None
try:
model = model_factory(**params)
model.fit(data, label, sample_weight=weight)
finally:
_safe_call(_LIB.LGBM_NetworkFree())
return model if return_model else None
Is this not equivalent to training a non-distributed version of a lightgbm estimator on a 1/n subsample of the training set? Am I missing something? I feel like I am missing a part where either the workers' independent models get combined into one, or where a single estimator is getting updated with the individual trees learned by separate workers.
Thank you!
Ah the answer is yes - dask_lightgbm uses all available training samples. Dask's responsibility is only to distribute data across workers. LightGBM handles all distributed learning once its network parameters are set. It's not that each worker is training its own independent model - LightGBM is training a single model - but each worker will get a copy of it. For this reason, only the chosen worker returns the fitted estimator, and everyone else returns None.
I’m dealing with CIFAR10 and I use torchvision.datasets to create it. I’m in need of GPU to accelerate the calculation but I can’t find a way to put the whole dataset into GPU at one time. My model need to use mini-batches and it is really time-consuming to deal with each batch separately.
I've tried to put each mini-batch into GPU separately but it seems really time-consuming.
TL;DR
You won't save time by moving the entire dataset at once.
I don't think you'd necessarily want to do that even if you have the GPU memory to handle the entire dataset (of course, CIFAR10 is tiny by today's standards).
I tried various batch sizes and timed the transfer to GPU as follows:
num_workers = 1 # Set this as needed
def time_gpu_cast(batch_size=1):
start_time = time()
for x, y in DataLoader(dataset, batch_size, num_workers=num_workers):
x.cuda(); y.cuda()
return time() - start_time
# Try various batch sizes
cast_times = [(2 ** bs, time_gpu_cast(2 ** bs)) for bs in range(15)]
# Try the entire dataset like you want to do
cast_times.append((len(dataset), time_gpu_cast(len(dataset))))
plot(*zip(*cast_times)) # Plot the time taken
For num_workers = 1, this is what I got:
And if we try parallel loading (num_workers = 8), it becomes even clearer:
I've got an answer and I'm gonna try it later. It seems promising.
You can write a dataset class where in the init function, you red the entire dataset and apply all the transformations you need, and convert them to tensor format. Then, send this tensor to GPU (assuming there is enough memory). Then, in the getitem function you can simply use the index to retrieve the elements of that tensor which is already on GPU.
I'm currently experimenting with the new high level tf.contrib.learn APIs in Tensorflow by porting over features from the inception-v3 retrain script from this tutorial:
May I know how do I replicate the minibatch sampling every iteration for both validation and training inputs as seen in the original retrain.py?
Currently I tried using tf.train.shuffle_batch in the input_fn function but not I'm not certain that it works.
Some code snippet for clarity
def train_input_fn():
# Get a batch of input bottleneck values, either calculated fresh every
# time with distortions applied, or from the cache stored on disk.
if do_distort_images:
(train_bottleneck_outputs, train_ground_truths) = get_random_distorted_bottlenecks(
sess, image_lists, -1, 'training',
FLAGS.image_dir, distorted_jpeg_data_tensor,
distorted_image_tensor, resized_image_tensor, bottleneck_tensor)
else:
(train_bottleneck_outputs, train_ground_truths, _) = get_random_cached_bottlenecks(
sess, image_lists, -1, 'training',
FLAGS.bottleneck_dir, FLAGS.image_dir, jpeg_data_tensor,
bottleneck_tensor)
return tf.train.shuffle_batch([tf.constant(train_bottleneck_outputs),
tf.constant(train_ground_truths)],
batch_size=FLAGS.train_batch_size, capacity=1100,
min_after_dequeue=1000, enqueue_many=True, num_threads=2)
I was using tensorflow input pipelines like cifar10 model in tensorflow and try to use tf.cond to do validation and I wrote something like this
train_data = model.input(istrain=True)
val_data = model.input(istrain=False)
# This selects which stream to use.
select_val = tf.placeholder(dtype=bool,shape=[],name='select_test')
data = tf.cond(
select_val,
lambda:val_data,
lambda:train_data
)
# Here is the model.
loss = ...
train_op = ...
...
with tf.Session():
...
And if I delete the cond and just use the training data, the speed is 4000 samples/s and if I use the code above, the speed decrease to 2300 samples/s. The validation pipeline capacity is set really small so it won't take too much memory in GPU. The frequency of doing validation is also really low.
I'm not sure what is going wrong and please help me out.
tf.cond is not fully lazy. Any operations that are required by either of the branches of the cond will be run even if the branch that requires it is not the branch to be executed. So in your case, both model.input(istrain=True) and model.input(istrain=False) are being execute every time your data op is being called. The results of one of them is just ignored.
The documentation for cond gives a minimal code example:
Note that the conditional execution applies only to the operations
defined in fn1 and fn2. Consider the following simple program:
z = tf.multiply(a, b)
result = tf.cond(x < y, lambda: tf.add(x, z), lambda: tf.square(y))
If x < y, the tf.add operation will be executed and tf.square
operation will not be executed. Since z is needed for at least one
branch of the cond, the tf.mul operation is always executed,
unconditionally. Although this behavior is consistent with the
dataflow model of TensorFlow, it has occasionally surprised some users
who expected a lazier semantics.
Also note, this means that if your model.input is pulling some set of data from a larger pool (say, a batch from an entire dataset), each time the cond is run, data gets pulled from both validation and training, and one set just gets thrown away. This can cause problems more serious than inefficiencies in some cases. For example, if you're processing only a certain number epochs, then with this code you're not actually processing that number of epochs because data was being pulled that was not used.
In Tensorflow, how can I save the weights and all other variables of the program after it has finished training? I would like to be able to use the model I trained later on. Thanks in advance.
You can define a saver object like this:
saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)
In this case, the saver is configured to keep the five most recent checkpoints and also to keep a checkpoint every hour during training.
The saver can then be called periodically in your main training loop with a call such as the following.
sess=tf.Session()
...
# Save the model every 100 iterations
if step % 100 == 0:
saver.save(sess, "./model", global_step=step)
In this example the saver is saving a checkpoint into the ./model subdirectory every 100 training steps. The optional parameter global_step appends this value to the checkpoint filenames.
The model weights and other values may be restored at a later time for additional training or inference by the following:
saver.restore(sess, path.model_checkpoint_path)
There are a variety of other useful variants and options. A good place to start learning about them is the TF how-to on variable creation, storage and retrieval here