tensorflow distribute seq2seq stuck forever - machine-learning

I'm trying to start the distributed seq2seq model in Tensorflow. This is the original single-process seq2seq model.
I set a cluster(1ps, 3workers) follow the tensorflow distributed tutorial here.
But all workers are stuck forever, and output the same pooling log info:
start running session
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7623 get requests, put_count=3649 evicted_count=1000 eviction_rate=0.274048 and unsatisfied allocation rate=0.665617
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
This is the cluster setting of translate.py:
ps_hosts = ["9.91.9.129:2222"]
worker_hosts = ["9.91.9.130:2223", "9.91.9.130:2224", "9.91.9.130:2225"]
#worker_hosts = ["9.91.9.130:2223"]
cluster = tf.train.ClusterSpec({"ps":ps_hosts, "worker":worker_hosts})
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Worker server
is_chief = (FLAGS.task_index == 0)
gpu_num = FLAGS.task_index
with tf.Graph().as_default():
with tf.device(tf.train.replica_device_setter(cluster=cluster,
worker_device="/job:worker/task:%d/gpu:%d" % (FLAGS.task_index, gpu_num))):
And I used the tf.train.SyncReplicasOptimizer to implement the SyncTraining.
This is part of my seq2seq_model.py:
# Gradients and SGD update operation for training the model.
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=num_workers,
replica_id=task_index,
total_num_replicas=num_workers)
for b in xrange(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.init_tokens_op = opt.get_init_tokens_op
self.chief_queue_runners = [opt.get_chief_queue_runner]
self.saver = tf.train.Saver(tf.all_variables())
This is my complete python code [here]

It seems like Tensorflow people are not ready yet to properly share the experience of running code on a cluster. So far comprehensive documentation can be found only in the source code.
As of version 0.11 according to SyncReplicasOptimizer.py you have to run this after SyncReplicasOptimizer construction:
init_token_op = optimizer.get_init_tokens_op()
chief_queue_runner = optimizer.get_chief_queue_runner()
And then run this after your session is constructed with Supervisor:
if is_chief:
sess.run(init_token_op)
sv.start_queue_runners(sess, [chief_queue_runner])
With SyncReplicasOptimizerV2 introduced with 0.12 this code might not be sufficient so, please, refer to the source code of the version you use.

Related

TFF : every client do a pretrain function instead of build_federated_averaging_process

I would like that every client train his model with a function pretrainthat I wrote below :
def pretrain(model):
resnet_output = model.output
layer1 = tf.keras.layers.GlobalAveragePooling2D()(resnet_output)
layer2 = tf.keras.layers.Dense(units=zdim*2, activation='relu')(layer1)
model_output = tf.keras.layers.Dense(units=zdim)(layer2)
model = tf.keras.Model(model.input, model_output)
iterations_per_epoch = determine_iterations_per_epoch()
total_iterations = iterations_per_epoch*num_epochs
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.9)
checkpoint = tf.train.Checkpoint(step=tf.Variable(1), optimizer=optimizer, net=model)
manager = tf.train.CheckpointManager(checkpoint, pretrain_save_path, max_to_keep=10)
current_epoch = tf.cast(tf.floor(optimizer.iterations/iterations_per_epoch), tf.int64)
batch = client_data(0)
batch = client_data(0).batch(2)
epoch_loss = []
for (image1, image2) in batch:
loss, gradients = train_step(model, image1, image2)
epoch_loss.append(loss)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# if tf.reduce_all(tf.equal(epoch, current_epoch+1)):
print("Loss after epoch {}: {}".format(current_epoch, sum(epoch_loss)/len(epoch_loss)))
#print("Learning rate: {}".format(learning_rate(optimizer.iterations)))
epoch_loss = []
current_epoch += 1
if current_epoch % 50 == 0:
save_path = manager.save()
print("Saved model for epoch {}: {}".format(current_epoch, save_path))
save_path = manager.save()
model.save("model.h5")
model.save_weights("saved_weights.h5")
But as we know that TFF has a predefined function :
iterative_process = tff.learning.build_federated_averaging_process(...)
So please, how can I proceed ? Thanks
There are a few ways that one could proceed along similar lines.
First it is important to note that TFF is functional--one can use things like writing to / reading from files to manage state (as TF allows this), but it is not in the interface TFF exposes to users--while something involving writing to / reading from a file (IE, manipulating state without passing it through function parameters and results), this should at best be considered an implementation detail. It's something that TFF does not encourage.
By slightly refactoring your code above, however, I think this kind of application can fit quite nicely in TFF's programming model. We will want to define something like:
#tff.tf_computation
#tf.function
def pretrain_client_model(model, client_dataset):
# perhaps do dataset processing you want...
for batch in client_dataset:
# do model training
return model.weights() # or some tensor structure representing the trained model weights
Once your implementation looks something like this, you will be able to wire it in to a custom iterative process. The canned function you mention (build_federated_averaging_process) really just constructs an instance of tff.templates.IterativeProcess; you are always, however, free to write your own instance of this class.
Several tutorials take us through this process, this probably being the simplest. For a finished code example of a standalone iterative process implementation, see simple_fedavg.py.

FedProx with TensorFlow Federated

Would anyone know how to implement the FedProx optimisation algorithm with TensorFlow Federated? The only implementation that seems to be available online was developed directly with TensorFlow. A TFF implementation would enable an easier comparison with experiments that utilise FedAvg which the framework supports.
This is the link to the FedProx repo: https://github.com/litian96/FedProx
Link to the paper: https://arxiv.org/abs/1812.06127
At this moment, FedProx implementation is not available. I agree it would be a valuable algorithm to have.
If you are interested in contributing FedProx, the best place to start would be simple_fedavg which is a minimal implementation of FedAvg meant as a starting point for extensions -- see the readme there for more details.
I think the major change would need to happen to the client_update method, where you would add the proximal term depending on model_weights and initial_weights to the loss computed in forward pass.
I provide below my implementation of FedProx in TFF. I am not 100% sure that this is the right implementation; I post this answer also for discussing on actual code example.
I tried to follow the suggestions in the Jacub Konecny's answer and comment.
Starting from the simple_fedavg (referring to the TFF Github repo), I just modified the client_update method, and specifically changing the input argument for calculating the gradient with the GradientTape, i.e. instaead of just passing in input the outputs.loss, the tape calculates the gradient considering the outputs.loss + proximal_term previosuly (and iteratively) calculated.
#tf.function
def client_update(model, dataset, server_message, client_optimizer):
"""Performans client local training of "model" on "dataset".Args:
model: A "tff.learning.Model".
dataset: A "tf.data.Dataset".
server_message: A "BroadcastMessage" from server.
client_optimizer: A "tf.keras.optimizers.Optimizer".
Returns:
A "ClientOutput".
"""
def difference_model_norm_2_square(global_model, local_model):
"""Calculates the squared l2 norm of a model difference (i.e.
local_model - global_model)
Args:
global_model: the model broadcast by the server
local_model: the current, in-training model
Returns: the squared norm
"""
model_difference = tf.nest.map_structure(lambda a, b: a - b,
local_model,
global_model)
squared_norm = tf.square(tf.linalg.global_norm(model_difference))
return squared_norm
model_weights = model.weights
initial_weights = server_message.model_weights
tf.nest.map_structure(lambda v, t: v.assign(t), model_weights,
initial_weights)
num_examples = tf.constant(0, dtype=tf.int32)
loss_sum = tf.constant(0, dtype=tf.float32)
# Explicit use `iter` for dataset is a trick that makes TFF more robust in
# GPU simulation and slightly more performant in the unconventional usage
# of large number of small datasets.
for batch in iter(dataset):
with tf.GradientTape() as tape:
outputs = model.forward_pass(batch)
# ------ FedProx ------
mu = tf.constant(0.2, dtype=tf.float32)
prox_term =(mu/2)*difference_model_norm_2_square(model_weights.trainable, initial_weights.trainable)
fedprox_loss = outputs.loss + prox_term
# Letting GradientTape dealing with the FedProx's loss
grads = tape.gradient(fedprox_loss, model_weights.trainable)
client_optimizer.apply_gradients(zip(grads, model_weights.trainable))
batch_size = tf.shape(batch['x'])[0]
num_examples += batch_size
loss_sum += outputs.loss * tf.cast(batch_size, tf.float32)
weights_delta = tf.nest.map_structure(lambda a, b: a - b,
model_weights.trainable,
initial_weights.trainable)
client_weight = tf.cast(num_examples, tf.float32)
return ClientOutput(weights_delta, client_weight, loss_sum / client_weight)

Execute another model in parallel to a model's forward pass with PyTorch

I am trying to make some changes to the ResNet-18 model in PyTorch to invoke the execution of another auxiliary trained model which takes in the ResNet intermediate layer output at the end of each ResNet block as an input and makes some auxiliary predictions during the inference phase.
I want to be able to do the auxiliary computation after the computation of a block in parallel to the computation of the next ResNet block so as to reduce the end-to-end latency of the entire pipeline execution on GPU.
I have a base code that works correctly from the functionality perspective, but the execution of the auxiliary model is serial to the computation of the ResNet block. I verified this in two ways -
By adding print statements and verifying the order of execution.
By instrumenting the running time of the original ResNet model (say time t1) and the auxiliary model (say time t2). My execution time is currently t1+t2.
The original ResNet block code (This is the BasicBlock since I am experimenting on ResNet-18). The entire code is available here
class BasicBlock(nn.module):
...
def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out
This is my modification which works in a serial fashion -
def forward(self, x):
if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
return x
# Do usual block computation
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
# Try to make an auxiliary prediction
# First flatten the tensor (also assume for now that batch size is 1)
batchSize = x.shape[0]
intermediate_output = out.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
Understandably, the above code causes a data dependency between the execution of the auxiliary model and the next block and hence things to happen serially. The first solution I tried was to check if breaking this data dependency reduces latency. I tried doing so by allowing the auxiliary model to execute but not having the auxiliary_prediction return if the condition is met (Note that this would break functionality but this experiment was purely to understand the behavior). Essentially, what I did was -
batchSize = x.shape[0]
intermediate_output = out.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
# Comment out return to break data dependency
#return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
However, this did not work and upon researching further, I stumbled upon CUDA streams at this Stack Overflow link. I tried incorporating the idea of CUDA streams to solve my problem in the below way -
def forward(self, x):
if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
return x
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
with torch.cuda.Stream(s1):
# Do usual block computation
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
with torch.cuda.Stream(s2):
# Try to make an auxiliary prediction
# First flatten the tensor (also assume for now that batch size is 1)
out_detach = out.detach() # Detach from backprop flow and from computational graph dependency
batchSize = x.shape[0]
intermediate_output = out_detach.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out_detach.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
However, the output from Nvidia Visual Profiler still indicates that all work is still being done on the default stream and still serialized. Note that I did verify with a small CUDA program that CUDA streams is supported by the CUDA version I am using.
My questions -
Why does breaking the data dependency not cause PyTorch to schedule the computations in parallel? I thought this was the point of the dynamic computation graphs in PyTorch.
Why does using CUDA streams not delegate the computation to non-default streams?
Are there alternative approaches to execute the auxiliary model asynchronously/parallelly to the ResNet block computation?

Input Queue not responding TensorFlow program hanging

I am currently trying to train a neural network. I have an array of file names and their corresponding labels. However I am having issues when trying to train the network.
image_list, label_list = readImageLables()
images = ops.convert_to_tensor(image_list, dtype=dtypes.string)
labels = ops.convert_to_tensor(label_list, dtype=dtypes.int32)
with tf.Session() as sess:
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
for epoch in range(hm_epochs):
epoch_loss = 0
for _ in range(int(7685/batch_size)):
print(labels.eval())
filename_queue = tf.train.slice_input_producer([images,labels], num_epochs=10, shuffle=True)
image,label = read_images_from_disk(filename_queue)
print(image.eval())
epoch_x, epoch_y = tf.train.batch([image, label], batch_size=batch_size)
print("wait what")
#imgs, lbls = epoch_x.eval(), epoch_y.eval()
_, c = sess.run([optimizer, cost], feed_dict={x: epoch_x.eval(), y: epoch_y.eval()})
epoch_loss += c
print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)
At the line in which I am trying to print the image data, the program hangs. Even when this line is removed, the program is hanging on the last sess.run call in which I am feeding this data. I have initialized queue runners, coordinators, etc. However, I have a feeling that the filename_queue is having an issue. Is there anything I am missing in the tf.train.slice_input_producer line? Also is the program hanging or is it just taking a while to load. How much time would it usually take to load an epoch with a batch size of 100 and images of 80 by 70?
It looks like an issue I opened. While feeding data the input queue runners was hanging. This is because you have to start it.
From the issue, we have:
Quoting: RudrakshTuwani
For anyone else struggling with this, please read the documentation as
mentioned by girving. For the lazy ones:
init = tf.global_variables_initializer()
sess.run(init)
threads = tf.train.start_queue_runners()
print(sess.run(name_of_output_tensor))
As well as:
From: girving
You probably need to start queue runners. Please see the documentation
at https://www.tensorflow.org/versions/r0.11/how_tos/threading_and_queues/index.html
Hope it helps!
pltrdy
Note that in my case I got confused because the original code was using:
sv = tf.train.Supervisor(logdir=FLAGS.save_path)
with sv.managed_session() as session:
instead of my (and your):
with tf.Session() as session:
The first one actually implicitely starts queue runners.

Slightly differing output from Pybrain neural network despite consistent initialisation?

I am working on a feed forward network in PyBrain. To allow me to compare the effects of varying certain parameters I have initialised the network weights myself. I have done this under the assumption that if the weights are always the same then the output should always be the same. Is this assumption incorrect? Below is the code used to set up the network
n = FeedForwardNetwork()
inLayer = LinearLayer(7, name="in")
hiddenLayer = SigmoidLayer(1, name="hidden")
outLayer = LinearLayer(1, name="out")
n.addInputModule(inLayer)
n.addModule(hiddenLayer)
n.addOutputModule(outLayer)
in_to_hidden = FullConnection(inLayer, hiddenLayer, name="in-to-hidden")
hidden_to_out = FullConnection(hiddenLayer, outLayer, name="hidden-to-out")
n.addConnection(in_to_hidden)
n.addConnection(hidden_to_out)
n.sortModules()
in_to_hidden_params = [
0.27160018, -0.30659429, 0.13443352, 0.4509613,
0.2539234, -0.8756649, 1.25660715
]
hidden_to_out_params = [0.89784474]
net_params = in_to_hidden_params + hidden_to_out_params
n._setParameters(net_params)
trainer = BackpropTrainer(n, ds, learningrate=0.01, momentum=0.8)
UPDATE
It looks like even by seeding the random number generator, reproducibility is still an issue. See the GitHub issue here
I have done this under the assumption that if the weights are always the same then the output should always be the same
The assumption is correct, but your code is not doing so. Your are training your weights, thus they do not end up being the same. Stochastic training methods often permute training samples, and this permutation leads to different results, in particular BackpropTrainer does so:
def train(self):
"""Train the associated module for one epoch."""
assert len(self.ds) > 0, "Dataset cannot be empty."
self.module.resetDerivatives()
errors = 0
ponderation = 0.
shuffledSequences = []
for seq in self.ds._provideSequences():
shuffledSequences.append(seq)
shuffle(shuffledSequences)
If you want repeatable results - seed your random number generators.

Resources