Input Queue not responding TensorFlow program hanging

Input Queue not responding TensorFlow program hanging - machine-learning

I am currently trying to train a neural network. I have an array of file names and their corresponding labels. However I am having issues when trying to train the network.
image_list, label_list = readImageLables()
images = ops.convert_to_tensor(image_list, dtype=dtypes.string)
labels = ops.convert_to_tensor(label_list, dtype=dtypes.int32)
with tf.Session() as sess:
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
for epoch in range(hm_epochs):
epoch_loss = 0
for _ in range(int(7685/batch_size)):
print(labels.eval())
filename_queue = tf.train.slice_input_producer([images,labels], num_epochs=10, shuffle=True)
image,label = read_images_from_disk(filename_queue)
print(image.eval())
epoch_x, epoch_y = tf.train.batch([image, label], batch_size=batch_size)
print("wait what")
#imgs, lbls = epoch_x.eval(), epoch_y.eval()
_, c = sess.run([optimizer, cost], feed_dict={x: epoch_x.eval(), y: epoch_y.eval()})
epoch_loss += c
print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)
At the line in which I am trying to print the image data, the program hangs. Even when this line is removed, the program is hanging on the last sess.run call in which I am feeding this data. I have initialized queue runners, coordinators, etc. However, I have a feeling that the filename_queue is having an issue. Is there anything I am missing in the tf.train.slice_input_producer line? Also is the program hanging or is it just taking a while to load. How much time would it usually take to load an epoch with a batch size of 100 and images of 80 by 70?

It looks like an issue I opened. While feeding data the input queue runners was hanging. This is because you have to start it.
From the issue, we have:
Quoting: RudrakshTuwani
For anyone else struggling with this, please read the documentation as
mentioned by girving. For the lazy ones:
init = tf.global_variables_initializer()
sess.run(init)
threads = tf.train.start_queue_runners()
print(sess.run(name_of_output_tensor))
As well as:
From: girving
You probably need to start queue runners. Please see the documentation
at https://www.tensorflow.org/versions/r0.11/how_tos/threading_and_queues/index.html
Hope it helps!
pltrdy
Note that in my case I got confused because the original code was using:
sv = tf.train.Supervisor(logdir=FLAGS.save_path)
with sv.managed_session() as session:
instead of my (and your):
with tf.Session() as session:
The first one actually implicitely starts queue runners.

Related

TFF : every client do a pretrain function instead of build_federated_averaging_process

I would like that every client train his model with a function pretrainthat I wrote below :
def pretrain(model):
resnet_output = model.output
layer1 = tf.keras.layers.GlobalAveragePooling2D()(resnet_output)
layer2 = tf.keras.layers.Dense(units=zdim*2, activation='relu')(layer1)
model_output = tf.keras.layers.Dense(units=zdim)(layer2)
model = tf.keras.Model(model.input, model_output)
iterations_per_epoch = determine_iterations_per_epoch()
total_iterations = iterations_per_epoch*num_epochs
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.9)
checkpoint = tf.train.Checkpoint(step=tf.Variable(1), optimizer=optimizer, net=model)
manager = tf.train.CheckpointManager(checkpoint, pretrain_save_path, max_to_keep=10)
current_epoch = tf.cast(tf.floor(optimizer.iterations/iterations_per_epoch), tf.int64)
batch = client_data(0)
batch = client_data(0).batch(2)
epoch_loss = []
for (image1, image2) in batch:
loss, gradients = train_step(model, image1, image2)
epoch_loss.append(loss)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# if tf.reduce_all(tf.equal(epoch, current_epoch+1)):
print("Loss after epoch {}: {}".format(current_epoch, sum(epoch_loss)/len(epoch_loss)))
#print("Learning rate: {}".format(learning_rate(optimizer.iterations)))
epoch_loss = []
current_epoch += 1
if current_epoch % 50 == 0:
save_path = manager.save()
print("Saved model for epoch {}: {}".format(current_epoch, save_path))
save_path = manager.save()
model.save("model.h5")
model.save_weights("saved_weights.h5")
But as we know that TFF has a predefined function :
iterative_process = tff.learning.build_federated_averaging_process(...)
So please, how can I proceed ? Thanks

There are a few ways that one could proceed along similar lines.
First it is important to note that TFF is functional--one can use things like writing to / reading from files to manage state (as TF allows this), but it is not in the interface TFF exposes to users--while something involving writing to / reading from a file (IE, manipulating state without passing it through function parameters and results), this should at best be considered an implementation detail. It's something that TFF does not encourage.
By slightly refactoring your code above, however, I think this kind of application can fit quite nicely in TFF's programming model. We will want to define something like:
#tff.tf_computation
#tf.function
def pretrain_client_model(model, client_dataset):
# perhaps do dataset processing you want...
for batch in client_dataset:
# do model training
return model.weights() # or some tensor structure representing the trained model weights
Once your implementation looks something like this, you will be able to wire it in to a custom iterative process. The canned function you mention (build_federated_averaging_process) really just constructs an instance of tff.templates.IterativeProcess; you are always, however, free to write your own instance of this class.
Several tutorials take us through this process, this probably being the simplest. For a finished code example of a standalone iterative process implementation, see simple_fedavg.py.

Pytorch: Calculating running time on GPU and CPU of a for loop

I am really new to pytorch. And I got really confused the whole day while I was trying out to figure out why my nn runs slower on GPU than CPU. I do not understand when I calculated the running time using time.time(), the time of the whole loop is a lot different with the sum of every single running time. Here is part of my code. Could anybody help me? Appreciate it!
time_out = 0
time_in = 0
for epoch in tqdm(range(self.n_epoch)):
running_loss = 0
running_error = 0
running_acc = 0
if self.cuda:
torch.cuda.synchronize() #time_out_start
epst1 = time.time()
for step, (batch_x, batch_y) in enumerate(self.normal_loader):
if self.cuda:
torch.cuda.synchronize() #time_in_start
t1 = time.time()
batch_x, batch_y = batch_x.to(self.device), batch_y.to(self.device)
b_x = Variable(batch_x)
b_y = Variable(batch_y)
pred_y = self.model(b_x)
#print (pred_y)
loss = self.criterion(pred_y, b_y)
error = mae(pred_y.detach().cpu().numpy(),b_y.detach().cpu().numpy())
acc = r2(b_y.detach().cpu().numpy(),pred_y.detach().cpu().numpy())
#print (loss)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
running_acc += acc
running_loss += loss.item()
running_error += error
if self.cuda:
torch.cuda.synchronize() #time_in_end
t6 = time.time()
time_in += t6-t1
if self.cuda:
torch.cuda.synchronize() #time_out_end
eped1 = time.time()
time_out += eped1-epst1
print ('loop time(out)',time_out)
print ('loop time(in)',time_in)
The result is:
CPU:
EPOCH 10: out: 1.283s in: 0.695s
EPOCH 50: out: 6.43s in: 3.288s
EPOCH 100: out:12.646s in:6.386s
GPU:
EPOCH 10: out: 3.92s in: 1.471s
EPOCH 50: out: 9.35s in:3.04s
EPOCH 100: out: 18.418s in:5.655
I understand that transferring data from cpu to gpu cost some time. So as the epochs go up, the calculation time of GPU should become less than CPU time. My question is:
why the time I record outside of the loop is so different from the inside one? Is there any step that I missed to record the running time?
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
The Network is really simple, which is:
class Model(nn.Module):
def __init__(self,n_input,n_nodes1,n_nodes2):
super(Model, self).__init__()
self.n_input = n_input
self.n_nodes1 = n_nodes1
self.n_nodes2 = n_nodes2
self.l1 = nn.Linear(self.n_input, self.n_nodes1)
self.l2 = nn.Linear(self.n_nodes1, self.n_nodes2)
self.l3 = nn.Linear(self.n_nodes2, 1)
def forward(self,x):
h1 = F.relu(self.l1(x))
h2 = F.relu(self.l2(h1))
h = self.l3(h2)
return h
the training data is formed as:(regression problem, input_x are descriptors and y is the target value)
def load_train_normal(self,x,y,batch_size = 100):
if batch_size:
self.batch_size = batch_size
self.x_train_n, self.y_train_n = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
#x, y = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
self.dataset = Data.TensorDataset(self.x_train_n,self.y_train_n)
self.normal_loader = Data.DataLoader(
dataset = self.dataset,
batch_size = self.batch_size,
shuffle = True, num_workers=2,)

why the time I record outside of the loop is so different with the inside one? Is there any step that I missed to record the running time?
self.normal_loader is not just a plain dictionary, vector or something as simple as that. Iterating over it takes a significant amount of time.
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
torch.cuda.synchronize() is a heavy operation. Even when it didn't even do anything useful such as in this case, as pred_y.detach().cpu() had already enforced synchronization.
As to how to get they faster? Drop the synchronize() calls, they don't do you any good.
And then defer the processing of pred_y until later. Much later. You want to have called the model at least 2 or 3 times before you trigger the first download of results. The simpler the model and the smaller the data, the more iterations you have to wait.
Because transfers to and from the GPU don't just "take time", they imply synchronization. Without synchronization, the execution model on the GPU mostly "lags behind", with data uploads to the GPU already being asynchronous behind the scenes, and actual execution only being queued behind them. If you don't synchronize by accident or explicitly, workloads start to overlap, stuff (uploads, execution, CPU work) starts running in parallel. Your effective execution time approaches max(upload, download, GPU execution, CPU execution).
If you synchronize, there are no tasks to overlap, and no batches to form from same-typed tasks. Upload, execution, download, CPU part, it all happens sequentially. Your execution time ends up upload + download + GPU execution + CPU execution. Some additional overhead for breaking batching on the driver level on top. So easily 5-10x slower than it should be.

Issue with multilabel classification

I followed this tutorial: https://medium.com/#vijayabhaskar96/multi-label-image-classification-tutorial-with-keras-imagedatagenerator-cd541f8eaf24
and wrote some of my code for multilabel classification. I had it working with one-hot encoding on a small scale but I had to move to option 2 mentioned in the article because I have 6000 classes and therefore one hot was not viable. I managed to train the network and it said 99% accuracy and 83% f1 score. However, when I'm trying to test the network, for every image it's outputting some combination of only 3 labels when there are 6000 possible labels. I wondered if maybe the code to test the model was incorrect. I tried using the code mentioned in the post and it doesn't work:
test_generator.reset()
pred = model.predict_generator(test_generator, steps=STEP_SIZE_TEST, verbose=1);
pred_bool = (pred > 0.5)
unorderable types: list() > float()
I've tried hard to fix this and not figured it out and I can't find any examples online of anyone doing something similar. Does anyone have an idea of how to get this prediction part working using this code block (I had it with another 2 options and was getting that issue printing one or several labels) or why the model might be failing in training with this behavior?
EDIT: for more context on the training issue, here is all the training code:
import json
input_file = open ('class_names_6000.json')
json_array = json.load(input_file)
#print(str(json_array))
args = parser.parse_args()
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print('Loading Data...')
df = pd.read_csv('dataset_train.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
datagen=ImageDataGenerator(rescale=1./255.)
test_datagen=ImageDataGenerator(rescale=1./255.)
train_generator=datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_test.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
test_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_validation.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
valid_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
print('Data Loaded.')
f1_score_callback = ComputeF1()
model = build_model('train', numclasses=len(json_array), model_name = args.model)
ImageFile.LOAD_TRUNCATED_IMAGES = True
Also, an important detail, when training, it says the accuracy is 99% and the f1 score is 84% with an validation f1 score at 84% as well.

Tensor.eval() gives no output. Just a blinking cursor in IPython notebook

I am trying to read multiple images in tensorflow and here is my code which i have taken from a stackoverflow
sess = tf.InteractiveSession()
filenames = ['/Users/darshak/TensorFlow/1.jpg', '/Users/darshak/TensorFlow/10.jpg']
filename_queue = tf.train.string_input_producer(filenames)
reader = tf.WholeFileReader()
filename, content = reader.read(filename_queue)
image = tf.image.decode_jpeg(content, channels=3)
Now when I run,
image.eval()
I get no output. Just a blinking cursor. How can I see if something is wrong?

You need to start the queue runners. Just add the following after creating the session, and before evaluating image.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
For more, check out Threading and Queues.

TensorFlow: does tf.train.batch automatically load the next batch when the batch has finished training?

For instance, after I have created my operations, fed the batch data through the operation and run the operation, does tf.train.batch automatically feed in another batch of data to the session?
I ask this because tf.train.batch has an attribute of allow_smaller_final_batch which makes it possible for the final batch to be loaded as a size lesser than the indicated batch size. Does this mean even without a loop, the next batch could be automatically fed? From the tutorial codes I am rather confused. When I load a single batch, I get literally a single batch size of shape [batch_size, height, width, num_channels], but the documentation says it Creates batches of tensors in tensors. Also, when I read the tutorial code in the tf-slim walkthrough tutorial, where there is a function called load_batch, there are only 3 tensors returned: images, images_raw, labels. Where are 'batches' of data as explained in the documentation?
Thank you for your help.

... does tf.train.batch automatically feeds in another batch of data to the session?
No. Nothing happens automatically. You must call sess.run(...) again to load a new batch.
Does this mean even without a loop, the next batch could be automatically fed?
No. tf.train.batch(..) will always load batch_size tensors. If you have for example 100 images and a batch_size=30 then you will have 3*30 batches as in you can call sess.run(batch) three times before the input queue will start from the beginning (or stop if epoch=1). This means that you miss out 100-3*30=10 samples from training. In case you do not want to miss them you can do tf.train.batch(..., allow_smaller_final_batch=True) so now you will have 3x 30-sample-batches and 1x 10-sample-batch before the input queue will restart.
Let me also elaborate with a code sample:
queue = tf.train.string_input_producer(filenames,
num_epochs=1) # only iterate through all samples in dataset once
reader = tf.TFRecordReader() # or any reader you need
_, example = reader.read(queue)
image, label = your_conversion_fn(example)
# batch will now load up to 100 image-label-pairs on sess.run(...)
# most tf ops are tuned to work on batches
# this is faster and also gives better result on e.g. gradient calculation
batch = tf.train.batch([image, label], batch_size=100)
with tf.Session() as sess:
# "boilerplate" code
sess.run([
tf.local_variables_initializer(),
tf.global_variables_initializer(),
])
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
# in most cases coord.should_stop() will return True
# when there are no more samples to read
# if num_epochs=0 then it will run for ever
while not coord.should_stop():
# will start reading, working data from input queue
# and "fetch" the results of the computation graph
# into raw_images and raw_labels
raw_images, raw_labels = sess.run([images, labels])
finally:
coord.request_stop()
coord.join(threads)

You need to call sess.run and pass the batch to it everytime when you want to load the next batch. See the code below.
img = [0,1,2,3,4,5,6,7,8]
lbl = [0,1,2,3,4,5,6,7,8]
images = tf.convert_to_tensor(img)
labels = tf.convert_to_tensor(lbl)
input_queue = tf.train.slice_input_producer([images,labels])
sliced_img = input_queue[0]
sliced_lbl = input_queue[1]
img_batch, lbl_batch = tf.train.batch([sliced_img,sliced_lbl], batch_size=3)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(0,3): #batch size
image_batch,label_batch = sess.run([img_batch,lbl_batch ])
print(image_batch, label_batch)
coord.request_stop()
coord.join(threads)
the answer would be something like this:
[4,1,8] [4,1,8]
[2,3,7] [2,3,7]
[2,6,8] [2,6,8]

I made the modification to the code from https://github.com/tensorflow/models/blob/master/research/slim/slim_walkthrough.ipynb and bodokaiser answer from the above post. Please note that this is from the evaluation scrip on https://github.com/tensorflow/models/tree/master/research/slim, eval_image_classifier.py. The most important modification to the eval_image_classifier.py code is to add num_epochs=1 to the DatasetDataProvider line. That way, all the images would be accessed once for inference.
provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
shuffle=False,
common_queue_capacity=2 * FLAGS.batch_size,
common_queue_min=FLAGS.batch_size, num_epochs=1)
[image, label] = provider.get(['image', 'label'])
images, labels = tf.train.batch(
[image, label],
batch_size=FLAGS.batch_size,
num_threads=FLAGS.num_preprocessing_threads,
capacity=1 * FLAGS.batch_size)
with tf.Session() as sess:
sess.run([tf.local_variables_initializer(),
tf.global_variables_initializer(),])
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
np_image, np_label = sess.run([images, labels])
except:
coord.request_stop()
coord.join(threads)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart