Tensorflow: different fetches bring different results - machine-learning

I am playing around with Tensorflow and there is one thing I can't seem to understand: the role of the fetches argument for the Session.run().
From the doc:
To fetch the outputs of operations, execute the graph with a run()
call on the Session object and pass in the tensors to retrieve.
What I understood is that the parameter is used to read certain values from the from the run step. I see, though, that different fetches bring to different results
....´
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(preactivations_output, tf_train_labels))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
...
train_prediction = tf.nn.softmax(preactivations_output)
...
# This produces a model with 89% accuracy
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
# This produces a model with 10% accuracy
l, predictions = session.run(
[loss, train_prediction], feed_dict=feed_dict)
So now I wonder what else the fetches parameter is used for, given that it seems to influence the way the training is run. I didn't really understand it from the documentation...

Related

Tensorflow low train/test accuracy

I restored a pre-trained model in Tensorflow 1.2 to do some testing work. I assumed the model was well-trained since the loss decreased to very low (0.0001). However, with either the testing samples or training samples, the accuracy ops give me a value which is almost 0. Is this because I'm using the wrong accuracy function or is it because the model is the problem?
Here is the accuracy function, the test_image below is a batch with a single test sample, test_image_label is a single label:
correct_prediction = tf.equal(tf.argmax(GoogleNet(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
with Session() as less:
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={keep_prob: 1.0}))
print(accuracy_vector)
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
The model is defined as GoogleNet(data) above, it is a function that returns the logits of the input batch. The training was done like this:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=GoogleNet(train_batch)))
train_step = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost, global_step=global_step)
The train_step is ran in every iteration. I think it is worth noting that after restored the model, I cannot run print(GoogleNet(test_image).eval(feed_dict={keep_prob: 1.0})) in the session, with which I intended to take a look at the output of the model. It returns the error of FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Variable_213
[[Node: Variable_213/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_213"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_213)]]

how the generator is trained with the output of discriminator in Generative adversarial Networks

Recently I have learned about Generative Adversarial Networks.
For training the Generator, I am somehow confused how it learns. Here is an implemenation of GANs:
`# train generator
z = Variable(xp.random.uniform(-1, 1, (batchsize, nz), dtype=np.float32))
x = gen(z)
yl = dis(x)
L_gen = F.softmax_cross_entropy(yl, Variable(xp.zeros(batchsize, dtype=np.int32)))
L_dis = F.softmax_cross_entropy(yl, Variable(xp.ones(batchsize, dtype=np.int32)))
# train discriminator
x2 = Variable(cuda.to_gpu(x2))
yl2 = dis(x2)
L_dis += F.softmax_cross_entropy(yl2, Variable(xp.zeros(batchsize, dtype=np.int32)))
#print "forward done"
o_gen.zero_grads()
L_gen.backward()
o_gen.update()
o_dis.zero_grads()
L_dis.backward()
o_dis.update()`
So it computes a loss for the Generator as it is mentioned in the paper.
However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array).
But we know that in general, for training a network, we compute a loss function in the last layer (a loss between the last layers output and the real output) and then we compute the gradients. So for example, if the output is 64*64, then we compare it with a 64*64 image and then compute the loss and do the back propagation.
However, in the codes that I see in Generative Adversarial Networks, I see they compute a loss for the Generator from the discriminator output (which is just a number) and then they call the back propagation for Generator. The Generators last layers is for example 64*64 pixels but the discriminator loss is 1*1 (which is different from the usual networks) So I do not understand how it cause the Generator to be learned and trained?
I thought if we attach the two networks (attaching the Generator and Discriminator) and then call the back propagation but just update the Generators parameters, it makes sense and it should work. But what I see in the codes are totally different.
So I am asking how it is possible?
Thanks
You say 'However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array)' whereas the loss is always a scalar value. When we compute mean square error of two images it is also a scalar value.
L_adversarial = E[log(D(x))]+E[log(1−D(G(z))]
x is from real data distribution
z is the latent data distribution which is transformed by the Generator
Coming back to your actual question, The Discriminator network has a sigmoid activation function in the last layer which means it outputs in the range [0,1]. Discriminator tries to maximize this loss by maximizing both terms that are added in the loss function. Maximum value of first term is 0 and occurs when D(x) is 1 and maximum value of second term is also 0 and occurs when 1-D(G(z)) is 1 which means D(G(z)) is 0. So Discriminator tries to do a binary classification my maximizing this loss function where it tries to output 1 when it is fed x(real data) and 0 when it is fed G(z)(generated fake data).
But the Generator tries to minimize this loss in other words it tries to fool the Discriminator by generating fake samples which are similar to real samples. With time both Generator and Discriminator gets better and better. This is the intuition behind GAN.
The code is in pytorch
bce_loss = nn.BCELoss() #bce_loss = -ylog(y_hat)-(1-y)log(1-y_hat)[similar to L_adversarial]
Discriminator = ..... #some network
Generator = ..... #some network
optimizer_generator = ....... #some optimizer for generator network
optimizer_discriminator = ....... #some optimizer for discriminator network
z = ...... #some latent data distribution that is transformed by the generator
real = ..... #real data distribution
#####################
#Update Discriminator
#####################
fake = Generator(z)
fake_prediction = Discriminator(fake)
real_prediction = Discriminator(real)
discriminator_loss = bce_loss(fake_prediction,torch.zeros(batch_size))+bce_loss(real_prediction,torch.ones(batch_size))
discriminator_loss.backward()
optimizer_discriminator.step()
#################
#Update Generator
#################
fake = Generator(z)
fake_prediction = Discriminator(fake)
generator_loss = bce_loss(fake_prediction,torch.ones(batch_size))
generator_loss.backward()
optimizer_generator.step()

Implementing minibatch in input fn

I'm currently experimenting with the new high level tf.contrib.learn APIs in Tensorflow by porting over features from the inception-v3 retrain script from this tutorial:
May I know how do I replicate the minibatch sampling every iteration for both validation and training inputs as seen in the original retrain.py?
Currently I tried using tf.train.shuffle_batch in the input_fn function but not I'm not certain that it works.
Some code snippet for clarity
def train_input_fn():
# Get a batch of input bottleneck values, either calculated fresh every
# time with distortions applied, or from the cache stored on disk.
if do_distort_images:
(train_bottleneck_outputs, train_ground_truths) = get_random_distorted_bottlenecks(
sess, image_lists, -1, 'training',
FLAGS.image_dir, distorted_jpeg_data_tensor,
distorted_image_tensor, resized_image_tensor, bottleneck_tensor)
else:
(train_bottleneck_outputs, train_ground_truths, _) = get_random_cached_bottlenecks(
sess, image_lists, -1, 'training',
FLAGS.bottleneck_dir, FLAGS.image_dir, jpeg_data_tensor,
bottleneck_tensor)
return tf.train.shuffle_batch([tf.constant(train_bottleneck_outputs),
tf.constant(train_ground_truths)],
batch_size=FLAGS.train_batch_size, capacity=1100,
min_after_dequeue=1000, enqueue_many=True, num_threads=2)

How to handle gradients when training two sub-graphs simultaneously

The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.
Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the “hidden state at end of encoding”). I use this hidden state at end of encoding to feed it into a further sub-graph which I call “prices” (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).
The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.
Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):
if global_step % 10 == 0:
execute-the-price-training_code
else:
execute-the-decoder-training_code
So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.
My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.
Here is the training-part of the code:
This is the (almost original) training-op-preparation:
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.AdadeltaOptimizer(self.learning_rate)
for bucket_id in xrange(len(buckets)):
tf.scalar_summary("seq2seq loss", self.losses[bucket_id])
gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))
Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:
with tf.name_scope('prices') as scope:
#First layer
W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
#self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)
#Second layer to softmax (price ranges)
W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
W_price_t = tf.transpose(W_price)
B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")
self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
self.label_price = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")
#Remember the prices trainables
var_list_prices = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
var_list_all = tf.trainable_variables()
#Backprop
self.loss_price = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
self.loss_price_scalar = tf.reduce_mean(self.loss_price)
self.optimizer_price = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)
Thx a bunch
I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.
Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.
The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.

How do I log or view the cost used in training a TensorFlow neural network with dropout?

How do I see the accuracy and cost for the dropouts actually used in training a TensorFlow neural network with dropout?
As expected, each time I run a summary, e.g. with
train_writer.add_summary(sess.run(merged, feed_dict=foo), step)
or
print(sess.run(accuracy, feed_dict=foo))
if the network includes dropout, and foo feeds a “keep probability” other than 1.0, I will get different values, so that, for example, I get a different loss or accuracy each time — e.g., three immediately successive computations of accuracy with
print(sess.run(accuracy, feed_dict=foo))
print(sess.run(accuracy, feed_dict=foo))
print(sess.run(accuracy, feed_dict=foo))
might give something like
75.808
75.646
75.770
Though these are roughly the same, they are not exactly the same, presumably because each time I evaluate, the network drops out different nodes. A consequence of this must be that I don’t ever see the cost actually encountered in training.
How do I log or view the cost (or other summary values computed using the network) actually used in training a TensorFlow neural network with dropout?
And where is the problem? You should get three different values if you call three times a stochastic network. When you are logging your losses from network you are logging the ones that are actually used during training. Basically you can just read out value from your computed graph, like:
for i in range(100):
batch_xs, batch_ys = mnist.train.next_batch(100)
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))
_, loss_val = sess.run([train_step, cross_entropy],
feed_dict={x: batch_xs, y_: batch_ys})
print 'loss = ' + loss_val
which will print the loss which was computed during training step (it will not compute it twice, thus the dropout output mask will not be resampled).
If instead you want to see "what would be the accuracy on train set if I stop learning now" you need an eval graph https://www.tensorflow.org/versions/r0.8/tutorials/mnist/tf/index.html#evaluate-the-model , which will tell your network, that it is time to change dropout units from stochastic, to scaling/averaging results.

Resources