Wrong results in Tensorflow due to GPU memory sensitivity - memory

Consider the following simple VGG training code in Tensorflow:
import tensorflow as tf
import tensorflow.contrib.slim as slim
from tensorflow.contrib.slim.nets import vgg
# random input for training
images = tf.random_normal(shape=(32, 224, 224, 3))
labels = tf.cast(1000 * tf.random_uniform(shape=(32,)), dtype=tf.int32)
# init graph & cost
logits, _ = vgg.vgg_19(images)
cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels)
cost = tf.reduce_mean(cost, name='cross_entropy_loss')
# create training_op
optimizer = tf.train.MomentumOptimizer(1e-5, 0.9)
train_op = slim.learning.create_train_op(cost, optimizer)
# start training
slim.learning.train(train_op, logdir='logs/bugreport')
On several of my GPUs (Tesla K40 as well as Tesla K80) the loss returns NaN pretty much right at the beginning of training (consistently on every run):
InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
[[Node: train_op/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/gpu:0"](cross_entropy_loss)]]
[[Node: train_op/control_dependency/_337 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_487_train_op/control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Interestingly, the situation can be resolved if memory consumption is constrained to some fixed portion, i.e. by replacing the last line with
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9)
slim.learning.train(train_op, logdir='logs/bugreport',
session_config=tf.ConfigProto(gpu_options=gpu_options))
in which case training runs smoothly. However, increasing the memory fraction to 0.95 brings back the above error. I can reproduce this bug on some other GPUs (again K40 or K80) albeit with different thresholds (on one the memory fraction cannot go beyond 0.7), but not all GPUs are affected. Even worse, I have another example (evaluating a fixed set of 1000 images on VGG) for which the returned accuracy depends on the batch size if no memory constraints are applied.
The variation between GPUs suggests that the GPU memory is to be blamed, although I never had such problems in Theano. Interestingly, the ECC log of the affected GPUs shows a varying number of memory errors, ranging from zero to 500 bits. At this point I am bit lost as to how to further analyse and debug this problem. In particular, if the memory of the GPUs is faulty, then I'd need harder proof in order to ask for a replacement from my supplier.

Related

CheckNumerics finds Nans in "dense_1/kernel/read:0" after training MDN for a while

I am training a mixture density network and after a while (57 epochs) I get an error about NaN values from tf.add_check_numerics_ops()
The error message is:
dense_1/kernel/read:0 : Tensor had NaN values
[[Node: CheckNumerics_9 = CheckNumerics[T=DT_FLOAT, message="dense_1/kernel/read:0", _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/kernel/read, ^CheckNumerics_8)]]
If I check the weights using layer.get_weights() of my dense_1 I can see that they are all not NaN.
When I try a sess.run([graph.get_tensor_by_name('dense_1/kernel/read:0)], feed_dict=stuff) I get an array the size off my weights that is just NaNs.
I don't really understand what the read operation is doing, is there some sort of caching that is having issues?
Details of the network:
(I've tried many combinations of these and they all eventually find NaNs although at different epochs.)
3 hidden layers, 32, 16, 32
non linearity = selu, but I've tried tanh, relu, elu and selu
gradient clipping
dropout
happens with or without batchnorm
validation error is still improving when I get NaNs
input: 128 dimensions
output: mixture of 3 beta distributions in each of 64 dimensions
occurs with or without adversarial examples
I use eps=1e-7 to clip by value [eps and 1-eps]
I use the logsumexp trick for numerical stability
most of the relevant code can be found here:
https://gist.github.com/MarvinT/29bbeda2aecee17858e329745881cc7c
Caused by this unsolved bug in tensorflow:
https://github.com/tensorflow/tensorflow/issues/2288
I still don't know where the NaN is getting into my gradient though...

tensorflow conv2d memory consumption explain?

output = tf.nn.conv2d(input, weights, strides = [1,3,3,1], padding = 'VALID')
My input has shape 200x225x225x1, weights is 15x15x1x64. Hence, the output has shape 200x71x71x64 since (225-15)/3 + 1 = 71
Tensorboard shows that this operation consumes totally 768MB (see pic below). Assuming it takes into account the size of input (38.6MB), weights (0.06MB) and output (246.2MB) the total memory consumption should not exceed 300MB. So where does the rest of the memory consumption come from?
Although I'm not able to reproduce your graph and values based on information provided, it's possible that you're seeing additional memory usage due to intermediary values materialized during the computation of Conv2D. It's also possible that the instrumentation is incorrect. (e.g. reshape operations that do not result in a copy of Tensor memory end up duplicating the "memory usage" in the TF Node Stats instrumentation.) Without a reproducible test case, it's hard to say more. If you do feel like this is a bug in TensorFlow, please do raise an issue on Github!

Using RNN to recover sine wave from noisy signal

I am involved with an application that needs to estimate the state of a certain system in real time by measuring a set of (non-linearly) dependent parameters. Up until now the application was using an extended Kalman filter, but it was found to be underperforming in certain circumstances, which is likely caused by the fact that the differences between the real system and its model used in the filter are too significant to be modeled as white noise. We cannot use a more precise model for a number of unrelated reasons.
We decided to try recurrent neural networks for the task. Since my experience with neural networks is quite limited, before tackling the real task itself, I decided to practice with a hand crafted problem first. That problem, however, I could not solve, so I'm asking for help here.
Here's what I did: I generated some sine waveforms of varying phase, frequency, amplitude, and offset. Then I distorted the waveforms with some white noise, and (unsuccessfully) attempted to train an LSTM network to recover my waveforms from the noisy signal. I expected that the network will eventually learn to fit a sine waveform into the noisy data set.
Here's the source (slightly abridged, but it should work):
#!/usr/bin/env python3
import time
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.wrappers import TimeDistributed
from keras.objectives import mean_absolute_error, cosine_proximity
POINTS_PER_WF = int(1e4)
X_SPACE = np.linspace(0, 100, POINTS_PER_WF)
def make_waveform_with_noise():
def add_noise(vec):
stdev = float(np.random.uniform(0.01, 0.2))
return vec + np.random.normal(0, stdev, size=len(vec))
f = np.random.choice((np.sin, np.cos))
wf = f(X_SPACE * np.random.normal(scale=5)) *\
np.random.normal(scale=5) + np.random.normal(scale=50)
return wf, add_noise(wf)
RESCALING = 1e-3
BATCH_SHAPE = (1, POINTS_PER_WF, 1)
model = Sequential([
TimeDistributed(Dense(5, activation='tanh'), batch_input_shape=BATCH_SHAPE),
LSTM(20, activation='tanh', inner_activation='sigmoid', return_sequences=True),
LSTM(20, activation='tanh', inner_activation='sigmoid', return_sequences=True),
TimeDistributed(Dense(1, activation='tanh'))
])
def compute_loss(y_true, y_pred):
skip_first = POINTS_PER_WF // 2
y_true = y_true[:, skip_first:, :] * RESCALING
y_pred = y_pred[:, skip_first:, :] * RESCALING
me = mean_absolute_error(y_true, y_pred)
cp = cosine_proximity(y_true, y_pred)
return me + cp
model.summary()
model.compile(optimizer='adam', loss=compute_loss,
metrics=['mae', 'cosine_proximity'])
NUM_ITERATIONS = 30000
for iteration in range(NUM_ITERATIONS):
wf, noisy_wf = make_waveform_with_noise()
y = wf.reshape(BATCH_SHAPE) * RESCALING
x = noisy_wf.reshape(BATCH_SHAPE) * RESCALING
info = model.train_on_batch(x, y)
model.save_weights('final.hdf5')
The first dense layer is actually useless, the reason I added it is because I wanted to make sure I can successfully combine LSTM and time distributed dense layers, since my real application will likely need that setup.
The error function was modified a number of times. Initially I was using plain mean squared error, but the training process was extremely slow, and it was mostly converging to simply copying the input noisy signal into the output. The cosine proximity metric I added later essentially defines the degree of similarity between the shapes of the functions; it seemed to speed up the learning quite a bit. Also note that I'm applying the loss function only to the last half of the dataset; the motivation for that is that I expected that the network will need to see a few periods of the signal in order to be able to correctly identify the parameters of the waveform. However, I found that this modification has no visible effect on the performance of the network.
The latest modification of the script uses Adam optimizer, I also experimented with RMSProp with varying learning rate and decay settings, but I found no noticeable difference in behavior of the network.
I am using Theano 0.9 (dev) backend configured to use 64 bit floating point, in order to prevent possible issues with numerical stability. The epsilon value is set accordingly to 1e-14.
This is what the output looks like after 15k..30k training steps (performance stops improving starting from about 15k steps) (the first plot is zoomed in for the sake of clarity):
Plot legend:
blue (0) - noisy signal, input of the RNN
green (1) - recovered signal, output of the RNN
red (2) - ground truth
My question is: what am I doing wrong?

Why are my TensorFlow network weights and costs NaN when I use RELU activations?

I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.
I believe I'm following all the right general advice. For example I initialize my weights with
weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))
and use a slow training rate, e.g.,
tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).
I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).
Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!
Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation
where nl is the flattened length of the the input vector or
stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))
results in weights that generally do not diverge.
If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.
Have you tried gradient clipping and/or a smaller learning rate?
Basically, you will need to process your gradients before applying them, as follows (from tf docs, mostly):
# Replace this with what follows
# opt = tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
# Create an optimizer.
opt = tf.train.MomentumOptimizer(learning_rate=0.001, momentum=0.5)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(cross_entropy_loss, tf.trainable_variables())
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt = opt.apply_gradients(capped_grads_and_vars)
Also, the discussion in this question might help.

LS-SVM training : Out of memory

I try to train an LS-SVM classifier on a dataset having the following size:
Training dataset: TS = 48000x12 (double)
Groups: G = 48000x1 (double)
Matlab training code is:
class = svmtrain(TS,G,'method','LS',...
'kernel_function','rbf','boxconstraint',C,'rbf_sigma',sigma);
Then, I got this error message:
Error using svmtrain (line 516)
Error evaluating kernel function 'rbf_kernel'.
Caused by:
Error using repmat
Out of memory. Type HELP MEMORY for your options.
Note that the size of the physical memory is 4Gb, and it works when I decrease dataset training size. So if there are any solution with the same data size and of course without adding physical memory.
It seems, that the implementation requires computation of the whole Gram matrix, which is the size of N x N (where N - number of sampels) in your case it is 2,304,000,000, now each is represented by the 32bit float, meaning it requires at least 4 bytes which gives as 9,216,000,000 bytes required, which is roughly 9GB of data just for a Gram (Kernel) matrix.
There are two options:
Find implementation which for RBF kernel do not compute the kernel (Gram) matrix, but instead use some callable to compute the kernel value each time
You can try to use some kind of LS-SVM approximation, like Fast Sparse Approximation of Least Squares Support Vector Machine : http://homes.cs.washington.edu/~lfb/software/FSALS-SVM.htm

Resources