I've noticed that a frequent occurrence during training is NANs being introduced.
Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.
Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?
The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?
I came across this phenomenon several times. Here are my observations:
Gradient blow up
Reason: large gradients throw the learning process off-track.
What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.
What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.
Bad learning rate policy and params
Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.
What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:
... sgd_solver.cpp:106] Iteration 0, lr = -nan
What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see this thread.
Faulty Loss function
Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.
What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.
For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.
Faulty input
Reason: you have an input with nan in it!
What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.
stride larger than kernel size in "Pooling" layer
For some reason, choosing stride > kernel_size for pooling may results with nans. For example:
layer {
name: "faulty_pooling"
type: "Pooling"
bottom: "x"
top: "y"
pooling_param {
pool: AVE
stride: 5
kernel: 3
}
}
results with nans in y.
Instabilities in "BatchNorm"
It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.
Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.
In my case, not setting the bias in the convolution/deconvolution layers was the cause.
Solution: add the following to the convolution layer parameters.
bias_filler {
type: "constant"
value: 0
}
This answer is not about a cause for nans, but rather proposes a way to help debug it.
You can have this python layer:
class checkFiniteLayer(caffe.Layer):
def setup(self, bottom, top):
self.prefix = self.param_str
def reshape(self, bottom, top):
pass
def forward(self, bottom, top):
for i in xrange(len(bottom)):
isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
if isbad>0:
raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isbad)/bottom[i].count))
def backward(self, top, propagate_down, bottom):
for i in xrange(len(top)):
if not propagate_down[i]:
continue
isf = np.sum(1-np.isfinite(top[i].diff[...]))
if isf>0:
raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isf)/top[i].count))
Adding this layer into your train_val.prototxt at certain points you suspect may cause trouble:
layer {
type: "Python"
name: "check_loss"
bottom: "fc2"
top: "fc2" # "in-place" layer
python_param {
module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
layer: "checkFiniteLayer"
param_str: "prefix-check_loss" # string for printouts
}
}
learning_rate is high and should be decreased
The accuracy in the RNN code was nan, with select the low value for learning rate it fixes
One more solution for anyone stuck like I just was-
I was receiving nan or inf losses on a network I setup with float16 dtype across the layers and input data. After all else failed, it occurred to me to switch back to float32, and the nan losses were solved!
So bottom line, if you switched dtype to float16, change it back to float32.
I was trying to build a sparse autoencoder and had several layers in it to induce sparsity. While running my net, I encountered the NaN's. On removing some of the layers (in my case, I actually had to remove 1), I found that the NaN's disappeared. So, I guess too much sparsity may lead to NaN's as well (some 0/0 computations may have been invoked!?)
Related
I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!
I am developing a model using linear regression to predict the age. I know that the age is from 0 to 100 and it is a possible value. I used conv 1 x 1 in the last layer to predict the real value. Do I need to add a ReLU function after the output of convolution 1x1 to guarantee the predicted value is a positive value? Currently, I did not add ReLU and some predicted value becomes negative value like -0.02 -0.4…
There's no compelling reason to use an activation function for the output layer; typically you just want to use a reasonable/suitable loss function directly with the penultimate layer's output. Specifically, a RELU doesn't solve your problem (or at most only solves 'half' of it) since it can still predict above 100. In this case -predicting a continuous outcome- there's a few standard loss functions like squared error or L1-norm.
If you really want to use an activation function for this final layer and are concerned about always predicting within a bounded interval, you could always try scaling up the sigmoid function (to between 0 and 100). However, there's nothing special about sigmoid here - any bounded function, ex. any CDF of a signed, continuous random variable, could be similarly used. Though for optimization, something easily differentiable is important.
Why not start with something simple like squared-error loss? It's always possible to just 'clamp' out-of-range predictions to within [0-100] (we can give this a fancy name like 'doubly RELU') when you need to actually make predictions (as opposed to during training/testing), but if you're getting lots of such errors, the model might have more fundamental problems.
Even for a regression problem, it can be good (for optimisation) to use a sigmoid layer before the output (giving a prediction in the [0:1] range) followed by a denormalization (here if you think maximum age is 100, just multiply by 100)
This tip is explained in this fast.ai course.
I personally think these lessons are excellent.
You should use a sigmoid activation function, and then normalize the targets outputs to the [0, 1] range. This solves both issues of being positive and with a limit.
You can easily then denormalize the neural network outputs to get an output in the [0, 100] range.
I am training a mixture density network and after a while (57 epochs) I get an error about NaN values from tf.add_check_numerics_ops()
The error message is:
dense_1/kernel/read:0 : Tensor had NaN values
[[Node: CheckNumerics_9 = CheckNumerics[T=DT_FLOAT, message="dense_1/kernel/read:0", _device="/job:localhost/replica:0/task:0/gpu:0"](dense_1/kernel/read, ^CheckNumerics_8)]]
If I check the weights using layer.get_weights() of my dense_1 I can see that they are all not NaN.
When I try a sess.run([graph.get_tensor_by_name('dense_1/kernel/read:0)], feed_dict=stuff) I get an array the size off my weights that is just NaNs.
I don't really understand what the read operation is doing, is there some sort of caching that is having issues?
Details of the network:
(I've tried many combinations of these and they all eventually find NaNs although at different epochs.)
3 hidden layers, 32, 16, 32
non linearity = selu, but I've tried tanh, relu, elu and selu
gradient clipping
dropout
happens with or without batchnorm
validation error is still improving when I get NaNs
input: 128 dimensions
output: mixture of 3 beta distributions in each of 64 dimensions
occurs with or without adversarial examples
I use eps=1e-7 to clip by value [eps and 1-eps]
I use the logsumexp trick for numerical stability
most of the relevant code can be found here:
https://gist.github.com/MarvinT/29bbeda2aecee17858e329745881cc7c
Caused by this unsolved bug in tensorflow:
https://github.com/tensorflow/tensorflow/issues/2288
I still don't know where the NaN is getting into my gradient though...
I ran caffe and got this output:
who can tell me what is the problem?
I will really appreciate!!
It seems like one (or more) of your label values are invalid, see this PR for information:
If you have an invalid ground truth label, "SoftmaxWithLoss" will silently access invalid memory [...] The old check only worked in DEBUG mode and also only worked for CPU.
Make sure your prediction vector length matches the number of labels you try to predict.
From your comments, it seems like you have labels in the range 0..10575, but on the other hand, your classification layer, "fc7" only predicts probabilities for 1000 classes. Thus, "SoftmaxWithLoss" layer tries to compute the loss for predicting label l>1000, and access memory outside the probability array, resulting with a segmentation fault.
I have implemented a neural network (using CUDA) with 2 layers. (2 Neurons per layer).
I'm trying to make it learn 2 simple quadratic polynomial functions using backpropagation.
But instead of converging, the it is diverging (the output is becoming infinity)
Here are some more details about what I've tried:
I had set the initial weights to 0, but since it was diverging I have randomized the initial weights
I read that a neural network might diverge if the learning rate is too high so I reduced the learning rate to 0.000001
The two functions I am trying to get it to add are: 3 * i + 7 * j+9 and j*j + i*i + 24 (I am giving the layer i and j as input)
I had implemented it as a single layer previously and that could approximate the polynomial functions better
I am thinking of implementing momentum in this network but I'm not sure it would help it learn
I am using a linear (as in no) activation function
There is oscillation in the beginning but the output starts diverging the moment any of weights become greater than 1
I have checked and rechecked my code but there doesn't seem to be any kind of issue with it.
So here's my question: what is going wrong here?
Any pointer will be appreciated.
If the problem you are trying to solve is of classification type, try 3 layer network (3 is enough accordingly to Kolmogorov) Connections from inputs A and B to hidden node C (C = A*wa + B*wb) represent a line in AB space. That line divides correct and incorrect half-spaces. The connections from hidden layer to ouput, put hidden layer values in correlation with each other giving you the desired output.
Depending on your data, error function may look like a hair comb, so implementing momentum should help. Keeping learning rate at 1 proved optimum for me.
Your training sessions will get stuck in local minima every once in a while, so network training will consist of a few subsequent sessions. If session exceeds max iterations or amplitude is too high, or error is obviously high - the session has failed, start another.
At the beginning of each, reinitialize your weights with random (-0.5 - +0.5) values.
It really helps to chart your error descent. You will get that "Aha!" factor.
The most common reason for a neural network code to diverge is that the coder has forgotten to put the negative sign in the change in weight expression.
another reason could be that there is a problem with the error expression used for calculating the gradients.
if these don't hold, then we need to see the code and answer.