Update: This question is outdated and was asked for a pre 1.0 version of tensorflow. Do not refer to answers or suggest new ones.
I'm using the tf.nn.sigmoid_cross_entropy_with_logits function for the loss and it's going to NaN.
I'm already using gradient clipping, one place where tensor division is performed, I've added an epsilon to prevent division by zero, and the arguments to all softmax functions have an epsilon added to them as well.
Yet, I'm getting NaN's mid-way through training.
Are there any known issues where TensorFlow does this that I have missed?
It's quite frustrating because the loss is randomly going to NaN during training and ruining everything.
Also, how could I go about detecting if the training step will result in NaN and maybe skip that example altogether? Any suggestions?
EDIT: The network is a Neural Turing Machine.
EDIT 2: Here's the code for gradient clipping:
optimizer = tf.train.AdamOptimizer(self.lr)
gvs = optimizer.compute_gradients(loss)
capped_gvs =\
[(tf.clip_by_value(grad, -1.0, 1.0), var) if grad != None else (grad, var) for grad, var in gvs]
train_step = optimizer.apply_gradients(capped_gvs)
I had to add the if grad != None condition because I was getting an error without it. Could the problem be here?
Potential Solution: I'm using tf.contrib.losses.sigmoid_cross_entropy for a while now, and so far the loss hasn't diverged. I will test some more and report back.
Use 1e-4 for the learning rate. That one always seems to work for me with the Adam optimizer. Even if you gradient clip it can still diverge. Also another sneaky one is taking a square root since although it will be stable for all positive inputs its gradient diverges as the value approaches zero. Finally I would check and make sure all inputs to the model are reasonable.
I know it has been a while since this was asked, but I'd like to add another solution that helped me, on top of clipping. I found that, if I increase the batch size, the loss tends to not go close to 0, and doesn't end up (as of yet) going to NaN. Hope this helps anyone that finds this!
In my case, the NaN values were a result of NaN in the training datasets , while I was working on multiclass classifier , the problem was a dataframe positional filter on the [ one hot encoding ] labels.
Resolving the the target dataset resolved my issue - hope this help someone else.
Best of luck.
for me i added epsilon to parameters inside a log function.
i no longer see the errors but i noticed a moderate increase in the model training accuracy.
Related
Update 1
I updated my lr according to " you want to be 10x back from that point, regardless of slope." and set it to
max_lr=-slice(1e-3, 1e-2)
And here is what I got
And the plots
What does this mean?
As you can see in the 2nd graph that
the loss was very good starting from 1e-08, but I never set my lr to 1e-08, why do I see this??
the loss went up and down between 1e-07 and 1e-04 and eventually it soared to almost 0.05 when the lr came back around 4e-05. What does this mean? Overfitting? How come initially when the Learning Rate was around the same value(4e-05) the loss looked okay?
from the Batches processed/Loss, I can see that train_loss and valid_loss went together and looked really well. This means the model was trained very well? If it was well trained, why the shot-up at the end of graph 2?
I have followed the rule about picking up the correct lr, why does not it work? May I conclude that the lr_find() does not work properly?
Here is my lr_find() plot
then according to its graph, I picked up the steepest slope section: 1e-2 to 1e-1 as my lr.
Here is the code:
learn.fit_one_cycle(20, max_lr=slice(1e-2,1e-1))
But here is what I got during training
And here are the plots for learn.recoder
learn.recorder.plot_lr()
learn.recorder.plot()
learn.recorder.plot_losses()
As you can see the valid_loss is getting worse cyclically. So my conclusion is lr_find() method doesn’t work properly.
How can I verify it?
If you want to see the entire code, here it is; the only difference is I use to_fp16():
learn = cnn_learner(data, models.resnet50, metrics=error_rate).to_fp16()
https://forums.fast.ai/t/train-loss-and-valid-loss-look-very-good-but-predicting-really-bad/60925
I'm trying to program a neural network with backpropagation in python.
Usually converges to 1. To the left of the image there are some delta values. They are very small, should they be larger? Do you know a reason why this converging could happen?
sometimes it goes up in the direction of the point and then goes down again
here is the complete code:
http://pastebin.com/9BiwhWrD the backpropagation code starts at line 146
(the root stuff in line 165 does nothing. was just trying out some ideas)
Any ideas of what could be wrong? Have you ever seen a behaviour like this?
Thanks you very much.
The reason why this happened is, because the input data was too large. The activation sigmoid function converged to f(x)=1 for x -> inf. I had to normalize the data
e.g.:
a = np.array([1,2,3,4,5])
a /= a.max()
or prevent generating unnormalized data at all.
Also, the interims value was updated BEFORE the sigmoid was applied. But the derivation of sigmoid looks like this: y'(x) = y(x)-(1-y(x)). In my case it was just: y'(x) = x-(1-x)
There were also errors in how i updated the weights after calculating the deltas. I rewrote the whole loop using a tutorial for neural networks with python and then it worked.
It still does not support bias but it can do classification. For regression it's not precise enough, but i guess this has to do with the missing bias.
Here is the code:
http://pastebin.com/hRCKe1dK
Someone suggested that i should put my training-data into a neural-network framework and see if it works. It didn't. So it was kindof clear that it had to to with it and so i had to the idea that it should be between -1 and 1.
In linear regression with 1 variable I can clearly see on plot prediction line and I can see if it properly fits the training data. I just create a plot with 1 variable and output and construct prediction line based on found values of Theta 0 and Theta 1. So, it looks like this:
But how can I check validity of gradient descent results implemented on multiple variables/features. For example, if number of features is 4 or 5. How to check if it works correctly and found values of all thetas are valid? Do I have to rely only on cost function plotted against number of iterations carried out?
Gradient descent converges to a local minimum, meaning that the first derivative should be zero and the second non-positive. Checking these two matrices will tell you if the algorithm has converged.
We can think of gradient descent as of something solving a problem of f'(x) = 0 where f' denotes gradient of f. For checking this problem convergence, as far as I know, the standard approach is to calculate discrepancy on each iteration and see if it converges to 0.
That is, check if ||f'(x)|| (or its square) converges to 0.
There are some things you can try.
1) Check if your cost/energy function is not improving as your iteration progresses. Use something like "abs(E_after - E_before) < 0.00001*E_before", i.e. check if the relative difference is very low.
2) Check if your variables have stopped changing. You can opt a very similar strategy like above to check this.
There is actually no perfect way to fully make sure that your function has converged, but some of the things mentioned above are what usually people try.
Good luck!
I am using backpropogation algorithm for my model. It works perfectly fine a simple xor case and when I tested it for a smaller subset of my actual data.
There are 3 inputs in total and a single output(0,1,2)
I have split the data set into training set (80% amounting to approx 5.5k) and the rest 20% as validation data.
I use trainingRate and momentum for calculating the delta weights.
I have normalized the input as below
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(input_array)
I use 1 hidden layer with sigmoid and linear activation functions for input-hidden and hidden-output respectively.
I train with trainingRate = 0.0005, momentum = 0.6, Epochs = 100,000. Any higher trainingRate shoots up the error to Nan. momentum values between 0.5 and 0.9 works fine and any other value makes the error Nan.
I tried various number of nodes in the hidden layer such as 3,6,9,10 and the error converged to 4140.327574 in each case. I am not sure how to reduce this. Changing the activation functions doesn't help. I even tried adding another hidden layer with gaussian activation function but I cannot reduce the error whatsoever.
Is it because of the outliers? Do i need to clean those values from the training data?
Any suggestion would be of great help be it the activation function, hidden layers, etc. I had been trying to get this working for quite some time and I am sort of stuck now.
Well I'm having kind of a similar problem, still haven fixed it, but I can tell you a couple of things I have found. I think the net is overfitting, my error at some point goes down and then starts going up again, also the verification set... is this you case also?
Check if you are implementing well the "early stopping" algorithm, most of the times the problem is not the backpropagation, but the error analysis or the validation analysis.
Hope this helps!
After using OpenCV for boosting I'm trying to implement my own version of the Adaboost algorithm (check here, here and the original paper for some references).
By reading all the material I've came up with some questions regarding the implementation of the algorithm.
1) It is not clear to me how the weights a_t of each weak learner are assigned.
In all the sources I've pointed out the choice is a_t = k * ln( (1-e_t) / e_t ), k being a positive constant and e_t the error rate of the particular weak learner.
At page 7 of this source it says that that particular value minimizes a certain convex differentiable function, but I really don't understand the passage.
Can anyone please explain it to me?
2) I have some doubts on the procedure of weight update of the training samples.
Clearly it should be done in such a way to guarantee that they remain a probability distribution. All the references adopt this choice:
D_{t+1}(i) = D_{t}(i) * e^(-a_ty_ih_t(x_i)) / Z_t (where Z_t is a
normalization factor chosen so that D_{t+1} is a distribution).
But why is the particular choice of weight update multiplicative with the exponential of error rate made by the particular weak learner?
Are there any other updates possible? And if yes is there a proof that this update guarantees some kind of optimality of the learning process?
I hope this is the right place to post this question, if not please redirect me!
Thanks in advance for any help you can provide.
1) Your first question:
a_t = k * ln( (1-e_t) / e_t )
Since the error on training data is bounded by product of Z_t)alpha), and Z_t(alpha) is convex w.r.t. alpha, and thus there is only one "global" optimal alpha which minimize the upperbound of the error. This is the intuition of how you find the magic "alpha"
2) Your 2nd question:
But why is the particular choice of weight update multiplicative with the exponential of error rate made by the particular weak learner?
To cut it short: the intuitive way of finding the above alpha is indeed improve the accuracy. This is not surprising: you are actually trusting more (by giving larger alpha weight) of the learners who work better than the others, and trust less (by giving smaller alpha) to those who work worse. For those learners brining no new knowledge than the previous learners, you assign weight alpha equal 0.
It is possible to prove (see) that the final boosted hypothesis yielding training error bounded by
exp(-2 \sigma_t (1/2 - epsilon_t)^2 )
3) Your 3rd question:
Are there any other updates possible? And if yes is there a proof that this update guarantees some kind of optimality of the learning process?
This is hard to say. But just remember here the update is improving the accuracy on the "training data" (at the risk of over-fitting), but it is hard to say about its generality.