tensorflow batch normalization leads to high test loss - machine-learning

I've been trying use batch normalization in tensorflow for while with no success.
The training loss converges nicely (better than without the BN), but the test loss remains high throughout the training. I'm using batch size of 1 but the problem still happens with bigger batch size.
what I'm currently doing is:
inputs = tf.layers.batch_normalization(
inputs=inputs, axis=1 if data_format == 'channels_first' else 3,
momentum== 0.997, epsilon=1e-5, center=True,
scale=True, training=is_training, fused=True)
inputs = tf.nn.relu(inputs)
is_training is tf.placeholder that I assign to True during training and False during testing, and for the training op I do this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss, global_step=batch)
I've tried "tf.contrib.layers.batch_norm" too, and several other implementations of BN that I found online. but nothing works.. I always get the same problem.
I know that the beta and gamma variables are being updated during training.
But I also noticed that tf.get_collection(tf.GraphKeys.MOVING_AVERAGE_VARIABLES) is an empty collection, which is weird.
Have anyone seen and solved this problem before?
I can't think of any more things to try.
Note: I know the problem is with the BN because without it the test loss converges with the training less as expected.

Related

What is the difference between these backward training methods in Pytorch?

I am a 3-month DL freshman who is doing small NLP projects with Pytorch.
Recently I am trying to reappear a GAN network introduced by a paper, using my own text data, to generate some specific kinds of question sentences.
Here is some background... If you have no time or interest about it, just kindly read the following question is OK.
As that paper says, the generator is firstly trained normally with normal question data to make that the output at least looks like a real question. Then by using an auxiliary classifier's result (of classifying the outputs), the generator is trained again to just generate the specific (several unique categories) questions.
However, as the paper do not reveal its code, I have to do the code all myself. I have these three training thoughts, but I do not know their differences, could you kindly tell me about it?
If they have almost the same effect, could you tell me which is more recommended in Pytorch's grammar? Thank you very much!
Suppose the discriminator loss to generator is loss_G_D, the classifier loss to generator is loss_G_C, and loss_G_D and loss_G_C has the same shape, i.e. [batch_size, loss value], then what is the difference?
1.
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_C = loss_func2(classifier(generated_data))
loss = loss_G+loss_C
loss.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()
Additional info: I observed that the classifier's classification loss is always very big compared with generator's loss, like -300 vs 3. So maybe the third one is better?
First of all:
loss.backward() backpropagates the error and assigns a gradient for every parameter along the way that has requires_grad=True.
optimizer.step() updates the model parameters using their stored gradients
optimizer.zero_grad() sets the gradients to 0, so that you can backpropagate your loss and update your model parameters for each batch without interfering with other batches.
1 and 2 are quite similar, but if your model uses batch statistics or you have an adaptive optimizer they will probably perform differently. However, for instance, if your model doesn't use batch statistics and you have a plain old SGD optimizer, they will produce the same result, even though 1 would be faster since you do the backprop only once.
3 is a completely different case, since you update your model parameters with loss_G_D.backward() and optimizer.step() before processing and backpropagating loss_G_C.
Given all of these, it's up to you which one to choose depending on your application.

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

Neural Turing Machine Loss Going to NaN

Update: This question is outdated and was asked for a pre 1.0 version of tensorflow. Do not refer to answers or suggest new ones.
I'm using the tf.nn.sigmoid_cross_entropy_with_logits function for the loss and it's going to NaN.
I'm already using gradient clipping, one place where tensor division is performed, I've added an epsilon to prevent division by zero, and the arguments to all softmax functions have an epsilon added to them as well.
Yet, I'm getting NaN's mid-way through training.
Are there any known issues where TensorFlow does this that I have missed?
It's quite frustrating because the loss is randomly going to NaN during training and ruining everything.
Also, how could I go about detecting if the training step will result in NaN and maybe skip that example altogether? Any suggestions?
EDIT: The network is a Neural Turing Machine.
EDIT 2: Here's the code for gradient clipping:
optimizer = tf.train.AdamOptimizer(self.lr)
gvs = optimizer.compute_gradients(loss)
capped_gvs =\
[(tf.clip_by_value(grad, -1.0, 1.0), var) if grad != None else (grad, var) for grad, var in gvs]
train_step = optimizer.apply_gradients(capped_gvs)
I had to add the if grad != None condition because I was getting an error without it. Could the problem be here?
Potential Solution: I'm using tf.contrib.losses.sigmoid_cross_entropy for a while now, and so far the loss hasn't diverged. I will test some more and report back.
Use 1e-4 for the learning rate. That one always seems to work for me with the Adam optimizer. Even if you gradient clip it can still diverge. Also another sneaky one is taking a square root since although it will be stable for all positive inputs its gradient diverges as the value approaches zero. Finally I would check and make sure all inputs to the model are reasonable.
I know it has been a while since this was asked, but I'd like to add another solution that helped me, on top of clipping. I found that, if I increase the batch size, the loss tends to not go close to 0, and doesn't end up (as of yet) going to NaN. Hope this helps anyone that finds this!
In my case, the NaN values were a result of NaN in the training datasets , while I was working on multiclass classifier , the problem was a dataframe positional filter on the [ one hot encoding ] labels.
Resolving the the target dataset resolved my issue - hope this help someone else.
Best of luck.
for me i added epsilon to parameters inside a log function.
i no longer see the errors but i noticed a moderate increase in the model training accuracy.

Interpretation of a learning curve in machine learning

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.
I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.
Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.
I trained a linear regression and plotted the following learning curve :
This result is particularly surprising for me, so I have questions about it :
Is this curve even remotely possible or is my code necessarily flawed?
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
If it is not, any hint to where I made a mistake?
Here's my code (Octave / Matlab) just in case:
Plot :
lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
learningCurve([ones(mTrain, 1) X_train], y_train, ...
[ones(size(X_val, 1), 1) X_val], y_val, ...
lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')
Learning curve :
S = ['Reg with '];
for i = startPoint:stepSize:m
temp_X = X(1:i,:);
temp_y = y(1:i);
% Initialize Theta
initial_theta = zeros(size(X, 2), 1);
% Create "short hand" for the cost function to be minimized
costFunction = #(t) linearRegCostFunction(X, y, t, lambda);
% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 50, 'GradObj', 'on');
% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);
[J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
error_train = [error_train; [i J]];
[J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
error_val = [error_val; [i J]];
fprintf('%s %6i examples \r', S, i);
fflush(stdout);
end
Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :
Note : the training set size is always around 24k examples, and validation set around 8k examples.
Is this curve even remotely possible or is my code necessarily flawed?
It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.
Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.
For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.
How big is your test set?
Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.
If it is not, any hint to where I made a mistake?
Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.
You might want to also try to find other research regarding that data set. What results are other people getting?
Regarding the update
That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.
Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.

Backpropogation neural network - error not converging

I am using backpropogation algorithm for my model. It works perfectly fine a simple xor case and when I tested it for a smaller subset of my actual data.
There are 3 inputs in total and a single output(0,1,2)
I have split the data set into training set (80% amounting to approx 5.5k) and the rest 20% as validation data.
I use trainingRate and momentum for calculating the delta weights.
I have normalized the input as below
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(input_array)
I use 1 hidden layer with sigmoid and linear activation functions for input-hidden and hidden-output respectively.
I train with trainingRate = 0.0005, momentum = 0.6, Epochs = 100,000. Any higher trainingRate shoots up the error to Nan. momentum values between 0.5 and 0.9 works fine and any other value makes the error Nan.
I tried various number of nodes in the hidden layer such as 3,6,9,10 and the error converged to 4140.327574 in each case. I am not sure how to reduce this. Changing the activation functions doesn't help. I even tried adding another hidden layer with gaussian activation function but I cannot reduce the error whatsoever.
Is it because of the outliers? Do i need to clean those values from the training data?
Any suggestion would be of great help be it the activation function, hidden layers, etc. I had been trying to get this working for quite some time and I am sort of stuck now.
Well I'm having kind of a similar problem, still haven fixed it, but I can tell you a couple of things I have found. I think the net is overfitting, my error at some point goes down and then starts going up again, also the verification set... is this you case also?
Check if you are implementing well the "early stopping" algorithm, most of the times the problem is not the backpropagation, but the error analysis or the validation analysis.
Hope this helps!

Resources