I am using dice loss for my implementation of a Fully Convolutional Network(FCN) which involves hypernetworks. The model has two inputs and one output which is a binary segmentation map. The model is updating weights but loss is constant.
It is not even overfitting on only three training examples
I have used other loss functions as well like dice+binarycrossentropy loss, jacard loss and MSE loss but the loss is almost constant.
I have also tried almost every activation function like ReLU, LeakyReLU, Tanh. Moreover I have to use sigmoid at the the output because I need my outputs to be in range [0,1]
Learning rate is 0.01. Moreover, I have tried different learning rates as well like 0.0001, 0.001, 0.1. And no matter what loss the training starts at, it always comes at this value
This shows gradients for three training examples. And overall loss
tensor(0.0010, device='cuda:0')
tensor(0.1377, device='cuda:0')
tensor(0.1582, device='cuda:0')
Epoch 9, Overall loss = 0.9604763123724196, mIOU=0.019766070265581623
tensor(0.0014, device='cuda:0')
tensor(0.0898, device='cuda:0')
tensor(0.0455, device='cuda:0')
Epoch 10, Overall loss = 0.9616242945194244, mIOU=0.01919178702228237
tensor(0.0886, device='cuda:0')
tensor(0.2561, device='cuda:0')
tensor(0.0108, device='cuda:0')
Epoch 11, Overall loss = 0.960331304506822, mIOU=0.01983801422510155
I expect the loss to converge in few epochs.
What should I do?
It's not really a question for stack overflow. There's a million things which could be wrong and it's usually not possible to post enough code to allow us to pinpoint the issue, and even if it were, nobody could bother reading that much.
That being said, there are some general guidelines which often work for me.
Try reducing the problem. If you replace your network with a single convolutional layer, will it converge? If yes, apparently something's wrong with your network
Look at the data as you feed it as well as the labels (matplotlib plots, etc). Perhaps you're misaligning input with output (cropping issues, etc) or your data augmentation is way too strong.
Look for, well..., bugs. Perhaps you're returning torch.sigmoid(x) from your network and then feeding it into torch.nn.functional.binary_cross_entropy_with_logits (effectively applying sigmoid twice). Maybe your last layer is ReLU and your network just cannot (by construction) output negative values where you would expect them.
Finally, I've personally never had much success training with dice as the primary loss function, so I would definitely try to get it working with cross entropy first, and then move on to dice.
#Muhammad Hamza Mughal
You got to add code of at least your forward and train functions for us to pinpoint the issue, #Jatentaki is right there could be so many things that could mess up a ML / DL code. Even I moved recently to pytorch from Keras, took some time to get used to it. But, here are the things I'd do:
1) As you're dealing with images, try to pre-process them a bit ( rotation, normalization, Gaussian Noise etc).
2) Zero gradients of your optimizer at the beginning of each batch you fetch and also step optimizer after you calculated loss and called loss.backward().
3) Add a weight decay term to your optimizer call, typically L2, as you're dealing with Convolution networks have a decay term of 5e-4 or 5e-5.
4) Add a learning rate scheduler to your optimizer, to change learning rates if there's no improvement over time.
We really can't include code in our answers. It's up to the practitioner to scout for how to implement all this stuff. Hope this helps.
#MuhammadHamzaMughal since you are using sigmoid to generate predictions, have you made sure that the target attributes in ground truth/training data/validation data are all in range [0-1] ?
Normalize the data with min-max normalization so that it is in [0-1] range.
Related
I'm trying to understand the bias and variance more.
I'm wondering if there is a loss function considering bias and variance.
As far as I know, the high bias makes underfit, and the high variance makes overfit.
the image from here
If we can consider the bias and variance in the loss, it could be like this, bias(x) + variance(x) + some_other_loss(x). And my curious point is two-part.
Is there a loss function considering bias and variance?
If the losses we normally have used already considered the bias and variance, how can I measure the bias and variance separately in scores?
This kind of question could be a fundamental mathematical question, I think. If you have any hint for that, I'll really appreciate it.
Thank you for reading my weird question.
After writing the question, I realized that regularization is one of the ways to reduce the variance. Then, 3) is it the way to measure the bias in a score?
Thank you again.
Update at Jan 16th, 2022
I have searched a little bit and answered myself. If there are wrong understandings, please comment below.
Bais is represented by loss value during training, so we don't need an additional bias loss function.
But for the variance, there is no way to score, because if we want to measure it we should get the training loss and unseen data's loss. But once we use the unseen data as a training loss, the unseen data be seen data. So this will are not unseen data anymore in terms of the model. So as far as I understand, there is no way to measure variance for training loss.
I hope other people can be helped and please comment your thinking if you have.
As you have clearly stated that high bias -> model is underfitting in comparison to a good fit, and high variance -> over fitting than a good fit.
Measuring either of them requires you to know the good fit in advance, which happens to be the end goal of training a model. Hence, it is not possible to measure underfitting or over fitting during training itself. However, if you can have an idea of a target amount of loss, you can use an early stopping callback to stop around the good fit.
I am trying to implement Neural Networks for classifcation having 5 hidden layers, and with softmax cross entropy in the output layer. The implementation is in JAVA.
For optimization, I have used MiniBatch gradient descent(Batch size=100, learning rate = 0.01)
However, after a couple of iterations, the weights become "NaN" and the predicted values turn out to be the same for every testcase.
Unable to debug the source of this error.
Here is the github link to the code(with the test/training file.)
https://github.com/ahana204/NeuralNetworks
In my case, i forgot to normalize the training data (by subtracting mean). This was causing the denominator of my softmax equation to be 0. Hope this helps.
Assuming the code you implemented is correct, one reason would be large learning rate. If learning rate is large, weights may not converge and may become very small or very large which could be shown NaN. Try to lower learning rate to see if anything changes.
I've been training a neural network and using Tensorflow.
My cost function is:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
Training my neural network has caused the cross entropy loss to decrease from ~170k to around 50, a dramatic improvement. Meanwhile, my accuracy has actually gotten slightly worse: from 3% to 2.9%. These tests are made on the training set so overfitting is not in the question.
I calculate accuracy simply as follows:
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print('Accuracy:', accuracy.eval({x: batch_x, y: batch_y}))
What could possibly be the cause for this?
Should I use the accuracy as a cost function instead since something is clearly wrong with the cross entropy (softmax) for my case.
I know there is a similar question to this on StackOverflow but the question was never answered completely.
I cannot tell the exact reason for the problem without seeing your machine learning problem.
Please at least provide the type of the problem (binary, multi-class and/or multi-label)
But, with such low accuracy and large difference between accuracy and loss, I believe it is a bug, not a machine learning issue.
One possible bug might be related to loading label data y. 3% accuracy is too low for most machine learning problems (except for the image classification). You would get 3% accuracy if you are guessing randomly over 33 labels.
Is your problem really 33 multi-class classification problem? If not, you might have done something wrong when creating data batch_y (wrong dimension, shape mismatch to prediction, ...).
I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.
I want to train CaffeNet on the MNIST dataset in Caffe. However, I noticed that after 100 iterations the loss just slightly dropped (from 2.66364 to 2.29882).
However, when I use LeNet on MNIST, the loss goes from 2.41197 to 0.22359, after 100 iterations.
Does this happen because CaffeNet has more layers, and therefore needs more training time to converge? Or is it due to something else? I made sure the solver.prototxt of the nets were the same.
While I know 100 iterations is extremely short (as CaffeNet usually trains for ~300-400k iterations), I find it odd that LeNet is able to get a loss so small, so soon.
I am not familiar with architecture of these nets, but in general there are several possible reasons:
1) One of the nets is really much more complicated
2) One of the nets was trained with a bigger learning rate
3) Or maybe it used a training with momentum while other net didn't use it?
4) Also possible that they both use momentum during training but one of them had the bigger momentum coefficient specified
Really, there are tons of possible explanations for that.