I'm training an FCN (Fully Convolutional Network) and using "Sigmoid Cross Entropy" as a loss function.
my measurements are F-measure and MAE.
The Train/Dev Loss w.r.t #iteration graph is something like the below:
Although Dev loss has a slight increase after #Iter=2200, my measurements on Dev set have been improved up to near #iter = 10000. I want to know is it possible in machine learning at all? If F-measure has been improved, should the loss also be decreased? How do you explain it?
Every answer would be appreciated.
Short answer, yes it's possible.
How I would explain it is by reasoning on the Cross-Entropy loss and how it differs from the metrics. Loss Functions for classification, generally speaking, are used to optimize models relying on probabilities (0.1/0.9), while metrics usually use the predicted labels. (0/1)
Assuming having strong confidence (close to 0 or to 1) in a model probability hypothesis, a wrong prediction will greatly increase the loss and have a small decrease in F-measure.
Likewise, in the opposite scenario, a model with low confidence (e.g. 0.49/0.51) would have a small impact on the loss function (from a numerical perspective) and a greater impact on the metrics.
Plotting the distribution of your predictions would help to confirm this hypothesis.
Related
I'm trying to understand the bias and variance more.
I'm wondering if there is a loss function considering bias and variance.
As far as I know, the high bias makes underfit, and the high variance makes overfit.
the image from here
If we can consider the bias and variance in the loss, it could be like this, bias(x) + variance(x) + some_other_loss(x). And my curious point is two-part.
Is there a loss function considering bias and variance?
If the losses we normally have used already considered the bias and variance, how can I measure the bias and variance separately in scores?
This kind of question could be a fundamental mathematical question, I think. If you have any hint for that, I'll really appreciate it.
Thank you for reading my weird question.
After writing the question, I realized that regularization is one of the ways to reduce the variance. Then, 3) is it the way to measure the bias in a score?
Thank you again.
Update at Jan 16th, 2022
I have searched a little bit and answered myself. If there are wrong understandings, please comment below.
Bais is represented by loss value during training, so we don't need an additional bias loss function.
But for the variance, there is no way to score, because if we want to measure it we should get the training loss and unseen data's loss. But once we use the unseen data as a training loss, the unseen data be seen data. So this will are not unseen data anymore in terms of the model. So as far as I understand, there is no way to measure variance for training loss.
I hope other people can be helped and please comment your thinking if you have.
As you have clearly stated that high bias -> model is underfitting in comparison to a good fit, and high variance -> over fitting than a good fit.
Measuring either of them requires you to know the good fit in advance, which happens to be the end goal of training a model. Hence, it is not possible to measure underfitting or over fitting during training itself. However, if you can have an idea of a target amount of loss, you can use an early stopping callback to stop around the good fit.
I wonder why is our objective is to maximize AUC when maximizing accuracy yields the same?
I think that along with the primary goal to maximize accuracy, AUC will automatically be large.
I guess we use AUC because it explains how well our method is able to separate the data independently of a threshold.
For some applications, we don't want to have false positive or negative. And when we use accuracy, we already make an a priori on the best threshold to separate the data regardless of the specificity and sensitivity.
.
In binary classification, accuracy is a performance metric of a single model for a certain threshold and the AUC (Area under ROC curve) is a performance metric of a series of models for a series of thresholds.
Thanks to this question, I have learnt quite a bit on AUC and accuracy comparisons. I don't think that there's a correlation between the two and I think this is still an open problem. At the end of this answer, I've added some links like these that I think would be useful.
One scenario where accuracy fails:
Example Problem
Let's consider a binary classification problem where you evaluate the performance of your model on a data set of 100 samples (98 of class 0 and 2 of class 1).
Take out your sophisticated machine learning model and replace the whole thing with a dumb system that always outputs 0 for whatever the input it receives.
What is the accuracy now?
Accuracy = Correct predictions/Total predictions = 98/100 = 0.98
We got a stunning 98% accuracy on the "Always 0" system.
Now you convert your system to a cancer diagnosis system and start predicting (0 - No cancer, 1 - Cancer) on a set of patients. Assuming there will be a few cases that corresponds to class 1, you will still achieve a high accuracy.
Despite having a high accuracy, what is the point of the system if it fails to do well on the class 1 (Identifying patients with cancer)?
This observation suggests that accuracy is not a good evaluation metric for every type of machine learning problems. The above is known as an imbalanced class problem and there are enough practical problems of this nature.
As for the comparison of accuracy and AUC, here are some links I think would be useful,
An introduction to ROC analysis
Area under curve of ROC vs. overall accuracy
Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
What does AUC stand for and what is it?
Understanding ROC curve
ROC vs. Accuracy vs. AROC
I've been training a neural network and using Tensorflow.
My cost function is:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
Training my neural network has caused the cross entropy loss to decrease from ~170k to around 50, a dramatic improvement. Meanwhile, my accuracy has actually gotten slightly worse: from 3% to 2.9%. These tests are made on the training set so overfitting is not in the question.
I calculate accuracy simply as follows:
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print('Accuracy:', accuracy.eval({x: batch_x, y: batch_y}))
What could possibly be the cause for this?
Should I use the accuracy as a cost function instead since something is clearly wrong with the cross entropy (softmax) for my case.
I know there is a similar question to this on StackOverflow but the question was never answered completely.
I cannot tell the exact reason for the problem without seeing your machine learning problem.
Please at least provide the type of the problem (binary, multi-class and/or multi-label)
But, with such low accuracy and large difference between accuracy and loss, I believe it is a bug, not a machine learning issue.
One possible bug might be related to loading label data y. 3% accuracy is too low for most machine learning problems (except for the image classification). You would get 3% accuracy if you are guessing randomly over 33 labels.
Is your problem really 33 multi-class classification problem? If not, you might have done something wrong when creating data batch_y (wrong dimension, shape mismatch to prediction, ...).
I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.
I'm using Pybrain to train a recurrent neural network. However, the average of the weights keeps climbing and after several iterations the train and test accuracy become lower. Now the highest performance on train data is about 55% and on test data is about 50%.
I think maybe the rnn have some training problems because of its high weights. How can I solve it? Thank you in advance.
The usual way to restrict the network parameters is to use a constrained error-functional which somehow penalizes the absolute magnitude of the parameters. Such is done in "weight decay" where you add to your sum-of-squares error the norm of the weights ||w||. Usually this is the Euclidian norm, but sometimes also the 1-norm in which case it is called "Lasso". Note that weight decay is also called ridge regression or Tikhonov regularization.
In PyBrain, according to this page in the documentation, there is available a Lasso-version of weight decay, which can be parametrized by the parameter wDecay.