Is there a loss function considering bias and variance? - machine-learning

I'm trying to understand the bias and variance more.
I'm wondering if there is a loss function considering bias and variance.
As far as I know, the high bias makes underfit, and the high variance makes overfit.
the image from here
If we can consider the bias and variance in the loss, it could be like this, bias(x) + variance(x) + some_other_loss(x). And my curious point is two-part.
Is there a loss function considering bias and variance?
If the losses we normally have used already considered the bias and variance, how can I measure the bias and variance separately in scores?
This kind of question could be a fundamental mathematical question, I think. If you have any hint for that, I'll really appreciate it.
Thank you for reading my weird question.
After writing the question, I realized that regularization is one of the ways to reduce the variance. Then, 3) is it the way to measure the bias in a score?
Thank you again.
Update at Jan 16th, 2022
I have searched a little bit and answered myself. If there are wrong understandings, please comment below.
Bais is represented by loss value during training, so we don't need an additional bias loss function.
But for the variance, there is no way to score, because if we want to measure it we should get the training loss and unseen data's loss. But once we use the unseen data as a training loss, the unseen data be seen data. So this will are not unseen data anymore in terms of the model. So as far as I understand, there is no way to measure variance for training loss.
I hope other people can be helped and please comment your thinking if you have.

As you have clearly stated that high bias -> model is underfitting in comparison to a good fit, and high variance -> over fitting than a good fit.
Measuring either of them requires you to know the good fit in advance, which happens to be the end goal of training a model. Hence, it is not possible to measure underfitting or over fitting during training itself. However, if you can have an idea of a target amount of loss, you can use an early stopping callback to stop around the good fit.

Related

Cross-entropy loss influence over F-score

I'm training an FCN (Fully Convolutional Network) and using "Sigmoid Cross Entropy" as a loss function.
my measurements are F-measure and MAE.
The Train/Dev Loss w.r.t #iteration graph is something like the below:
Although Dev loss has a slight increase after #Iter=2200, my measurements on Dev set have been improved up to near #iter = 10000. I want to know is it possible in machine learning at all? If F-measure has been improved, should the loss also be decreased? How do you explain it?
Every answer would be appreciated.
Short answer, yes it's possible.
How I would explain it is by reasoning on the Cross-Entropy loss and how it differs from the metrics. Loss Functions for classification, generally speaking, are used to optimize models relying on probabilities (0.1/0.9), while metrics usually use the predicted labels. (0/1)
Assuming having strong confidence (close to 0 or to 1) in a model probability hypothesis, a wrong prediction will greatly increase the loss and have a small decrease in F-measure.
Likewise, in the opposite scenario, a model with low confidence (e.g. 0.49/0.51) would have a small impact on the loss function (from a numerical perspective) and a greater impact on the metrics.
Plotting the distribution of your predictions would help to confirm this hypothesis.

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

ResNet How to achieve accuracy as in the document?

I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.

Why does one not use IOU for training?

When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.
My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.
I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?
Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!
It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).
Here's another way to think about this in a simple manner.
Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.
Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"
Just a trickle of an explanation, but hope it helps..
I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.

Why does adding features to linear regression decrease accuracy?

I am new to ML and am working on a kaggle competition to learn a bit. When I add certain features to my dataset, the accuracy decreases.
Why isn't the feature that adds to the cost just weighted to zero (ignored)? Is it because non-linear features can cause the a local-minimum solution?
Thanks.
If you're talking about training error for a linear regression classifier, then adding features will always decrease your error unless you have a bug. Like you say, it's a convex problem and the global solution can never be worse as you can just set the weight to zero.
If you're talking about test error however, then overfitting is going to be the big issue with adding features, and is certainly something you would observe.
I cant comment therefore posting as answer.
#agilefall: you are not necessarily wrong. If you are measuring accuracy in terms of the correlation between predicted output and actual output then the accuracy can decrease as you add more feature. linear regression does not guarantee anything about that.

Resources