Why does adding features to linear regression decrease accuracy? - machine-learning

I am new to ML and am working on a kaggle competition to learn a bit. When I add certain features to my dataset, the accuracy decreases.
Why isn't the feature that adds to the cost just weighted to zero (ignored)? Is it because non-linear features can cause the a local-minimum solution?
Thanks.

If you're talking about training error for a linear regression classifier, then adding features will always decrease your error unless you have a bug. Like you say, it's a convex problem and the global solution can never be worse as you can just set the weight to zero.
If you're talking about test error however, then overfitting is going to be the big issue with adding features, and is certainly something you would observe.

I cant comment therefore posting as answer.
#agilefall: you are not necessarily wrong. If you are measuring accuracy in terms of the correlation between predicted output and actual output then the accuracy can decrease as you add more feature. linear regression does not guarantee anything about that.

Related

Is there a loss function considering bias and variance?

I'm trying to understand the bias and variance more.
I'm wondering if there is a loss function considering bias and variance.
As far as I know, the high bias makes underfit, and the high variance makes overfit.
the image from here
If we can consider the bias and variance in the loss, it could be like this, bias(x) + variance(x) + some_other_loss(x). And my curious point is two-part.
Is there a loss function considering bias and variance?
If the losses we normally have used already considered the bias and variance, how can I measure the bias and variance separately in scores?
This kind of question could be a fundamental mathematical question, I think. If you have any hint for that, I'll really appreciate it.
Thank you for reading my weird question.
After writing the question, I realized that regularization is one of the ways to reduce the variance. Then, 3) is it the way to measure the bias in a score?
Thank you again.
Update at Jan 16th, 2022
I have searched a little bit and answered myself. If there are wrong understandings, please comment below.
Bais is represented by loss value during training, so we don't need an additional bias loss function.
But for the variance, there is no way to score, because if we want to measure it we should get the training loss and unseen data's loss. But once we use the unseen data as a training loss, the unseen data be seen data. So this will are not unseen data anymore in terms of the model. So as far as I understand, there is no way to measure variance for training loss.
I hope other people can be helped and please comment your thinking if you have.
As you have clearly stated that high bias -> model is underfitting in comparison to a good fit, and high variance -> over fitting than a good fit.
Measuring either of them requires you to know the good fit in advance, which happens to be the end goal of training a model. Hence, it is not possible to measure underfitting or over fitting during training itself. However, if you can have an idea of a target amount of loss, you can use an early stopping callback to stop around the good fit.

Probabilistic classification with Gaussian Bayes Classifier vs Logistic Regression

I have a binary classification problem where I have a few great features that have the power to predict almost 100% of the test data because the problem is relatively simple.
However, as the nature of the problem requires, I have no luxury to make mistake(let's say) so instead of giving a prediction I am not sure of, I would rather have the output as probability, set a threshold and would be able to say, "if I am less than %95 sure, I will call this "NOT SURE" and act accordingly". Saying "I don't know" rather than making a mistake is better.
So far so good.
For this purpose, I tried Gaussian Bayes Classifier(I have a cont. feature) and Logistic Regression algorithms, which provide me the probability as well as the prediction for the classification.
Coming to my Problem:
GBC has around 99% success rate while Logistic Regression has lower, around 96% success rate. So I naturally would prefer to use GBC.
However, as successful as GBC is, it is also very sure of itself. The odds I get are either 1 or very very close to 1, such as 0.9999997, which makes things tough for me, because in practice GBC does not provide me probabilities now.
Logistic Regression works poor, but at least gives better and more 'sensible' odds.
As nature of my problem, the cost of misclassifying is by the power of 2 so if I misclassify 4 of the products, I lose 2^4 more (it's unit-less but gives an idea anyway).
In the end; I would like to be able to classify with a higher success than Logistic Regression, but also be able to have more probabilities so I can set a threshold and point out the ones I am not sure of.
Any suggestions?
Thanks in advance.
If you have enough data, you can simply retune the probabilities. For example, given the "predicted probability" output of your gaussian classifier, you can go back through (on a held out dataset) and at different prediction values, estimate the probability of the positive class.
Further, you can simply set up an optimization on your holdout set to determine the best threshold(without actually estimating the probability). Since it's one dimensional, you shouldn't even need to do anything fancy for optimization-- test like 500 different thresholds and pick the one which minimizes the costs associated with misclassifications.

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

Why does one not use IOU for training?

When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.
My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.
I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?
Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!
It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).
Here's another way to think about this in a simple manner.
Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.
Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"
Just a trickle of an explanation, but hope it helps..
I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.

SGD model "overconfidence"

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!
What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.
From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.

Resources