SGD model "overconfidence" - machine-learning

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!

What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.

From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.


Training Sparse Autoencoders

My dataset consists of vectors that are massive. The data points are all mostly zeros with ~3% of the features being 1. Essentially my data is super sparse and I am attempting to train an autoencoder however my model is learning just to recreate vectors of all zeros.
Are there any techniques to prevent this? I have tried replacing mean squared error with dice loss but it completely stopped learning. My other thoughts would be to use a loss function that favors guessing 1s correctly rather than zeros. I have also tried using a sigmoid and linear last activation with no clear winner. Any ideas would be awesome.
It seems like you are facing a severe "class imbalance" problem.
Have a look at focal loss. This loss is designed for binary classification with severe class imbalance.
Consider "hard negative mining": that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016).

Need help choosing loss function

I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:
You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?
For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.

Why does one not use IOU for training?

When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.
My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.
I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?
Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!
It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).
Here's another way to think about this in a simple manner.
Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.
Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"
Just a trickle of an explanation, but hope it helps..
I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.

Techniques to improve the accuracy of SVM classifier

I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?
For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier from sklearn.ensemble.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.
some thoughts that have come to my mind when reading your question and the arguments you putting forward with this author claiming to have achieved acc=99.51%.
My first thought was OVERFITTING. I can be wrong, because it might depend on the dataset - But the first thought will be overfitting. Now my questions;
1- Has the author in his article stated whether the dataset was split into training and testing set?
2- Is this acc = 99.51% achieved with the training set or the testing one?
With the training set you can hit this acc = 99.51% when your model is overfitting.
Generally, in this case the performance of the SVM classifier on unknown dataset is poor.

Why does adding features to linear regression decrease accuracy?

I am new to ML and am working on a kaggle competition to learn a bit. When I add certain features to my dataset, the accuracy decreases.
Why isn't the feature that adds to the cost just weighted to zero (ignored)? Is it because non-linear features can cause the a local-minimum solution?
If you're talking about training error for a linear regression classifier, then adding features will always decrease your error unless you have a bug. Like you say, it's a convex problem and the global solution can never be worse as you can just set the weight to zero.
If you're talking about test error however, then overfitting is going to be the big issue with adding features, and is certainly something you would observe.
I cant comment therefore posting as answer.
#agilefall: you are not necessarily wrong. If you are measuring accuracy in terms of the correlation between predicted output and actual output then the accuracy can decrease as you add more feature. linear regression does not guarantee anything about that.
