liblinear L1 vs. L2 logistic regression performance difference

liblinear L1 vs. L2 logistic regression performance difference - machine-learning

I'm training a simple logistic regression classifier using LIBLINEAR. There are only 3 features, and label is binary 0-1.
Sample input file:
1 1:355.55660999775586 2:-3.401379785 3:5
1 1:252.43759050148728 2:-3.96044759307 3:9
1 1:294.15085871437088 2:-13.1649273486 3:14
1 1:432.10492221032933 2:-2.72636786196 3:9
0 1:753.80863694081768 2:-12.4841741178 3:14
1 1:376.54927850355756 2:-6.9494008935 3:7
Now, if I use "-s 6", which is "L1-regularized logistic regression", then the 10-fold cross validation accuracy is around 70%, and each iter finishes within seconds.
But if I use "-s 7", which is "L2-regularized logistic regression (dual)", then the training iteration exceeds 1000, and the 10-fold accuracy is only 60%.
Has anybody seen this kind of strange behavior? From my understanding, the only difference between L1 and L2 is whether the regularization term uses abs(x) or pow(x, 2).

Thanks for posting this! I work with liblinear fairly often and generally always use L2 loss without thinking. This article does a pretty good job explaining the difference: http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
Based on that, I'm guessing that not only do you have a small amount of features but maybe also a small dataset? Have you tried to increase the number of input points?

Not think it's a 'Strange' behavior in my poor opinion. You have to make a trial to confirm which one is fitted into your case better before you have not any sense of it. Theoretically,L1-regular is bounded,just like feature selection,while l2-regular is more smooth.

I just realized there are two logistic regression classifier provided by LIBLINEAR:
0 -- L2-regularized logistic regression (primal)
7 -- L2-regularized logistic regression (dual)
I was using 7, which doesn't converge even after 1000 iterations.
After I switched to 0, it converged very fast and was able to get to ~70% accuracy.
I believe the dual vs. primal is mainly the difference in optimization methods, so I think this is probably some numerical computation issue.
For more info on dual form vs. primal form:
https://stats.stackexchange.com/questions/29059/logistic-regression-how-to-get-dual-function

Related

Need help choosing loss function

I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:

You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?

For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.

Probabilistic classification with Gaussian Bayes Classifier vs Logistic Regression

I have a binary classification problem where I have a few great features that have the power to predict almost 100% of the test data because the problem is relatively simple.
However, as the nature of the problem requires, I have no luxury to make mistake(let's say) so instead of giving a prediction I am not sure of, I would rather have the output as probability, set a threshold and would be able to say, "if I am less than %95 sure, I will call this "NOT SURE" and act accordingly". Saying "I don't know" rather than making a mistake is better.
So far so good.
For this purpose, I tried Gaussian Bayes Classifier(I have a cont. feature) and Logistic Regression algorithms, which provide me the probability as well as the prediction for the classification.
Coming to my Problem:
GBC has around 99% success rate while Logistic Regression has lower, around 96% success rate. So I naturally would prefer to use GBC.
However, as successful as GBC is, it is also very sure of itself. The odds I get are either 1 or very very close to 1, such as 0.9999997, which makes things tough for me, because in practice GBC does not provide me probabilities now.
Logistic Regression works poor, but at least gives better and more 'sensible' odds.
As nature of my problem, the cost of misclassifying is by the power of 2 so if I misclassify 4 of the products, I lose 2^4 more (it's unit-less but gives an idea anyway).
In the end; I would like to be able to classify with a higher success than Logistic Regression, but also be able to have more probabilities so I can set a threshold and point out the ones I am not sure of.
Any suggestions?
Thanks in advance.

If you have enough data, you can simply retune the probabilities. For example, given the "predicted probability" output of your gaussian classifier, you can go back through (on a held out dataset) and at different prediction values, estimate the probability of the positive class.
Further, you can simply set up an optimization on your holdout set to determine the best threshold(without actually estimating the probability). Since it's one dimensional, you shouldn't even need to do anything fancy for optimization-- test like 500 different thresholds and pick the one which minimizes the costs associated with misclassifications.

Why do we want to maximize AUC in classification problems?

I wonder why is our objective is to maximize AUC when maximizing accuracy yields the same?
I think that along with the primary goal to maximize accuracy, AUC will automatically be large.

I guess we use AUC because it explains how well our method is able to separate the data independently of a threshold.
For some applications, we don't want to have false positive or negative. And when we use accuracy, we already make an a priori on the best threshold to separate the data regardless of the specificity and sensitivity.
.

In binary classification, accuracy is a performance metric of a single model for a certain threshold and the AUC (Area under ROC curve) is a performance metric of a series of models for a series of thresholds.
Thanks to this question, I have learnt quite a bit on AUC and accuracy comparisons. I don't think that there's a correlation between the two and I think this is still an open problem. At the end of this answer, I've added some links like these that I think would be useful.
One scenario where accuracy fails:
Example Problem
Let's consider a binary classification problem where you evaluate the performance of your model on a data set of 100 samples (98 of class 0 and 2 of class 1).
Take out your sophisticated machine learning model and replace the whole thing with a dumb system that always outputs 0 for whatever the input it receives.
What is the accuracy now?
Accuracy = Correct predictions/Total predictions = 98/100 = 0.98
We got a stunning 98% accuracy on the "Always 0" system.
Now you convert your system to a cancer diagnosis system and start predicting (0 - No cancer, 1 - Cancer) on a set of patients. Assuming there will be a few cases that corresponds to class 1, you will still achieve a high accuracy.
Despite having a high accuracy, what is the point of the system if it fails to do well on the class 1 (Identifying patients with cancer)?
This observation suggests that accuracy is not a good evaluation metric for every type of machine learning problems. The above is known as an imbalanced class problem and there are enough practical problems of this nature.
As for the comparison of accuracy and AUC, here are some links I think would be useful,
An introduction to ROC analysis
Area under curve of ROC vs. overall accuracy
Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
What does AUC stand for and what is it?
Understanding ROC curve
ROC vs. Accuracy vs. AROC

Techniques to improve the accuracy of SVM classifier

I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?

For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier from sklearn.ensemble.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.

some thoughts that have come to my mind when reading your question and the arguments you putting forward with this author claiming to have achieved acc=99.51%.
My first thought was OVERFITTING. I can be wrong, because it might depend on the dataset - But the first thought will be overfitting. Now my questions;
1- Has the author in his article stated whether the dataset was split into training and testing set?
2- Is this acc = 99.51% achieved with the training set or the testing one?
With the training set you can hit this acc = 99.51% when your model is overfitting.
Generally, in this case the performance of the SVM classifier on unknown dataset is poor.

SGD model "overconfidence"

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!

What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.

From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart