Why should we use Lasso over Linear regression for feature selection in machine learning? - machine-learning

while selecting features in machine learning, one can use Lasso regression to figure out the least required feature by selecting the least coefficient but we can do the same using Linear Regression
linear regression
Y=x0+x1b1+x2b2.......xnbn
here x1,x2,x3...xn are coefficient, using gradient descent we get the best coefficient, we can remove the features who has the least coefficient. now when it is possible using Linear Regression then why should one use Lasso Regression?
am i missing something, please help

Lasso is a regularization technique which is for avoiding overfitting when you train your model. When you do not use any regularization technique, your loss function just tries to minimize the difference between the predicted value and real value min |y_pred - y|.
To minimize this loss function, gradient descent changes the coefficient of your model. This step may cause the overfitting of your model because your optimization function want only to minimize the difference between prediction and real value. To solve this issue, regularization techniques add another penalty term to the loss functions: value of coefficients. In this way, when your model tries to minimize the difference between predicted and real value, it also tries to do not increase the coefficients too much.
As you mentioned, you can select features in both ways, however, Lasso technique also takes care of the overfitting problem.

Related

Should I find regularization parameter or degree first for Support Vector Regression algorithm?

I am working on a problem to predict revenue generated by a film. I am using sklearn's support vector regression algorithm with polynomial kernel. I tried to find the degree which gives best accuracy using default value of regularization parameter. But, I got error percentage in the 7 digit range. So, I decided to increase variance, by tuning the regularization parameter.
So should I first assume a degree and find the regularization parameter which gives best result or vice versa? Or is there something else that I should consider?
Generally, doing a grid search with degree and regularization parameter is a common practice. There is some information about this is sklearn here:
https://scikit-learn.org/stable/modules/grid_search.html
This will allow you to make a dictionary of the kernels that you can try (rbf, poly, etc) and their respective hyperparameters and the regularization parameter and attempt to find the best one.

Local and global minima of the cost function in logistic regression

I'm misunderstanding the idea behind the minima in the derivation of the logistic regression formula.
The idea is to increase the hypothesis as much as possible (i.e correct prediction probability close to 1 as possible), which in turn requires minimising the cost function $J(\theta)$ as much as possible.
Now I've been told that for this all to work, the cost function must be convex. My understanding of convexity requires there to be no maximums, and therefore there can only be one minimum, the global minimum. Is this really the case? If it's not, please explain why not. Also, if it's not the case, then that implies the possibility of multiple minima in the cost function, implying multiple sets of parameters yielding higher and higher probabilities. Is this possible? Or can I be certain the returned parameters refer to the global minima and hence highest probability/ prediction?
The fact that we use convex cost function does not guarantee a convex problem.
There is a distinction between a convex cost function and a convex method.
The typical cost functions you encounter (cross entropy, absolute loss, least squares) are designed to be convex.
However, the convexity of the problem depends also on the type of ML algorithm you use.
Linear algorithms (linear regression, logistic regression etc) will give you convex solutions, that is they will converge. When using neural nets with hidden layers however, you are no longer guaranteed a convex solution.
Thus, convexity is a measure of describing your method not only your cost function!
LR is a linear classification method so you should get a convex optimization problem each time you use it! However, if the data is not linearly separable, it might not give a solution and it definitely won't give you a good solution in that case.
Yes, Logistic Regression and Linear Regression aims to find weights and biases which improve the accuracy of the model (or say work well with higher probability on the test data, or real world data). To achieve that, we try to find weights and biases such a way that it has least deviations (say cost) between prediction and real out-comes. So, if we plot cost function and find its minima, that would achieve the same purpose. Hence we use a model such a way that its cost function would have one local minima (i.e. model should be convex)

What's the relationship between an SVM and hinge loss?

My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? How do the support vectors come into play? What about the slack variables? Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
I will answer one thing at at time
Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?
SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.
Or is it more complex than that?
No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.
Intuitively look at these three common losses:
hinge: max(0, 1-py)
log: y log p
mse: (p-y)^2
Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.
How do the support vectors come into play?
Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.
What about the slack variables?
This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.
Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).
To sum up:
there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)

SGD model "overconfidence"

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!
What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.
From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.

What's the meaning of logistic regression dataset labels?

I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.

Resources