What is the difference between LinearRegression and SGDRegressor? - machine-learning

I understand that both LinearRegression class and SGDRegressor class from scikit-learn performs linear regression. However, only SGDRegressor uses Gradient Descent as the optimization algorithm.
Then what is the optimization algorithm used by LinearRegression, and what are the other significant differences between these two classes?

LinearRegression always uses the least-squares as a loss function.
For SGDRegressor you can specify a loss function and it uses Stochastic Gradient Descent (SGD) to fit. For SGD you run the training set one data point at a time and update the parameters according to the error gradient.
In simple words - you can train SGDRegressor on the training dataset, that does not fit into RAM. Also, you can update the SGDRegressor model with a new batch of data without retraining on the whole dataset.

To understand the algorithm used by LinearRegression, we must have in mind that there is (in favorable cases) an analytical solution (with a formula) to find the coefficients which minimize the least squares:
theta = (X'X)^(-1)X'Y (1)
where X' is the the transpose matrix of X.
In the case of non-invertibility, the inverse can be replaced by the Moore-Penrose pseudo-inverse calculated using "singular value decomposition" (SVD). And even in the case of invertibility, the SVD method is faster and more stable than applying the formula (1).
PS - No LaTeX (MathJaX) in Stackoverflow ???
--
Pierre (from France)

Related

How does Lightgbm (or other boosted trees implementations with 2nd order approximations of the loss) work for L1 losses?

I've been trying to understand how Lightgbm handless L1 loses (MAE, MAPE, HUBER)
According to this article, the gain during a split should depend only on the first and second derivatives of the loss function. This is due to the fact that Lightgbm uses a second order approximation to the loss function and consequently we can approximate the loss as follows
For L1 losses however, the absolute value of the gradient of the loss is constant and its hessian 0. I've also read that to deal with this, for loss functions with hessian = 0 we should rather use 1 as the Hessian:
"For these objective function with first_order_gradient is constant, LightGBM has a special treatment for them: (...) it will use the constant gradient for the tree structure learning, but use the residual for the leaf output calculation, with percentile function, e.g. 50% for MAE. This solution is from sklearn, and is proven to work in many benchmarks."
However, even using constant hessian doesn't make sense to me: if for instance when using MAE the gradient is the sign of the error, the squared gradient doesn't give us information. Does it mean that when the gradient is constant, LightGbm does not use the second order approximation, and defaults to traditional gradient boosting?
On the other hand, when reading about GOSS boosting the original lightgbm paper
for the GOSS boosting strategy, the authors consider the square of the sum of the gradients. I see the same problem as above: if the gradient of the MAE is the sign of the error, how does taking the square of the gradient reflect a gain? Does it mean that also GOSS won't work with loss functions with constant gradient?
Thanks in advance,
I've asked this in the Lightgbm repo and got this answer:
Before this version, we use the second-order approximation, but its performance actually is not good.
And we switch back to 1) use first-order gradient to find split point; 2) then use the median of residuals for leaf outputs, as shown in the above code.
So it seems Lightgbm will treat the already implemented L1 losses using gradient descent. For custom loss functions, it will still try to do the 2nd order approx.

What is the exact difference between strocastic gradient Decent, mini batch gradient decent and gradient decent?

I am new to AI. I just learnt GD and about batches for gradient decent. I am confused about whats the exact difference between them. Any solution for this would be appreciated.
Thanks in advance
All of those methods are first order optimization methods, only require the knowledge of gradients, to minimize fintie sum functions. This means that we minimize a function F that is written as the sum of N functions f_{i}, and we can compute the gradient of each of those functions in any given point.
The GD methods consists in using the gradient of F, wich is equal to the sum of gradients of all f_{i} to do one update, i.e.
x <- x - alpha* grad(F)
The stochastic GD, cinsists in selecting randomly one function f_{i}, and doing an update using its gradients, i.e.
x <- x - alpha*grad(f_{i})
So each update is faster, but we need more updates to find the optimimum.
Mini-batch GD is in between of those two strategies and selects m functions f_{i} randomly to do one update.
For more information look at this link
Check this.
In both gradient descent (GD) and stochastic gradient descent (SGD), you iteratively update a set of parameters to minimize an error function.
While in GD, you have to run through all the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use only one or subset of training sample from your training set to do the update for a parameter in a particular iteration. If you use a subset, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples is large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values is enough because they reach the optimal values and keep oscillating there.
Hope this will help you.

difference between LinearRegression and svm.SVR(kernel="linear")

First there are questions on this forum very similar to this one but trust me none matches so no duplicating please.
I have encountered two methods of linear regression using scikit's sklearn and I am failing to understand the difference between the two, especially where in first code there's a method train_test_split() called while in the other one directly fit method is called.
I am studying with multiple resources and this single issue is very confusing to me.
First which uses SVR
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
y = np.array(df['label'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = svm.SVR(kernel='linear')
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)
And second is this one
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
So my main focus is the difference between using svr(kernel="linear") and using LinearRegression()
cross_validation.train_test_split : Splits arrays or matrices into random train and test subsets.
In second code, splitting is not random.
svm.SVR: The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.
Linear Regression: In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.
Reference:
https://cs.adelaide.edu.au/~chhshen/teaching/ML_SVR.pdf
This is what I found:
Intuitively, as all regressors it tries to fit a line to data by minimising a cost function. However, the interesting part about SVR is that you can deploy a non-linear kernel. In this case you end making non-linear regression, i.e. fitting a curve rather than a line.
This process is based on the kernel trick and the representation of the solution/model in the dual rather than in the primal. That is, the model is represented as combinations of the training points rather than a function of the features and some weights. At the same time the basic algorithm remains the same: the only real change in the process of going non-linear is the kernel function, which changes from a simple inner product to some non linear function.
So SVR allows non linear fitting problems as well while LinearRegression() is only for simple linear regression with straight line (may contain any number of features in both cases).
The main difference for these methods is in mathematics background!
We have samples X and want to predict target Y.
The Linear Regression method just minimizes the least squares error:
for one object target y = x^T * w, where w is model's weights.
Loss(w) = Sum_1_N(x_n^T * w - y_n) ^ 2 --> min(w)
As it is a convex functional the global minimum will be always found.
After taking derivative of Loss by w and transforming sums to vectors you'll get:
w = (X^T * X)^(-1)* (X^T * Y)
So, in ML (i'm sure sklearn also has the same implementation) the w is calculated according above formula.
X is train samples, when you call fit method.
In predict this weights just multiplies on X_test.
So the decision is explicit and faster (except for Big selections as finding inverse matrix in this cases is complicated task) than converging methods such as svm.
In addition: Lasso and Ridge solves the same task but have additionally the regularization on weights in their losses.
And you can calculate the weights explicit in that cases too.
The SVM.Linear does almost the same thing except it has an optimization task for maximizing the margin (i apologize but it is difficult to put it down because i didn't find out how to write in Tex format here).
So it uses gradient descent methods for finding global extremum.
Sklearn's class SVM even have attribute max_iter which is used in the converging tasks.
To sum up: Linear Regression has explicit decision and SVM finds approximate of real decision because of numerical(computational) solution.

In sklearn what is the difference between a SVM model with linear kernel and a SGD classifier with loss=hinge

I see that in scikit-learn I can build an SVM classifier with linear kernel in at last 3 different ways:
LinearSVC
SVC with kernel='linear' parameter
Stochastic Gradient Descent with loss='hinge' parameter
Now, I see that the difference between the first two classifiers is that the former is implemented in terms of liblinear and the latter in terms of libsvm.
How the first two classifiers differ from the third one?
The first two always use the full data and solve a convex optimization problem with respect to these data points.
The latter can treat the data in batches and performs a gradient descent aiming to minimize expected loss with respect to the sample distribution, assuming that the examples are iid samples of that distribution.
The latter is typically used when the number of samples is very big or not ending. Observe that you can call the partial_fit function and feed it chunks of data.
Hope this helps?

Which Regression methods are suitable for binary valued features and continuous output?

I want to build a machine learning model to regression on continuous output given binary valued features(0,1). the dimension of my problem is around 200.
which of the flowing methods seems suitable for this kind of problem ?
SVR with different Kernels
Regression random forest
MARS
Gradient boosting with regression tree
Kernel regression (Nadya-Watson Kernel regression)
LSR and LARS
Stochastic gradient boosting
Intuitively speaking, anything requiring the calculation of a gradient is going to struggle on binary values. From your list, SVR and Forests would be the first place I'd look for a benchmark solution.
You can also look at expectation maximization for Bernoully mixture models.
It deals with binary input sets. You can find theory in book:
Christopher M. Bishop. "Pattern Recognition and Machine Learning".

Resources