Is Stochastic gradient descent a classifier or an optimizer? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am new to Machine Learning and I am trying analyze the classification algorithm for a project of mine. I came across SGDClassifier in sklearn library. But a lot of papers have referred to SGD as an optimization technique. Can someone please explain how is SGDClassifier implemented?

Taken from SGD sikit-learn documentation
loss="hinge": (soft-margin) linear Support Vector Machine,
loss="modified_huber": smoothed hinge loss,
loss="log": logistic regression

SGD is indeed a technique that is used to find the minima of a function.
SGDClassifier is a linear classifier (by default in sklearn it is a linear SVM) that uses SGD for training (that is, looking for the minima of the loss using SGD). According to the documentation:
SGDClassifier is a Linear classifiers (SVM, logistic regression, a.o.)
with SGD training.
This estimator implements regularized linear models with stochastic
gradient descent (SGD) learning: the gradient of the loss is estimated
each sample at a time and the model is updated along the way with a
decreasing strength schedule (aka learning rate). SGD allows minibatch
(online/out-of-core) learning, see the partial_fit method. For best
results using the default learning rate schedule, the data should have
zero mean and unit variance.
This implementation works with data represented as dense or sparse
arrays of floating point values for the features. The model it fits
can be controlled with the loss parameter; by default, it fits a
linear support vector machine (SVM).

SGDClassifier is a linear classifier which implements regularized linear models with stochastic gradient descent (SGD) learning
Other classifiers:
classifiers = [
("ASGD", SGDClassifier(average=True, max_iter=100)),
("Perceptron", Perceptron(tol=1e-3)),
("Passive-Aggressive I", PassiveAggressiveClassifier(loss='hinge',
C=1.0, tol=1e-4)),
("Passive-Aggressive II", PassiveAggressiveClassifier(loss='squared_hinge',
C=1.0, tol=1e-4)),
("SAG", LogisticRegression(solver='sag', tol=1e-1, C=1.e4 / X.shape[0]))
]
Stochastic Gradient Descent (sgd) is a solver. It is a simple and efficient approach for discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression.
Other alternative solvers for sgd in neural_network.MLPClassifier are lbfgs and adam
solver : {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’
The solver for weight optimization.
‘lbfgs’ is an optimizer in the family of quasi-Newton methods
‘sgd’ refers to stochastic gradient descent.
‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
Details about implementation of SGDClassifier can be read # SGDClassifier documentation page.
in brief:
This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning

Lets seggregate each word in simple english meaning
Stochastic - Random,
Gradient - slope,
Descent - downwards
Basically, this technique is used as an "optimizing algorithm" for finding the parameters with minimal convex loss/cost function.
By which we can find out the slope of the line which has minimal loss for linear classifiers i.e. (SVM & Logistic Regression)
Some of the other ways in which it is performed:
Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch
For more details, on the above please go through the referred link:
https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/

Related

Learning rate & gradient descent difference?

What is the difference between the two?, the two serve to reach the minimum point (lower loss) of a function for example.
I understand (I think) that the learning rate is multiplied by the gradient ( slope ) to make the gradient descent , but is that so ? Do I miss something?
What is the difference between lr and gradient?
Thanks
Deep learning neural networks are trained using the stochastic gradient descent algorithm.
Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.
The amount that the weights are updated during training is referred to as the step size or the “learning rate.”
Specifically, the learning rate is a configurable hyperparameter
used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.
A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.
The challenge of training deep learning neural networks involves carefully selecting the learning rate. It may be the most important hyperparameter for the model.
The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.
— Page 429, Deep Learning, 2016.
For more on what the learning rate is and how it works, see the post:
How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks
Also you can refer to here: Understand the Impact of Learning Rate on Neural Network Performance

Weak Learners of Gradient Boosting Tree for Classification/ Multiclass Classification

I am a beginner in machine learning field and I want to learn how to do multiclass classification with Gradient Boosting Tree (GBT). I have read some of the articles about GBT but for regression problem and I couldn't find the right explanation about GBT for multiclass classfication. I also check GBT in scikit-learn library for machine learning. The implementation of GBT is GradientBoostingClassifier which used regression tree as the weak learners for multiclass classification.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
Source : http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
The things is, why do we use regression tree as our learners for GBT instead of classification tree ? It would be very helpful, if someone can provide me the explanation about why regression tree is being used rather than classification tree and how regression tree can do the classification. Thank you
You are interpreting 'regression' too literally here (as numeric prediction), which is not the case; remember, classification is handled with logistic regression. See, for example, the entry for loss in the documentation page you have linked:
loss : {‘deviance’, ‘exponential’}, optional (default=’deviance’)
loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.
So, a 'classification tree' is just a regression tree with loss='deviance'...

Why stochastic gradient descent does not support non-linear SVM

I have read that SGD supports linear SVM, but not non-linear SVM. Why is that? I was looking at the cost function of non-linear SVM. It does has a "sum" sign in the beginning.
Please read about Mercer Theorem. Maybe, it will shed some light!

Which Regression methods are suitable for binary valued features and continuous output?

I want to build a machine learning model to regression on continuous output given binary valued features(0,1). the dimension of my problem is around 200.
which of the flowing methods seems suitable for this kind of problem ?
SVR with different Kernels
Regression random forest
MARS
Gradient boosting with regression tree
Kernel regression (Nadya-Watson Kernel regression)
LSR and LARS
Stochastic gradient boosting
Intuitively speaking, anything requiring the calculation of a gradient is going to struggle on binary values. From your list, SVR and Forests would be the first place I'd look for a benchmark solution.
You can also look at expectation maximization for Bernoully mixture models.
It deals with binary input sets. You can find theory in book:
Christopher M. Bishop. "Pattern Recognition and Machine Learning".

Difference between classification and regression, with SVMs

What is the exact difference between a Support Vector Machine classifier and a Support Vector Machine regresssion machine?
The one sentence answer is that SVM classifier performs binary classification and SVM regression performs regression.
While performing very different tasks, they are both characterized by following points.
usage of kernels
absence of local minima
sparseness of the solution
capacity control obtained by acting on the margin
number of support vectors, etc.
For SVM classification the hinge loss is used, for SVM regression the epsilon insensitive loss function is used.
SVM classification is more widely used and in my opinion better understood than SVM regression.

Resources