Why stochastic gradient descent does not support non-linear SVM - machine-learning

I have read that SGD supports linear SVM, but not non-linear SVM. Why is that? I was looking at the cost function of non-linear SVM. It does has a "sum" sign in the beginning.

Please read about Mercer Theorem. Maybe, it will shed some light!

Related

Learning algorithm to implement XOR gate

I know we can't use perceptron learning algorithm to implement XOR gate because it is a lineraly inseparable problem. So my question is which learning algorithm and which neural network can we use to implement XOR gate? I tried using Delta rule, but it is not producing desired weight matrix.
Thank You!
A 2 layered MLP (multi-layer perceptron) will do the trick.
Consider this article.
By the way, Wikipedia reads:
The delta rule is a gradient descent learning rule for updating the
weights of the inputs to artificial neurons in a single-layer neural
network.
The "single-layer neural network" here is the issue. As you said, a simple (single layer) perceptron does not have the representational power to capture XOR.

Is Stochastic gradient descent a classifier or an optimizer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am new to Machine Learning and I am trying analyze the classification algorithm for a project of mine. I came across SGDClassifier in sklearn library. But a lot of papers have referred to SGD as an optimization technique. Can someone please explain how is SGDClassifier implemented?
Taken from SGD sikit-learn documentation
loss="hinge": (soft-margin) linear Support Vector Machine,
loss="modified_huber": smoothed hinge loss,
loss="log": logistic regression
SGD is indeed a technique that is used to find the minima of a function.
SGDClassifier is a linear classifier (by default in sklearn it is a linear SVM) that uses SGD for training (that is, looking for the minima of the loss using SGD). According to the documentation:
SGDClassifier is a Linear classifiers (SVM, logistic regression, a.o.)
with SGD training.
This estimator implements regularized linear models with stochastic
gradient descent (SGD) learning: the gradient of the loss is estimated
each sample at a time and the model is updated along the way with a
decreasing strength schedule (aka learning rate). SGD allows minibatch
(online/out-of-core) learning, see the partial_fit method. For best
results using the default learning rate schedule, the data should have
zero mean and unit variance.
This implementation works with data represented as dense or sparse
arrays of floating point values for the features. The model it fits
can be controlled with the loss parameter; by default, it fits a
linear support vector machine (SVM).
SGDClassifier is a linear classifier which implements regularized linear models with stochastic gradient descent (SGD) learning
Other classifiers:
classifiers = [
("ASGD", SGDClassifier(average=True, max_iter=100)),
("Perceptron", Perceptron(tol=1e-3)),
("Passive-Aggressive I", PassiveAggressiveClassifier(loss='hinge',
C=1.0, tol=1e-4)),
("Passive-Aggressive II", PassiveAggressiveClassifier(loss='squared_hinge',
C=1.0, tol=1e-4)),
("SAG", LogisticRegression(solver='sag', tol=1e-1, C=1.e4 / X.shape[0]))
]
Stochastic Gradient Descent (sgd) is a solver. It is a simple and efficient approach for discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression.
Other alternative solvers for sgd in neural_network.MLPClassifier are lbfgs and adam
solver : {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’
The solver for weight optimization.
‘lbfgs’ is an optimizer in the family of quasi-Newton methods
‘sgd’ refers to stochastic gradient descent.
‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
Details about implementation of SGDClassifier can be read # SGDClassifier documentation page.
in brief:
This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning
Lets seggregate each word in simple english meaning
Stochastic - Random,
Gradient - slope,
Descent - downwards
Basically, this technique is used as an "optimizing algorithm" for finding the parameters with minimal convex loss/cost function.
By which we can find out the slope of the line which has minimal loss for linear classifiers i.e. (SVM & Logistic Regression)
Some of the other ways in which it is performed:
Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch
For more details, on the above please go through the referred link:
https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/

When should I use linear neural networks and when non-linear?

I am using feed forward, gradient descent backpropagation neural networks.
Currently I have only worked with non-linear networks where tanh is activation function.
I was wondering.
What kind of tasks would you give to a neural networks with non-linear activation function and what kind of tasks for linear?
I know that network with linear activation function are used to solve linear problems.
What are those linear problems?
Any examples?
Thanks!
I'd say never, since composition of linear functions is still linear using a neural network with linear activations is just a way to complicate linear regression.
Whether to choose a linear model or something more complicated is up to you and depends on the data you have; this is (one of the reasons) why it is customary hold out some data during training and use it to validate the model. Other ways of testing models are residuals analysis, hypothesis testing, and so on

Begining to code Logistic regression in java

I want to code the logistic regression(classification problem) algorithm using java -
Hypothesis is -
Can anyone please tell me what −(−θ to the power T) is?
I was able to code linear regression its hypothesis is which is relatively easy but can not start off with logistic regression.
ΘT is the transpose of parameters vector Θ and ΘTx is the linear combination of input features.If you know linear regression then you can think ΘTx as a output of linear regression. Look at the figure below.
The first part is the linear regression. The output of the linear regression is
. Since logistic regression is not a regression but a classification problem, your output shouldn't be continuous. Instead you require a binary output for any inputs. For this you need a function that maps the range of input to the value between 0 and 1 so that you can apply some threshold to the output to get the classification. And the suitable function for this would be sigmoid function as you mentioned.
Regrading your question, the output of linear regression can be written as
The term = ΘTx is the vectorized implementation of output of linear regression. So ΘT is nothing but a transpose of parameter vector. This can be understood by following mathematical operations.
For details in logistic regression and coding check this link.
The ΘT represenets transponse of theta matrix. Where theta matrix is matrix of features. When writing code for those algorthms, I strongly advice yout to use first MATLAB or OCTAVE software first for calculating matrices. Then, when you are sure that your algorithm is working correctly implement it in JAVA.
Cheers,
Emil

Which Regression methods are suitable for binary valued features and continuous output?

I want to build a machine learning model to regression on continuous output given binary valued features(0,1). the dimension of my problem is around 200.
which of the flowing methods seems suitable for this kind of problem ?
SVR with different Kernels
Regression random forest
MARS
Gradient boosting with regression tree
Kernel regression (Nadya-Watson Kernel regression)
LSR and LARS
Stochastic gradient boosting
Intuitively speaking, anything requiring the calculation of a gradient is going to struggle on binary values. From your list, SVR and Forests would be the first place I'd look for a benchmark solution.
You can also look at expectation maximization for Bernoully mixture models.
It deals with binary input sets. You can find theory in book:
Christopher M. Bishop. "Pattern Recognition and Machine Learning".

Resources