Why is logistic regression called regression? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
According to what I have understood, linear regression predicts the outcome which can have continuous values, whereas logistic regression predicts outcome which is discrete. It seems to me that logistic regression is similar to a classification problem. So, why is it called regression?
There is also a related question: What is the difference between linear regression and logistic regression?

There is a strict link between linear regression and logistic regression.
With linear regression you're looking for the ki parameters:
h = k0 + Σ ki ˙ Xi = Kt ˙ X
With logistic regression you've the same aim but the equation is:
h = g(Kt ˙ X)
Where g is the sigmoid function:
g(w) = 1 / (1 + e-w)
So:
h = 1 / (1 + e-Kt ˙ X)
and you need to fit K to your data.
Assuming a binary classification problem, the output h is the estimated probability that the example x is a positive match in the classification task:
P(Y = 1) = 1 / (1 + e-Kt ˙ X)
When the probability is greater than 0.5 then we can predict "a match".
The probability is greater than 0.5 when:
g(w) > 0.5
and this is true when:
w = Kt ˙ X ≥ 0
The hyperplane:
Kt ˙ X = 0
is the decision boundary.
In summary:
logistic regression is a generalized linear model using the same basic formula of linear regression but it is regressing for the probability of a categorical outcome.
This is a very abridged version. You can find a simple explanation in these videos (third week of Machine Learning by Andrew Ng).
You can also take a look at http://www.holehouse.org/mlclass/06_Logistic_Regression.html for some notes on the lessons.

As explained earlier,logistic regression is a generalized linear model using the same basic formula of linear regression but it is regressing for the probability of a categorical outcome.
As you can see, we get similar type of equation for both linear and logistic regression.
Difference lies in fact that linear regression give continous values of y for given x where logistic regression also gives continous values of p(y=1) for given x which is coverted later to y=0 or y=1 based on threshold value(0.5).

Logistic regression falls under the category of supervised learning.It measures the relationship between categorical dependent variable and one or more independent variables by estimating probabilities using logistic/sigmoid function.
Logistic regression is a bit similar to linear regression or we can see it as a generalized linear model.
In linear regression we predict output y based on a weighted sum of input variables.
y=c+ x1*w1 + x2*w2 + x3*w3 + .....+ xn*wn
The main purpose of linear regression is to estimate values of c,w1,w2,...,wn and minimize the cost function and predict y.
Logistic regression also does the same thing but with one addition. It pass the result through a special function called logistic/sigmoid function to produce the output y.
y=logistic(c + x1*w1 + x2*w2 + x3*w3 + ....+ xn*wn)
y=1/1+e[-(c + x1*w1 + x2*w2 + x3*w3 + ....+ xn*wn)]

Related

In multi-class logistic regression, does SGD one training example update all the weights?

In multi-class logistic regression, lets say we use softmax and cross entropy.
Does SGD one training example update all the weights or only a portion of the weights which are associated to the label ?
For example, the label is one-hot [0,0,1]
Does the whole matrix W_{feature_dim \times num_class} updated or only W^{3}_{feature_dim \times 1} updated ?
Thanks
All of your weights are updated.
You have y = Softmax(W x + β), so to predict a y out of a single x you are making use of all your W weights. If something is used during the forward pass (prediction), then it also gets updated during the backward pass (SGD). Perhaps a more intuitive way of thinking about it is that you are essentially predicting the class membership probability for your features; assigning weight to some class means removing weight from another, so you need to update both.
Take for instance the simple case of x ∈ ℝ, y ∈ ℝ3. Then W ∈ ℝ1×3. Before activation, your prediction for some given x would look like: y= [y1 = W11x + β1, y2 = W12x + β2, y3 = W13x + β3]. You have an error signal for all of these mini-predictions, coming out of categorical crossentropy, for which you must then compute the derivative wrt the W, β terms.
I hope this is clear

ML Classification - Decision Boundary Algorithm

Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.

Why not logistic regression uses multiplication instead of addition for error matrix?

This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?
Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.

How to do gaussian/polynomial regression with scikit-learn?

Does scikit-learn provide facility to perform regression using a gaussian or polynomial kernel? I looked at the APIs and I don't see any.
Has anyone built a package on top of scikit-learn that does this?
Theory
Polynomial regression is a special case of linear regression. With the main idea of how do you select your features. Looking at the multivariate regression with 2 variables: x1 and x2. Linear regression will look like this: y = a1 * x1 + a2 * x2.
Now you want to have a polynomial regression (let's make 2 degree polynomial). We will create a few additional features: x1*x2, x1^2 and x2^2. So we will get your 'linear regression':
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
This nicely shows an important concept curse of dimensionality, because the number of new features grows much faster than linearly with the growth of degree of polynomial. You can take a look about this concept here.
Practice with scikit-learn
You do not need to do all this in scikit. Polynomial regression is already available there (in 0.15 version. Check how to update it here).
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
X = [[0.44, 0.68], [0.99, 0.23]]
vector = [109.85, 155.72]
predict= [0.49, 0.18]
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
predict_ = poly.fit_transform(predict)
clf = linear_model.LinearRegression()
clf.fit(X_, vector)
print clf.predict(predict_)
Either you use Support Vector Regression sklearn.svm.SVR and set the appropritate kernel (see here).
Or you install the latest master version of sklearn and use the recently added sklearn.preprocessing.PolynomialFeatures (see here) and then OLS or Ridge on top of that.

Neural Networks: Why does the perceptron rule only work for linearly separable data?

I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.

Resources