Optimizing weights in logistic regression ( log likelihood ) - machine-learning

In Logistic Regression:
hypothesis function,
h(x) = ( 1 + exp{-wx} )^-1
where, w - weights/parameters to be fit or optimized
Cost function ( -ve log likelihood function ) is given as :
For a single training e.g.. (x,y):
l(w) = y * log ( h(x) ) + (1 - y) * log ( 1 - h(x) )
The goal is to maximize l(w) over all the training examples and thereby estimate w.
Question :
Consider the situation wherein there are many more positive (y=1) training examples than negative (y=0) training examples.
For simplicity:
if we consider only the positive (y=1) examples:
Algorithm runs:
maximize ( l(w) )
=> maximize ( y * log ( h(x) ) )
=> maximize ( log( h(x) ) )
=> maximize ( h(x) ); since log(z) increases with z
=> maximize ( ( 1 + exp{-wx} )^-1 )
=> maximize ( wx );
since a larger wx will increase h(x) and move it closer to 1
In other words, the optimization algorithm will try to increase (wx) so as to better fit the data and increase likelihood.
However, it seems possible that there is an unintended way for the algorithm to increase (wx) but not improve the solution ( decision boundary ) in anyway :
by scaling w: w' = k*w ( where k is positive constant )
We can increase (k*wx) without changing our solution in anyway.
1) Why is this not a problem ? Or is this a problem ?
2) One can argue that in a dataset with many more positive examples than negative examples, the algorithm will try to keep increasing the ||w||.

This is a problem sometimes, but it is solved by regularization
Only if the classes are perfectly separated
If there are only y=1, algorithm will indeed try to make wx as large as possible, and never converge. But if you have only one class, you don't need logistic regression at all.
If the dataset is imbalanced (there are much more y=1 than y=0), in general, logistic regression would suffer no convergence problems.
Let's see why. Suppose you have only 1 negative example x_0, and N identical positive examples x_1. Then log-likelihood would look like
l(w) = N * log(h(x_1)) + log(1-h(x_0))
The h(x) is bounded between 0 and 1, so the both components are bounded above by 0, but unbounded from below.
Now, if w is large enough and you keep increasing it, the first term will increase only marginally (because it is already close to 0), but the second term can decrease very fast (because log(x) tends to minus infinity very fast when x approaches 0). If you increase w unlimitely, l(w) will go to minus infinity. Thus, there is a finite w that maximizes likelihood.
But there is one important exception. It happens when the classes are perfectly separated by some hyperplane (it has little to do with class sizes). In this case, both the first and the second term will tend to 0 while ||w|| tends to infinity.
But if the classes are perfectly separated, you probably don't need logistic regression at all! Its power is in the probabilistic prediction, but in the perfeclty-separated case, prediction might be deterministic! So you could apply, say, an SVM to your data instead.
Or you can solve a regularized problem, maximizing l(w)-lambda*||w||. For example, in scikit-learn logistic regression does exactly this. In this case, if l(w) is sufficiently close to 0, ||w|| will dominate and the objective function will eventually decrease in w.
Thus, a small penalty in the objective function solves are you worries. And this is a widely applied solution, not only in logistic regression, but in linear models (Lasso, Ridge, etc.) and in neural networks.

Related

Confused about sklearn’s implementation of OSVM

I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed
this example .
I apologize for the silly questions But I’m a bit confused about two things :
Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?
Which of labels predicted by the OSVM and represent outliers is it 1 or -1
Once again i apologize for those questions but for some reason i cannot find this documented anyware
As this example you reference is about novelty-detection, the docs say:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
Meaning: you should train on regular examples only.
The approach is based on:
Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.
Extract:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
The above docs also say:
Inliers are labeled 1, while outliers are labeled -1.
This can also be seen in your example code, extracted:
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
...
# all regular = inliers (defined above)
y_pred_test = clf.predict(X_test)
...
# -1 = outlier <-> error as assumed to be inlier
n_error_test = y_pred_test[y_pred_test == -1].size

ML Classification - Decision Boundary Algorithm

Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.

Wouldn't setting the first derivative of Cost function J to 0 gives the exact Theta values that minimize the cost?

I am currently doing Andrew NG's ML course. From my calculus knowledge, the first derivative test of a function gives critical points if there are any. And considering the convex nature of Linear / Logistic Regression cost function, it is a given that there will be a global / local optima. If that is the case, rather than going a long route of taking a miniscule baby step at a time to reach the global minimum, why don't we use the first derivative test to get the values of Theta that minimize the cost function J in a single attempt , and have a happy ending?
That being said, I do know that there is a Gradient Descent alternative called Normal Equation that does just that in one successful step unlike the former.
On a second thought, I am thinking if it is mainly because of multiple unknown variables involved in the equation (which is why the Partial Derivative comes into play?) .
Let's take an example:
Gradient simple regression cost function:
Δ[RSS(w) = [(y-Hw)T(y-Hw)]
y : output
H : feature vector
w : weights
RSS: residual sum of squares
Equating this to 0 for getting the closed form solution will give:
w = (H T H)-1 HT y
Now assuming there are D features, the time complexity for calculating transpose of matrix is around O(D3). If there are a million features, it is computationally impossible to do within reasonable amount of time.
We use these gradient descent methods since they give solutions with reasonably acceptable solutions within much less time.

Why is the bias term not regularized in ridge regression?

In most of classifications (e.g. logistic / linear regression) the bias term is ignored while regularizing. Will we get better classification if we don't regularize the bias term?
Example:
Y = aX + b
Regularization is based on the idea that overfitting on Y is caused by a being "overly specific", so to speak, which usually manifests itself by large values of a's elements.
b merely offsets the relationship and its scale therefore is far less important to this problem. Moreover, in case a large offset is needed for whatever reason, regularizing it will prevent finding the correct relationship.
So the answer lies in this: in Y = aX + b, a is multiplied with the explanatory/independent variable, b is added to it.

Why is 1-norm SVM more sparse than 2-norm SVM?

How are we increasing sparsity by using 1-norm weight in cost function as compared to using 2-norm weight in the same cost function for an SVM.
For 1-norm : Cost function- Minimize ||w||_1
For 2-norm : Cost function - Minimize ||w||_2
Is it related to LP-SVM?
Look at the partial derivative of the l_1 loss with respect to some parameter.
The loss is constant with respect to an increase in weight. So that increased weight needs to offset some fixed amount of error, regardless of how small the weight already is.
Compare this the l2 loss, where the penalty scales with the size of the current parameter. So as it gets near 0, it only needs to have an infinitesimal decrease in error to offset the regularization penalty.
Note that ||w||_2 < ||w||_1 for the same w when 0 < w < 1 (which usually happens) since L2 norm squares the weights.
That's why ||w||_1 is a harder constraint which results in a sparse vector.
It's not specific to SVM, many algorithms use L1 or L2 regularizations.

Resources