Does logistic regression always find global optimum, assuming that the optimisation converges? - machine-learning

I am uncertain if this is the case both in general:
does logistic regression always find global optimum?
and in particular
does logistic regression always find global optimum when that the optimisation converges?

When the data are separable, the optimum is at infinity, so you will never reach it. Normally, though, any optimization algorithm you are using will reach a point from which no noticeable improvement can be attained by iterating further.
An adequately tuned algorithm will eventually find the global optimum if this is not the case, because the loss function is convex.

Related

How to tune maximum entropy's parameter?

I am doing text classification with scikit learn's logistic regression function (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). I am using grid search in order to choose a value for the C parameter. Do I need to do the same for max_iter parameter? why?
Both C and max_iter parameters have default values in Sklearn, which means they need to be tuned. But, from what I understand, early stopping and l1/l2 regularization are two desperate methods for avoiding overfitting and performing one of them is enough. Am I incorrect in assuming that tunning the value of max_iter is equivalent to early stopping?
To summarize, here are my main questions:
1- Does max_iter need tuning? why? (the documentation says it is only useful for certain solvers)
2- Is tuning the max_iter equivalent to early stopping?
3- Should we perform early stopping and L1/L2 regularization at the same time?
Here's some simple responses to your numbered questions and grossly simplified:
Yes, sometimes you need to tune max_iter. Why? See next.
No. max_iter is the number of iterations that the logistic regression classifier's solver is allowed to step through before being stopped. The aim is to reach a "stable" solution for the parameters of the logistic regression model, i.e., it is an optimisation problem. If your max_iter is too low, you may not reach an optimal solution and your model is underfit. If your value is too high, you can essentially wait forever to have a solution for little gain in accuracy. You may also get stuck at local optima if max_iter is too low.
Yes or No.
a. L1/L2 regularisation is essentially "smoothing" of your complex model so that it does not overfit to the training data. If parameters become too large, they are penalised in the cost.
b. Early stopping is when you stop optimising your model (e.g., via gradient descent) at some stage in which you deem acceptable (before max_iter). For example, a metric such as RMSE can be used to define when to stop, or a comparison of the metrics from your test/training data.
c. When to use them? This is dependent on your problem. If you have a simple linear problem, with limited features, you will not need regularisation or early stopping. If you have thousands of features and experience overfitting then apply regularisation as one solution. If you do not want to wait for the optimisation to run to the end when you are playing with parameters as you only care about a certain level of accuracy, you could apply early stopping.
Finally, how do I tune max_iter correctly? This depends on your problem at hand. If you find your classification metric shows your model is performing poorly, it could be that your solver has not taken enough steps to reach a minimum. I'd suggest you do this by hand and look at the cost vs. max_iter to see if it is reaching a minimum properly rather than automate it.

Local and global minima of the cost function in logistic regression

I'm misunderstanding the idea behind the minima in the derivation of the logistic regression formula.
The idea is to increase the hypothesis as much as possible (i.e correct prediction probability close to 1 as possible), which in turn requires minimising the cost function $J(\theta)$ as much as possible.
Now I've been told that for this all to work, the cost function must be convex. My understanding of convexity requires there to be no maximums, and therefore there can only be one minimum, the global minimum. Is this really the case? If it's not, please explain why not. Also, if it's not the case, then that implies the possibility of multiple minima in the cost function, implying multiple sets of parameters yielding higher and higher probabilities. Is this possible? Or can I be certain the returned parameters refer to the global minima and hence highest probability/ prediction?
The fact that we use convex cost function does not guarantee a convex problem.
There is a distinction between a convex cost function and a convex method.
The typical cost functions you encounter (cross entropy, absolute loss, least squares) are designed to be convex.
However, the convexity of the problem depends also on the type of ML algorithm you use.
Linear algorithms (linear regression, logistic regression etc) will give you convex solutions, that is they will converge. When using neural nets with hidden layers however, you are no longer guaranteed a convex solution.
Thus, convexity is a measure of describing your method not only your cost function!
LR is a linear classification method so you should get a convex optimization problem each time you use it! However, if the data is not linearly separable, it might not give a solution and it definitely won't give you a good solution in that case.
Yes, Logistic Regression and Linear Regression aims to find weights and biases which improve the accuracy of the model (or say work well with higher probability on the test data, or real world data). To achieve that, we try to find weights and biases such a way that it has least deviations (say cost) between prediction and real out-comes. So, if we plot cost function and find its minima, that would achieve the same purpose. Hence we use a model such a way that its cost function would have one local minima (i.e. model should be convex)

Why do we use regularization for training neural network?

In my understanding, I think it's to avoid over/under fitting, and for the faster calculation.
Is it right?
Your understanding is partially correct. Regularization will not help with underfitting. It can protect (to some extent) from overfitting. Furthermore it will not speed up calculations (as it is actually more complex to compute something with added regularization) but can lead to simplier optimization problem - thus less number of steps required for convergenc (as a resulting error surface is more smooth).

When should one use LinearSVC or SVC?

From my research, I found three conflicting results:
SVC(kernel="linear") is better
LinearSVC is better
Doesn't matter
Can someone explain when to use LinearSVC vs. SVC(kernel="linear")?
It seems like LinearSVC is marginally better than SVC and is usually more finicky. But if scikit decided to spend time on implementing a specific case for linear classification, why wouldn't LinearSVC outperform SVC?
Mathematically, optimizing an SVM is a convex optimization problem, usually with a unique minimizer. This means that there is only one solution to this mathematical optimization problem.
The differences in results come from several aspects: SVC and LinearSVC are supposed to optimize the same problem, but in fact all liblinear estimators penalize the intercept, whereas libsvm ones don't (IIRC). This leads to a different mathematical optimization problem and thus different results. There may also be other subtle differences such as scaling and default loss function (edit: make sure you set loss='hinge' in LinearSVC). Next, in multiclass classification, liblinear does one-vs-rest by default whereas libsvm does one-vs-one.
SGDClassifier(loss='hinge') is different from the other two in the sense that it uses stochastic gradient descent and not exact gradient descent and may not converge to the same solution. However the obtained solution may generalize better.
Between SVC and LinearSVC, one important decision criterion is that LinearSVC tends to be faster to converge the larger the number of samples is. This is due to the fact that the linear kernel is a special case, which is optimized for in Liblinear, but not in Libsvm.
The actual problem is in the problem with scikit approach, where they call SVM something which is not SVM. LinearSVC is actually minimizing squared hinge loss, instead of just hinge loss, furthermore, it penalizes size of the bias (which is not SVM), for more details refer to other question:
Under what parameters are SVC and LinearSVC in scikit-learn equivalent?
So which one to use? It is purely problem specific. As due to no free lunch theorem it is impossible to say "this loss function is best, period". Sometimes squared loss will work better, sometimes normal hinge.

Neural nets (or similar) for regression problems

The motivating idea behind neural nets seems to be that they learn the "right" features to apply logistic regression to. Is there a similar approach for linear regression? (or just regression problems in general?)
Would doing the obvious thing of removing the application of a sigmoid function for all neurons (ie, including the hidden layers) make sense/work? (ie, each neuron is performing linear regression instead of logistic regression).
Alternatively, would doing the (maybe even more obvious) thing of just scaling output values to [0,1] work? (intuitively I would think not, as the sigmoid function seems like it would cause the net to arbitrarily favor extreme values) (edit: though I was just searching around some more, and saw that one technique is to scale based on mean and variance, which seems like it might deal with this issue -- so maybe this is more viable than I thought).
Or is there some other technique for doing "feature learning" for regression problems?
Check out this applet. Try to learn different functions. When you dictate linear activation functions at both hidden and output layers, it even fails to learn the quadratic function. At least one layer needs to be set to sigmoid function, see figures below.
There are different kinds of scaling. Standard scaling, as you mentioned, eliminates the impact of mean and standard deviation of the training sample, is most often used in machine learning. Just make sure you are using the same mean and std value from training sample in the test sample.
The reason why scaling is required is because the output of sigmoid function ranges at (0,1). I didn't try, but I think it is better to scale the output even if you select linear function at output layer. Otherwise large input at hidden layer (with sigmoid) won't lead to drastic output (the sigmoid function is approximately linear when the input is at a small range, out of such range will make the output changes much slowly). You can try this by yourself in your own data.
Besides, if you have various features, the feature normalization that makes different features in the same scale is also recommended. The scaling speeds up gradient descent by avoiding many extra iterations that are required when one or more features take on much larger values than the rest.
As #Ray mentioned, deep learning that many levels of features are involved can help you with the feature learning, it's not all linear combinations though.

Resources