How to calculate the regularization parameter in linear regression - machine-learning

When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update the theta parameters in the gradient descent algorithm.
My question is how do we calculate this lambda regularization parameter?

The regularization parameter (lambda) is an input to your model so what you probably want to know is how do you select the value of lambda. The regularization parameter reduces overfitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda results in less overfitting but also greater bias. So the real question is "How much bias are you willing to tolerate in your estimate?"
One approach you can take is to randomly subsample your data a number of times and look at the variation in your estimate. Then repeat the process for a slightly larger value of lambda to see how it affects the variability of your estimate. Keep in mind that whatever value of lambda you decide is appropriate for your subsampled data, you can likely use a smaller value to achieve comparable regularization on the full data set.

CLOSED FORM (TIKHONOV) VERSUS GRADIENT DESCENT
Hi! nice explanations for the intuitive and top-notch mathematical approaches there. I just wanted to add some specificities that, where not "problem-solving", may definitely help to speed up and give some consistency to the process of finding a good regularization hyperparameter.
I assume that you are talking about the L2 (a.k. "weight decay") regularization, linearly weighted by the lambda term, and that you are optimizing the weights of your model either with the closed-form Tikhonov equation (highly recommended for low-dimensional linear regression models), or with some variant of gradient descent with backpropagation. And that in this context, you want to choose the value for lambda that provides best generalization ability.
CLOSED FORM (TIKHONOV)
If you are able to go the Tikhonov way with your model (Andrew Ng says under 10k dimensions, but this suggestion is at least 5 years old) Wikipedia - determination of the Tikhonov factor offers an interesting closed-form solution, which has been proven to provide the optimal value. But this solution probably raises some kind of implementation issues (time complexity/numerical stability) I'm not aware of, because there is no mainstream algorithm to perform it. This 2016 paper looks very promising though and may be worth a try if you really have to optimize your linear model to its best.
For a quicker prototype implementation, this 2015 Python package seems to deal with it iteratively, you could let it optimize and then extract the final value for the lambda:
In this new innovative method, we have derived an iterative approach to solving the general Tikhonov regularization problem, which converges to the noiseless solution, does not depend strongly on the choice of lambda, and yet still avoids the inversion problem.
And from the GitHub README of the project:
InverseProblem.invert(A, be, k, l) #this will invert your A matrix, where be is noisy be, k is the no. of iterations, and lambda is your dampening effect (best set to 1)
GRADIENT DESCENT
All links of this part are from Michael Nielsen's amazing online book "Neural Networks and Deep Learning", recommended reading!
For this approach it seems to be even less to be said: the cost function is usually non-convex, the optimization is performed numerically and the performance of the model is measured by some form of cross validation (see Overfitting and Regularization and why does regularization help reduce overfitting if you haven't had enough of that). But even when cross-validating, Nielsen suggests something: you may want to take a look at this detailed explanation on how does the L2 regularization provide a weight decaying effect, but the summary is that it is inversely proportional to the number of samples n, so when calculating the gradient descent equation with the L2 term,
just use backpropagation, as usual, and then add (λ/n)*w to the partial derivative of all the weight terms.
And his conclusion is that, when wanting a similar regularization effect with a different number of samples, lambda has to be changed proportionally:
we need to modify the regularization parameter. The reason is because the size n of the training set has changed from n=1000 to n=50000, and this changes the weight decay factor 1−learning_rate*(λ/n). If we continued to use λ=0.1 that would mean much less weight decay, and thus much less of a regularization effect. We compensate by changing to λ=5.0.
This is only useful when applying the same model to different amounts of the same data, but I think it opens up the door for some intuition on how it should work, and, more importantly, speed up the hyperparametrization process by allowing you to finetune lambda in smaller subsets and then scale up.
For choosing the exact values, he suggests in his conclusions on how to choose a neural network's hyperparameters the purely empirical approach: start with 1 and then progressively multiply&divide by 10 until you find the proper order of magnitude, and then do a local search within that region. In the comments of this SE related question, the user Brian Borchers suggests also a very well known method that may be useful for that local search:
Take small subsets of the training and validation sets (to be able to make many of them in a reasonable amount of time)
Starting with λ=0 and increasing by small amounts within some region, perform a quick training&validation of the model and plot both loss functions
You will observe three things:
The CV loss function will be consistently higher than the training one, since your model is optimized for the training data exclusively (EDIT: After some time I've seen a MNIST case where adding L2 helped the CV loss decrease faster than the training one until convergence. Probably due to the ridiculous consistency of the data and a suboptimal hyperparametrization though).
The training loss function will have its minimum for λ=0, and then increase with the regularization, since preventing the model from optimally fitting the training data is exactly what regularization does.
The CV loss function will start high at λ=0, then decrease, and then start increasing again at some point (EDIT: this assuming that the setup is able to overfit for λ=0, i.e. the model has enough power and no other regularization means are heavily applied).
The optimal value for λ will be probably somewhere around the minimum of the CV loss function, it also may depend a little on how does the training loss function look like. See the picture for a possible (but not the only one) representation of this: instead of "model complexity" you should interpret the x axis as λ being zero at the right and increasing towards the left.
Hope this helps! Cheers,
Andres

The cross validation described above is a method used often in Machine Learning. However, choosing a reliable and safe regularization parameter is still a very hot topic of research in mathematics.
If you need some ideas (and have access to a decent university library) you can have a look at this paper:
http://www.sciencedirect.com/science/article/pii/S0378475411000607

Related

How to model multiple inputs to single output in classification?

Purpose:
I am trying to build a model to classify multiple inputs to a single output class, which is something like this:
{x_i1, x_i2, x_i3, ..., x_i16} (features) to y_i (class)
I am using a SVM to make the classification, but the 0/1-loss was bad (half of the data a misclassified), which leads me to the conclusion that the data might be non-linear. This is why I played around with polynomial basis function. I transformed each coefficient such that I get any combinations of polynomials up to degree 4, in the hope that my features are linear in the transformed space. my new transformed input looks like this:
{x_i1, ..., x_i16, x_i1^2, ..., x_i16^2, ... x_i1^4, ..., x_i16^4, x_i1^3, ..., x_i16^3, x_i1*x_i2, ...}
The loss was minimized but still not quite where I want to go. Since with the number of polynomial degree the chance of overfitting rises, i added regularization in order to counter balance that. I also added a forward greedy algorithm in order to pick up the coefficients which leads to minimal cross-validation error, but with no great improvement.
Question:
Is there a systematic way to figure out which transform leads to linear feature behaviour in the transformed space? Seems little odd to me that I have to try out every polynomial until it "fits". Are there perhaps better basis functions except polynomials? I understand that in low dimensional feature space, one can simply plot the data out and estimate the transform visually, but how can I do it in high dimensional space?
Maybe a little off topic but I also informed myself about PCA in order to throw away the components which doesnt provide much informations in the first place. Is this worth a try?
Thank you for your help.
Did you try other kernel functions such as RBF other than linear and polynomial? Since different dataset may have different characteristics, some kernel functions may work better than others do, especially in non-linear cases.
I don't know which tools you are using, but the following one also provides a guide for beginners on how to build SVM models:
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
It is always a good idea to have a feature selection step first, especially for high-dimensional data. Those noisy or irrelevant features should be taken away, leading to a better performance and higher efficiency.

Parameter delta in Adam method

I am using Adam method in caffe. It has a delta/epsilon tuning parameter (used to avoid divide by zero). In caffe, its default value is 1e-8. I can change it to 1e-6 or 1-e0. From tensorflow, I hear that this parameter will affect to performance of training, especially limited dataset.
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
If anyone has experimented with changing this parameter, please give me some advice about the effect of this parameter on performance?
Consider the update equation for Adam: epsilon is to prevent dividing by zero in the case that (the exponentially-decaying average of) the standard deviation of the gradients is zero.
Why would a low value of epsilon cause problems? Perhaps there are cases where some parameters settle to good values before others and having epsilon too low means those parameters get huge learning rates and get pushed away from those good values. I'd guess this would be more problematic in something like a resnet where a lot of the layers have little effect on a large portion of the examples.
On the other hand, setting epsilon higher both limits the parameter-wise learning rate effect and reduces all the learning rates, slowing down training. It's possible to find examples of higher values of epsilon helping simply because the learning rate was too high to begin with.

How does regularization parameter work in regularization?

In machine learning cost function, if we want to minimize the influence of two parameters, let's say theta3 and theta4, it seems like we have to give a large value of regularization parameter just like the equation below.
I am not quite sure why the bigger regularization parameter reduces the influence instead of increasing it. How does this function work?
It is because that the optimum values of thetas are found by minimizing the cost function.
As you increase the regularization parameter, optimization function will have to choose a smaller theta in order to minimize the total cost.
Quoting from similar question's answer:
At a high level you can think of regularization parameters as applying a kind of Occam's razor that favours simple solutions. The complexity of models is often measured by the size of the model w viewed as a vector. The overall loss function as in your example above consists of an error term and a regularization term that is weighted by λ, the regularization parameter. So the regularization term penalizes complexity (regularization is sometimes also called penalty). It is useful to think what happens if you are fitting a model by gradient descent. Initially your model is very bad and most of the loss comes from the error terms, so the model is adjusted to primarily to reduce the error term. Usually the magnitude of the model vector increases as the optimization progresses. As the model is improving and the model vector is growing the regularization term becomes a more significant part of the loss. Regularization prevents the model vector growing arbitrarily for negligible reductions in the error. λ just determines the relative importance of keeping the model simple relative to reducing training error.
There are different types of regularization terms in common use. The one you have, and most commonly used in SVMs, is L2 regularization. It has the side effect of spreading weight more evenly between the components of the model vector. The main alternative is L1 or lasso regularization which has the form λ∑i|wi|, ie it penalizes the sum absolute values of the model parameters. It favors concentrating the size of the model in only a few components, the opposite of L2 regularization. Generally L2 tends to be preferable for low dimensional models while lasso tends to work better for high dimensional models like text classification where it leads to sparse models, ie models with few non-zero parameters.
There is also elastic net regularization, which is just a weighted combination of L1 and L2 regularization. So you have 3 terms in your loss function: error term and the 2 regularization terms each with its own regularization parameter.
You said that you want to minimize the influence of two parameters, theta3 and theta4, meaning those two are both NOT important, so we are going to tell the model we want to fit by:
minimize the weights of theta3 and theta4 cause they don't really matter
And here is the learning process of the model:
Given theta3 and theta4 a really big parameter lambda , when theta3 or theta4 grows, your loss functions grows heavily relatively cause they(theta3 and theta4) both have a big multiplier(lambda), to minimize your object function(loss function), both theta3 and theta4 can only be chosen a very small value, saying that they are not important.
As regularization parameter increases from 0 to infinity, the residual sum of squares in linear regression decreases ,Variance of model decreases and Bias increases .
I will try it in most simple language. i think what you are asking is, how does adding a regularization term at the end deceases the value of parameters like theta3 and theta4 here.
So, lets first assume you added this to the end of your loss function which should massively increase the loss, making the function a bit more bias compared to before. Now we will use any optimization method, lets say gradient descent here and the job of gradient descent is to find all values of theta, now remember the fact that until this point we dont any value of theta and if you solve it you will realize the the values of theta are gonna be different if you hadnt used the regularization term at the end. To be exact, its gonna be less for theta3 and theta4.
So this will make sure your hypothesis has more bias and less variance. In simple term, it will make the equation is bit worse or not as exact as before but it will generalize the equation better.

Why is inference in Markov Random Fields hard?

I'm studying Markov Random Fields, and, apparently, inference in MRF is hard / computationally expensive. Specifically, Kevin Murphy's book Machine Learning: A Probabilistic Perspective says the following:
"In the first term, we fix y to its observed values; this is sometimes called the clamped term. In the second term, y is free; this is sometimes called the unclamped term or contrastive term. Note that computing the unclamped term requires inference in the model, and this must be done once per gradient step. This makes training undirected graphical models harder than training directed graphical models."
Why are we performing inference here? I understand that we're summing over all y's, which seems expensive, but I don't see where we're actually estimating any parameters. Wikipedia also talks about inference, but only talks about calculating the conditional distribution, and needing to sum over all non-specified nodes.. but.. that's not what we're doing here, is it?
Alternatively, any have good intuition on why inference in MRF is difficult?
Sources:
Chapter 19 of ML:PP: https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf
Specific section seen below
When training your CRF, you want to estimate your parameters, \theta.
In order to do this, you can differentiate your loss function (Equation 19.38) with respect to \theta, set it to 0, and solve for \theta.
You can't analytically solve the equation for \theta if you do this though. You can, however, minimise Equation 19.38 by gradient descent. Since the loss function is convex, it is guaranteed that gradient descent will get you the globally optimal solution when it converges.
Equation 19.41 is the actual gradient which you need to compute in order to be able to do gradient descent. The first term is easy (and computationally cheap) to compute as you are summing up over the observed values of y. However, the second term requires you to do inference. In this term, you are not summing up over the observed value of y as in the first term. Instead, you need to compute the configuration of y (inference), and then calculate the value of the potential function under this configuration.

Why don't use regularization term instead of sparsity term in autoencoder?

I have read this article about autoencoder, which is introduced by Andrew Ng. In there, he use sparity like regularization to drop connection but formular of sparsity is different from regur. So, I want to know why we don't use directly regularization term like model NNs or logistic regression : (1/2 * m) * Theta^2 ?
First, let us start with some naming convention, both sparsity penalty and L2 penalty on weights can (and often are) called regularizers. Thus, the question should be "why use sparsity-based regularization instead of simple L2-norm based?". And there is no simple answer for this problem, since it goes not deeply into underlying mathematics and asks what is a better way to make sure our network creates a well generalizing representation - to keep parameters more or less in fixed sphere (L2 regularization, the one you propose) or to make sure that whatever we put as an input to the network, it will produce relatively simple representation (possibly at the cost of having lots of weights/neurons that rarely are used). Even on this level of abstraction it should show qualitatitative difference between these two regularizers, which will lead to building completely differnet models. Will sparsity term be better always? Probably not, nearly nothing in ML is "always better". But on average it seems like a less heuristic choice for an autoencoder - you want to have a kind of compression - thus you force your net to create compressed representation which is really ... well.. compressed (small!), while using L2 regularization would simply "squash" representation in terms of norm (since dot product through weights with small norm will not increase too much norm of the input), but it can still use "tiny bit" of each neuron, thus efficiently build a complex representation (using many units) but simply - with small activations.

Resources