Why do we use regularization for training neural network? - machine-learning

In my understanding, I think it's to avoid over/under fitting, and for the faster calculation.
Is it right?

Your understanding is partially correct. Regularization will not help with underfitting. It can protect (to some extent) from overfitting. Furthermore it will not speed up calculations (as it is actually more complex to compute something with added regularization) but can lead to simplier optimization problem - thus less number of steps required for convergenc (as a resulting error surface is more smooth).


Bayesian Optimization does not improve prediction accuracy

What could be the reason for this?
There is not any guarantee that Bayesian optimization will provide optimal hyperparameter values; quoting from the definitive textbook Deep Learning, by Goodfellow, Bengio, and Courville (page 430):
Currently, we cannot unambiguously recommend Bayesian hyperparameter
optimization as an established tool for achieving better deep learning results or
for obtaining those results with less effort. Bayesian hyperparameter optimization
sometimes performs comparably to human experts, sometimes better, but fails
catastrophically on other problems. It may be worth trying to see if it works on a
particular problem but is not yet sufficiently mature or reliable.
In other words, it is actually just a heuristic (like grid search), and what you report does not necessarily mean that you are doing something wrong or that there is a problem with the procedure to be corrected...
I would like to extend a perfect #desertnaut answer by a small intuition what could go wrong and how one can improve Bayesian optimization. Bayesian optimization usually use some form of computation of distance (and correlation) between points (hyperparameters). Unfortunately - usually it is close to impossible to impose such geometrical structure on the parameter space. One of important issues connected to this problem is to impose a Lipshitz or linear dependency between optimized value and hyperparameters. To understand that in more details let us have a look at:
Integer(50, 1000, name="estimators")
parameter. Let us inspect how adding 100 estimators could change the behavior of optimization problem. If we add 100 estimators to 50 - we will triple the number of estimators and would probably significantly increase the expressive power. How ever changing from 900 to 1000 should not be as important. So if the optimization process start with - let's say 600 estimators as a first guess - it would notice that changing estimators by approximately 50 is not changing a lot so it would skip optimizing this hyper-parameter (as it assumes quasi continuous-linear dependency). This might seriously harm the exploration process.
In order to overcome this issue it is better to use some sort of log distribution for this parameter. Similar trick was applied e.g. to learning_rate parameter.

Why does one not use IOU for training?

When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.
My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.
I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?
Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!
It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).
Here's another way to think about this in a simple manner.
Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.
Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"
Just a trickle of an explanation, but hope it helps..
I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

Neural nets (or similar) for regression problems

The motivating idea behind neural nets seems to be that they learn the "right" features to apply logistic regression to. Is there a similar approach for linear regression? (or just regression problems in general?)
Would doing the obvious thing of removing the application of a sigmoid function for all neurons (ie, including the hidden layers) make sense/work? (ie, each neuron is performing linear regression instead of logistic regression).
Alternatively, would doing the (maybe even more obvious) thing of just scaling output values to [0,1] work? (intuitively I would think not, as the sigmoid function seems like it would cause the net to arbitrarily favor extreme values) (edit: though I was just searching around some more, and saw that one technique is to scale based on mean and variance, which seems like it might deal with this issue -- so maybe this is more viable than I thought).
Or is there some other technique for doing "feature learning" for regression problems?
Check out this applet. Try to learn different functions. When you dictate linear activation functions at both hidden and output layers, it even fails to learn the quadratic function. At least one layer needs to be set to sigmoid function, see figures below.
There are different kinds of scaling. Standard scaling, as you mentioned, eliminates the impact of mean and standard deviation of the training sample, is most often used in machine learning. Just make sure you are using the same mean and std value from training sample in the test sample.
The reason why scaling is required is because the output of sigmoid function ranges at (0,1). I didn't try, but I think it is better to scale the output even if you select linear function at output layer. Otherwise large input at hidden layer (with sigmoid) won't lead to drastic output (the sigmoid function is approximately linear when the input is at a small range, out of such range will make the output changes much slowly). You can try this by yourself in your own data.
Besides, if you have various features, the feature normalization that makes different features in the same scale is also recommended. The scaling speeds up gradient descent by avoiding many extra iterations that are required when one or more features take on much larger values than the rest.
As #Ray mentioned, deep learning that many levels of features are involved can help you with the feature learning, it's not all linear combinations though.

How to calculate the regularization parameter in linear regression

When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update the theta parameters in the gradient descent algorithm.
My question is how do we calculate this lambda regularization parameter?
The regularization parameter (lambda) is an input to your model so what you probably want to know is how do you select the value of lambda. The regularization parameter reduces overfitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda results in less overfitting but also greater bias. So the real question is "How much bias are you willing to tolerate in your estimate?"
One approach you can take is to randomly subsample your data a number of times and look at the variation in your estimate. Then repeat the process for a slightly larger value of lambda to see how it affects the variability of your estimate. Keep in mind that whatever value of lambda you decide is appropriate for your subsampled data, you can likely use a smaller value to achieve comparable regularization on the full data set.
Hi! nice explanations for the intuitive and top-notch mathematical approaches there. I just wanted to add some specificities that, where not "problem-solving", may definitely help to speed up and give some consistency to the process of finding a good regularization hyperparameter.
I assume that you are talking about the L2 (a.k. "weight decay") regularization, linearly weighted by the lambda term, and that you are optimizing the weights of your model either with the closed-form Tikhonov equation (highly recommended for low-dimensional linear regression models), or with some variant of gradient descent with backpropagation. And that in this context, you want to choose the value for lambda that provides best generalization ability.
If you are able to go the Tikhonov way with your model (Andrew Ng says under 10k dimensions, but this suggestion is at least 5 years old) Wikipedia - determination of the Tikhonov factor offers an interesting closed-form solution, which has been proven to provide the optimal value. But this solution probably raises some kind of implementation issues (time complexity/numerical stability) I'm not aware of, because there is no mainstream algorithm to perform it. This 2016 paper looks very promising though and may be worth a try if you really have to optimize your linear model to its best.
For a quicker prototype implementation, this 2015 Python package seems to deal with it iteratively, you could let it optimize and then extract the final value for the lambda:
In this new innovative method, we have derived an iterative approach to solving the general Tikhonov regularization problem, which converges to the noiseless solution, does not depend strongly on the choice of lambda, and yet still avoids the inversion problem.
And from the GitHub README of the project:
InverseProblem.invert(A, be, k, l) #this will invert your A matrix, where be is noisy be, k is the no. of iterations, and lambda is your dampening effect (best set to 1)
All links of this part are from Michael Nielsen's amazing online book "Neural Networks and Deep Learning", recommended reading!
For this approach it seems to be even less to be said: the cost function is usually non-convex, the optimization is performed numerically and the performance of the model is measured by some form of cross validation (see Overfitting and Regularization and why does regularization help reduce overfitting if you haven't had enough of that). But even when cross-validating, Nielsen suggests something: you may want to take a look at this detailed explanation on how does the L2 regularization provide a weight decaying effect, but the summary is that it is inversely proportional to the number of samples n, so when calculating the gradient descent equation with the L2 term,
just use backpropagation, as usual, and then add (λ/n)*w to the partial derivative of all the weight terms.
And his conclusion is that, when wanting a similar regularization effect with a different number of samples, lambda has to be changed proportionally:
we need to modify the regularization parameter. The reason is because the size n of the training set has changed from n=1000 to n=50000, and this changes the weight decay factor 1−learning_rate*(λ/n). If we continued to use λ=0.1 that would mean much less weight decay, and thus much less of a regularization effect. We compensate by changing to λ=5.0.
This is only useful when applying the same model to different amounts of the same data, but I think it opens up the door for some intuition on how it should work, and, more importantly, speed up the hyperparametrization process by allowing you to finetune lambda in smaller subsets and then scale up.
For choosing the exact values, he suggests in his conclusions on how to choose a neural network's hyperparameters the purely empirical approach: start with 1 and then progressively multiply&divide by 10 until you find the proper order of magnitude, and then do a local search within that region. In the comments of this SE related question, the user Brian Borchers suggests also a very well known method that may be useful for that local search:
Take small subsets of the training and validation sets (to be able to make many of them in a reasonable amount of time)
Starting with λ=0 and increasing by small amounts within some region, perform a quick training&validation of the model and plot both loss functions
You will observe three things:
The CV loss function will be consistently higher than the training one, since your model is optimized for the training data exclusively (EDIT: After some time I've seen a MNIST case where adding L2 helped the CV loss decrease faster than the training one until convergence. Probably due to the ridiculous consistency of the data and a suboptimal hyperparametrization though).
The training loss function will have its minimum for λ=0, and then increase with the regularization, since preventing the model from optimally fitting the training data is exactly what regularization does.
The CV loss function will start high at λ=0, then decrease, and then start increasing again at some point (EDIT: this assuming that the setup is able to overfit for λ=0, i.e. the model has enough power and no other regularization means are heavily applied).
The optimal value for λ will be probably somewhere around the minimum of the CV loss function, it also may depend a little on how does the training loss function look like. See the picture for a possible (but not the only one) representation of this: instead of "model complexity" you should interpret the x axis as λ being zero at the right and increasing towards the left.
Hope this helps! Cheers,
The cross validation described above is a method used often in Machine Learning. However, choosing a reliable and safe regularization parameter is still a very hot topic of research in mathematics.
If you need some ideas (and have access to a decent university library) you can have a look at this paper:
