In this article the author says
...without applying regularization we also run the risk of
underfitting...
Why we might get underfitting without regularization? Regularization “make” network simpler to avoid overfitting and not underfittin. So, if we don’t have regularization it won’t cause underfitting.
We require regularization when our model is overfitting, i.e our training accuracy is considerably higher than our testing accuracy.
When our model is underfitting,we need to increase complexity of the model( by, say, adding new features).
Hence, Regularization is not a solution to underfitting , and that is what the author is trying to say.
Related
I would like to improve my understanding on how pruning would affect the accuracy of the training and test sets.
My current understanding is that it will improve accuracy on the test set because pruning prevents the tree from overfitting. Is this the right idea?
And how would pruning affect the accuracy on the training set? I think it reduces accuracy but why?
Any help is appreciated, thanks!
Pruning might lower the accuracy of the training set, since the tree will not learn the optimal parameters as well for the training set. However, if we do not overcome overfitting by setting the appropriate parameters, we might end up building a model that will fail to generalize.
That means that the model has learnt an overly complex function, which perfectly predicts on the train data, but which will fail to generalize with unseen data. This is more of an issue when we have lower training sets, since the set in itself might not be representative enough of new samples that can come in the future.
So you need to take care of these parameters to limit the maximum depth and the number of leafs to prevent the model from being too complex.
You might want to read also about the Bias–variance tradeoff.
I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.
I am recently studying Machine Learning with Coursera ML course, and some questions popped up while learning cost function with regularization.
Please give me your advice if you have any idea.
If I have enough number of training data, I think regularization would reduce the accuracy because the model is able to obtain high reliability and generalized output only from the training set, without regularization. How can I make a good decision whether or not I should use regularization?
Let’s suppose we have a model as follows: w3*x3 + w2*x2 + w1*x1 +w0, and x3 is the feature which particularly causes overfitting; this means it has more outliers. In this situation, I think the way of regularization is sort of unreasonable due to the fact that it takes effect on every weight. Do you know any better way that I can use in this case?
What is the best way to choose the value of lambda? I guess the simplest way is to conduct multiple learning with different lambda values and to compare their training accuracy. However, this is definitely inefficient when we have huge number of training data. I want to know how you choose the ideal lambda value.
Thanks for reading!
It's a bad idea to come up with guesses before you evaluate your model on validation data. When you talk about 'accuracy' in your question, to which accuracy do you refer to? Train set accuracy is not very useful in estimation of your model's goodness. Normally, regularization is desirable for many families of ML algorithms. In the case of linear regression, it is definitely worth to do. The question here is only the amount of it, i.e. the value of lambda parameter. Also, you might want to try L1 instead of L2. Read this.
In machine learning, questions like this are normally answered using data. Try a model, investigate how it behaves, try different solutions for the issues you observe.
Read this: How to calculate the regularization parameter in linear regression
The title says it all: Should a neural network be able to have a perfect train accuracy? Mine saturates at ~0.9 accuracy and I am wondering if that indicates a problem with my network or the training data.
Training instances: ~4500 sequences with an average length of 10 elements.
Network: Bi-directional vanilla RNN with a softmax layer on top.
Perfect accuracy on training data is usually a sign of a phenomenon called overfitting (https://en.wikipedia.org/wiki/Overfitting) and the model may generalize poorly to unseen data. So, no, probably this alone is not an indication that there is something wrong (you could still be overfitting but it is not possible to tell from the information in your question).
You should check the accuracy of the NN on the validation set (data your network has not seen during training) and judge its generalizability. usually it's an iterative process where you train many networks with different configurations in parallel and see which one performs best on the validation set. Also see cross validation (https://en.wikipedia.org/wiki/Cross-validation_(statistics))
If you have low measurement noise, a model may still not get zero training error. This could be for many reasons including that the model is not flexible enough to capture the true underlying function (which can be a complicated, high-dimensional, non-linear function). You can try increasing the number of hidden layers and nodes but you have to be careful about the same things like overfitting and only judge based on evaluation through cross validation.
You can definitely get a 100% accuracy on training datasets by increasing model complexity but I would be wary of that.
You cannot expect your model to be better on your test set than on your training set. This means if your training accuracy is lower than the desired accuracy, you have to change something. Most likely you have to increase the number of parameters of your model.
The reason why you might be ok with not having a perfect training accuracy is (1) the problem of overfitting (2) training time. The more complex your model is, the more likely is overfitting.
You might want to have a look at Structural Risc Minimization:
(source: svms.org)
When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update the theta parameters in the gradient descent algorithm.
My question is how do we calculate this lambda regularization parameter?
The regularization parameter (lambda) is an input to your model so what you probably want to know is how do you select the value of lambda. The regularization parameter reduces overfitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda results in less overfitting but also greater bias. So the real question is "How much bias are you willing to tolerate in your estimate?"
One approach you can take is to randomly subsample your data a number of times and look at the variation in your estimate. Then repeat the process for a slightly larger value of lambda to see how it affects the variability of your estimate. Keep in mind that whatever value of lambda you decide is appropriate for your subsampled data, you can likely use a smaller value to achieve comparable regularization on the full data set.
CLOSED FORM (TIKHONOV) VERSUS GRADIENT DESCENT
Hi! nice explanations for the intuitive and top-notch mathematical approaches there. I just wanted to add some specificities that, where not "problem-solving", may definitely help to speed up and give some consistency to the process of finding a good regularization hyperparameter.
I assume that you are talking about the L2 (a.k. "weight decay") regularization, linearly weighted by the lambda term, and that you are optimizing the weights of your model either with the closed-form Tikhonov equation (highly recommended for low-dimensional linear regression models), or with some variant of gradient descent with backpropagation. And that in this context, you want to choose the value for lambda that provides best generalization ability.
CLOSED FORM (TIKHONOV)
If you are able to go the Tikhonov way with your model (Andrew Ng says under 10k dimensions, but this suggestion is at least 5 years old) Wikipedia - determination of the Tikhonov factor offers an interesting closed-form solution, which has been proven to provide the optimal value. But this solution probably raises some kind of implementation issues (time complexity/numerical stability) I'm not aware of, because there is no mainstream algorithm to perform it. This 2016 paper looks very promising though and may be worth a try if you really have to optimize your linear model to its best.
For a quicker prototype implementation, this 2015 Python package seems to deal with it iteratively, you could let it optimize and then extract the final value for the lambda:
In this new innovative method, we have derived an iterative approach to solving the general Tikhonov regularization problem, which converges to the noiseless solution, does not depend strongly on the choice of lambda, and yet still avoids the inversion problem.
And from the GitHub README of the project:
InverseProblem.invert(A, be, k, l) #this will invert your A matrix, where be is noisy be, k is the no. of iterations, and lambda is your dampening effect (best set to 1)
GRADIENT DESCENT
All links of this part are from Michael Nielsen's amazing online book "Neural Networks and Deep Learning", recommended reading!
For this approach it seems to be even less to be said: the cost function is usually non-convex, the optimization is performed numerically and the performance of the model is measured by some form of cross validation (see Overfitting and Regularization and why does regularization help reduce overfitting if you haven't had enough of that). But even when cross-validating, Nielsen suggests something: you may want to take a look at this detailed explanation on how does the L2 regularization provide a weight decaying effect, but the summary is that it is inversely proportional to the number of samples n, so when calculating the gradient descent equation with the L2 term,
just use backpropagation, as usual, and then add (λ/n)*w to the partial derivative of all the weight terms.
And his conclusion is that, when wanting a similar regularization effect with a different number of samples, lambda has to be changed proportionally:
we need to modify the regularization parameter. The reason is because the size n of the training set has changed from n=1000 to n=50000, and this changes the weight decay factor 1−learning_rate*(λ/n). If we continued to use λ=0.1 that would mean much less weight decay, and thus much less of a regularization effect. We compensate by changing to λ=5.0.
This is only useful when applying the same model to different amounts of the same data, but I think it opens up the door for some intuition on how it should work, and, more importantly, speed up the hyperparametrization process by allowing you to finetune lambda in smaller subsets and then scale up.
For choosing the exact values, he suggests in his conclusions on how to choose a neural network's hyperparameters the purely empirical approach: start with 1 and then progressively multiply÷ by 10 until you find the proper order of magnitude, and then do a local search within that region. In the comments of this SE related question, the user Brian Borchers suggests also a very well known method that may be useful for that local search:
Take small subsets of the training and validation sets (to be able to make many of them in a reasonable amount of time)
Starting with λ=0 and increasing by small amounts within some region, perform a quick training&validation of the model and plot both loss functions
You will observe three things:
The CV loss function will be consistently higher than the training one, since your model is optimized for the training data exclusively (EDIT: After some time I've seen a MNIST case where adding L2 helped the CV loss decrease faster than the training one until convergence. Probably due to the ridiculous consistency of the data and a suboptimal hyperparametrization though).
The training loss function will have its minimum for λ=0, and then increase with the regularization, since preventing the model from optimally fitting the training data is exactly what regularization does.
The CV loss function will start high at λ=0, then decrease, and then start increasing again at some point (EDIT: this assuming that the setup is able to overfit for λ=0, i.e. the model has enough power and no other regularization means are heavily applied).
The optimal value for λ will be probably somewhere around the minimum of the CV loss function, it also may depend a little on how does the training loss function look like. See the picture for a possible (but not the only one) representation of this: instead of "model complexity" you should interpret the x axis as λ being zero at the right and increasing towards the left.
Hope this helps! Cheers,
Andres
The cross validation described above is a method used often in Machine Learning. However, choosing a reliable and safe regularization parameter is still a very hot topic of research in mathematics.
If you need some ideas (and have access to a decent university library) you can have a look at this paper:
http://www.sciencedirect.com/science/article/pii/S0378475411000607