Reading the paper '
Cyclical Learning Rates for Training Neural Networks' https://arxiv.org/abs/1506.01186
Does it make sense to use the learning rate finder if the model is over-fitting ? Other than reduce the number of iterations before the model overfit's will using the learning finder prevent over-fitting ?
From reading the paper there is no suggestion this method of reduces over-fitting, is my interpretation correct ?
I don't think changing the learning rate reduces over-fit. To avoid over-fitting you might want to use L1/L2 regularization and drop-out or some of its variant.
Related
I am new to this DNN field and I am fed up with tunning hyperparameters and other parameters in a DNN cause there are a lot of parameters to tune and it is like a multivariable analysis without the help of a computer. How human can move towards the highest accuracy that can be achieved for a task using DNN due to the huge number of variables inside a DNN. And how will we know what accuracy is possible to get by using DNN or do I have to give up on DNN? I am lost. Help is appreciated.
Main problems I have :
1. What are the limits of DNN / when we have to give up on DNN
2. What is the proper way of tunning without missing good parameter values
Here is the summary I got by learning theory in this field. Corrections are much appreciated if I am wrong or misunderstood. You can add anything I missed. Sorted by the importance according to my knowledge.
for overfitting -
1. reduce the number of layers
2. reduce the number of nodes of layers
3. add regularizers (l1/ l2/ l1-l2) - have to decide the factors
4. add dropout layers and -have to decide the dropout factor
5. reduce batch size
6. stop earlier
for underfitting
1. increase the number of layers
2. increase number of nodes of layers
3. Add different types of layers (Conv, LSTM, ...)
4. add learning rate decay (decide the type and parameters for the type)
5. reduce the learning rate
other than that generally we can do,
1. number of epochs (by seeing what is happening while model training)
2. Adjust Learning Rate
3. batch normalization -for fast learning
4. initializing techniques (zero/ random/ Xavier / he)
5. different optimization algorithms
auto tunning methods
- Gridsearchcv - but for this, we have to choose what we want to change and it takes a lot of time.
Short Answer: You should experiment a lot!
Long Answer: At first, you may be overwhelmed by having plenty of knobs that you can tweak, but you gradually become experienced. A very quick way to gain some intuition on how you should tune the hyperparameters of your model is trying to replicate what other researchers have published. By replicating the results (and trying to improve the state-of-the-art), you acquire the intuition about deep learning.
I, personally, follow no particular order in tuning the hyperparameters of the model. Instead, I try to implement a dirty model and try to improve it. For instance, if I see that there are overshoots in validation accuracy, which might be an indicator of the fact that the model is bouncing around the sweet spot, I divide the learning rate by ten and see how it goes. If I see the model begins to overfit, I use early stopping to save the best parameters before overfitting. I also play with dropout rates and weight decay to find the best combination of them in order to have the model fit enough while maintaining the regularization effect. And so on.
To correct some of your assumptions, adding different types of layers will not necessarily help your model not to overfit. Moreover, sometimes (especially when using transfer learning, which is a trend these days), you cannot simply add a convolutional layer to your neural network.
Assuming you are dealing with computer vision tasks, Data Augmentation is another useful approach to increase the amount of available data to train your model and perform its performance.
Also, note that Batch Normalization also has a regularization effect. Weight Decay is another implementation of l2 regularization that is widely used.
Another interesting technique that can improve the training of neural networks is the One Cycle policy for learning rate and momentum (if applicable). Check this paper out: https://doi.org/10.1109/WACV.2017.58
Meaning to say if during training you have set your learning rate too high and you had unfortunately reached a local minimum where the value is too high, is it good to retrain with a lower learning rate or should you start from a higher learning rate for the poor-performing model, in hopes that the loss will escape the local minimum?
In the strict sense, you don't have to retrain as you can continue training with a lower learning rate (this is called a learning shedule). A very common approach is to lower the learning rate (by usually dividing by 10) each time the loss stagnates or becomes constant.
Another approach is to use an optimizer that scales the learning rate with the gradient magnitude, so the learning rate naturally decays as you get closer to the minima. Examples of this are ADAM, Adagrad and RMSProp.
In any case, make sure to find the optimal learning rate on a validation set, this will considerably improve performance and make learning faster. This applies to both plain SGD and with any other optimizer.
I have recently watched a video explaining that for Deep Learning, if you add more data, you don't need as much regularization, which sort of makes sense.
This being said, does this statement hold for "normal" Machine Learning algorithms like Random Forest for example ? And if so, when searching for the best hyper-parameters for the algorithm, in theory you should have as input dataset ( of course that gets further divided into cross validation sets etc ) as much data as you have, and not just a sample of it. This of course means a muuch longer training time, as for every combination of hyper-params you have X cross-validation sets which need to be trained and so on.
So basically, is it fair to assume that the params found for a decently size sample of your dataset are the "best" ones to use for the entire dataset or isn't it ?
Speaking from a statistician's point of view: it really depends on the quality of your estimator. If it's unbiased and low-variance, then a sample will be fine. If the variance is high, you'll want to use all the data you can.
I am recently studying Machine Learning with Coursera ML course, and some questions popped up while learning cost function with regularization.
Please give me your advice if you have any idea.
If I have enough number of training data, I think regularization would reduce the accuracy because the model is able to obtain high reliability and generalized output only from the training set, without regularization. How can I make a good decision whether or not I should use regularization?
Let’s suppose we have a model as follows: w3*x3 + w2*x2 + w1*x1 +w0, and x3 is the feature which particularly causes overfitting; this means it has more outliers. In this situation, I think the way of regularization is sort of unreasonable due to the fact that it takes effect on every weight. Do you know any better way that I can use in this case?
What is the best way to choose the value of lambda? I guess the simplest way is to conduct multiple learning with different lambda values and to compare their training accuracy. However, this is definitely inefficient when we have huge number of training data. I want to know how you choose the ideal lambda value.
Thanks for reading!
It's a bad idea to come up with guesses before you evaluate your model on validation data. When you talk about 'accuracy' in your question, to which accuracy do you refer to? Train set accuracy is not very useful in estimation of your model's goodness. Normally, regularization is desirable for many families of ML algorithms. In the case of linear regression, it is definitely worth to do. The question here is only the amount of it, i.e. the value of lambda parameter. Also, you might want to try L1 instead of L2. Read this.
In machine learning, questions like this are normally answered using data. Try a model, investigate how it behaves, try different solutions for the issues you observe.
Read this: How to calculate the regularization parameter in linear regression
Normally learning rate is a value that we decide in the begining and normally it doesn't change with no of iterations.
But in SOM learning rate is change with the iteration, what is the idea behind that?
As I understand learning rate should be decrease with the number of iterations. why is that?
Reason is quite simple. SOMs are ill-designed in terms of mathematical models, and one needs to decrease learning rate in order to ensure convergence. In other words, if you do not change this value, learning procedure might not stop at all. This issue is somehow addressed by more mathematical models called "Principal Curves" and "Principal Manifolds", which are much less popular but introduce valid mathematical approach for learning SOM-like representations.