What does "learning rate warm-up" mean? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
In machine learning, especially deep learning, what does it mean to warm-up?
I've heard sometimes that in some models, warming-up is a phase in training. But honestly, I don't know what it is because I'm very new to ML. Until now I've never used or come across it, but I want to know it because I think it might be useful for me.
What is learning rate warm-up and when do we need it?

If your data set is highly differentiated, you can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your model's initial training can skew badly toward those features -- or worse, toward incidental features that aren't truly related to the topic at all.
Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.
Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1*p/n for its learning rate; the second uses 2*p/n, and so on: iteration i uses i*p/n, until we hit the nominal rate at iteration n.
This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence.
Note that the ramp-up is commonly on the order of one epoch -- but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. You may want to adjust, depending on how functionally extreme your batches can become when the shuffling algorithm is applied to the training set.

It means that if you specify your learning rate to be say 2e-5, then during training the learning rate will be linearly increased from approximately 0 to 2e-5 within the first say 10,000 steps.

There are actually two strategies for warmup, ref here.
constant: Use a low learning rate than base learning rate for the initial few steps.
gradual: In the first few steps, the learning rate is set to be lower than base learning rate and increased gradually to approach it as step number increases. As #Prune an #Patel suggested.

Related

What orders of hyperparameter tuning [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have using Neural Network for a classification problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate
batch-size
number of iterations (epoch)
For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way. Is there a special order to tune the parameters? E.g learning rate first, then batch size, then ... I am not sure that all these parameters are independent. Which ones are clearly independent and which ones are clearly not independent? Should we then tune them together? Is there any paper or article which talks about properly tuning all the parameters in a special order?
There is even more than that! E.g. the number of layers, the number of neurons per layer, which optimizer to chose, etc...
So the real work in training a neural network is actually finding the best-suited parameters.
I would say there is no clear guideline because training a machine learning algorithm, in general, is always task-specific.
You see, there are many hyperparameters to tune, and you won't have time to try out every combination of each. For many hyperparameters, you will build somewhat of intuition on what a good choice would be, but for now, a great starting point is always using what has been proven by others to work. So if you find a paper on the same or similar task you could try to use the same or similar parameters as them too.
Just to share with you some small experiences I've made:
I rarely vary the learning rate. I mostly choose the Adam optimizer and stick with it.
The batch size I try to choose as big as possible without running out of memory
number of iterations you could just set to e.g. 1000. You can always look at the current loss and decide for yourself if you can stop when the net e.g. isn't learning anymore.
Keep in mind these are in no way rules or strict guidelines. Just some ideas until you've got a better intuition yourself. The more papers you've read and more nets you've trained you will understand what to chose when better.
Hope this serves a good starting point at least.

What are the costs of many trainable parameters and layers in Machine Learning? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm curious as to what effect many trainable parameters have on a machine learning algorithm.
For instance, a CNN with many layers can extract more abstract data from images. But more layers means more trainable parameters. What downsides exist for making really deep models?
Do more params mean more RAM is required to load the model?
Do more params mean the model is computationally expensive i.e. requires better hardware to train?
Do more params mean the model takes longer to train and classify or is the change negligible in most cases?
I noticed that deeper models with more params have a larger file size when the models are saved are there any other costs?.
Most sources I can find on this topic just mention keeping the params low but never elaborate and why exactly we do this.
Do more params mean more RAM is required to load the model?
Yes, it requires more RAM/VRAM depending on your training mode.
Do more params mean the model is computationally expensive i.e.
requires better hardware to train?
Yes.
Do more params mean the model takes longer to train and classify or is
the change negligible in most cases?
Yes it takes longer as the number of calculations increases.
I noticed that deeper models with more params have a larger file size
when the models are saved are there any other costs?.
Aside from what you have already pointed out? Yes, The optimization can be affected, the training regime is affected.
Most sources I can find on this topic just mention keeping the params
low but never elaborate and why exactly we do this.
When the number of trainable parameters increases, the number of calculations also increases, this denotes an increase in the hardware requirements, both in form of RAM and processing power.
It also introduces difficulties in terms of optimization such as overfitting or even altogether, the choice of the optimization algorithm used.
The transfer of the model can become an issue depending on your use case, for example, consider, IoT devices, drones, autonomous vehicles/robots that require updates, etc, bad/poor connection speed, etc and stuff like this. (There are ways to reduce the model size and increase its performance, but in general, the more parameters means more space/time)

crossvalidation “balancing” for regression problems [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Classification problems can exhibit a strong label imbalance in the given dataset. This can be overcome by subsampling certain class weight attributed weights, which allow for balancing the label distributions at least during model training. Stratification on the other hand will allow for keeping a certain label distribution, which stays for every respective fold.
For a regression problem this is by standard libaries e.g. scikit-learn not defined. There are few approaches to cover stratification and a well written theoretical approach for regression subsampling by Scott Lowe here.
I am wondering why label balancing for regression instead of classification problems has so few attention in the Machine Learning community? Regression problems also exhibit different characteristica that might be easier / harder acquired in a data collection setting. And then, is there any framework or paper that further addresses this issue?
The complexity of the problem lies in the continuous nature of regression. When you have the classification, it is very natural to split them into classes because they are basically already split into classes :) Now, if you have a regression, the number of possibilities to split is basically infinite and most importantly, it is just impossible to know what a good split would be. As in the article you sent, you might apply sorted or fractional approaches but in the end, you have no idea to what extent they would be correct. You can also split it into intervals. This is what the stack library does. In the documentation, it says: "For continuous target variable overstock uses binning and categoric split based on bins". What they do is, they first assign the continuous values to bins(classes) and then they apply stratification on them.
There are not many studies on this because everything you can come up with is going to be a heuristic. However, there can be exceptions if you can incorporate some domain knowledge. As an example, let's say that you are trying to predict the frequency of some electromagnetic waves from some set of features. In that case, you have prior knowledge of how the wave frequencies are split. ( https://en.wikipedia.org/wiki/Electromagnetic_spectrum) So now it is natural to split them into continuous intervals with respect to their wavelengths and do a regression stratification. But otherwise, it is hard to come with something that would generalize.
I personally never encountered a study on this.

What is the relation between time taken to detect an object in an image and training an image? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I have an ML model that takes X seconds to detect an object in an image on which it is trained. Does that mean it took at least X or X+Y seconds during training per image? Can you provide a detailed insight?
For instance, Assume the training speed of SSD512 model is 30 images per second on a hardware platform, Does this imply that I will be able to achieve the inference speed of at least (if not more) than 30 images per second?
The question is not confined to neural network models. A generic insight is appreciated. I am dealing with Cascade Classifiers in my case. I am loading a cascade.xml trained model to detect an object. I want to know the relation between the time taken to train an image and the time taken to detect an object after loading the trained model.
Because unstated, I assume here you mean neural network ML model.
The training process could be seen as two steps: running the network to detect the object and updating the weights to minimize the loss function.
Running the network: while training, the backpropagation part you essentially run the network as if you are detecting the object using the current network weights, which take X time as you stated. It should take the same as when used after the training, for example on the test dataset (to make things simple I am ignoring the mini-batch learning usually used, which might change things).
Updating the weights: this part in the training is done by the completing the backpropagation algorithm which tells you how changing the weights will affect your detection performance (i.e. lower the loss function for the current image) then usually a stochastic gradient descent iteration is done, which updates the weights. This is the Y you stated, which in fact could be bigger than X.
These two parts are done for every image (more commonly, for every mini-batch) in the training process.
UPDATE: You said in your response that you are looking for an answer for a generic algorithm. It is an interesting question! When looking on the training task, you always need to learn some kind of weights W that is the outcome of the training process and is the essence of what was learned. The update needs to make the learned function better, which basically sounds harder than simply running the function. I really don't know of any algorithm (certainly not the commonly used ones) that would take less training time than running time per image, but it might be theoretically possible.

Using Reinforcement Learning for Classfication Problems [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Can I use reinforcement learning on classification? Such as human activity recognition? And how?
There are two types of feedback. One is evaluative that is used in reinforcement learning method and second is instructive that is used in supervised learning mostly used for classification problems.
When supervised learning is used, the weights of the neural network are adjusted based on the information of the correct labels provided in the training dataset. So, on selecting a wrong class, the loss increases and weights are adjusted, so that for the input of that kind, this wrong class is not chosen again.
However, in reinforcement learning, the system explores all the possible actions, class labels for various inputs in this case and by evaluating the reward it decides what is right and what is wrong. It may be the case too that until it gets the correct class label it may be giving wrong class name as it is the best possible output it has found till now. So, it doesn't make use of the specific knowledge we have about the class labels, hence slows the convergence rate significantly as compared to supervised learning.
You can use reinforcement learning for classification problems but it won't be giving you any added benefit and instead slow down your convergence rate.
Short answer: Yes.
Detailed answer: yes but it's an overkill. Reinforcement learning is useful when you don't have labeled dataset to learn the correct policy, so you need to develop correct strategy based on the rewards. This also allows to backpropagate through non-differentiable blocks (which I suppose is not your case). The biggest drawback of reinforcement learning methods is that thay are typically took a VERY large amount of time to converge. So, if you possess labels, it would be a LOT more faster and easier to use regular supervised learning.
You may be able to develop an RL model that chooses which classifier to use. The gt labels being used to train the classifiers and the change in performance of those classifiers being the reward for the RL model. As others have said, it would probably take a very long time to converge, if it ever does. This idea may also require many tricks and tweaks to make it work. I would recommend searching for research papers on this topic.

Resources