Effect of learning rate on the neural network - machine-learning

I have a dataset of about a 100 numeric values. I have set the learning rate of my neural network to 0.0001. I have successfully trained it on the dataset for over 1 million times. But my question is that what is the effect of very low learning rates in the neural networks ?

Low learning rate mainly implies slow convergence: you're moving down the loss function with smaller steps (the step size is the learning rate).
If your function is convex this is not a problem, you will wait more but you'll reach a good solution.
If, as in the case of deep neural networks, your function is not convex than a low learning rate could lead to the reaching of a "good" optimum that's not the best one (getting stuck in a local minimum, without making steps as big as required to jump out of it).
That's why there are different optimization algorithms that are adaptive: such algorithm, like ADAM, RMSProp, ... have different learning rates for each weight in the network (every single learning rate starts from the same value). In this way, the optimization algorithm can work on every single parameter independently with the aim of finding a better solution (and letting the chose of the initial learning rate less critical)

Related

What is meant by stability in relation to neural networks

I hear the terms stability/instability thrown around a lot when reading up on Deep Q Networks. I understand that stability is improved with the addition of a target network and replay buffer but I fail to understand exactly what it's refering to.
What would the loss graph look like for an instable vs stable neural network?
What does it mean when a neural network converges/diverges?
Stability, also known as algorithmic stability, is a notion in
computational learning theory of how a machine learning algorithm is
perturbed by small changes to its inputs. A stable learning algorithm
is one for which the prediction does not change much when the training
data is modified slightly.
Here Stability means suppose you have 1000 training data that you use to train the model and it performs well. So in terms of model stability if you train the same model with 900 training data the model should still perform well , thats why it is also called as algorithmic stability.
As For the loss Graph if the model is stable the loss graph probably should be same for both size of training data (1000 & 900). And different in case of unstable model.
As in Machine learning we want to minimize loss so when we say a model converges we mean to say that the model's loss value is within acceptable margin and the model is at that stage where no additional training would improve the model.
Divergence is a non-symmetric metrics which is used to measure the difference between continuous value. For example you want to calculate difference between 2 graphs you would use Divergence instead of traditional symmetric metrics like Distance.

TFF : test accuracy fluctuate

I train a ResNet50 model with TFF, I use test accuracy on test data for evaluation, but I find many fluctuations as shown in the figure below, So please how can I avoid this fluctuation ?
I would say behavior such as this is to be expected for stochastic optimization in general. The inherent variance causes you to oscillate somewhere around good solution. The magnitude of the variance and properties of the optimization objective control how much this oscillates when looking at a accuracy metric.
For plain SGD, decreasing learning rate decreases the variance and slows down convergence.
For optimization methods for federated learning, the story is a bit more complicated, but decreasing the client learning rate, or decreasing the number of local steps (while keeping other things the same) can have a similar effect, typically including slowing down convergence. More details can be found in https://arxiv.org/abs/2007.00878 mentioned also in the other answer. Potentially decreasing the client learning rate across rounds could also work. The details can differ also based on what exactly is the optimization method you are using.
How is the test accuracy calculated? How many local epochs are the clients training?
If the global model is tested on a held out set of examples, it is possible that clients are detrimentally overfitting during local training. As the global model approaches convergence, each client ends up training a model that works well for them individually, but may be diverging from the optimal global model (sometimes called client drift https://arxiv.org/abs/1910.06378). This is may occur when the client's local dataset has a distribution very different from the global distribution and more likely when the client learning rates are high (https://arxiv.org/abs/2007.00878).
Decreasing the client learning rate, reducing the number of steps/batches, and other methods that cause the clients to do less "work" per communication round may reduce the fluctuation.

How do we define the bad learning rate in neural network?

I am trying to define the proper definition of the bad learning rate in the neural network as follows:
Bad learning rate in the neural network is when you assigned the learning rate too low or too high, with too low learning rate the network would take too much time to train but with too high learning rate the network would change too quick which might result in output.
Any suggestion would be much appreciated.
I believe that an efficient learning rate(alpha) depends on the data. The points that you have mentioned are absolutely correct with regards to inefficient learning rates. So, there is no hard and fast rule for selecting alpha. Let me enumerate the steps that I take while deciding alpha :
You would obviously need a big alpha so that your model learns quickly
Also take note that big alpha can lead to overshooting the minima and hence your hypothesis will not converge
To take care of this, you can go for learning rate decay. This decreases your learning rate as you approach minima and slow down learning so that your model doesn't overshoot.
Couple of ways to do this:
Step decay
exponential decay
linear decay
You can select either and then train your model. Having said that, let me point out that it still requires some trial and error from your side until you get the optimum result.

convnet suddenly drop accuracy

I was trying to train an emotion recognition model on the fer2013 dataset using the architecture proposed in this paper
The paper uses different dataset than mine so I did some modifications on on the stride and filter size.
After a couple hours of training, accuracy on both training and test set suddenly drops.
After that the accuracy just stay around 0.1-0.2 for both set, never improve anymore.
Does anybody know about this phenomenon?
In any neural network training, if both accuracies i.e. training and validation improves at first and then starts decreasing, it is a sign that your network is failing to converge. More appropriately, your optimizer has started overshooting.
One most likely reason for this could be high learning rate. Reduce your learning rate and then check your example again. Also, in your linked paper, (at least in first glimpse), I couldn't see learning rate mentioned. Since your data is different from the paper's, same learning rate might not work as well.

Neural Networks: Online learning (one-example-at-a-time)

I have a neural network that performs a classification task, and it works fairly well when the training set is large enough.
However, I'm looking for a way to train the NN with one labelled example at a time.
That is, I intercept data, one example at a time, and I need to update my NN weights. I'm not allowed to store the data for future, batch training purposes.
I have a feed-forward NN built(following Stanford's ML course on Coursera) in octave. I can run back-prop on the NN using each new example, but that approach unreliably converges, and takes a significant amount of time.
Are there any other more efficient algorithms for online learning out there in the context of Neural Networks?
I noticed Matlab's adapt() function seems to do just this. The documentation doesn't specify what it's doing behind the scene. Is it just one iteration of back-prop?

Resources