Learning rate ,Loss and Batch size - machine-learning

Is Loss dependent upon learning rate and batch size. For .e.g if i keep batch size 4 and a learning rate lets say .002 then loss does not converge but if change the batch size to 32 keeping the learning rate same , i get a converging loss curve. Is this okk?

I would say that the loss is highly dependent on what parameters you use for your training. On the other hand, I would not call it a dependency in terms of a mathematical function but rather a relation.
If your network does not learn you need to tweak the parameters (architecture, learning rate, batch size, etc.).
It is hard to give a more specific answer to your question. What parameters that are ok are depending on the problem. However, if it converges and you can validate your solution I would say that you are fine.

Related

Where do # of epochs and batch size belong in the hyperparameter tuning process?

I'm fairly new to machine learning, and working on optimizing hyperparameters for my model. I'm doing this via a randomized search. My question is: should I be searching over # of epochs and batch size along with my other hyperparameters (e.g. loss function, number of layers, etc.)? If not, should I fix a these values first, find the other parameters, then return to tune these?
My concern is a) that searching over many epochs will be extremely time-consuming, so leaving it at one low value for the initial scan would be useful and b) that these parameters, esp. # of epochs, will disproportionately affect the results when the model is behaving well, and won't really give me much information about the rest of my architecture, as there should be a regime where more epochs, up to a point, are better. I know this isn't totally accurate, i.e. # of epochs is a real hyperparameter and too many can lead to overfitting issues, for example. Currently, my model is not clearly improving with # of epochs, though it was suggested by someone working on a similar problem within my area of research that this may be mitigated by implementing batch normalization, which is another parameter I am testing. Finally, I am worried that batch size will be quite affected by the fact that I am scaling my data down to 60% to allow my code to run reasonably (and I think the final model will be trained on vastly more data than the simulated data currently available to me).
I agree with your intuition on epochs. It is common to keep this value as low as possible in order to complete more training "experiments" in the same number of working hours. I don't have a great reference here, but I would welcome one in the comments.
For almost everything else, there is a paper by Leslie N. Smith that I can't recommend enough, A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay.
As you can see, batch size is included but epochs are not. You will also notice that the model architecture is not included (number of layers, layer size, etc). Neural Architecture Search is a huge research field in its own right, separate from hyper-parameter tuning.
As for the loss function, I can't think of any reason to "tune" that except in the context of an Auxiliary Loss for training only, which I suspect is not what you are talking about.
The loss function that will be applied to your validation or test set is part of the problem statement. That, along with the data, defines the problem you are solving. You don't changing it by tuning, you change it by convincing a product manager that your alternative is better for the business need.

Deep Neural Network - Order of the Parameters to tune

I am new to this DNN field and I am fed up with tunning hyperparameters and other parameters in a DNN cause there are a lot of parameters to tune and it is like a multivariable analysis without the help of a computer. How human can move towards the highest accuracy that can be achieved for a task using DNN due to the huge number of variables inside a DNN. And how will we know what accuracy is possible to get by using DNN or do I have to give up on DNN? I am lost. Help is appreciated.
Main problems I have :
1. What are the limits of DNN / when we have to give up on DNN
2. What is the proper way of tunning without missing good parameter values
Here is the summary I got by learning theory in this field. Corrections are much appreciated if I am wrong or misunderstood. You can add anything I missed. Sorted by the importance according to my knowledge.
for overfitting -
1. reduce the number of layers
2. reduce the number of nodes of layers
3. add regularizers (l1/ l2/ l1-l2) - have to decide the factors
4. add dropout layers and -have to decide the dropout factor
5. reduce batch size
6. stop earlier
for underfitting
1. increase the number of layers
2. increase number of nodes of layers
3. Add different types of layers (Conv, LSTM, ...)
4. add learning rate decay (decide the type and parameters for the type)
5. reduce the learning rate
other than that generally we can do,
1. number of epochs (by seeing what is happening while model training)
2. Adjust Learning Rate
3. batch normalization -for fast learning
4. initializing techniques (zero/ random/ Xavier / he)
5. different optimization algorithms
auto tunning methods
- Gridsearchcv - but for this, we have to choose what we want to change and it takes a lot of time.
Short Answer: You should experiment a lot!
Long Answer: At first, you may be overwhelmed by having plenty of knobs that you can tweak, but you gradually become experienced. A very quick way to gain some intuition on how you should tune the hyperparameters of your model is trying to replicate what other researchers have published. By replicating the results (and trying to improve the state-of-the-art), you acquire the intuition about deep learning.
I, personally, follow no particular order in tuning the hyperparameters of the model. Instead, I try to implement a dirty model and try to improve it. For instance, if I see that there are overshoots in validation accuracy, which might be an indicator of the fact that the model is bouncing around the sweet spot, I divide the learning rate by ten and see how it goes. If I see the model begins to overfit, I use early stopping to save the best parameters before overfitting. I also play with dropout rates and weight decay to find the best combination of them in order to have the model fit enough while maintaining the regularization effect. And so on.
To correct some of your assumptions, adding different types of layers will not necessarily help your model not to overfit. Moreover, sometimes (especially when using transfer learning, which is a trend these days), you cannot simply add a convolutional layer to your neural network.
Assuming you are dealing with computer vision tasks, Data Augmentation is another useful approach to increase the amount of available data to train your model and perform its performance.
Also, note that Batch Normalization also has a regularization effect. Weight Decay is another implementation of l2 regularization that is widely used.
Another interesting technique that can improve the training of neural networks is the One Cycle policy for learning rate and momentum (if applicable). Check this paper out: https://doi.org/10.1109/WACV.2017.58

Why does different batch-sizes give different accuracy in Keras?

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

ResNet How to achieve accuracy as in the document?

I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.

Learning Rate parameter in an Artificial Neural Network software program

I am using a software program of the type that is known as an Artificial Neural Network. One of the parameters of the software is called Learning Rate (also known as alpha). The learning rate setting can be controlled by moving a slider back and forth. On one side of the slider is the value 1E-05 on the other side is just 1. In between are various values such as 9E-05, .000045, etc. What I want to know is which one of these 2 learning rates is the fastest learning rate, 1E-05 on one side or 1 on the other. Thanks.
Learning rate is not about speed of training it is about size of the step when using quite naive approximation of the function (linear - for 1st order optimizers, or quadratic - for 2nd order). Consequently very small learning rate should lead to slow training, but big learning rate can lead to lack of training. Furthermore - values in between can still be not monotonic (you can have training where smaller learning rate actually converges faster than bigger one). So even though naively we could say that big learning rate is faster training - in general this is not true, furthermore - one cannot answer what learning rate is the fastest one. You can only use some general heuristics/observations here - you can start with big learning rate, and if results are bad, try reducing it. But in terms of actual training time guarantees - there are none.

Resources