Minibatching in Stochastic Gradient Descent and in Q-Learning - machine-learning

Background (may be skipped):
In training neural networks, usually stochastic gradient descent (SGD) is used: instead of computing the network's error on all members of the training set and updating the weights by gradient descent (which means waiting a long time before each weight update), use each time a minbatch of members, and treat the resulting error as an unbiased estimation of the true error.
In reinforcement learning, sometimes Q-learning is implemented with a neural network (as in deep Q-learning), and experience replay is used: Instead of updating the weights by the previous (state,action,reward) of the agent, update using a minibatch of random samples of old (states,actions,rewards), so that there is no correlation between subsequent updates.
The Question:
Is the following assertion correct?: When minibatching in SGD, one weights update is performed per the whole minibatch, while when minibatching in Q-learning, one weights update is performed per each member in the minibatch?
One more thing:
I think this question is more suitable for Cross Validated, as it is a conceptual question about machine learning and has nothing to do with programming, but by looking at questions tagged reinforcement-learning on Stackoverflow, I conclude that it is normative to ask this question here, and the number of responses I can get is larger.

The answer is no. The Q-network's parameters can be updated at once using all examples in a minibatch. Denote the members of the minibatch by (s1,a1,r1,s'1),(s2,a2,r2,s'2),... Then the loss is estimated relative to the current Q-network:
L=(Q(s1,a1)-(r1+max{Q(s'1, _ )}))^2+(Q(s2,a2)-(r2+max{Q(s'2, _ )}))^2+...
This is an estimation of the true loss, which is an expectation over all (s,a,r). In this way, the updating of the parameters of Q is similar to SGD.
Notes:
the expression above could also contain a discount factor.
the estimation is biassed since it does not contain a term representing the variance due to s', but this does not change the direction of the gradient.
sometimes, the second Q-network in each squared term is not the current Q but a past Q (double Q-learning).

Related

Why does different batch-sizes give different accuracy in Keras?

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

Sequential or batch parameters estimation

This is the problem that I should describe. Unfortunately the only one technique that I studied to estimate the parameters in the linear regression is the classic gradient descent algorithm. Is that one of "batch" or "sequential" mode ? And what is the difference between them ?
I wasn't expecting to find exactly the question from the ML exam here! Well the point is that as James Phillips says the gradient descent is an iterative method, so called sequential. The gradient descent is just an iterative optimization algorithm for finding the minimum of a function but you could use it to find the 'best-fitting line'. A complete batch way will be e.g. the Linear Least Squares method applying all the equations at once. You can find all the parameters calculating the partial derivatives of the sum of the square of the errors w.r.t. the best line fit and setting them to zero. Of course, as Phillips said it is not a convenient method, it's more a theoretical definition. Hope, it is useful.
From Liang et al. "A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks":
Batch learning is usually a time consuming affair as it may involve many iterations through the training data. In most applications, this may take several minutes to several hours and further the learning parameters (i.e., learning rate, number of learning epochs, stopping criteria, and other predefined parameters) must be properly chosen to ensure convergence. Also, whenever a new data is received batch learning uses the past data together with the new data and performs a retraining, thus consuming a lot of time. There are many industrial applications where online sequential learning algorithms are preferred over batch learning algorithms as sequential learning algorithms do not require retraining whenever a new data is received. The back-propagation (BP) algorithm and its variants have been the backbone for training SLFNs with additive hidden nodes. It is to be noted that BP is basically a batch learning algorithm. Stochastic gradient descent BP (SGBP) is one of the main variants of BP for sequential learning applications.
Basically, gradient descent is theorized in a batch way, but in practice you use iterative methods.
I think the question doesn't ask you to show two ways (batch and sequential) to estimate the parameters of the model, but instead to explain—either in a batch or sequential mode—how such an estimation would work.
For instance, if you are trying to estimate parameters for a linear regression model, you could just describe likelihood maximization, which is equivalent to minimize the least square error:
If you want to show a sequential mode, you can describe the gradient descent algorithm.

Is tuning batch size or epochs necessary for linear regression with TensorFlow?

I am working on an article where I focus on a simple problem – linear regression over a large data set in the presence of standard normal or uniform noise. I chose Estimator API from TensorFlow as the modeling framework.
I am finding that, hyperparameter tuning is, in fact, of little importance for such a machine learning problem when the number of training steps can be made sufficiently large. By hyperparameter I mean batch size or number of epochs in the training data stream.
Is there any paper/article with formal proof of this?
I don't think there is a paper specifically focused on this question, because it's a more or less fundamental fact. The introductory chapter of this book discusses the probabilistic interpretation of machine learning in general and loss function optimization in particular.
In short, the idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.

Why do we need epochs?

In courses there is nothing about epochs, but in practice they are everywhere used.
Why do we need them if the optimizer finds the best weight in one pass. Why does the model improve?
Generally whenever you want to optimize you use gradient descent. Gradient descent has a parameter called learning rate. In one iteration alone you can not guarantee that the gradient descent algorithm would converge to a local minima with the specified learning rate. That is the reason why you iterate again for the gradient descent to converge better.
Its also a good practice to change learning rates per epoch by observing the learning curves for better convergence.
Why do we need [to train several epochs] if the optimizer finds the best weight in one pass?
That's wrong in most cases. Gradient descent methods (see a list of them) does usually not find the optimal parameters (weights) in one pass. In fact, I have never seen any case where the optimal parameters were even reached (except for constructed cases).
One epoch consists of many weight update steps. One epoch means that the optimizer has used every training example once. Why do we need several epochs? Because gradient descent are iterative algorithms. It improves, but it just gets there in tiny steps. It only uses tiny steps, because it can only use local information. It does not have an idea of the function besides the current point at which it is.
You might want to read the gradient descent part of my optimization basics blog post.

Adequate Mean Squared Error while using Deep Autoencoder/ Deep Learning in general

I'm currently wondering when to stop training of Deep Autoencoders, especially when it seems to be stuck in a local minimum.
Is it essential to get the training criterium (e.g. MSE) to e.g. 0.000001 and force it to perfectly reconstruct the input or is it okay to keep differences (e.g. stop when the MSE is at about 0.5) depending on the dataset used.
I know that a better reconstruction might lead to better classification results afterwards but is there a "rule of thumb" when to stop? I'm especially interested in rules that have no heuristic character like "if the MSE doesn't get smaller in x iterations".
I don't think it's possible to derive a general rule of thumb for this, as generating NN:s/machine learning is a very problem-specific procedure, and generally, there is no free lunch. How to decide what is a "good" training error to terminate at depends on various problem-specific factors, e.g. the noise in the data. Evaluating your NN only with regard to training sets, with the only objective of minimising the MSE, will many times lead to overfitting. With only the training error as feedback, you might tune your NN to the noise in the training data (hence the overfitting). One method to avoid this is holdout validation. Instead of only training your NN to given data, your divide your data set into a training set, a validation set (and a test set).
Training sets: Training and feedback to NN, will naturally keep decreasing with longer training (at least down to "OK" MSE values for the specific problem).
Validation sets: Evaluate your NN w.r.t. to these, but don't give feedback to your NN/genetic algoritm.
Along with the evaluation-feedback of your training sets you should hence also evaluate the validation set, however without giving feedback to your neural network (NN).
Track the decrease in MSE for training as well as validation sets; generally training error will steadily decrease, whereas, at some point, the validation error will reach a minimum and start to increase with further training. Of course, you cannot know during runtime where this minima occurs, so generally one stores the NN with the lowest validation error, and after this has seemingly not been updated in some time (i.e., in error retrospect: we've passed a minima in validation error), the algorithm is terminated.
See e.g. the following article Neural Network: Train-validate-Test Stopping for details, as well as this SE-statistics thread discussing two different validation methods.
For the training/validation of Deep Autoencoders/Deep Learning, specifically w.r.t. overfitting, I find the article Dropout: A Simple Way to Prevent Neural Networks from Overfitting (*) to be valuable.
(*) By H. Srivistava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, University of Toronto.

Resources