This is the problem that I should describe. Unfortunately the only one technique that I studied to estimate the parameters in the linear regression is the classic gradient descent algorithm. Is that one of "batch" or "sequential" mode ? And what is the difference between them ?
I wasn't expecting to find exactly the question from the ML exam here! Well the point is that as James Phillips says the gradient descent is an iterative method, so called sequential. The gradient descent is just an iterative optimization algorithm for finding the minimum of a function but you could use it to find the 'best-fitting line'. A complete batch way will be e.g. the Linear Least Squares method applying all the equations at once. You can find all the parameters calculating the partial derivatives of the sum of the square of the errors w.r.t. the best line fit and setting them to zero. Of course, as Phillips said it is not a convenient method, it's more a theoretical definition. Hope, it is useful.
From Liang et al. "A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks":
Batch learning is usually a time consuming affair as it may involve many iterations through the training data. In most applications, this may take several minutes to several hours and further the learning parameters (i.e., learning rate, number of learning epochs, stopping criteria, and other predefined parameters) must be properly chosen to ensure convergence. Also, whenever a new data is received batch learning uses the past data together with the new data and performs a retraining, thus consuming a lot of time. There are many industrial applications where online sequential learning algorithms are preferred over batch learning algorithms as sequential learning algorithms do not require retraining whenever a new data is received. The back-propagation (BP) algorithm and its variants have been the backbone for training SLFNs with additive hidden nodes. It is to be noted that BP is basically a batch learning algorithm. Stochastic gradient descent BP (SGBP) is one of the main variants of BP for sequential learning applications.
Basically, gradient descent is theorized in a batch way, but in practice you use iterative methods.
I think the question doesn't ask you to show two ways (batch and sequential) to estimate the parameters of the model, but instead to explain—either in a batch or sequential mode—how such an estimation would work.
For instance, if you are trying to estimate parameters for a linear regression model, you could just describe likelihood maximization, which is equivalent to minimize the least square error:
If you want to show a sequential mode, you can describe the gradient descent algorithm.
Related
I understand the three types of gradient descent, but my problem is I can not know which type I must use on my model. I have read a lot, but I did not get it.
No code, it is just question.
Types of Gradient Descent:
Batch Gradient Descent: It processes all the training examples for each iteration of gradient descent. But, this method is computationally expensive when the number of training examples is large and usually not preferred.
Stochastic Gradient Descent: It processes one training example in each iteration. Here, the parameters are being updated after each iteration. This method is faster than batch gradient descent method. But, it increases system overhead when number of training examples is large by increasing the number of iterations.
Mini Batch gradient descent: Mini batch algorithm is the most favorable and widely used algorithm that makes precise and faster results using a batch of m training examples. In mini batch algorithm rather than using the complete data set, in every iteration we use a set of m training examples called batch to compute the gradient of the cost function. Common mini-batch sizes range between 50 and 256, but can vary for different applications.
There are various other optimization algorithms apart from gradient descent variants, like adam, rmsprop, etc.
Which optimizer should we use?
The question was to choose the best optimizer for our Neural Network Model in order to converge fast and to learn properly and tune the internal parameters so as to minimize the Loss function.
Adam works well in practice and outperforms other Adaptive techniques.
If your input data is sparse then methods such as SGD, NAG and momentum are inferior and perform poorly. For sparse data sets one should use one of the adaptive learning-rate methods. An additional benefit is that we won't need to adjust the learning rate but likely achieve the best results with the default value.
If one wants fast convergence and train a deep neural network Model or a highly complex Neural Network, then Adam or any other Adaptive learning rate techniques should be used because they outperforms every other optimization algorithms.
I hope this would help you to decide which one to use for your model.
I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.
I am working on an article where I focus on a simple problem – linear regression over a large data set in the presence of standard normal or uniform noise. I chose Estimator API from TensorFlow as the modeling framework.
I am finding that, hyperparameter tuning is, in fact, of little importance for such a machine learning problem when the number of training steps can be made sufficiently large. By hyperparameter I mean batch size or number of epochs in the training data stream.
Is there any paper/article with formal proof of this?
I don't think there is a paper specifically focused on this question, because it's a more or less fundamental fact. The introductory chapter of this book discusses the probabilistic interpretation of machine learning in general and loss function optimization in particular.
In short, the idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.
In courses there is nothing about epochs, but in practice they are everywhere used.
Why do we need them if the optimizer finds the best weight in one pass. Why does the model improve?
Generally whenever you want to optimize you use gradient descent. Gradient descent has a parameter called learning rate. In one iteration alone you can not guarantee that the gradient descent algorithm would converge to a local minima with the specified learning rate. That is the reason why you iterate again for the gradient descent to converge better.
Its also a good practice to change learning rates per epoch by observing the learning curves for better convergence.
Why do we need [to train several epochs] if the optimizer finds the best weight in one pass?
That's wrong in most cases. Gradient descent methods (see a list of them) does usually not find the optimal parameters (weights) in one pass. In fact, I have never seen any case where the optimal parameters were even reached (except for constructed cases).
One epoch consists of many weight update steps. One epoch means that the optimizer has used every training example once. Why do we need several epochs? Because gradient descent are iterative algorithms. It improves, but it just gets there in tiny steps. It only uses tiny steps, because it can only use local information. It does not have an idea of the function besides the current point at which it is.
You might want to read the gradient descent part of my optimization basics blog post.
I want my neural network to be trained on every new data that it classifies incorrectly. Assuming that I somehow label the data correctly every time the network makes a mistake, how many back props do i need to run on this single instance of new data in order to train my network for that particular case? Is there a better way to train a neural network on real time scenarios?
It depends on the optimization algorithm you use. The backpropagation by itself calculates only the gradient, which is used by the next iteration of the algorithm.
In the simplest case you can use a self-developed gradient descent and check the behavior of your cost function. If the cost function decreases less than some threshold epsilon, you might break the optimization loop for the current instance. You can also limit the maximum number of iterations.
It is worth using some advanced optimizers such fminunc in Matlab, which will stop by themselves when reached an optimum.
You may find this post about different termination conditions of gradient descent very useful.
I think, learning only using one single instance is not really efficient. The cost function can behave jerky. You may consider the batch learning method, where you learn using small batches of new instances. It should provide a better learning rate.
In order to illustrate how network's accuracy depends on the iteration number and on the batch size, I experimented a bit with a neural network used to recognize hand written digits. I had 4000 examples in the training set and 1000 examples in the validation set. Then I started the learning algorithm with different parameters and measured the resulted accuracy. You can see the result here:
Of course this plot describes only my particular case, but you can get some intuition on what to expect and on how to validate network parameters.