Is tuning batch size or epochs necessary for linear regression with TensorFlow? - machine-learning

I am working on an article where I focus on a simple problem – linear regression over a large data set in the presence of standard normal or uniform noise. I chose Estimator API from TensorFlow as the modeling framework.
I am finding that, hyperparameter tuning is, in fact, of little importance for such a machine learning problem when the number of training steps can be made sufficiently large. By hyperparameter I mean batch size or number of epochs in the training data stream.
Is there any paper/article with formal proof of this?

I don't think there is a paper specifically focused on this question, because it's a more or less fundamental fact. The introductory chapter of this book discusses the probabilistic interpretation of machine learning in general and loss function optimization in particular.
In short, the idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.

Related

How I can know which type of gradient descent to use?

I understand the three types of gradient descent, but my problem is I can not know which type I must use on my model. I have read a lot, but I did not get it.
No code, it is just question.
Types of Gradient Descent:
Batch Gradient Descent: It processes all the training examples for each iteration of gradient descent. But, this method is computationally expensive when the number of training examples is large and usually not preferred.
Stochastic Gradient Descent: It processes one training example in each iteration. Here, the parameters are being updated after each iteration. This method is faster than batch gradient descent method. But, it increases system overhead when number of training examples is large by increasing the number of iterations.
Mini Batch gradient descent: Mini batch algorithm is the most favorable and widely used algorithm that makes precise and faster results using a batch of m training examples. In mini batch algorithm rather than using the complete data set, in every iteration we use a set of m training examples called batch to compute the gradient of the cost function. Common mini-batch sizes range between 50 and 256, but can vary for different applications.
There are various other optimization algorithms apart from gradient descent variants, like adam, rmsprop, etc.
Which optimizer should we use?
The question was to choose the best optimizer for our Neural Network Model in order to converge fast and to learn properly and tune the internal parameters so as to minimize the Loss function.
Adam works well in practice and outperforms other Adaptive techniques.
If your input data is sparse then methods such as SGD, NAG and momentum are inferior and perform poorly. For sparse data sets one should use one of the adaptive learning-rate methods. An additional benefit is that we won't need to adjust the learning rate but likely achieve the best results with the default value.
If one wants fast convergence and train a deep neural network Model or a highly complex Neural Network, then Adam or any other Adaptive learning rate techniques should be used because they outperforms every other optimization algorithms.
I hope this would help you to decide which one to use for your model.

Why does different batch-sizes give different accuracy in Keras?

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

Sequential or batch parameters estimation

This is the problem that I should describe. Unfortunately the only one technique that I studied to estimate the parameters in the linear regression is the classic gradient descent algorithm. Is that one of "batch" or "sequential" mode ? And what is the difference between them ?
I wasn't expecting to find exactly the question from the ML exam here! Well the point is that as James Phillips says the gradient descent is an iterative method, so called sequential. The gradient descent is just an iterative optimization algorithm for finding the minimum of a function but you could use it to find the 'best-fitting line'. A complete batch way will be e.g. the Linear Least Squares method applying all the equations at once. You can find all the parameters calculating the partial derivatives of the sum of the square of the errors w.r.t. the best line fit and setting them to zero. Of course, as Phillips said it is not a convenient method, it's more a theoretical definition. Hope, it is useful.
From Liang et al. "A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks":
Batch learning is usually a time consuming affair as it may involve many iterations through the training data. In most applications, this may take several minutes to several hours and further the learning parameters (i.e., learning rate, number of learning epochs, stopping criteria, and other predefined parameters) must be properly chosen to ensure convergence. Also, whenever a new data is received batch learning uses the past data together with the new data and performs a retraining, thus consuming a lot of time. There are many industrial applications where online sequential learning algorithms are preferred over batch learning algorithms as sequential learning algorithms do not require retraining whenever a new data is received. The back-propagation (BP) algorithm and its variants have been the backbone for training SLFNs with additive hidden nodes. It is to be noted that BP is basically a batch learning algorithm. Stochastic gradient descent BP (SGBP) is one of the main variants of BP for sequential learning applications.
Basically, gradient descent is theorized in a batch way, but in practice you use iterative methods.
I think the question doesn't ask you to show two ways (batch and sequential) to estimate the parameters of the model, but instead to explain—either in a batch or sequential mode—how such an estimation would work.
For instance, if you are trying to estimate parameters for a linear regression model, you could just describe likelihood maximization, which is equivalent to minimize the least square error:
If you want to show a sequential mode, you can describe the gradient descent algorithm.

Is there any classifier which is able to make decisions very fast?

Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training data,and endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.

Gradient boosting predictions in low-latency production environments?

Can anyone recommend a strategy for making predictions using a gradient boosting model in the <10-15ms range (the faster the better)?
I have been using R's gbm package, but the first prediction takes ~50ms (subsequent vectorized predictions average to 1ms, so there appears to be overhead, perhaps in the call to the C++ library). As a guideline, there will be ~10-50 inputs and ~50-500 trees. The task is classification and I need access to predicted probabilities.
I know there are a lot of libraries out there, but I've had little luck finding information even on rough prediction times for them. The training will happen offline, so only predictions need to be fast -- also, predictions may come from a piece of code / library that is completely separate from whatever does the training (as long as there is a common format for representing the trees).
I'm the author of the scikit-learn gradient boosting module, a Gradient Boosted Regression Trees implementation in Python. I put some effort in optimizing prediction time since the method was targeted at low-latency environments (in particular ranking problems); the prediction routine is written in C, still there is some overhead due to Python function calls. Having said that: prediction time for single data points with ~50 features and about 250 trees should be << 1ms.
In my use-cases prediction time is often governed by the cost of feature extraction. I strongly recommend profiling to pin-point the source of the overhead (if you use Python, I can recommend line_profiler).
If the source of the overhead is prediction rather than feature extraction you might check whether its possible to do batch predictions instead of predicting single data points thus limiting the overhead due to the Python function call (e.g. in ranking you often need to score the top-K documents, so you can do the feature extraction first and then run predict on the K x n_features matrix.
If this doesn't help either you should try the limit the number of trees because the runtime cost for prediction is basically linear in the number of trees.
There are a number of ways to limit the number of trees without affecting the model accuracy:
Proper tuning of the learning rate; the smaller the learning rate, the more trees are needed and thus the slower is prediction.
Post-process GBM with L1 regularization (Lasso); See Elements of Statistical Learning Section 16.3.1 - use predictions of each tree as new features and run the representation through a L1 regularized linear model - remove those trees that don't get any weight.
Fully-corrective weight updates; instead of doing the line-search/weight update just for the most recent tree, update all trees (see [Warmuth2006] and [Johnson2012]). Better convergence - fewer trees.
If none of the above does the trick you could investigate cascades or early-exit strategies (see [Chen2012])
References:
[Warmuth2006] M. Warmuth, J. Liao, and G. Ratsch. Totally corrective boosting algorithms that maximize the margin. In Proceedings of the 23rd international conference on Machine learning, 2006.
[Johnson2012] Rie Johnson, Tong Zhang, Learning Nonlinear Functions Using Regularized Greedy Forest, arxiv, 2012.
[Chen2012] Minmin Chen, Zhixiang Xu, Kilian Weinberger, Olivier Chapelle, Dor Kedem, Classifier Cascade for Minimizing Feature Evaluation Cost, JMLR W&CP 22: 218-226, 2012.

Resources