SGD mini batches - all of the same size? - machine-learning

Stochastic Gradient Descent algorithms with mini batches usually use mini batches' size or count as a parameter.
Now what I'm wondering, do all of the mini-batches need to be of exact same size?
Take for example a training data from MNIST(60k training images) and a mini-batch size of 70.
If we are going in a simple loop, that produces us 857 mini-batches of size 70 (as specified) and one mini-batch of size 10.
Now, does it even matter that (using this approach) one mini-batch will be smaller than the others (worst case scenario here: mini batch of size 1)?
Will this strongly affect the weights and biases that our network has learned over almost all its' training?

No, mini batches do not have to be the same size. They are usually constant sized for efficiency reasons (you do not have to reallocate memory/resize tensors). In practise you could even sample size of the batch in each iteration.
However, the size of the batch makes a difference. It is hard to say which one is the best, but using smaller/bigger batch sizes can result in different solutions (and always - different convergence speed). This is an effect of dealing with more stochastic motion (small batch) vs smooth updates (good gradient estimators). In particular - doing stochastic size of a batch with some predefined distribution of sizes can be used to use both effects at the same time (but time spent fitting this distribution might be not worth it)

Related

optimal batch size for image classification using deep learning

I have a broad question, but should be still relevant.
lets say I am doing a 2 class image classification using a CNN. a batch size of 32-64 should be sufficient for training purpose. However, if I had data with about 13 classes, surely 32 batch size would not be sufficient for a good model, as each batch might get 2-3 images of each class. is there a generic or approximate formula to determine the batch size for training? or should that be determined as a hyperparameter using techniques like grid search or bayesian methods?
sedy
Batch size is a hyper parameter like e.g. learning rate. It is really hard to say what is the perfect size for your problem.
The problem you are mentioning might exist but is only really relevant in specific problems where you can't just to random sampling like face/person re-identification.
For "normal" problems random sampling is sufficient. The reason behind minibatch training is, to get a more stable training. You want your weight updates to go in the right direction in regards to the global minimum of the loss function for the whole dataset. A minibatch is an approximation of this.
With increasing the batchsize you get less updates but "better" updates. With a small batchsize you get more updates, but they will more often go in the wrong direction. If the batch size is to small (e.g. 1) the network might take a long time to converge and thus increases the training time. To large of a batch size can hurt the generalization of the network. Good paper about the topic On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Another interesting paper on the topic is: Don't Decay the Learning Rate, Increase the Batch Size. Which analyzes the effect of batch size on the training. In general learning rate and batch size have effects on each other.
In general batch size is more a factor to reduce training time, because you can make use of parallelism and have less weight updates with increasing batch size and more stability. As with everything look at what others did for a task comparable with your problem and take it as a baseline and experiment with it a little. Also with huge networks the available memory often limits the maximum batch size anyway.

TensorFlow for image recognition, size of images

How can size of an image effect training the model for this task?
My current training set holds images that are 2880 X 1800, but I am worried this may be too large to train. In total my sample size will be about 200-500 images.
Would this just mean that I need more resources (GPU,RAM, Distribution) when training my model?
If this is too large, how should I go about resizing? -- I want to mimic real-world photo resolutions as best as possible for better accuracy.
Edit:
I would also be using TFRecord format for the image files
Your memory and processing requirements will be proportional to the pixel size of your image. Whether this is too large for you to process efficiently will depend on your hardware constraints and the time you have available.
With regards to resizing the images there is no one answer, you have to consider how to best preserve information that'll be required for your algorithm to learn from your data while removing information that won't be useful. Reducing the size of your input images won't necessarily be a negative for accuracy. Consider two cases:
Handwritten digits
Here the images could be reduced considerably in size and maintain all the structural information necessary to be correctly identified. Have a look at the MNIST data set, these images are distributed at 28 x 28 resolution and identifiable to 99.7%+ accuracy.
Identifying Tree Species
Imagine a set of images of trees where individual leaves could help identify species. Here you might find that reducing the image size reduces small scale detail on leaf shape in a way that's detrimental to the model, but you might find that you get a similar result with a tight crop (which preserves individual leaves) rather than an image resize. If this is the case you may find that creating multiple crops from the same image gives you an augmented data set for training that considerably improves results (which is something to consider, if possible, given your training set is very small)
Deep learning models are achieving results around human level in many image classification tasks: if you struggle to identify your own images then it's less likely you'll train an algorithm to. This is often a useful starting point when considering the level of scaling that might be appropriate.
If you are using GPUs to train, this will def affect your training time. Tensorflow does most of the GPU allocation so you don't have to worry about that. But with big photos you will be experiencing long training time although your dataset is small. You should consider data-augmentation.
You could complement your resizing with the data-augmentation. Resize in equal dimensions and then perform reflection and translation (as in geometric movement)
If your images are too big, your GPU might run out of memory before it can start training because it has to store the convolution outputs on its memory. If that happens, you can do some of the following things to reduce memory consumption:
resize the image
reduce batch size
reduce model complexity
To resize your image, there are many scripts just one Google search away, but I will add that in your case 1440 by 900 is probably a sweet spot.
Higher resolution images will result in a higher training time and an increased memory consumption (mainly GPU memory).
Depending on your concrete task, you might want to reduce the image size in order to therefore fit a reasonable batch size of let's say 32 or 64 on the GPU - for stable learning.
Your accuracy is probably affected more by the size of your training set. So instead of going for image size, you might want to go for 500-1000 sample images. Recent publications like SSD - Single Shot MultiBox Detector achieve high accuracy values like an mAP of 72% on the PascalVOC dataset - with "only" using 300x300 image resolution.
Resizing and augmentation: SSD for instance just scales every input image down to 300x300, independent of the aspect ratio - does not seem to hurt. You could also augment your data by mirroring, translating, ... etc (but I assume there are built-in methods in Tensorflow for that).

Should batch size matter at inference

I am training a model
5-layer very narrow CNN,
followed by a 5-layer highway,
then fully connected and
softmax over 7 classes.
As there are 7 equally distributed classes, the random bet lassification accuracy would be 14 % (1/7th is roughly 14 %).
Actual accuracy is 40 %. So the net somewhat learns..
Now the weird thing is it learns only with a batch size of 2. Batch sizes of 16, 32 or 64 don't learn at all.
Now the even weirder thing: If I take the checkpoint of the trained net (accuracy 40 %, trained at batch size 2) and start it with a batch size of 32 I should keep on getting my 40 % at least for the first couple of steps, right? I do when I restart at barch size 2. But with the bs 32 initial accuracy is, guess what, 14 %.
Any idea why the batch size would ruin inference? I fear I might be having a shape error somewhere but I cannot find anything.
Thx for your thoughts
There are two possible modes of work for a batch normalization layer at inference time:
Compute the activation mean and variance over a given inference batch
Use average mean and variance from the training batches
In Pytorch, for example, the track_running_stats parameter of the BatchNorm2D layer is set to True, or in other words, Option 2 is the default:
If you choose Option 1, then of course, the size of the inference batch and the characteristics of each sample in it will affect the outputs of the other samples.
So γ and β are learned in the training and used in inference as is, and if you do not change the default behavior, the "same" is true for E[x] and Var[x]. In purpose I wrote "same" within quotes, as these are just batch statistics from the training.
If we're already talking about batch size, I'd mention that it may sound tempting to use very large batch sizes in training, to have more accurate statistics and to have a better approximation of the loss function for an SGD step. Yet, approximating the loss function too well has drawbacks, such as overfitting.
You should look at the accuracy when your model converges, not when it is still training. It's hard to compare effects of different batch size during training steps because they can get "lucky" and follow a good gradient path. In general smaller batch size tends to be more noisy and could give you good peaks and bad drops in accuracy.
It's hard to tell without looking at the code, but I think that large batch sizes cause the gradient to be too large and the training cannot converge. One way to fight this effect would be to increase the batch size but decrease the learning rate. You can also try to clip gradient magnitude.

Caffe: What can I do if only a small batch fits into memory?

I am trying to train a very large model. Therefore, I can only fit a very small batch size into GPU memory. Working with small batch sizes results with very noisy gradient estimations.
What can I do to avoid this problem?
You can change the iter_size in the solver parameters.
Caffe accumulates gradients over iter_size x batch_size instances in each stochastic gradient descent step.
So increasing iter_size can also get more stable gradient when you cannot use large batch_size due to the limited memory.
As stated in this post, the batch size is not a problem in theory (the efficiency of stochastic gradient descent has been proven with a batch of size 1). Make sure you implement your batch correctly (the samples should be randomly picked over your data).

Will larger batch size make computation time less in machine learning?

I am trying to tune the hyper parameter i.e batch size in CNN.I have a computer of corei7,RAM 12GB and i am training a CNN network with CIFAR-10 dataset which can be found in this blog.Now At first what i have read and learnt about batch size in machine learning:
let's first suppose that we're doing online learning, i.e. that we're
using a mini­batch size of 1. The obvious worry about online learning
is that using mini­batches which contain just a single training
example will cause significant errors in our estimate of the gradient.
In fact, though, the errors turn out to not be such a problem. The
reason is that the individual gradient estimates don't need to be
super­accurate. All we need is an estimate accurate enough that our
cost function tends to keep decreasing. It's as though you are trying
to get to the North Magnetic Pole, but have a wonky compass that's
10­-20 degrees off each time you look at it. Provided you stop to
check the compass frequently, and the compass gets the direction right
on average, you'll end up at the North Magnetic Pole just
fine.
Based on this argument, it sounds as though we should use online
learning. In fact, the situation turns out to be more complicated than
that.As we know we can use matrix techniques to compute the gradient
update for all examples in a mini­batch simultaneously, rather than
looping over them. Depending on the details of our hardware and linear
algebra library this can make it quite a bit faster to compute the
gradient estimate for a mini­batch of (for example) size 100 , rather
than computing the mini­batch gradient estimate by looping over the
100 training examples separately. It might take (say) only 50 times as
long, rather than 100 times as long.Now, at first it seems as though
this doesn't help us that much.
With our mini­batch of size 100 the learning rule for the weights
looks like:
where the sum is over training examples in the mini­batch. This is
versus for online learning.
Even if it only takes 50 times as long to do the mini­batch update, it
still seems likely to be better to do online learning, because we'd be
updating so much more frequently. Suppose, however, that in the
mini­batch case we increase the learning rate by a factor 100, so the
update rule becomes
That's a lot like doing separate instances of online learning with a
learning rate of η. But it only takes 50 times as long as doing a
single instance of online learning. Still, it seems distinctly
possible that using the larger mini­batch would speed things up.
Now i tried with MNIST digit dataset and ran a sample program and set the batch size 1 at first.I noted down the training time needed for the full dataset.Then i increased the batch size and i noticed that it became faster.
But in case of training with this code and github link changing the batch size doesn't decrease the training time.It remained same if i use 30 or 128 or 64.They are saying that they got 92% accuracy.After two or three epoch they have got above 40% accuracy.But when i ran the code in my computer without changing anything other than the batch size i got worse result after 10 epoch like only 28% and test accuracy stuck there in the next epochs.Then i thought since they have used batch size of 128 i need to use that.Then i used the same but it became more worse only give 11% after 10 epoch and stuck in there.Why is that??
Neural networks learn by gradient descent an error function in the weight space which is parametrized by the training examples. This means the variables are the weights of the neural network. The function is "generic" and becomes specific when you use training examples. The "correct" way would be to use all training examples to make the specific function. This is called "batch gradient descent" and is usually not done for two reasons:
It might not fit in your RAM (usually GPU, as for neural networks you get a huge boost when you use the GPU).
It is actually not necessary to use all examples.
In machine learning problems, you usually have several thousands of training examples. But the error surface might look similar when you only look at a few (e.g. 64, 128 or 256) examples.
Think of it as a photo: To get an idea of what the photo is about, you usually don't need a 2500x1800px resolution. A 256x256px image will give you a good idea what the photo is about. However, you miss details.
So imagine gradient descent to be a walk on the error surface: You start on one point and you want to find the lowest point. To do so, you walk down. Then you check your height again, check in which direction it goes down and make a "step" (of which the size is determined by the learning rate and a couple of other factors) in that direction. When you have mini-batch training instead of batch-training, you walk down on a different error surface. In the low-resolution error surface. It might actually go up in the "real" error surface. But overall, you will go in the right direction. And you can make single steps much faster!
Now, what happens when you make the resolution lower (the batch size smaller)?
Right, your image of what the error surface looks like gets less accurate. How much this affects you depends on factors like:
Your hardware/implementation
Dataset: How complex is the error surface and how good it is approximated by only a small portion?
Learning: How exactly are you learning (momentum? newbob? rprop?)
I'd like to add to what's been already said here that larger batch size is not always good for generalization. I've seen these cases myself, when an increase in batch size hurt validation accuracy, particularly for CNN working with CIFAR-10 dataset.
From "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima":
The stochastic gradient descent (SGD) method and its variants are
algorithms of choice for many Deep Learning tasks. These methods
operate in a small-batch regime wherein a fraction of the training
data, say 32–512 data points, is sampled to compute an approximation
to the gradient. It has been observed in practice that when using a
larger batch there is a degradation in the quality of the model, as
measured by its ability to generalize. We investigate the cause for
this generalization drop in the large-batch regime and present
numerical evidence that supports the view that large-batch methods
tend to converge to sharp minimizers of the training and testing
functions—and as is well known, sharp minima lead to poorer
generalization. In contrast, small-batch methods consistently converge
to flat minimizers, and our experiments support a commonly held view
that this is due to the inherent noise in the gradient estimation. We
discuss several strategies to attempt to help large-batch methods
eliminate this generalization gap.
Bottom-line: you should tune the batch size, just like any other hyperparameter, to find an optimal value.
The 2018 opinion retweeted by Yann LeCun is the paper Revisiting Small Batch Training For Deep Neural Networks, Dominic Masters and Carlo Luschi suggesting a good generic maximum batch size is:
32
With some interplay with choice of learning rate.
The earlier 2016 paper On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima gives some reason for not using big batches, which I paraphrase badly, as big batches are likely to get stuck in local (“sharp”) minima, small batches not.

Resources