How does Niftynet handle multiple-gpu training - niftynet

I'm using Niftynet to train a CNN using 2 GPUs. As I understand, each GPU is trained independently as I get two loss values per iteration. Are the results of both GPUs combined at inference time?
I used to believe that using multiple-gpus reduces the training time but in Niftynet it doesn't seem to be the case.

Yes, correct. It does reduce training time at my case. Notice, that the batch size doubles by using multiple GPUs.
For example, if your batch size = 2, after using multiple gpu it means that every gpu will have batch size = 2. So your final batch size will be 4.

Related

Does epoch size need to be an exact mutiple of batch size?

When training a net does it matter if the number of samples in the epoch is not an exact multiple of the batch size? My training code doesnt seem to mind if this is the case, though my loss curve is pretty noisy at the moment (in case that is a related issue).
This would be useful to know, as if it is not an issue it saves on messing around with the dataset to make it quantized by batch size. It may also be less wasteful of captured data.
does it matter if the number of samples in the epoch is not an exact multiple of the batch size
No, it does not. Your number of samples can be say 1000, and your batch size can be 400.
You can decide the total number of iterations (where each iteration = sampling a batch and doing gradient descent) based on the overall number of epochs you want to cover. Say, you want to have roughly 5 epochs, then roughly your number of iterations >= 5 * 1000 / 400 = 13. So you will sample a random batch 13 times to get roughly 5 epochs.
In the context of Convolution Neural Networks (CNN), Batch size is the number of examples that are fed to the algorithm at a time. This is normally some small power of 2 like 32,64,128 etc. During training an optimization algorithm computes the average cost over a batch then runs backpropagation to update the weights. In a single epoch the algorithm is run with $n_{batches} = {n_{examples} \over batchsize} $ times. Generally the algorithm needs to train for several epochs to achieve convergence of weight values. Every batch is normally sampled randomly from the whole example set.
The idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.

Effect of Data Parallelism on Training Result

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?
The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

Finetuning Caffe Deep Learning Check failed: error == cudaSuccess out of memory

I am relatively new in Deep learning and its framework. Currently, I am experimenting with Caffe framework and trying to fine tune the Vgg16_places_365.
I am using the Amazone EC2 instance g2.8xlarge with 4 GPUs (each has 4 GB of RAM). However, when I try to train my model (using a single GPU), I got this error:
Check failed: error == cudaSuccess (2 vs. 0) out of memory
After I did some research, I found that one of the ways to solve this out of memory problem is by reducing the batch size in my train.prototxt
Caffe | Check failed: error == cudaSuccess (2 vs. 0) out of memory.
Initially, I set the batch size into 50, and iteratively reduced it until 10 (since it worked when batch_size = 10).
Now, the model is being trained and I am pretty sure it will take quite long time. However, as a newcomer in this domain, I am curious about the relation between this batch size and another parameter such as the learning rate, stepsize and even the max iteration that we specify in the solver.prototxt.
How significant the size of the batch will affect the quality of the model (like accuracy may be). How the other parameters can be used to leverage the quality. Also, instead of reducing the batch size or scale up my machine, is there another way to fix this problem?
To answer your first question regarding the relationship between parameters such as batch size, learning rate and maximum number of iterations, you are best of reading about the mathematical background. A good place to start might be this stats.stackexchange question: How large should the batch size be for stochastic gradient descent?. The answer will briefly discuss the relation between batch size and learning rate (from your question I assume learning rate = stepsize) and also provide some references for further reading.
To answer your last question, with the dataset you are finetuning on and the model (i.e. the VGG16) being fixed (i.e. the input data of fixed size, and the model of fixed size), you will have a hard time avoiding the out of memory problem for large batch sizes. However, if you are willing to reduce the input size or the model size you might be able to use larger batch sizes. Depending on how (and what) exactly you are finetuning, reducing the model size may already be achieved by discarding learned layers or reducing the number/size of fully connected layers.
The remaining questions, i.e. how significant the batchsize influences quality/accuracy and how other parameters influence quality/accuracy, are hard to answer without knowing the concrete problem you are trying to solve. The influence of e.g. the batchsize on the achieved accuracy might depend on various factors such as the noise in your dataset, the dimensionality of your dataset, the size of your dataset as well as other parameters such as learning rate (=stepsize) or momentum parameter. For these sort of questions, I recommend the textbook by Goodfellow et al., e.g. chapter 11 may provide some general guidelines on choosing these hyperparmeters (i.e. batchsize, learning rate etc.).
another way to solve your problem is using all the GPUs on your machine. If you have 4x4=16GB RAM on your GPUs, that would be enough. If you are running caffe in command mode, just add the --gpu argument as follows (assuming you have 4 GPUs indexed as default 0,1,2,3):
build/tools/caffe train --solver=solver.prototxt --gpu=0,1,2,3
However if you are using the python interface, running with multiple GPUs is not yet supported.
I can point out some general hints to answer your question on the batchsize:
- The smaller the batchsize is, the more stochastic your learning would be --> less probability of overfitting on the training data; higher probability of not converging.
- each iteration in caffe fetches one batch of data, runs forward and ends with a backpropagation.
- Let's say your training data is 50'000 and your batchsize is 10; then in 1000 iterations, 10'000 of your data has been fed to the network. In the same scenario scenario, if your batchsize is 50, in 1000 iterations, all your training data are seen by the network. This is called one epoch. You should design your batchsize and maximum iterations in a way that your network is trained for a certain number of epochs.
- stepsize in caffe, is the number of iterations your solver will run before multiplying the learning rate with the gamma value (if you have set your training approach as "step").

Caffe | solver.prototxt values setting strategy

On Caffe, I am trying to implement a Fully Convolution Network for semantic segmentation. I was wondering is there a specific strategy to set up your 'solver.prototxt' values for the following hyper-parameters:
test_iter
test_interval
iter_size
max_iter
Does it depend on the number of images you have for your training set? If so, how?
In order to set these values in a meaningful manner, you need to have a few more bits of information regarding your data:
1. Training set size the total number of training examples you have, let's call this quantity T.
2. Training batch size the number of training examples processed together in a single batch, this is usually set by the input data layer in the 'train_val.prototxt'. For example, in this file the train batch size is set to 256. Let's denote this quantity by tb.
3. Validation set size the total number of examples you set aside for validating your model, let's denote this by V.
4. Validation batch size value set in batch_size for the TEST phase. In this example it is set to 50. Let's call this vb.
Now, during training, you would like to get an un-biased estimate of the performance of your net every once in a while. To do so you run your net on the validation set for test_iter iterations. To cover the entire validation set you need to have test_iter = V/vb.
How often would you like to get this estimation? It's really up to you. If you have a very large validation set and a slow net, validating too often will make the training process too long. On the other hand, not validating often enough may prevent you from noting if and when your training process failed to converge. test_interval determines how often you validate: usually for large nets you set test_interval in the order of 5K, for smaller and faster nets you may choose lower values. Again, all up to you.
In order to cover the entire training set (completing an "epoch") you need to run T/tb iterations. Usually one trains for several epochs, thus max_iter=#epochs*T/tb.
Regarding iter_size: this allows to average gradients over several training mini batches, see this thread fro more information.

What batch size for neural network?

I have a training set consisting of 36 data points. I want to train a neural network on it. I can choose as the batch size for example 1 or 12 or 36 (every number where 36 can divided by).
Of course when I increase the batch size training runtime decreases substantially.
Is there a disadvantage if I choose e.g. 12 as the batch size instead of 1?
There are no golden rules for batch sizes. period.
However. Your dataset is extremely tiny, and probably batch size will not matter at all, all your problems will come from lack of data, not any hyperparameters.
I agree with lejlot. The batchsize is not the problem in your current model building, given the very small data size. Once you move on to larger data that can't fit in memory, then try different batch sizes (say, some powers of 2, i.e. 32, 128, 512,...).
The choice of batch size depends on:
your hardware capacity and model architecture. Given enough memory and the capacity of the bus carrying data from memory to CPU/GPU, the larger batch sizes result in faster learning. However, the debate is whether the quality remains.
Algorithm and its implementation. For example, Keras python package (which is based on either Theano and TensorFlow implementation of neural network algorithms) states:
A batch generally approximates the distribution of the input data
better than a single input. The larger the batch, the better the
approximation; however, it is also true that the batch will take
longer to process and will still result in only one update. For
inference (evaluate/predict), it is recommended to pick a batch size
that is as large as you can afford without going out of memory (since
larger batches will usually result in faster evaluating/prediction).
You will have a better intuition after having tried different batch sizes. If your hardware and time allows, have the machine pick the right batch for you (loop through different batch sizes as part of the grid search.
Here are some good answers: one, two.

Resources