NiftyNet Selective Sampler batches not taken from mix of volumes? - niftynet

I'm training on three CT volumes using the Selective Sampler to ensure that enough samples are taken around the RoI (due to class imbalance), with some random samples. I'm also augmenting the data by scaling, rotation, and flipping, which takes a significant amount of time whenever samples are created.
Setting sample_per_volume to some large value (such as 32768) and batch_size to 128, it seems like NiftyNet will do 256 iterations of 128 samples just taken from the first volume, then switch to samples only taken from the 2nd volume (with a sharp jump in loss) and so on. I want each batch of 128 samples to be a roughly even mixture of samples taken from all of the training volumes.
I've tried setting sample_per_volume to roughly 1/3 of the batch_size so that samples are reselected for each iteration, but this slows down each iteration from around 2s to 50-60s.
Am I misunderstanding something? Or is there a way around this to ensure my batches are made up of samples from a mix of all the training data? Thanks.

The samples populate a queue of length queue_length, given in the .ini file. They are then randomly taken from the queue to populate the batch.
I would make the queue_length parameter bigger. Then it will be filled with data from several different subjects.

Related

Does epoch size need to be an exact mutiple of batch size?

When training a net does it matter if the number of samples in the epoch is not an exact multiple of the batch size? My training code doesnt seem to mind if this is the case, though my loss curve is pretty noisy at the moment (in case that is a related issue).
This would be useful to know, as if it is not an issue it saves on messing around with the dataset to make it quantized by batch size. It may also be less wasteful of captured data.
does it matter if the number of samples in the epoch is not an exact multiple of the batch size
No, it does not. Your number of samples can be say 1000, and your batch size can be 400.
You can decide the total number of iterations (where each iteration = sampling a batch and doing gradient descent) based on the overall number of epochs you want to cover. Say, you want to have roughly 5 epochs, then roughly your number of iterations >= 5 * 1000 / 400 = 13. So you will sample a random batch 13 times to get roughly 5 epochs.
In the context of Convolution Neural Networks (CNN), Batch size is the number of examples that are fed to the algorithm at a time. This is normally some small power of 2 like 32,64,128 etc. During training an optimization algorithm computes the average cost over a batch then runs backpropagation to update the weights. In a single epoch the algorithm is run with $n_{batches} = {n_{examples} \over batchsize} $ times. Generally the algorithm needs to train for several epochs to achieve convergence of weight values. Every batch is normally sampled randomly from the whole example set.
The idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.

what does training instance mean?

I am new to machine learning. I just stumble across the term 'training instances' in a paper about using CNN for image segmentation. In that paper, a total 1100 images were used for modeling. The authors chose sub-regions from the images for training, and they presented a classification performance curve over 500K training instances. I am confused about they get such a large number of training instances from only 1100 images. Does one training instance mean one training sample or something else related to the training size?
You can visualize training instances as training batches. If there are millions of data-sets to test, you don't want to do them all at the same time but in instances or batches.
If you take 'n' images and split each image in 'm' sub-sections, you will get n x m subsections.
So in your case suppose we split each image in 4096 sections (why 4096, because its a even 64x64 grid split) we will get
1100 * 4096 = 4505600 subsections of given training data.
To get 500K instances or subsets of training data , we simply divide 4505600 by 500k to get 9 images.
Thus we will get about 9 images in each of 500k subsets.
If the images are sufficiently dense in terms of pixel resolution and hence large in size, it may be possible to increase the subsections further to get greater number of images in each training batches or instances.
An instance in a training dataset is a single observation of record data.

Time Series Prediction using LSTM

I am using Jason Brownlee's tutorial (mirror) to apply LSTM network on some syslog/network log data. He's a master!
I have syslog data(a specific event) for each day for last 1 year and so I am using LSTM network for time series analysis. I am using LSTM from Keras deep learning library.
As I understand -
About Batch_size
A batch of data is a fixed-sized number of rows from the training
dataset that defines how many patterns to process before updating
the weights of the network. Based on the batch_size the Model
takes random samples from the data for the analysis. For time series
this is not desirable, hence the batch_size should always be 1.
About setting value for shuffle value
By default, the samples within an epoch are shuffled prior to being exposed to the network. This is undesirable for the LSTM
because we want the network to build up state as it learns across
the sequence of observations. We can disable the shuffling of
samples by setting “shuffle” to “False“.
Scenario1 -
Using above two rules/guidelines - I ran several trials with different number of neurons, epoch size and different layers and got better results from the baseline model(persistence model).
Scenario2-
Without using above guidelines/rules - I ran several trials with different number of neurons, epoch size and different layers and got even better results than Scenario 1.
Query - Setting shuffle to True and Batch_size values to 1 for time series. Is this a rule or a guideline?
It seems logical reading the tutorial that the data for time series should not be shuffled as we do not want to change the sequence of data, but for my data the results are better if I let the data be shuffled.
At the end what I think, what matters is how I get better predictions with my runs.
I think I should try and put away "theory" over concrete evidence, such as metrics, elbows, RMSEs,etc.
Kindly enlighten.
It depends a lot on the size of your data, also in the number of variables, decreasing batch size in my experience gives better results since the update is more frequent but in huge datasets it is very expensive. And you have to play with this trade-off (training time vs result).
About your shuffle it may be the case that your data is not that correlated with the past, if that is the case shuffling the data helps the network to learn and be able to generalize (like ordered by label) check reason 7 of the following 37 reasons your neural network not working
Batch size the larger the difficult it is to generalize (reason 11). When data clearly depends on the past you can declare your LSTM in Keras to stateful, this means: "that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch" according to Keras API. Hope this helps.

Effect of Data Parallelism on Training Result

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?
The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

What batch size for neural network?

I have a training set consisting of 36 data points. I want to train a neural network on it. I can choose as the batch size for example 1 or 12 or 36 (every number where 36 can divided by).
Of course when I increase the batch size training runtime decreases substantially.
Is there a disadvantage if I choose e.g. 12 as the batch size instead of 1?
There are no golden rules for batch sizes. period.
However. Your dataset is extremely tiny, and probably batch size will not matter at all, all your problems will come from lack of data, not any hyperparameters.
I agree with lejlot. The batchsize is not the problem in your current model building, given the very small data size. Once you move on to larger data that can't fit in memory, then try different batch sizes (say, some powers of 2, i.e. 32, 128, 512,...).
The choice of batch size depends on:
your hardware capacity and model architecture. Given enough memory and the capacity of the bus carrying data from memory to CPU/GPU, the larger batch sizes result in faster learning. However, the debate is whether the quality remains.
Algorithm and its implementation. For example, Keras python package (which is based on either Theano and TensorFlow implementation of neural network algorithms) states:
A batch generally approximates the distribution of the input data
better than a single input. The larger the batch, the better the
approximation; however, it is also true that the batch will take
longer to process and will still result in only one update. For
inference (evaluate/predict), it is recommended to pick a batch size
that is as large as you can afford without going out of memory (since
larger batches will usually result in faster evaluating/prediction).
You will have a better intuition after having tried different batch sizes. If your hardware and time allows, have the machine pick the right batch for you (loop through different batch sizes as part of the grid search.
Here are some good answers: one, two.

Resources