Sometimes I run into a problem:
OOM when allocating tensor with shape
e.g.
OOM when allocating tensor with shape (1024, 100, 160)
Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.
Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?
In short: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.
From the recent Deep Learning book by Goodfellow et al., chapter 8:
Minibatch sizes are generally driven by the following factors:
Larger batches provide a more accurate estimate of the gradient, but
with less than linear returns.
Multicore architectures are usually
underutilized by extremely small batches. This motivates using some
absolute minimum batch size, below which there is no reduction in the
time to process a minibatch.
If all examples in the batch are to be
processed in parallel (as is typically the case), then the amount of
memory scales with the batch size. For many hardware setups this is
the limiting factor in batch size.
Some kinds of hardware achieve
better runtime with speciļ¬c sizes of arrays. Especially when using
GPUs, it is common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes
being attempted for large models.
Small batches can offer a
regularizing effect (Wilson and Martinez, 2003), perhaps due to the
noise they add to the learning process. Generalization error is often
best for a batch size of 1. Training with such a small batch size
might require a small learning rate to maintain stability because of
the high variance in the estimate of the gradient. The total runtime
can be very high as a result of the need to make more steps, both
because of the reduced learning rate and because it takes more steps
to observe the entire training set.
Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".
You might want also to consult several good posts here in Stack Exchange:
Tradeoff batch size vs. number of iterations to train a neural network
Selection of Mini-batch Size for Neural Network Regression
How large should the batch size be for stochastic gradient descent?
Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.
Hope this helps...
UPDATE (Dec 2017):
There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.
UPDATE (Mar 2021):
Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:
The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
You can estimate the largest batch size using:
Max batch size= available GPU memory bytes / 4 / (size of tensors + trainable parameters)
Use the summaries provided by pytorchsummary (pip install) or keras (builtin).
E.g.
from torchsummary import summary
summary(model)
.....
.....
================================================================
Total params: 1,127,495
Trainable params: 1,127,495
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 13.93
Params size (MB): 4.30
Estimated Total Size (MB): 18.25
----------------------------------------------------------------
Each instance you put in the batch will require a full forward/backward pass in memory, your model you only need once. People seem to prefer batch sizes of powers of two, probably because of automatic layout optimization on the gpu.
Don't forget to linearly increase your learning rate when increasing the batch size.
Let's assume we have a Tesla P100 at hand with 16 GB memory.
(16000 - model_size) / (forward_back_ward_size)
(16000 - 4.3) / 18.25 = 1148.29
rounded to powers of 2 results in batch size 1024
Here is a function to find batch size for training the model:
def FindBatchSize(model):
"""model: model architecture, that is yet to be trained"""
import os, sys, psutil, gc, tensorflow, keras
import numpy as np
from keras import backend as K
BatchFound= 16
try:
total_params= int(model.count_params()); GCPU= "CPU"
#find whether gpu is available
try:
if K.tensorflow_backend._get_available_gpus()== []:
GCPU= "CPU"; #CPU and Cuda9GPU
else:
GCPU= "GPU"
except:
from tensorflow.python.client import device_lib; #Cuda8GPU
def get_available_gpus():
local_device_protos= device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
if "gpu" not in str(get_available_gpus()).lower():
GCPU= "CPU"
else:
GCPU= "GPU"
#decide batch size on the basis of GPU availability and model complexity
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <1000000):
BatchFound= 64
if (os.cpu_count() <16) and (total_params <500000):
BatchFound= 64
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <2000000) and (total_params >=1000000):
BatchFound= 32
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=2000000) and (total_params <10000000):
BatchFound= 16
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=10000000):
BatchFound= 8
if (os.cpu_count() <16) and (total_params >5000000):
BatchFound= 8
if total_params >100000000:
BatchFound= 1
except:
pass
try:
#find percentage of memory used
memoryused= psutil.virtual_memory()
memoryused= float(str(memoryused).replace(" ", "").split("percent=")[1].split(",")[0])
if memoryused >75.0:
BatchFound= 8
if memoryused >85.0:
BatchFound= 4
if memoryused >90.0:
BatchFound= 2
if total_params >100000000:
BatchFound= 1
print("Batch Size: "+ str(BatchFound)); gc.collect()
except:
pass
memoryused= []; total_params= []; GCPU= "";
del memoryused, total_params, GCPU; gc.collect()
return BatchFound
I ran into a similar GPU mem error which was solved by configuring the tensorflow session with the following:
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
see: google colaboratory `ResourceExhaustedError` with GPU
Related
I am wondering why the number of images has no influence on the number of iterations when training. Here is an example to to make my question clearer:
Suppose we have 6400 images for a training to recognize 4 classes. Based on AlexeyAB explanations, we keep batch= 64, subdivisions = 16 and write max_batches = 8000 since max_batches is determined by #classes x 2000.
Since we have 6400 images, a complete epoch requires 100 iterations. Therefore this training ends after 80 epochs.
Now, suppose that we have 12800 images. In that case, an epoch needs 200 iterations. Therefore the training ends after 40 epochs.
Since an epoch refers to one cycle through the full training dataset, I'm wondering why we don't increase the number of iterations when our dataset increases, in order to keep the number of epochs constant.
Said differently, I'm asking for a simple explanation as to why the number of epochs seems to be irrelevant to the quality of the training. I feel that it's a consequence of Yolo's construction but I am not knowledgeable enough to understand how.
Why the number of images has no influence on the number of iterations when training?
In darknet yolo, the number of iterations depends on the max_batches parameter in .cfg file. After running for max_batches, the darknet saves the final_weights.
In each epoch, all the data samples are passed through the network, so if you have many images, the training time for one epoch (and iteration) will be higher, you can test that by increasing images in your data.
The sub-division accounts for the number of mini-batches. Let's say, you have 100 images in your dataset. your batch size is 10, sub-division is 2, max_batches is 20.
So, in each iteration, 10 images are passed to the network in two mini-batches (Each having 5 samples), once you have done 20 baches (20*10 data samples), the training will be completed. (The details can be a little different, I'm using a slightly modified darknet by original author pjreddie)
The instructions are updated now. max_batches is equal to classes*2000 but not less than number of training images and not less than 6000. Please find it at this link.
On my Macbook Pro 13" I have the Blackmagic eGPU (AMD Radeon Pro 580) connected via USB-C. This should theoretically speed up my model training with Turi Create enormously.
For a small model in my case 15 labeled images (4k x 3k) and 500 iterations are used, which which take about 2 hours including the eGPU. Only CPU takes 4h, so the GPU speeds up, but not extremely.
In the Guide to Turi Create there is said that an object detection model with ~700 images and 4000 iterations is processed in 1 hour. So way faster.
While using CreateML I observe an increase of performance of at least 5x for transfer learning during the feature detection phase when using the eGPU.
Is this a problem of the framework itself?
Can I optimize the data or training parameters for better usage of the eGPU?
Is the data too small or the resolution too big to have optimal GPU usage over USB-C?
Class : ObjectDetector
Schema
------
Model : darknet-yolo
Number of classes : 4
Non-maximum suppression threshold : 0.45
Input image shape : (3, 416, 416)
Training summary
----------------
Training time : 1h 29m 8s
Training epochs : 1066
Training iterations : 500
Number of examples (images) : 15
Number of bounding boxes (instances) : 49
Final loss (specific to model) : 1.808
It is the image size/resolution (4k x 3k) which creates the bottleneck to the GPU. Scaling the images down (and setting the labels accordingly) gets full speed of the eGPU (100x vs CPU).
I'm reading this CS231n tutorial, about convolutional neural networks. They give an example about VGGNet:
http://cs231n.github.io/convolutional-networks/
VGGNet in detail. Lets break down the VGGNet in more detail as a case
study. The whole VGGNet is composed of CONV layers that perform 3x3
convolutions with stride 1 and pad 1, and of POOL layers that perform
2x2 max pooling with stride 2 (and no padding). We can write out the
size of the representation at each step of the processing and keep
track of both the representation size and the total number of weights:
Then they give a detailed calculation of the network structure:
But the thing is, for total memory, the tutorial gives the result of 24M, but when I calculated it I only got about 15M ! I simply added all of the memories:
>>> 224*224*(3+64*2)+112*112*(64+128*2)+56*56*(128+256*3)+28*28*(256+512*3)+14*14*(512*4)+7*7*512+4096+4096+1000
15237608
Please help me.
Nice catch! Your calculation is correct, total memory of VGG representation is indeed
15.2M * 4 bytes ~= 61Mb
In fact, this error has been reported long time ago, but unfortunately CS231n staff doesn't spend too much time on website maintenance...
However, note that if you code VGG network in any framework (Caffe, Tensorflow, etc), the total model size will include the parameters and this part is much larger, as the authors also show in their calculations (which seems right).
When using the function tf.nn.fractional_max_pool in Tensorflow, in addition to the output pooled tensor it returns, it also returns a row_pooling_sequence and a col_pooling_sequence, which I presume is used in backpropagation to find the gradient of. This is in contrast to the normal $2 \times 2$ max pooling, which just returns the pooled tensor.
My question is: do we have to handle the row_pooling and col_pooling values ourselves? How would we include them into a network to get backpropagation working properly? I modified a simple convolutional neural network to use fractional max pooling instead of 2 x 2 max pooling without making use of these values and the results were much poorer, leading me to believe we must explicitly handle these.
Here's the relevant portion of my code that makes use of the FMP:
def add_layer_ops_FMP(conv_func, x_input, W, keep_prob_layer, training_phase):
h_conv = conv_func(x_input, W, stride_l = 1)
h_BN = batch_norm(h_conv, training_phase, epsilon)
h_elu = tf.nn.elu(h_BN) # Rectified unit layer - change accordingly
def dropout_no_training(h_elu=h_elu):
return dropout_op(h_elu, keep_prob = 1.0)
def dropout_in_training(h_elu=h_elu, keep_prob_layer=keep_prob_layer):
return dropout_op(h_elu, keep_prob = keep_prob_layer)
h_drop = tf.cond(training_phase, dropout_in_training, dropout_no_training)
h_pool, row_pooling_sequence, col_pooling_sequence = tf.nn.fractional_max_pool(h_drop) # FMP layer. See Ben Graham's paper
return h_pool
Link to function on github.
Do we need to handle row_pooling_sequence and col_pooling_sequence?
Even though the tf.nn.fractional_max_pool documentation says it turns 2 extra tensors which are needed to calculate gradient, I believe we do not need to specially handle these 2 extra tensors and add them into gradient calculation operation. The backpropagation of tf.nn.fractional_max_poolin TensorFlow is already registered into the gradient calculation flow by the _FractionalMaxPoolGrad function. As you can see in the _FractionalMaxPoolGrad, the row_pooling_sequence and col_pooling_sequence are extracted by op.outputs[1] and op.outputs[2] and used to calculate gradient.
#ops.RegisterGradient("FractionalMaxPool")
def _FractionalMaxPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
"""..."""
return gen_nn_ops._fractional_max_pool_grad(op.inputs[0], op.outputs[0],
grad_0, op.outputs[1],
op.outputs[2],
op.get_attr("overlapping"))
Possible reasons for poorer performance after using fractional_max_pool (in my personal opinions).
In the fractional max pooling paper, the author used fractional max pooling in a spatially-sparse convolutional network. According to his spatially-sparse convolutional network design, he actually extended the image input spatial size by padding zeros. Additionally, fractional max pooling downsizes the input by a factor of pooling_ratio which is often less than 2. These two combined together allowed stacking more convolutional layers than using regular max pooling and hence building a deeper network. (i.e. imagine using CIFAR-10 dataset, the (non-padding) input spatial size is 32x32, the spatial size drops to 4x4 after 3 convolutional layers and 3 max pooling operations. If using fractional max pooling with pooling_ratio=1.4, the spatial size drops to 4x4 after 6 convolutional and 6 fractional max pooling layers). I experimented with building a CNN with 2-conv-layer+2-pooling-layer(regular max pool vs. fractional max pool with pooling_ratio=1.47)+2-fully-connected-layer on MNIST dataset. The one using regular max pooling also produced a better performance than the one using fractional max pooling (down by 15~20% performance). Comparing the spatial size before feeding into fully connected layers, the model with regular max pooling has spatial size of 7x7, the one with fractional max pooling has spatial size of 12x12. Adding one more conv+fractional_max_pool into the latter model (final spatial size dropped to be 8x8) improved the performance to a more comparative level with the former model with regular max pooling.
In summary, I personally think the good performance in the Fractional Max-Pooling paper is achieved by a combination of using spatially-sparse CNN with fractional max-pooling and small filters (and network in network) which enable building a deep network even when the input image spatial size is small. Hence in regular CNN network, simply replace regular max pooling with fractional max pooling does not necessarily give you a better performance.
I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).