Efficient inference of 3D deep learning model (pytorch) - memory

I am trying to use a Pytorch 3D UNet for inference (from here: https://github.com/wolny/pytorch-3dunet) which receives images of size (96, 96, 96). I would like to use it on CPU instances, but I am getting very high memory usages (~18 GB). After researching on the subject I found out that this was due to the way convolutions are implemented on CPU (see https://discuss.pytorch.org/t/pytorch-high-memory-demand/2798/5). I thus have the following questions:
Is there a way to use a more memory-efficient implementation of the convolution in Pytorch?
How can I optimize my model for CPU inference? I saw that some tools like AWS Neo, Intel OpenVINO, etc. exist; could they solve my problem?
Does Tensorflow have a similar problem for using convolutions on CPU?
Any other tip, link on how to deploy such models in an efficient way is welcome!
Thanks!

You could benchmark your model's performance with DNN-Bench and choose the best inference engine for your application and your hardware. You might need to convert your model to ONNX first.

Related

Multi GPU training for Transformers with different GPUs

I want to fine tune a GPT-2 model using Huggingface’s Transformers. Preferably the medium model but large if possible. Currently, I have a RTX 2080 Ti with 11GB of memory and I can train the small model just fine.
My question is: will I run into any issues if I added an old Tesla K80 (24GB) to my machine and distributed the training? I cannot find information about using different capacity GPUs during training and issues I could run into.
Will my model size limit essentially be sum of all available GPU memory? (35GB?)
I’m not interested in doing this in AWS.
You already solved your problem. That's great. I would like to point out a different approach and address a few questions.
Will my model size limit essentially be sum of all available GPU
memory? (35GB?)
This depends on the training technique you use. The standard data parallelism replicates the model, gradients and optimiser states to each of the GPUs. So each GPU must have enough memory to hold all these. The data is splitted across the GPUs. However, the bottleneck is usually the optimiser states and the model not the data.
The state-of-the-art approach in training is ZeRO. Not only the dataset, but also the model parameters, the gradients and the optimizer states are splitted across the GPUs. This allows you to train huge models without hitting OOM. See the nice illustration below from the paper. The baseline is the standard case that I mentioned. They gradually split optimizer states, gradients and model parameter accross the GPU's and compare the memory usage per GPU.
The authors of the paper created a library called DeepSpeed and it is very easy to integrate it with huggingface. With that I was able to increase my model size from 260 Million to 11 Billion :)
If you want to understand in detail how it works, here is the paper:
https://arxiv.org/pdf/1910.02054.pdf
More information on integrating DeepSpeed with Huggingface can be found here:
https://huggingface.co/docs/transformers/main_classes/deepspeed
PS: There is a the model parallelism technique in which each GPU trains different layers of the model but it lost its popularity and is not being actively used.

Which is faster when deploying cnn models by TensorFlow Lite, Caffe2 or OpenCV?

We can deploy MobileNet on Smartphone by TensorFlow Lite, Caffe2 or OpenCV, and I think Caffe2 will provide the best performance with higher fps. But why? Is the performance gap between them so large? Thanks.
You should probably go for TensorFlow Lite. Last I looked, Caffe2 had almost zero smartphone GPU support, while TFLite now supports both iOS and many Android devices (all that have OpenGLES >=3.1). Using the GPU generally makes things several times faster, and you can reduce the inference precision to half-float (FP16) with TFLite for even more speed and not too much of a performance hit.
When you can't use the mobile GPU, you'll probably want to quantize your network to int8, which is easily doable with TensorFlow and TensorFlow Lite, whether during or after training. Caffe2 seems to need QNNPACK for quantization, which is claimed to be as much as 2 times faster. The catch is that it only works with two pre-trained models that they released (https://github.com/pytorch/QNNPACK/issues/12), so you can't convert your own model.
So I can't really think of a reason to use Caffe2 over TFLite.
I'm not sure about OpenCV's DNN module, but I seriously doubt it has mobile GPU support. There's a slight chance it has quantization.
Each framework introduces their own optimizations, the result may be significantly different for different devices.

Speeding up inference of Keras models

I have a Keras model which is doing inference on a Raspberry Pi (with a camera). The Raspberry Pi has a really slow CPU (1.2.GHz) and no CUDA GPU so the model.predict() stage is taking a long time (~20 seconds). I'm looking for ways to reduce that by as much as possible. I've tried:
Overclocking the CPU (+ 200 MhZ) and got a few extra seconds of performance.
Using float16's instead of float32's.
Reducing the image input size as much as possible.
Is there anything else I can do to increase the speed during inference? Is there a way to simplify a model.h5 and take a drop in accuracy? I've had success with simpler models, but for this project I need to rely on an existing model so I can't train from scratch.
VGG16 / VGG19 architecture is very slow since it has lots of parameters. Check this answer.
Before any other optimization, try to use a simpler network architecture.
Google's MobileNet seems like a good candidate since it's implemented on Keras and it was designed for more limited devices.
If you can't use a different network, you may compress the network with pruning. This blog post specifically do pruning with Keras.
Maybe OpenVINO will help. OpenVINO is an open-source toolkit for network inference, and it optimizes the inference performance by, e.g., graph pruning and fusing some operations. The ARM support is provided by the contrib repository.
Here are the instructions on how to build an ARM plugin to run OpenVINO on Raspberry Pi.
Disclaimer: I work on OpenVINO.

How does the SHOGUN Toolbox convolutional neural network compare to Caffe and Theano?

I'm interested in implementing a convolutional neural network in my C++ program where I'm tracking tagged insects (I'm also using OpenCV). I see people mention Caffe, Torch and Theano a lot but I haven't heard the CNN in the SHOGUN Toolbox discussed. Does this CNN work well and would anyone recommend it if you're working in C++? I've used Theano via scikit-neuralnetwork in Python to test out some images and that worked really well, except unfortunately Theano is Python-only.
Shogun also has GPU support of some of the operations used in the NN code. This is work in progress though. At this point in time, other libraries might be faster. We mostly built these networks in there in order to be able to easily compare them to the other algorithms in the toolbox.
The advantage, however, is that you can use it from a large number of languages (while internally, C++ code is executed) -- useful if you don't want to use python.
Here are some IPython notebooks that you could use as a basis to compare:
autoencoders for denoising and classification
(convolution) networks for digit classification
We appreciate any experience to be shared. Shogun is in constant development and especially the NNs attract a lot of people to work on them, so expect things to change. If you are interested in helping GPU-fying Shogun, please let us know.
The difference lies in the speed. cnn is computationally expensive, so a GPU implementation is at least 10 times faster than CPU. caffe and theano provide seamless integration of calling either CPU or GPU, which may not be easy for you to implement without much GPU programming experience.
Other factors may exist including a unified interface for multiplayer, stochastic gradient descent, and etc. but I think speed issue is most crucial among all these factors.

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

Resources