Speeding up inference of Keras models - machine-learning

I have a Keras model which is doing inference on a Raspberry Pi (with a camera). The Raspberry Pi has a really slow CPU (1.2.GHz) and no CUDA GPU so the model.predict() stage is taking a long time (~20 seconds). I'm looking for ways to reduce that by as much as possible. I've tried:
Overclocking the CPU (+ 200 MhZ) and got a few extra seconds of performance.
Using float16's instead of float32's.
Reducing the image input size as much as possible.
Is there anything else I can do to increase the speed during inference? Is there a way to simplify a model.h5 and take a drop in accuracy? I've had success with simpler models, but for this project I need to rely on an existing model so I can't train from scratch.

VGG16 / VGG19 architecture is very slow since it has lots of parameters. Check this answer.
Before any other optimization, try to use a simpler network architecture.
Google's MobileNet seems like a good candidate since it's implemented on Keras and it was designed for more limited devices.
If you can't use a different network, you may compress the network with pruning. This blog post specifically do pruning with Keras.

Maybe OpenVINO will help. OpenVINO is an open-source toolkit for network inference, and it optimizes the inference performance by, e.g., graph pruning and fusing some operations. The ARM support is provided by the contrib repository.
Here are the instructions on how to build an ARM plugin to run OpenVINO on Raspberry Pi.
Disclaimer: I work on OpenVINO.

Related

Efficient inference of 3D deep learning model (pytorch)

I am trying to use a Pytorch 3D UNet for inference (from here: https://github.com/wolny/pytorch-3dunet) which receives images of size (96, 96, 96). I would like to use it on CPU instances, but I am getting very high memory usages (~18 GB). After researching on the subject I found out that this was due to the way convolutions are implemented on CPU (see https://discuss.pytorch.org/t/pytorch-high-memory-demand/2798/5). I thus have the following questions:
Is there a way to use a more memory-efficient implementation of the convolution in Pytorch?
How can I optimize my model for CPU inference? I saw that some tools like AWS Neo, Intel OpenVINO, etc. exist; could they solve my problem?
Does Tensorflow have a similar problem for using convolutions on CPU?
Any other tip, link on how to deploy such models in an efficient way is welcome!
Thanks!
You could benchmark your model's performance with DNN-Bench and choose the best inference engine for your application and your hardware. You might need to convert your model to ONNX first.

Which is faster when deploying cnn models by TensorFlow Lite, Caffe2 or OpenCV?

We can deploy MobileNet on Smartphone by TensorFlow Lite, Caffe2 or OpenCV, and I think Caffe2 will provide the best performance with higher fps. But why? Is the performance gap between them so large? Thanks.
You should probably go for TensorFlow Lite. Last I looked, Caffe2 had almost zero smartphone GPU support, while TFLite now supports both iOS and many Android devices (all that have OpenGLES >=3.1). Using the GPU generally makes things several times faster, and you can reduce the inference precision to half-float (FP16) with TFLite for even more speed and not too much of a performance hit.
When you can't use the mobile GPU, you'll probably want to quantize your network to int8, which is easily doable with TensorFlow and TensorFlow Lite, whether during or after training. Caffe2 seems to need QNNPACK for quantization, which is claimed to be as much as 2 times faster. The catch is that it only works with two pre-trained models that they released (https://github.com/pytorch/QNNPACK/issues/12), so you can't convert your own model.
So I can't really think of a reason to use Caffe2 over TFLite.
I'm not sure about OpenCV's DNN module, but I seriously doubt it has mobile GPU support. There's a slight chance it has quantization.
Each framework introduces their own optimizations, the result may be significantly different for different devices.

If I train a neural network with CUDA do I need to run the outputted algorithm with CUDA?

Let's say I used CUDA to train an object tracking program. Could I then put that program on another computer that didn't have a powerful gpu and run the object tracking program? Or is gpu support required to run the outputted algorithm as well as train it?
No, it does not matter how you trained your model. You can execute it in completely different scenario, using CPU, GPU, cloud or whatever you want. Since execution is usually much cheaper than training - you will usually need much less powerful hardware.

nerual networks: gpu vs no-gpu

I need to train a recurrent neural network as a language model and I decided to use keras with theano backend for that. Is it better to use an ordinary PC with some graphics card instead of a "cool" server machine that can't do gpu computing? Is there a boundary (given perhaps by the architecture of the NN and amount of the training data) that would separate "cpu-learnable" problems from those that can be done (in reasonable time) only by utilizing gpu?
(I have access to an older production server in the company I work in. It has 16 cores, about 49GB of available RAM so I thought I was ready for training, now I am reading about gpu optimization theano is doing and I am thinking I am basically screwed without it.)
Edit
I have just come across this article, where Tomáš Mikolov states they managed to train a single-layer recurrent neural network with 1024 states in 10 days while using only 24 CPUs and no GPU.
Is there a boundary
One that would separate CPU vs GPU is memory access. If you are accessing the values from your neural network often, CPU would do better, as it has faster access to RAM. If I'm not wrong, getting the updates (SGD, RMSProp, Adagrad etc) would require that the values be accessed.
GPU would be advisable when amount of computation is larger than memory access, e.g. training a deep neural network.
that can be done (in reasonable time) only by utilizing gpu
Unfortunately, if you are trying to solve such a hard problem, Theano would be a bad choice, as you are constrained to running on a single machine. Try other frameworks that would allow running on multiple CPU and GPU across machines, such as Microsoft CNTK or Google TensorFlow.
thinking I am basically screwed
The difference (may be speed up or slow down) won't be that big, depending on the neural network. Plus, running the neural network computation on your machine can get in the way of your work. So you are probably better off using that extra server and making it useful.

How does the SHOGUN Toolbox convolutional neural network compare to Caffe and Theano?

I'm interested in implementing a convolutional neural network in my C++ program where I'm tracking tagged insects (I'm also using OpenCV). I see people mention Caffe, Torch and Theano a lot but I haven't heard the CNN in the SHOGUN Toolbox discussed. Does this CNN work well and would anyone recommend it if you're working in C++? I've used Theano via scikit-neuralnetwork in Python to test out some images and that worked really well, except unfortunately Theano is Python-only.
Shogun also has GPU support of some of the operations used in the NN code. This is work in progress though. At this point in time, other libraries might be faster. We mostly built these networks in there in order to be able to easily compare them to the other algorithms in the toolbox.
The advantage, however, is that you can use it from a large number of languages (while internally, C++ code is executed) -- useful if you don't want to use python.
Here are some IPython notebooks that you could use as a basis to compare:
autoencoders for denoising and classification
(convolution) networks for digit classification
We appreciate any experience to be shared. Shogun is in constant development and especially the NNs attract a lot of people to work on them, so expect things to change. If you are interested in helping GPU-fying Shogun, please let us know.
The difference lies in the speed. cnn is computationally expensive, so a GPU implementation is at least 10 times faster than CPU. caffe and theano provide seamless integration of calling either CPU or GPU, which may not be easy for you to implement without much GPU programming experience.
Other factors may exist including a unified interface for multiplayer, stochastic gradient descent, and etc. but I think speed issue is most crucial among all these factors.

Resources