I saw MTCNN being recommended but haven't seen a direct comparison of DLIB and MTCNN.
I assume since MTCNN uses a neural networks it might work better for more use cases, but also have some surprisingly horrible edge cases?
Has anyone done an analysis of error rate, performance under different conditions (GPU and CPU), and general eyeball observations of the two?
You can have a look at this amazing kaggle notebook by timesler. Comparison is made between facenet-pytorch, DLIB & MTCNN.
https://www.kaggle.com/timesler/comparison-of-face-detection-packages
"Each package is tested for its speed in detecting the faces in a set of 300 images (all frames from one video), with GPU support enabled. Detection is performed at 3 different resolutions.
Any one-off initialization steps, such as model instantiation, are performed prior to performance testing."
You can test it within deepface easily. My experiments show that mtcnn overperforms than dlib.
#!pip install deepface
from deepface import DeepFace
backends = ['opencv', 'ssd', 'dlib', 'mtcnn']
DeepFace.detectFace("img.jpg", detector_backend = backends[0])
Related
I am trying to use a Pytorch 3D UNet for inference (from here: https://github.com/wolny/pytorch-3dunet) which receives images of size (96, 96, 96). I would like to use it on CPU instances, but I am getting very high memory usages (~18 GB). After researching on the subject I found out that this was due to the way convolutions are implemented on CPU (see https://discuss.pytorch.org/t/pytorch-high-memory-demand/2798/5). I thus have the following questions:
Is there a way to use a more memory-efficient implementation of the convolution in Pytorch?
How can I optimize my model for CPU inference? I saw that some tools like AWS Neo, Intel OpenVINO, etc. exist; could they solve my problem?
Does Tensorflow have a similar problem for using convolutions on CPU?
Any other tip, link on how to deploy such models in an efficient way is welcome!
Thanks!
You could benchmark your model's performance with DNN-Bench and choose the best inference engine for your application and your hardware. You might need to convert your model to ONNX first.
We can deploy MobileNet on Smartphone by TensorFlow Lite, Caffe2 or OpenCV, and I think Caffe2 will provide the best performance with higher fps. But why? Is the performance gap between them so large? Thanks.
You should probably go for TensorFlow Lite. Last I looked, Caffe2 had almost zero smartphone GPU support, while TFLite now supports both iOS and many Android devices (all that have OpenGLES >=3.1). Using the GPU generally makes things several times faster, and you can reduce the inference precision to half-float (FP16) with TFLite for even more speed and not too much of a performance hit.
When you can't use the mobile GPU, you'll probably want to quantize your network to int8, which is easily doable with TensorFlow and TensorFlow Lite, whether during or after training. Caffe2 seems to need QNNPACK for quantization, which is claimed to be as much as 2 times faster. The catch is that it only works with two pre-trained models that they released (https://github.com/pytorch/QNNPACK/issues/12), so you can't convert your own model.
So I can't really think of a reason to use Caffe2 over TFLite.
I'm not sure about OpenCV's DNN module, but I seriously doubt it has mobile GPU support. There's a slight chance it has quantization.
Each framework introduces their own optimizations, the result may be significantly different for different devices.
I have a Keras model which is doing inference on a Raspberry Pi (with a camera). The Raspberry Pi has a really slow CPU (1.2.GHz) and no CUDA GPU so the model.predict() stage is taking a long time (~20 seconds). I'm looking for ways to reduce that by as much as possible. I've tried:
Overclocking the CPU (+ 200 MhZ) and got a few extra seconds of performance.
Using float16's instead of float32's.
Reducing the image input size as much as possible.
Is there anything else I can do to increase the speed during inference? Is there a way to simplify a model.h5 and take a drop in accuracy? I've had success with simpler models, but for this project I need to rely on an existing model so I can't train from scratch.
VGG16 / VGG19 architecture is very slow since it has lots of parameters. Check this answer.
Before any other optimization, try to use a simpler network architecture.
Google's MobileNet seems like a good candidate since it's implemented on Keras and it was designed for more limited devices.
If you can't use a different network, you may compress the network with pruning. This blog post specifically do pruning with Keras.
Maybe OpenVINO will help. OpenVINO is an open-source toolkit for network inference, and it optimizes the inference performance by, e.g., graph pruning and fusing some operations. The ARM support is provided by the contrib repository.
Here are the instructions on how to build an ARM plugin to run OpenVINO on Raspberry Pi.
Disclaimer: I work on OpenVINO.
I am exploring some of the deep learning libraries including Chainer, Torch, TensorFlow, and Theano.
Formerly I was a user of Chainer, and I found that Theano or Tensorflow has a great flexibility and seems to have a nice future potential.
However, what keeps me from moving to Theano or Tensorflow is the memory issue. Is there an option to make Theano or Tensorflow do not keep computation history? In Chainer it can be done by setting volatile flag, so that I can evaluate large data with less memory because it does not keep unnecessary data (which are only necessary when calculating gradients).
I am primarily working with RNNs, and the typical approach to training RNNs is to use truncated BPTT. However I found that it is useful and has slightly more accurate to feed the full sequence to the network when I want just forward computations, not backpropagation.
I tried to find this option from documentions of both frameworks, but I couldn't find. Is there a reason that this feature cannot be implemented?
I'm interested in implementing a convolutional neural network in my C++ program where I'm tracking tagged insects (I'm also using OpenCV). I see people mention Caffe, Torch and Theano a lot but I haven't heard the CNN in the SHOGUN Toolbox discussed. Does this CNN work well and would anyone recommend it if you're working in C++? I've used Theano via scikit-neuralnetwork in Python to test out some images and that worked really well, except unfortunately Theano is Python-only.
Shogun also has GPU support of some of the operations used in the NN code. This is work in progress though. At this point in time, other libraries might be faster. We mostly built these networks in there in order to be able to easily compare them to the other algorithms in the toolbox.
The advantage, however, is that you can use it from a large number of languages (while internally, C++ code is executed) -- useful if you don't want to use python.
Here are some IPython notebooks that you could use as a basis to compare:
autoencoders for denoising and classification
(convolution) networks for digit classification
We appreciate any experience to be shared. Shogun is in constant development and especially the NNs attract a lot of people to work on them, so expect things to change. If you are interested in helping GPU-fying Shogun, please let us know.
The difference lies in the speed. cnn is computationally expensive, so a GPU implementation is at least 10 times faster than CPU. caffe and theano provide seamless integration of calling either CPU or GPU, which may not be easy for you to implement without much GPU programming experience.
Other factors may exist including a unified interface for multiplayer, stochastic gradient descent, and etc. but I think speed issue is most crucial among all these factors.