Why is opencv dnn slower if I use Halide? - opencv

I am testing the performance of some samples in the opencv source tree depending on if halide is used or not.
Surprisingly, the performance is worse if halide is used for the computation:
squeezenet_halide: ~24ms with halide and ~16ms without halide.
resnet_ssd_face: ~84ms with halide and ~36ms without halide.
I have compiled halide and opencv following the instructions in this tutorial. The opencv code was downloaded from the master branch of the opencv git repository.
I have tested the performance using the sample files 'resnet_ssd_face.cpp' and 'squeezenet_halide.cpp'. In both cases I include one of these code lines just before call 'forward', to activate or deactivate halide:
net.setPreferableBackend(DNN_BACKEND_HALIDE); // use Halide
net.setPreferableBackend(DNN_BACKEND_DEFAULT); // NOT use Halide
The time is measured using this code just after the call to 'forward' function:
std::vector<double> layersTimings;
double freq = cv::getTickFrequency() / 1000;
double time = net.getPerfProfile(layersTimings) / freq;
std::cout << "Time: " << time << " ms" << std::endl;
Is there anything missed in the tutorial? Should Halide be compiled with different parameters?
My setup is:
OS: Linux (Ubuntu 16.04)
CPU: Intel(R) Core(TM) i5-4570 CPU # 3.20GHz
GPU: nVidia GeForce GT 730 (Driver Version: 384.90)
Cuda: CUDA Version 9.0.176

Taking into account the comment by Dmitry Kurtaev and looking the wiki in the OpenCV GitHub account, I found a page where a benchmark comparing different approaches is included (I missed the links in the tutorial).
Also, there is a merge request where a similar benchmark is included.
In both of them, the time measurement shows that the performance using Halide is worse than with the original c++ approach.
I can assume that the Halide integration is in an early stage. Moreover, as Zalman Stern comments, the Halide scheduling is a work in progress and the original optimizations in dnn module of opencv could be more accurate than the included scheduling for Halide.
I hope this measures could change in future versions of OpenCV, but for now, this is the performance.

My answer is slightly unrelated but helpful
For face detection + Face alignment :
Normal SSD detection time : 50 - 55ms
Using Openvino inference engine : 40 - 45 ms

Related

Use .tflite with iOS and GPU

I have created a new tflite model based on MobilenetV2. It works well without quantization using CPU on iOS. I should say that TensorFlow team did a great job, many thanks.
Unfortunately there is a problem with latency. I use iPhone5s to test my model, so I have the following results for CPU:
500ms for MobilenetV2 with 224*224 input image.
250-300ms for MobilenetV2 with 160*160 input image.
I used the following pod 'TensorFlowLite', '~> 1.13.1'
It's not enough, so I have read TF documentation related to optimization (post trainig quantization). I suppose I need to use Float16 or UInt8 quantization and GPU Delegate (see https://www.tensorflow.org/lite/performance/post_training_quantization).
I used Tensorflow v2.1.0 to train and quantize my models.
Float16 quantization of weights (I used MobilenetV2 model after Float16 quantization)
https://github.com/tensorflow/examples/tree/master/lite/examples/image_segmentation/ios
pod 'TensorFlowLiteSwift', '0.0.1-nightly'
No errors, but model doesn’t work
pod 'TensorFlowLiteSwift', '2.1.0'
2020-05-01 21:36:13.578369+0300 TFL Segmentation[6367:330410] Initialized TensorFlow Lite runtime.
2020-05-01 21:36:20.877393+0300 TFL Segmentation[6367:330397] Execution of the command buffer was aborted due to an error during execution. Caused GPU Hang Error (IOAF code 3)
Full integer quantization of weights and activations
pod ‘TensorFlowLiteGpuExperimental’
Code sample: https://github.com/makeml-app/MakeML-Nails/tree/master/Segmentation%20Nails
I used a MobilenetV2 model after uint8 quantization.
GpuDelegateOptions options;
options.allow_precision_loss = true;
options.wait_type = GpuDelegateOptions::WaitType::kActive;
//delegate = NewGpuDelegate(nullptr);
delegate = NewGpuDelegate(&options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk)
Segmentation Live[6411:331887] [DYMTLInitPlatform] platform initialization successful
Loaded model 1resolved reporterDidn't find op for builtin opcode 'PAD' version '2'
Is it possible to use MObilenetV2 quantized model on IOS somehow? Hopefully I did some mistake :) and it's possible.
Best regards,
Dmitriy
This is a link to GITHUB issue with answers: https://github.com/tensorflow/tensorflow/issues/39101
sorry for outdated documentation - the GPU delegate should be included in the TensorFlowLiteSwift 2.1.0. However, looks like you're using C API, so depending on TensorFlowLiteC would be sufficient.
MobileNetV2 do work with TFLite runtime in iOS, and if I recall correctly it doesn't have PAD op. Can you attach your model file? With the information provided it's a bit hard to see what's causing the error. As a sanity check, you can get quant/non-quant version of MobileNetV2 from here: https://www.tensorflow.org/lite/guide/hosted_models
For int8 quantized model - afaik GPU delegate only works for FP32 and (possibly) FP16 inputs.

Whats the right way of using openCV with openVINO?

Dislcaimer: I have never used openCV or openVINO or for the fact anything even close to ML before. However I've been slamming my head studying neural-networks(reading material online) because I've to work with intel's openVINO on an edge device.
Here's what the official documentation says about using openCV with openVINO(using openVINO's inference engine with openCV).
->Optimize the pretrained model with openVINO's model optimizer(creating the IR file pair)
use these IR files with
openCV's dnn.readnet() //this is where the inference engine gets set?
https://docs.openvinotoolkit.org/latest/_docs_install_guides_installing_openvino_raspbian.html
Tried digging more and found a third party reference. Here a difference approach is taken.
->Intermediatte files (bin/xml are not created. Instead caffe model file is used)
->the inference engine is defined explicitly with the following line
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
https://www.learnopencv.com/using-openvino-with-opencv/
Now I know to utilize openCV we have to use it's inference engine with pretrained models. I want to know which of the two approach is the correct(or preferred) one, and if rather I'm missing out no something.
You can get started using OpenVino from: https://docs.openvinotoolkit.org/latest/_docs_install_guides_installing_openvino_windows.html
You would require a set of pre-requsites to run your sample. OpenCV is your Computer Vision package which can used for Image processing.
Openvino inference requires you to convert any of your trained models(.caffemodel,.pb,etc.) to Intermediate representations(.xml,.bin) files.
For a better understanding and sample demos on OpenVino, watch the videos/subscribe to the OpenVino Youtube channel: https://www.youtube.com/channel/UCkN8KINLvP1rMkL4trkNgTg
If the topology that you are using is supported by OpenVino,the best way to use is the opencv that comes with openvino. For that you need to
1.Initialize the openvino environment by running the setupvars.bat in your openvino path(C:\Program Files (x86)\IntelSWTools\openvino\bin)
2.Generate the IR file (xml&bin)for your model using model optimizer.
3.Run using inference engine samples in the path /inference_engine_samples_build/
If the topology is not supported, then you can go for the other procedure that you mentioned.
The most common issues I ran into:
setupvars.bat must be run within the same terminal, or use os.environ["varname"] = varvalue
OpenCV needs to be built with support for the inference engines (ie DLDT). There are pre-built binaries here: https://github.com/opencv/opencv/wiki/Intel%27s-Deep-Learning-Inference-Engine-backend
Target inference engine: net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
Target NCS2: net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
The OpenCV pre-built binary located in the OpenVino directory already has IE support and is also an option.
Note that the Neural Compute Stick 2 AKA NCS2 (OpenVino IE/VPU/MYRIAD) requires FP16 model formats (float16). Also try to keep you image in this format to avoid conversion penalties. You can input images as any of these formats though: FP32, FP16, U8
I found this guide helpful: https://learnopencv.com/using-openvino-with-opencv/
Here's an example targetting the NCS2 from https://medium.com/sclable/intel-openvino-with-opencv-f5ad03363a38:
# Load the model.
net = cv2.dnn.readNet(ARCH_FPATH, MODEL_FPATH)
# Specify target device.
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD) # NCS 2
# Read an image.
print("Processing input image...")
img = cv2.imread(IMG_FPATH)
if img is None:
raise Exception(f'Image not found here: {IMG_FPATH}')
# Prepare input blob and perform inference
blob = cv2.dnn.blobFromImage(img, size=(672, 384), ddepth=cv2.CV_8U)
net.setInput(blob)
out = net.forward()
# Draw detected faces
for detect in out.reshape(-1, 7):
conf = float(detect[2])
xmin = int(detect[3] * frame.shape[1])
ymin = int(detect[4] * frame.shape[0])
xmax = int(detect[5] * frame.shape[1])
ymax = int(detect[6] * frame.shape[0])
if conf > CONF_THRESH:
cv2.rectangle(img, (xmin, ymin), (xmax, ymax), color=(0, 255, 0))
There are more samples here (jupyter notebook/python): https://github.com/sclable/openvino_opencv

DL4J is super slow on GoogleNews-vectors file

I tried to execute the following example on DL4J (loading pre-trained vectors file):
File gModel = new File("./GoogleNews-vectors-negative300.bin.gz");
Word2Vec vec = WordVectorSerializer.loadGoogleModel(gModel, true);
InputStreamReader r = new InputStreamReader(System.in);
BufferedReader br = new BufferedReader(r);
for (; ; ) {
System.out.print("Word: ");
String word = br.readLine();
if ("EXIT".equals(word)) break;
Collection<String> lst = vec.wordsNearest(word, 20);
System.out.println(word + " -> " + lst);
}
But it is super slow (taking ~10 minutes to calculate the nearest words, though they are correct).
There is enough memory (-Xms20g -Xmx20g).
When I run the same Word2Vec example from https://code.google.com/p/word2vec/
it gives the nearest words very quickly.
DL4J uses ND4J which claims to be twice as fast as Numpy: http://nd4j.org/benchmarking
Is there anything wrong with my code?
UPDATE: It is based on https://github.com/deeplearning4j/dl4j-0.4-examples.git (I didn't touch any dependencies, just tried to read the Google pre-trained vectors file). Word2VecRawTextExample works just fine (but the data size is relatively small).
In order to improve performance, I propose you to do the following:
Set environment variable OMP_NUM_THREADS equal to number of your logical cores
Install Intel Math Kernel Library if you use Intel processors
In your path add information where mkl_intel_thread.dll from Intel Math Kernel library lives
This post is real old , but by now it should have improved a lot. I have run the DL4J with Word2vec model in production with following settings # JVM level and it works on a t2.large box onwards with 8G RAM and up
java -Xmx2G -Dorg.bytedeco.javacpp.maxbytes=6G -Dorg.bytedeco.javacpp.maxphysicalbytes=6G
Also I have not used wordsNearest() method because it comes with restrictions of having corpus embedding to be pre-computed , instead of written my own cosine similarity which performs sub milliseconds response.
Blog post for that one is here
https://medium.com/sumvit/building-text-similarity-system-from-ground-up-using-word2vec-and-deeplearning4j-dece9ae4e433
in case if you want to know how to build nearest word or any other application like text similarity (same basic principal)

Why my OpenCV CUDA is running slower than CPU for simple thresholding?

My CPU is Intel Core2 Duo T5550, GPU is GeForce 8400M G. CUDA version 5.5.22, OpenCV version 2.4.8.
The test code is as follows:
double t = (double)getTickCount();
gpu::threshold(src, dst, thres, binMax, THRESH_BINARY);
t = ((double)getTickCount() - t)/getTickFrequency();
cout << "Times passed in seconds: " << t << endl;
For a 3648*2736 image, the result is
CPU: Times passed in seconds: 0.0136336
GPU: Times passed in seconds: 0.0217714
Thanks!
Perhaps this is not suprising.
You GeForce 8400M G is a old mobile card having only 8 cores, see the GeForce 8M series specifications, so you cannot extract much parallelism out of it.
Brutally speaking, GPUs are advantageous over multicore CPUs when you are capable of massively extracting parallelism by a large number of cores. In other words, to fastly build up an Egyptian pyramid by slow slaves (GPU cores) you need a large number of slaves. If you have only very few slow slaves (8 in your case), then perhaps it is better to have even fewer (2 CPU cores, for example), but much faster, slaves.
EDIT
I remembered just now to have bumped into this post
Finding minimum in GPU slower than CPU
which may help convince you that bad implementations (as underlined by Abid Rahman and Mailerdaimon) may lead to GPU codes that are slower than CPU ones. The situation is even worse if, as pointed out in the answer to the post above, you are hosting also the X display on your already limited GeForce 8400M G card.
Additionally to what #JackOLantern said:
Every Copy operation involving the GPU takes Time! A lot of time compared to just computing with the CPU. This is why #Abid Rahman K comment is a good Idea, he suggested to test again with more complex Code. The advantage of the GPU is in fast parallel processing, on off it disadvantages is the relatively slow transfer rate while copying data to and from the GPU.

Is a CUDA-programmed GPU suitable for implementation of OpenCV adaptive threshold?

On my system, for a 5 MP image with a large window size (75px) it takes a whopping 140 ms (roughly 20 times as much as linear operations) to complete and I am looking to optimize it. I have noticed that the OpenCV gpu module does not implement a gpu version of the adaptiveThreshold so I have been thinking of implementing that algorithm for the GPU myself.
Can I hope for any speedup if I implement an adaptive threshold algorithm in CUDA, based on a large window size (50px+) and a large image (5 MP+), ignoring the overhead for loading memory into the GPU?
adaptiveThreshold documentation on opencv.org:
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold
Building on Eric's answer:
The Npp CUDA library does not implement adaptiveThreshold but it seems beneficial to getting an adaptive threshold in a VERY straightforward way (just tested it and anecdotally works):
Run a box filter on src (i.e. compute mean window value for every pixel),
store in an intermediate image tmp.
Subtract a number K from each pixel in tmp
Run a compare function between src and
tmp into dst. The end.
The code may look like this (here K=0, 2nd step omitted):
nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceIntermediate.data(), oDeviceDst.pitch(),
oSizeROI, oAdapThreshWindowSize,oAnchor);
nppiCompare_8u_C1R(oDeviceSrc.data(),oDeviceSrc.pitch(),
oDeviceDst.data(),oDeviceDst.pitch(),
oDeviceResult.data(),oDeviceResult.pitch(),
oSizeROI,NPP_CMP_LESS);
Also, wikipedia claims that applying a box filter 3 times in a row approximates a Gaussian filter to 97% accuracy.
Yes, this algorithm can be optimized on the GPU. I would expect to see an excellent speedup.
For ADAPTIVE_THRESH_MEAN_C, you could use a standard parallel reduction to calculate the arithmetic mean. For ADAPTIVE_THRESH_GAUSSIAN_C, you might use a kernel that performs per-pixel gaussian attenuation combined with a standard parallel reduction for the sum.
Implementation by CUDA should give you a satisfied performance gain.
Since your window size is large, this operation should be compute-bounded. The theoretical peak performance of a 5 MP image with 75px window on a Tesla K20X GPU should be about
5e6 * 75 * 75 / 3.95 Tflops = 7ms
Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Nvidia cuNPP library also provides a function nppiFilterBox(), which can be used to implement ADAPTIVE_THRESH_MEAN_C directly.
http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp
For ADAPTIVE_THRESH_GAUSSIAN_C, the function nppiFilter() with a proper mask could be used.
NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf

Resources