Compiler for RISC-V vector code generation - vectorization

Is there a compiler available which generates the vector instructions according to the new vector extension proposed in RISC-V ISA specification v2.2?

No such a compiler does not exist. The vector extension is still being written but there should be a more complete draft and progress on a toolchain by the next RISC-V workshop.

As of early 2020, there are binutil/gcc branches that support the current RISC-V Vector Extension "V" 0.8 draft.
That means with that experimental toolchain, you can use vector instruction in assembly code and in inline assembler constructs.
It's unclear if that GCC RISC-V "V" support already extends to auto-vectorization.
Vector intrinsics also aren't available, yet.

Related

fft numpy style on iOS accelerate with non power of two data length

I'm working on reimplementing python code on iOS (swift).
I need to do an fft (numpy style) on chunks of 1D data. each with size 1050 (windowed audio data).
Thankfully I found related explanation and snippet of code on how to do iOS fft in numpy style (link).
However, I'm stuck where accelerate framework supports doing fft only on a power of 2 input data length (or more recently, f * 2^n, where f is 3, 5, or 15 and n is at least 3).
I tested my python code on window size 1050. Working great for my use case. But it is not straightforward to implement on iOS, because of the above limitation.
It is not so easy to dig into numpy c code to know how they're doing it for non power of two lengths. This answer was a good starting point for me, but still didn't get it.
Speed here is also important, that's why I'm not considering a brute force dft.
Any guidance here would be really appreciated.
IIRC, for fft, under-the-hood, numpy uses fftpack, a C conversion of an old NCAR Fortran math library. The actual numpy fft is not implemented in Python code. You could very likely compile some fftpack C code using Xcode, and use a bridging header to call it from iOS Swift code.
Your answers/comments guided me to use c/c++ code to get the desired result. (I didn't think of that as an option initially).
I ended up using opencv dft function (which internally implements fft) that produces similar results to numpy's fft (+ it is faster than numpy, according to their docs).

drake: search for fixed points and trim points of a system

I have a LeafSystem in drake, with dynamics \dot{x} = f(x,u) written in DoCalcTimeDerivatives. The fixed points and trim points of this system are not trivial to find. Therefore, I image one would need to write a nonlinear optimization problem to find the fixed points:
find x, u;
s.t. f(x,u)=0
or
find x,u;
min f(x,u)^2
I am wondering, how should I take advantage of the dynamics that I have already written in DoCalcTimeDerivatives of the LeafSystem, and write a non-linear optimization to search over x and u to find the fixed points and trim points in drake? Some existing examples in drake would be greatly appreciated!
It's simple to write for your case (and only slightly harder to write for the general case... it's on my TODO list).
Assuming your plant supports symbolic, then looking at the trajectory optimization will give you a sense for how you might write the constraint:
https://github.com/RobotLocomotion/drake/blob/master/systems/trajectory_optimization/direct_transcription.cc#L212
(the autodiff version is just below):
fwiw, the general case from the old matlab version is here:
https://github.com/RobotLocomotion/drake/blob/last_sha_with_original_matlab/drake/matlab/solvers/FixedPointProgram.m

Why is opencv dnn slower if I use Halide?

I am testing the performance of some samples in the opencv source tree depending on if halide is used or not.
Surprisingly, the performance is worse if halide is used for the computation:
squeezenet_halide: ~24ms with halide and ~16ms without halide.
resnet_ssd_face: ~84ms with halide and ~36ms without halide.
I have compiled halide and opencv following the instructions in this tutorial. The opencv code was downloaded from the master branch of the opencv git repository.
I have tested the performance using the sample files 'resnet_ssd_face.cpp' and 'squeezenet_halide.cpp'. In both cases I include one of these code lines just before call 'forward', to activate or deactivate halide:
net.setPreferableBackend(DNN_BACKEND_HALIDE); // use Halide
net.setPreferableBackend(DNN_BACKEND_DEFAULT); // NOT use Halide
The time is measured using this code just after the call to 'forward' function:
std::vector<double> layersTimings;
double freq = cv::getTickFrequency() / 1000;
double time = net.getPerfProfile(layersTimings) / freq;
std::cout << "Time: " << time << " ms" << std::endl;
Is there anything missed in the tutorial? Should Halide be compiled with different parameters?
My setup is:
OS: Linux (Ubuntu 16.04)
CPU: Intel(R) Core(TM) i5-4570 CPU # 3.20GHz
GPU: nVidia GeForce GT 730 (Driver Version: 384.90)
Cuda: CUDA Version 9.0.176
Taking into account the comment by Dmitry Kurtaev and looking the wiki in the OpenCV GitHub account, I found a page where a benchmark comparing different approaches is included (I missed the links in the tutorial).
Also, there is a merge request where a similar benchmark is included.
In both of them, the time measurement shows that the performance using Halide is worse than with the original c++ approach.
I can assume that the Halide integration is in an early stage. Moreover, as Zalman Stern comments, the Halide scheduling is a work in progress and the original optimizations in dnn module of opencv could be more accurate than the included scheduling for Halide.
I hope this measures could change in future versions of OpenCV, but for now, this is the performance.
My answer is slightly unrelated but helpful
For face detection + Face alignment :
Normal SSD detection time : 50 - 55ms
Using Openvino inference engine : 40 - 45 ms

why should the disparity value in sgbm be divisble by 16?

I am working on opencv sgbm(semi global block matching) function. Here two parameters (minDisparity and numberOfDisparities) are used.
In that why numberOfDisparities values should be divisible by 16?
Probably to simplify the code internally, which uses SSE2. In general, SSE2 instructions:
Work on multiple numbers simultaneously; having the total number of pieces of information be evenly divisible makes things simpler.
SSE2 requires 128-bit (16 byte) memory alignment; alignment can be more easily maintained when things are nice multiples of 16...
If you examine the OpenCV source code, you'll see lots of SSE2 code for the SGBM algorithm.

OpenCV GPU Primitives

Are the OpenCV primitives based on the CUDA Nvidia Performance Primitives (NPP)?.
By primitives I mean the same ones implemented in the NPP library, for example: boxFilter, Mirror, Convolution...
I would like to know about this issue as I'm planning use the NPP library. However, OpenCV has more functions that could help me for example in border treatment for image processing.
OpenCV uses NPP library for some functions. But it is hard to create a compelete list of such functions.
Some functions uses only NPP implemetation (boxFilter, graphcut, histEven).
Other functions uses different implemetations for different input parameters. For example, cv::gpu::resize uses NPP for some input parameters (CV_8UC1 and CV_8UC3 types, INTER_NEAREST and INTER_LINEAR interpolation mode) and for other parameters it uses own implementation.
Great webinar about OpenCV on a GPU using CUDA
Video - http://on-demand.gputechconf.com/gtc/2013/webinar/opencv.mp4
Slides PDF - http://on-demand.gputechconf.com/gtc/2013/webinar/opencv-gtc-express-shalini-gupta.pdf

Resources