How to make the Intel C Compiler (icc) vectorize basic blocks? - vectorization

I am currently using icc (version 13.1.0.146) to compile C programs running in native mode on the Intel Xeon Phi coprocessor.
Consider the following two code fragments:
// fragment 1
array[pos] += 1;
array[pos + 1] += 1;
array[pos + 2] += 1;
array[pos + 3] += 1;
// fragment 2
for (int i = 0; i < 4; ++i)
array[i] += 1;
Unfortunately, only the loop is vectorized automatically. However, if i compile for the x86 platform, icc also vectorizes the "unrolled" version.
Is there a way to tell icc to vectorize basic blocks when compiling for the Xeon Phi, too?
Any help is appreciated. Thanks in advance!

The transformation which you are looking for here is "Loop materialization" which creates short running loops (number of iterations) from basic blocks and the loop body is very small. So generally are not good candidates for vectorization on Intel(R) Xeon Phi(TM) Coprocessor. This is because we want a significant workload in the loop body so that overhead of creating vector operands doesn't show up significantly in the overall execution time of the loop.

Related

MedianBlur() calculate max kernel size

I want to use MedianBlur function with very high Ksize, like 301 or more. But if I pass ksize too high, sometimes the function will crash. The error message is:
OpenCV Error: (k < 16) in cv::medianBlur_8u_O1, in file ../opencv\modules\imgproc\src\smooth.cpp
(I use opencv4nodejs, but I also tried the original OpenCV 3.4.6).
I did reduce the ksize in a try/catch loop, but not so effective, since I have to work with videos.
I did checkout the OpenCV source code and did some research.
In OpenCV 3.4.6, the crash come from line 241, file opencv\modules\imgproc\src\median_blur.simd.hpp:
for ( k = 0; k < 16 ; ++k )
{
sum += H.coarse[k];
if ( sum > t )
{
sum -= H.coarse[k];
break;
}
}
CV_Assert( k < 16 ); // Error here
t is caculated base on ksize. But sum and H.coarse array's calculations are quite complicated.
Did further researches, I found a scientific document about the algorithm: https://www.researchgate.net/publication/321690537_Efficient_Scalable_Median_Filtering_Using_Histogram-Based_Operations
I am trying to read but honestly, I don't understand too much.
How do I calculate the maximum ksize with a given image?
The maximum kernel size is determined from the bit depth of the image. As mentioned in the publication you cited:
"An 8-bit value is limited to a max value of 255. Our goal is to
support larger kernel sizes, including kernels that are greater in
size than 17 × 17, thus the larger 32-bit data type is used"
so for an image of data type CV_8U the maximum kernel size is 255.

How can I make this function as fast as the vDSP version?

I have written the function below, which uses vDSP function calls to compute a certain result. I thought it would be faster if I rewrote it using the 128 bit vFloat data type to avoid the vDSP function calls. But my vFloat code is still 2-3 times slower than the vDSP version.
I am targeting iOS mainly, but it would be best if the code also runs well on Mac OS.
I measure the speed of these functions on arrays of length 256, which is the typical array length for my application. I want to know how to get this function to run as fast as possible because I have many others like it and I am hoping that once I figure out how to optimize this one I can use the same principles for all the others.
Here is the vDSP version, which on Mac OS is 50% faster with aggressive optimizations enabled, or 2-3x faster with less aggressive compiler settings:
void asymptoticLimitTest2(float limit,
const float* input,
float* output,
size_t numSamples){
// input / limit => output
vDSP_vsdiv(input, 1, &limit, output, 1, numSamples);
// abs(output) => output
vDSP_vabs(output, 1, output, 1, numSamples);
// 1 + output => output
float one = 1.0;
vDSP_vsadd(output, 1, &one, output, 1, numSamples);
// input / output => output
vDSP_vdiv(output, 1, input, 1, output, 1, numSamples);
}
Here is my vFloat version, which I thought would be faster because it avoids all the function calls, but for my application's standard vector length of 256, is not faster:
void asymptoticLimitTest3(float limit,
const float* input,
float* output,
size_t numSamples){
vFloat limitv = {limit, limit, limit, limit};
vFloat onev = {1.0,1.0,1.0,1.0};
size_t n = numSamples;
// process in chunks of 8 samples
while(n > 4){
vFloat d = vfabsf(*(vFloat *)input / limitv) + onev;
*(vFloat *)output = *(vFloat *)input / d;
input += 4;
output += 4;
n -= 4;
}
// process the remaining samples individually
while(n > 0){
float d = fabsf(*input / limit) + 1.0;
*output = *input / d;
input++;
output++;
n--;
}
}
I am hoping to get asymptoticLimitTest3() to run faster than asymptoticLimitTest2(). I'm interested to hear any and all suggestions that will speed up asymptoticLimitTest3()
Thanks in advance for your help.
I did more tests on the timing, and it turns out that the vFloat version is generally NOT slower than the vDSP version. When I did the original tests, I was calling the same function over and over again in a for loop to get the timing info. When I rewrote the loop so that it interleaves the calls with calls to other functions (as would be more normal for an actual application), then the vFloat version is faster. Apparently the vDSP version was picking up some benefit from running the same code repeatedly, perhaps because the code itself remained in the cache.

Image computation on GPU and value returning

I have a C# project in which I retreive grey-scale images from cameras and do some computation with the image data. The computations are quite time-consuming since I need to loop over the total image several times and I am doing it all on the CPU.
Now I would like to try to get the evaluation running on the GPU, but I have a lot of struggle achieving that, since I never did any GPU calculations before.
The software should be able to run on several computers with varying hardware, so CUDA for example is not a solution for me, since the code should also run on laptops which only have onboard graphics. After some research I came accross Cloo (found it on this project), which seems to be a quite resonable choice.
So far I integrated Cloo in my project and tried to get this hello world example running. I guess it is running, since I don´t get any exception, but I don´t know where I can see the printed output.
For my computations I need to pass the image to the GPU and I also need the x-y coordinates during the computation. So, in C# the computation looks like this:
int a = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
a += image[x,y] * x * y;
}
}
int b = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
b += image[x,y] * (x-a) * y;
}
}
Now I want to have these calculations to run on the GPU, and I want to parallel the y-loop, so that in every task one x-loop is running. Then I could take all the resulting a values and add them up before the second loop block would start.
Afterwards I would like to return the values a and b to my C# code and use them there.
So, to wrap up my questions:
Is Cloo a recommendable choice for this task?
What is the best way to pass the image-data (16bit, short-array) and the dimensions (img_width, img_height) to the GPU?
How can I return a value from the GPU? As far as I know kernels are always used as kernel void...
What would be the best way to implement the loops?
I hope my questions are clear and I provided sufficient information to understand my struggles. Any help is appreciated. Thanks in advance.
Let's reverse engineer the problem. Understanding the efficient processing of the "dependency-chain" of image[][], image_height, image_width, a, b
Ad 4 ) the tandem of identical for-loops has a poor performance
given the defined code, there could be just a single loop, thus with reduced overhead costs and best with also maximising cache-aligned vectorised code.
Cache-Naive re-formulation:
int a = 0;
int c = 1;
for ( int y = 0; y < img_height; y++ ){
for ( int x = 0; x < img_width; x++ ){
int intermediate = image[x,y] * y; // .SET PROD(i[x,y],y)
a += x * intermediate; // .REUSE 1st
c -= intermediate; // .REUSE 2nd
}
}
int b = a * c; // was my fault upon being in a hurry leaving for weekend :o)
Moving the code into the split tandem loops is only increasing these overheads and devastating any possible cache-friendly tricks in the code-performance tweaking.
Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this
OpenCL and Cloo document these details, so nothing magical beyond the documented methods is needed here.
Yet, there are latency costs associated with each such host-side to device-side + device-side to host-side transfers. Given you claim that the 16bit-1920x1200 image-data are to be re-processed ~ 10 times in a loop, there are some chances these latencies need not be spent on every such loop pass-through.
The worst performance-killer is a very shallow kernel mathematical density. The problem is, there is indeed not much to calculate in the kernel, so the chances for any efficient SIMD / GPU parallel tricks are indeed pretty low.
In this sense, the CPU-side smart-vectorised code will do much better than the ( H2D + D2H )-overheads-far latency-hostile computationally-shallow GPU-kernel processing.
Ad 1) Given 2+3 and 4 above, 1 may easily loose sense
As prototyped and given additional cache-friendly vectorised tricks, the in-ram + in-cache vectorised code will have chances to beat all OpenCL and mixed-GPU/CPU automated ad-hoc kernel compilation generated device code and it's computing efforts.

Can dispatch overhead be more expensive than an actual thread work?

Recently I'm thinking about possibilities for hard optimization. I mean those kind of optimization when you sometimes hardcode the loop from 3 iterations just to get something.
So one thought came to my mind. Imagine we have a buffer of 1024 elements. We want to multiply every single element of it by 2. And we create a simple kernel, where we pass a buffer, outBuffer, their size (to check if we outside of the bounds) and [[thread_position_in_grid]]. Then we just do a simple multiplaction and write that number to another buffer.
It will look a bit like that:
kernel void multiplyBy2(constant float* in [[buffer(0)]],
device float* out [[buffer(1)]],
constant Uniforms& uniforms [[buffer(2)]],
uint gid [[thread_position_in_grid]])
{
if (gid >= uniforms.buffer_size) { return; }
out[gid] = in[gid] * 2.0;
}
The thing I'm concerned about is If the actual thread work still worth the overhead that is produced by it's dispatching?
Would it be more effective to, for example, dispatch 4 times less threads, that do something like that
out[gid * 4 + 0] = in[gid + 0] * 2.0;
out[gid * 4 + 1] = in[gid + 1] * 2.0;
out[gid * 4 + 2] = in[gid + 2] * 2.0;
out[gid * 4 + 3] = in[gid + 3] * 2.0;
So that thread can work a little bit longer? Or it is better to make threads as thin as possible?
Yes, and this is true not merely in contrived examples, but in some real-world scenarios too.
For extremely simple kernels like yours, the dispatch overhead can swamp the work to be done, but there's another factor that may have an even bigger effect on performance: sharing fetched data and intermediate results.
If you have a kernel that, for example, reads the 3x3 neighborhood of a pixel from an input texture and writes the average to an output texture, you could share the fetched texture data and partial sums between adjacent pixels by operating on more than one pixel in your kernel function and reducing the total number of threads you dispatch.
Perhaps this sates your curiosity. For any practical application, Scott Hunter is right that you should profile on all target devices before and after optimizing.

How would to write a multiplication of double values in NEON assembly?

The line in question is pretty contained:
w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1]
Considering these variables are double (but I can downgrade to float), I think I can pass one value per register? Would it be more efficient to use the 2x2 matrix W directly?
EDIT 1:
This line is inside a loop that is fired hundreds of times per second and has real-time requirements. Instruments says this line takes 60% of the time of the loop.
EDIT 2:
This is the loop(s) I'm talking about:
for (int x=startingX; x<endingX; ++x)
{
for (int y=startingY; y<endingY; ++y)
{
Matx21d position(x,y);
// warp patch
uint8_t *data;
[self backwardWarpPatchWithWarpingMatrix:warpingMatrix withWarpData:&data withReferenceImage:_initialView withCenter:position];
// check that the backward patch was successful
if (!data)
continue;
// calculate zero mean (on the patch) sum of squared differences
int ssd = [self computeZMSSDScoreWithX:x withY:y withCurrentTargetPatch:data];
if (fabs(ssd) < bestSSD)
{
bestPosition = position;
bestSSD = ssd;
}
}
}
backwardWarpPatchWithWarpingMatrix:
Matx22d warpingMatrixInverse = warpingMatrix.inv();
double wmi0 = warpingMatrixInverse(0,0), wmi1 = warpingMatrixInverse(0,1), wmi2 = warpingMatrixInverse(1,0), wmi3 = warpingMatrixInverse(1,1);
if (isnan(wmi0))
{
warpingMatrixInverse = Matx22d::eye();
}
// Perform the warp on a larger patch.
int LEVEL_REF = 0, halfPatchSize = PATCH_SIZE/2;
Matx21d centerInLevel = center * (1.0 / (1<<LEVEL_REF));
__block Mat warped(PATCH_SIZE, PATCH_SIZE, CV_8UC1);
dispatch_apply(PATCH_SIZE, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t y)
{
for (int x=0; x<PATCH_SIZE; ++x)
{
double pp0 = x - halfPatchSize, pp1 = (double)y - halfPatchSize;
Matx21d multiplication(wmi0 * pp0 + wmi1 * pp1, wmi2 * pp0 + wmi3 * pp1);
Matx21d px(multiplication(0) + centerInLevel(0), multiplication(1) + centerInLevel(1));
double warpedPixel = [self interpolatePointInImage:referenceImage withU:px(0) withV:px(1)];
warped.at<uchar>(y,x) = (uint8_t)warpedPixel;
}
});
computeReferencePatchScores:
int x = (int)u;
int y = (int)v;
float subpixX = u - x,
subpixY = v - y,
oneMinusSubpixX = 1.0 - subpixX,
oneMinusSubpixY = 1.0 - subpixY;
float w00 = oneMinusSubpixX * oneMinusSubpixY,
w01 = oneMinusSubpixX * subpixY,
w10 = subpixX * oneMinusSubpixY,
w11 = 1.0f - w00 - w01 - w10;
const int stride = (int)image.step.p[0];
uchar* ptr = image.data + y * stride + x;
return w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1];
You typically don't translate a single line of code into assembly. For it to be worth writing in assembly, you have to first assume that you can generate better assembly than the compiler will. Sometimes that's true for vectorized code on NEON, but it's usually because you have special knowledge about a complex loop. You're unlikely to beat the compiler significantly on a single line of code (and will likely lose). Is this line part of a loop that you've profiled and identified as a major bottleneck? Have you already tried Accelerate? Have you analyzed the assembly the compiler is generating and found mistakes that it's making.
Trying to do this in ObjC++ is very inefficient. ObjC++ is a glue language for tying together C++ and ObjC; doing both in the same file imposes several performance costs, especially with ARC. Calling an ObjC method inside of a performance-critical inner-loop is very expensive in any case (even if there weren't mixed-in C++). You should never do any kind of function call (least of all an ObjC method dispatch) inside of a tight inner-loop. It's not clear where you're actually calling computeReferencePatchScores. The use of GCD here is probably hurting you more than helping (since it prevents the compiler from applying certain vector optimizations).
This is all to say: how a particular line of code is being compiled into assembly is by far the least of your problems in this code. Its structure is fighting clang's optimizer.
Step one is to step back and ask what computation you want to execute, and then read through the Core Image Programming Guide and the vImage Programming Guide and verify that it isn't already available. You might also look over OpenGL ES, but OpenGL is often a whole approach to drawing (so it's a bit more of a commitment). It looks like you're already using OpenCV, so make sure it doesn't have available functions to do what you want. (Most of what I see in there looks like stuff built into both OpenCV and vImage.)
The simplest way to improve performance without moving to more powerful frameworks is to move the entire loop into a single C++ function. Then the optimizer can see all the code and apply vector operations on its own. But the next step is to make use of the high-level high-performance frameworks already available.
In any case, you'll want to sit down and carefully work through exactly the calculations you need to perform (I usually do this by hand on paper). Make sure you're not duplicating anything, that you need every calculation you're performing, and that each change you make still generates the same result.
This looks to be a 2x2 convolution. If the data set is large, then vImageConvolve_PlanarF with a 3x3 kernel with some zero padding in it will do the job. It tries to skip work on kernel elements that are 0. You would need to convert the data set to single precision.
If the data set is small, then you are probably stuck with scalar code performance. Inline the function if you can. Perhaps you can figure out how to aggregate a bunch of these together to take advantage of a heavier duty high performance routine.
However, if the weights change from pixel to pixel, then a convolution isn't going to work. You may look instead at the N-dimensional lookup table feature in vImage/Transform.h, if your data set is not huge.
I am a bit skeptical that the time is really spent just in that line. It is best to look at the assembly view in instruments to see where the samples really land.

Resources