How can I make this function as fast as the vDSP version? - vectorization

I have written the function below, which uses vDSP function calls to compute a certain result. I thought it would be faster if I rewrote it using the 128 bit vFloat data type to avoid the vDSP function calls. But my vFloat code is still 2-3 times slower than the vDSP version.
I am targeting iOS mainly, but it would be best if the code also runs well on Mac OS.
I measure the speed of these functions on arrays of length 256, which is the typical array length for my application. I want to know how to get this function to run as fast as possible because I have many others like it and I am hoping that once I figure out how to optimize this one I can use the same principles for all the others.
Here is the vDSP version, which on Mac OS is 50% faster with aggressive optimizations enabled, or 2-3x faster with less aggressive compiler settings:
void asymptoticLimitTest2(float limit,
const float* input,
float* output,
size_t numSamples){
// input / limit => output
vDSP_vsdiv(input, 1, &limit, output, 1, numSamples);
// abs(output) => output
vDSP_vabs(output, 1, output, 1, numSamples);
// 1 + output => output
float one = 1.0;
vDSP_vsadd(output, 1, &one, output, 1, numSamples);
// input / output => output
vDSP_vdiv(output, 1, input, 1, output, 1, numSamples);
}
Here is my vFloat version, which I thought would be faster because it avoids all the function calls, but for my application's standard vector length of 256, is not faster:
void asymptoticLimitTest3(float limit,
const float* input,
float* output,
size_t numSamples){
vFloat limitv = {limit, limit, limit, limit};
vFloat onev = {1.0,1.0,1.0,1.0};
size_t n = numSamples;
// process in chunks of 8 samples
while(n > 4){
vFloat d = vfabsf(*(vFloat *)input / limitv) + onev;
*(vFloat *)output = *(vFloat *)input / d;
input += 4;
output += 4;
n -= 4;
}
// process the remaining samples individually
while(n > 0){
float d = fabsf(*input / limit) + 1.0;
*output = *input / d;
input++;
output++;
n--;
}
}
I am hoping to get asymptoticLimitTest3() to run faster than asymptoticLimitTest2(). I'm interested to hear any and all suggestions that will speed up asymptoticLimitTest3()
Thanks in advance for your help.

I did more tests on the timing, and it turns out that the vFloat version is generally NOT slower than the vDSP version. When I did the original tests, I was calling the same function over and over again in a for loop to get the timing info. When I rewrote the loop so that it interleaves the calls with calls to other functions (as would be more normal for an actual application), then the vFloat version is faster. Apparently the vDSP version was picking up some benefit from running the same code repeatedly, perhaps because the code itself remained in the cache.

Related

Webgl2 maximum size of 3D texture?

I'm creating a 3D texture in webgl with the
gl.bindTexture(gl.TEXTURE_3D, texture) {
const level = 0;
const internalFormat = gl.R32F;
const width = 512;
const height = 512;
const depth = 512;
const border = 0;
const format = gl.RED;
const type = gl.FLOAT;
const data = imagesDataArray;
...... command
It seems that the size of 512512512 when using 32F values is somewhat of a dealbraker since the chrome (running on laptop 8 gb ram) browser crashes when uploading a 3D texture of this size, but not always. Using a texture of say size 512512256 seems to always work on my laptop.
Is there any way to tell in advance the maximum size of 3D texture that the GPU in relation to webgl2 can accomodate?
Best regards
Unfortunately no, there isn't a way to tell how much space there is.
You can query the largest dimensions the GPU can handle but you can't query to amount of memory it has available just like you can't query how much memory is available to JavaScript.
That said, 512*512*512*1(R)*4(32F) is at least 0.5 Gig. Does your laptop's GPU have 0.5Gig? You actually probably need at least 1gig of GPU memory to use .5gig as a texture since the OS needs space for your apps' windows etc...
The Browsers also put different limits on how much memory you can use.
Some things to check.
How much GPU memory do you have.
If it's not more than 0.5gig you're out of luck
If your GPU has more than 0.5gig, try a different browser
Firefox probably has different limits than Chrome
Can you create the texture at all?
use gl.texStorage3D and then call gl.getError. Does it get an out of memory error or crash right there.
If gl.texStorage3D does not crash can you upload a little data at a time with gl.texSubImage3D
I suspect this won't work even if gl.texStorage3D does work because the browser will still have to allocate 0.5gig to clear out your texture. If it does work this points to another issue which is that to upload a texture you need 3x-4x the memory, at least in Chrome.
There's your data in JavaScript
data = new Float32Array(size);
That data gets sent to the GPU process
gl.texSubImage3D (or any other texSub or texImage command)
The GPU process sends that data to the driver
glTexSubImage3D(...) in C++
Whether the driver needs 1 or 2 copies I have no idea. It's possible it keeps
a copy in ram and uploads one to the GPU. It keeps the copy so it can
re-upload the data if it needs to swap it out to make room for something else.
Whether or not this happens is up to the driver.
Also note that while I don't think this is the issue the drive is allowed to expand the texture to RGBA32F needing 2gig. It's probably not doing this but I know in the past certain formats were emulated.
Note: texImage potentially takes more memory than texStorage because the semantics of texImage mean that the driver can't actually make the texture until just before you draw since it has no idea if you're going to add mip levels later. texStorage on the other hand you tell the driver the exact size and number of mips to start with so it needs no intermediate storage.
function main() {
const gl = document.createElement('canvas').getContext('webgl2');
if (!gl) {
return alert('need webgl2');
}
const tex = gl.createTexture();
gl.bindTexture(gl.TEXTURE_3D, tex);
gl.texStorage3D(gl.TEXTURE_3D, 1, gl.R32F, 512, 512, 512);
log('texStorage2D:', glEnumToString(gl, gl.getError()));
const data = new Float32Array(512*512*512);
for (let depth = 0; depth < 512; ++depth) {
gl.texSubImage3D(gl.TEXTURE_3D, 0, 0, 0, depth, 512, 512, 1, gl.RED, gl.FLOAT, data, 512 * 512 * depth);
}
log('texSubImage3D:', glEnumToString(gl, gl.getError()));
}
main();
function glEnumToString(gl, value) {
return Object.keys(WebGL2RenderingContext.prototype)
.filter(k => gl[k] === value)
.join(' | ');
}
function log(...args) {
const elem = document.createElement('pre');
elem.textContent = [...args].join(' ');
document.body.appendChild(elem);
}

Image computation on GPU and value returning

I have a C# project in which I retreive grey-scale images from cameras and do some computation with the image data. The computations are quite time-consuming since I need to loop over the total image several times and I am doing it all on the CPU.
Now I would like to try to get the evaluation running on the GPU, but I have a lot of struggle achieving that, since I never did any GPU calculations before.
The software should be able to run on several computers with varying hardware, so CUDA for example is not a solution for me, since the code should also run on laptops which only have onboard graphics. After some research I came accross Cloo (found it on this project), which seems to be a quite resonable choice.
So far I integrated Cloo in my project and tried to get this hello world example running. I guess it is running, since I don´t get any exception, but I don´t know where I can see the printed output.
For my computations I need to pass the image to the GPU and I also need the x-y coordinates during the computation. So, in C# the computation looks like this:
int a = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
a += image[x,y] * x * y;
}
}
int b = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
b += image[x,y] * (x-a) * y;
}
}
Now I want to have these calculations to run on the GPU, and I want to parallel the y-loop, so that in every task one x-loop is running. Then I could take all the resulting a values and add them up before the second loop block would start.
Afterwards I would like to return the values a and b to my C# code and use them there.
So, to wrap up my questions:
Is Cloo a recommendable choice for this task?
What is the best way to pass the image-data (16bit, short-array) and the dimensions (img_width, img_height) to the GPU?
How can I return a value from the GPU? As far as I know kernels are always used as kernel void...
What would be the best way to implement the loops?
I hope my questions are clear and I provided sufficient information to understand my struggles. Any help is appreciated. Thanks in advance.
Let's reverse engineer the problem. Understanding the efficient processing of the "dependency-chain" of image[][], image_height, image_width, a, b
Ad 4 ) the tandem of identical for-loops has a poor performance
given the defined code, there could be just a single loop, thus with reduced overhead costs and best with also maximising cache-aligned vectorised code.
Cache-Naive re-formulation:
int a = 0;
int c = 1;
for ( int y = 0; y < img_height; y++ ){
for ( int x = 0; x < img_width; x++ ){
int intermediate = image[x,y] * y; // .SET PROD(i[x,y],y)
a += x * intermediate; // .REUSE 1st
c -= intermediate; // .REUSE 2nd
}
}
int b = a * c; // was my fault upon being in a hurry leaving for weekend :o)
Moving the code into the split tandem loops is only increasing these overheads and devastating any possible cache-friendly tricks in the code-performance tweaking.
Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this
OpenCL and Cloo document these details, so nothing magical beyond the documented methods is needed here.
Yet, there are latency costs associated with each such host-side to device-side + device-side to host-side transfers. Given you claim that the 16bit-1920x1200 image-data are to be re-processed ~ 10 times in a loop, there are some chances these latencies need not be spent on every such loop pass-through.
The worst performance-killer is a very shallow kernel mathematical density. The problem is, there is indeed not much to calculate in the kernel, so the chances for any efficient SIMD / GPU parallel tricks are indeed pretty low.
In this sense, the CPU-side smart-vectorised code will do much better than the ( H2D + D2H )-overheads-far latency-hostile computationally-shallow GPU-kernel processing.
Ad 1) Given 2+3 and 4 above, 1 may easily loose sense
As prototyped and given additional cache-friendly vectorised tricks, the in-ram + in-cache vectorised code will have chances to beat all OpenCL and mixed-GPU/CPU automated ad-hoc kernel compilation generated device code and it's computing efforts.

Can dispatch overhead be more expensive than an actual thread work?

Recently I'm thinking about possibilities for hard optimization. I mean those kind of optimization when you sometimes hardcode the loop from 3 iterations just to get something.
So one thought came to my mind. Imagine we have a buffer of 1024 elements. We want to multiply every single element of it by 2. And we create a simple kernel, where we pass a buffer, outBuffer, their size (to check if we outside of the bounds) and [[thread_position_in_grid]]. Then we just do a simple multiplaction and write that number to another buffer.
It will look a bit like that:
kernel void multiplyBy2(constant float* in [[buffer(0)]],
device float* out [[buffer(1)]],
constant Uniforms& uniforms [[buffer(2)]],
uint gid [[thread_position_in_grid]])
{
if (gid >= uniforms.buffer_size) { return; }
out[gid] = in[gid] * 2.0;
}
The thing I'm concerned about is If the actual thread work still worth the overhead that is produced by it's dispatching?
Would it be more effective to, for example, dispatch 4 times less threads, that do something like that
out[gid * 4 + 0] = in[gid + 0] * 2.0;
out[gid * 4 + 1] = in[gid + 1] * 2.0;
out[gid * 4 + 2] = in[gid + 2] * 2.0;
out[gid * 4 + 3] = in[gid + 3] * 2.0;
So that thread can work a little bit longer? Or it is better to make threads as thin as possible?
Yes, and this is true not merely in contrived examples, but in some real-world scenarios too.
For extremely simple kernels like yours, the dispatch overhead can swamp the work to be done, but there's another factor that may have an even bigger effect on performance: sharing fetched data and intermediate results.
If you have a kernel that, for example, reads the 3x3 neighborhood of a pixel from an input texture and writes the average to an output texture, you could share the fetched texture data and partial sums between adjacent pixels by operating on more than one pixel in your kernel function and reducing the total number of threads you dispatch.
Perhaps this sates your curiosity. For any practical application, Scott Hunter is right that you should profile on all target devices before and after optimizing.

How would to write a multiplication of double values in NEON assembly?

The line in question is pretty contained:
w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1]
Considering these variables are double (but I can downgrade to float), I think I can pass one value per register? Would it be more efficient to use the 2x2 matrix W directly?
EDIT 1:
This line is inside a loop that is fired hundreds of times per second and has real-time requirements. Instruments says this line takes 60% of the time of the loop.
EDIT 2:
This is the loop(s) I'm talking about:
for (int x=startingX; x<endingX; ++x)
{
for (int y=startingY; y<endingY; ++y)
{
Matx21d position(x,y);
// warp patch
uint8_t *data;
[self backwardWarpPatchWithWarpingMatrix:warpingMatrix withWarpData:&data withReferenceImage:_initialView withCenter:position];
// check that the backward patch was successful
if (!data)
continue;
// calculate zero mean (on the patch) sum of squared differences
int ssd = [self computeZMSSDScoreWithX:x withY:y withCurrentTargetPatch:data];
if (fabs(ssd) < bestSSD)
{
bestPosition = position;
bestSSD = ssd;
}
}
}
backwardWarpPatchWithWarpingMatrix:
Matx22d warpingMatrixInverse = warpingMatrix.inv();
double wmi0 = warpingMatrixInverse(0,0), wmi1 = warpingMatrixInverse(0,1), wmi2 = warpingMatrixInverse(1,0), wmi3 = warpingMatrixInverse(1,1);
if (isnan(wmi0))
{
warpingMatrixInverse = Matx22d::eye();
}
// Perform the warp on a larger patch.
int LEVEL_REF = 0, halfPatchSize = PATCH_SIZE/2;
Matx21d centerInLevel = center * (1.0 / (1<<LEVEL_REF));
__block Mat warped(PATCH_SIZE, PATCH_SIZE, CV_8UC1);
dispatch_apply(PATCH_SIZE, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t y)
{
for (int x=0; x<PATCH_SIZE; ++x)
{
double pp0 = x - halfPatchSize, pp1 = (double)y - halfPatchSize;
Matx21d multiplication(wmi0 * pp0 + wmi1 * pp1, wmi2 * pp0 + wmi3 * pp1);
Matx21d px(multiplication(0) + centerInLevel(0), multiplication(1) + centerInLevel(1));
double warpedPixel = [self interpolatePointInImage:referenceImage withU:px(0) withV:px(1)];
warped.at<uchar>(y,x) = (uint8_t)warpedPixel;
}
});
computeReferencePatchScores:
int x = (int)u;
int y = (int)v;
float subpixX = u - x,
subpixY = v - y,
oneMinusSubpixX = 1.0 - subpixX,
oneMinusSubpixY = 1.0 - subpixY;
float w00 = oneMinusSubpixX * oneMinusSubpixY,
w01 = oneMinusSubpixX * subpixY,
w10 = subpixX * oneMinusSubpixY,
w11 = 1.0f - w00 - w01 - w10;
const int stride = (int)image.step.p[0];
uchar* ptr = image.data + y * stride + x;
return w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1];
You typically don't translate a single line of code into assembly. For it to be worth writing in assembly, you have to first assume that you can generate better assembly than the compiler will. Sometimes that's true for vectorized code on NEON, but it's usually because you have special knowledge about a complex loop. You're unlikely to beat the compiler significantly on a single line of code (and will likely lose). Is this line part of a loop that you've profiled and identified as a major bottleneck? Have you already tried Accelerate? Have you analyzed the assembly the compiler is generating and found mistakes that it's making.
Trying to do this in ObjC++ is very inefficient. ObjC++ is a glue language for tying together C++ and ObjC; doing both in the same file imposes several performance costs, especially with ARC. Calling an ObjC method inside of a performance-critical inner-loop is very expensive in any case (even if there weren't mixed-in C++). You should never do any kind of function call (least of all an ObjC method dispatch) inside of a tight inner-loop. It's not clear where you're actually calling computeReferencePatchScores. The use of GCD here is probably hurting you more than helping (since it prevents the compiler from applying certain vector optimizations).
This is all to say: how a particular line of code is being compiled into assembly is by far the least of your problems in this code. Its structure is fighting clang's optimizer.
Step one is to step back and ask what computation you want to execute, and then read through the Core Image Programming Guide and the vImage Programming Guide and verify that it isn't already available. You might also look over OpenGL ES, but OpenGL is often a whole approach to drawing (so it's a bit more of a commitment). It looks like you're already using OpenCV, so make sure it doesn't have available functions to do what you want. (Most of what I see in there looks like stuff built into both OpenCV and vImage.)
The simplest way to improve performance without moving to more powerful frameworks is to move the entire loop into a single C++ function. Then the optimizer can see all the code and apply vector operations on its own. But the next step is to make use of the high-level high-performance frameworks already available.
In any case, you'll want to sit down and carefully work through exactly the calculations you need to perform (I usually do this by hand on paper). Make sure you're not duplicating anything, that you need every calculation you're performing, and that each change you make still generates the same result.
This looks to be a 2x2 convolution. If the data set is large, then vImageConvolve_PlanarF with a 3x3 kernel with some zero padding in it will do the job. It tries to skip work on kernel elements that are 0. You would need to convert the data set to single precision.
If the data set is small, then you are probably stuck with scalar code performance. Inline the function if you can. Perhaps you can figure out how to aggregate a bunch of these together to take advantage of a heavier duty high performance routine.
However, if the weights change from pixel to pixel, then a convolution isn't going to work. You may look instead at the N-dimensional lookup table feature in vImage/Transform.h, if your data set is not huge.
I am a bit skeptical that the time is really spent just in that line. It is best to look at the assembly view in instruments to see where the samples really land.

CUDA coalesced access to global memory

I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access.
Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.
__shared__ int sData[256];
When is coalesced access performed?
1.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];
Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?
2.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];
If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?
What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];
This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.
The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.
On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.
Additional examples:
Not coalesced:
shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];
Not coalesced, but not too bad on GT200 and later:
stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Not coalesced at all:
stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Coalesced, 2D grid, 1D block:
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch +
blockIdx.x * blockDim.x + threadIdx.x];
Coalesced, 2D grid and block:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];
Your indexing at 1 is wrong (or intentionally so strange it seems wrong), some blocks access same element in each thread, so there is no way for coalesced access in these blocks.
Proof:
Example:
Grid = dim(2,2,0)
t(blockIdx.x, blockIdx.y)
//complete block reads at 0
t(0,0) -> sData[threadIdx.x] = gData[0];
//complete block reads at 2
t(0,1) -> sData[threadIdx.x] = gData[2];
//definetly coalesced
t(1,0) -> sData[threadIdx.x] = gData[threadIdx.x];
//not coalesced since 2 is no multiple of a half of the warp size = 16
t(1,1) -> sData[threadIdx.x] = gData[threadIdx.x + 2];
So its a "luck" game if a block is coalesced, so in general No
But coalesced memory reads rules are not as strict on newer cuda versions as before.
But for compatibility issues you should try to optimise kernels for lowest cuda versions, if it is possible.
Here is some nice source:
http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf
The rules for which accesses can be coalesced are somewhat complicated and they have changed over time. Each new CUDA architecture is more flexible in what it can coalesce. I would say not to worry about it at first. Instead, do the memory accesses in whatever way is the most convenient and then see what the CUDA profiler says.
Your examples are correct if you intended to use a 1D grid and thread-geometry. I think the indexing you intended to use is [blockIdx.x*blockDim.x + threadIdx.x].
With #1, the 32 threads in a warp execute that instruction 'simultaneously' so their requests, which are sequential and aligned to 128B (32 x 4), are coalesced in both Tesla and Fermi architectures, I believe.
With #2, it is a bit blurry. If someIndex is 1, then it won't coalesce all of the 32 requests in a warp, but it might do partial coalescing. I believe Fermi devices will coalesce the accesses for threads 1-31 in a warp as a part of a 128B sequential segment of memory (and the first 4B, which no thread needs, are wasted). I think Tesla architecture devices would make that an uncoalesced access due to the misalignment, but I am not sure.
With someIndex as, say, 8, Tesla will have 32B aligned addresses, and Fermi might group them as 32B, 64B, and 32B. But the bottom line is, depending on the value of someIndex and the architecture, what happens is blurry, and it won't necessarily be terrible.

Resources