Metal IOS simple passthrough compute kernel takes 10 miliseconds on iphone 5s - ios

I created simple passthrough compute kernel
kernel void filter(texture2d<float, access::read> inTexture [[texture(0)]],
texture2d<float, access::write> outTexture [[texture(1)]],
uint2 gridPos [[ thread_position_in_grid ]]) {
float4 color = inTexture.read(gridPos);
outTexture.write(color, gridPos);
}
Measuring the execution time
[self.timer start];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
CGFloat ms = [self.timer elapse];
Timer class works like this:
- (void)start {
self.startMach = mach_absolute_time();
}
- (CGFloat)elapse {
uint64_t end = mach_absolute_time();
uint64_t elapsed = end - self.startMach;
uint64_t nanosecs = elapsed * self.info.numer / self.info.denom;
uint64_t millisecs = nanosecs / 1000000;
return millisecs;
}
Dispatch call:
static const NSUInteger kGroupSize = 16;
- (MTLSize)threadGroupSize {
return MTLSizeMake(kGroupSize, kGroupSize, 1);
}
- (MTLSize)threadGroupsCount:(MTLSize)threadGroupSize {
return MTLSizeMake(self.provider.texture.width / kGroupSize,
self.provider.texture.height / kGroupSize, 1);
}
[commandEncoder dispatchThreadgroups:threadgroups
threadsPerThreadgroup:threadgroupSize];
gives me 13 ms on 512x512 rgba image and it grows lineary if I perform more passes.
Is this correct? It seems too much overhead for real time application.

Compute kernels are known to have rather high overhead on A7 processors. One thing to consider, though, is that this is basically the least flattering test you can run: a one-shot threadgroup dispatch might take ~2ms to get scheduled, but scheduling of subsequent dispatches can be up to an order of magnitude faster. Additionally there's little chance for latency hiding here. In practice, a much more complex kernel probably wouldn't take substantially longer to execute, and if you can interleave it with whatever rendering you might be doing, you might find performance to be acceptable.

Related

Optimize metal compute shader for image histogram

I have a metal shader that computes an image histogram like this:
#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];
kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 grid [[thread_position_in_grid]]) {
if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
}
This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.
kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 t_pos_grid [[thread_position_in_grid]],
uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
uint2 t_pos_tg [[ thread_position_in_threadgroup]],
uint t_idx_tg [[ thread_index_in_threadgroup ]],
uint2 t_per_tg [[ threads_per_threadgroup ]]
)
{
threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };
if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
localhisto[grayvalue]++;
// wait for all threads in threadgroup to finish
threadgroup_barrier(mem_flags::mem_none);
// copy the thread group result atomically into global histo buffer
if(t_idx_tg == 0) {
for(uint i=0;i<CHANNEL_SIZE;i++) {
atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
}
}
}
There are 2 problems:
The improved routine does not yield identical results compared to the first and I currently don't see why ?
The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.
Anbody who can explain what I am doing wrong here ? Thanks
EDIT: 2019-12-22:
According to Matthijs answer I have changed the local histogram also to atomic operations like this:
threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);
However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???
You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.
I think the problem is with initializing shared memory, I don't think this definition does the job. Also, threadgroup level memory synchronization is required between zeroing shared memory and atomic update.
As for the device memory update, doing it using a single thread is clearly suboptimal. Updating the whole 256 length histogram in each threadblock can have a huge overhead depending on the size of the threadblock.
A sample I used for a small (16 element) histogram using 8x8 threadblocks:
kernel void gaussian_filter(device const uchar* data,
device atomic_uint* p_hist,
uint2 imageShape [[threads_per_grid]],
uint2 idx [[thread_position_in_grid]],
uint tidx [[thread_index_in_threadgroup]])
{
threadgroup atomic_uint sh_hist[16];
if (tidx < 16)
atomic_store_explicit(sh_hist + tidx, 0, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
uint histBin = (uint)data[imageShape[0]*idx[1] + idx[0]]/16;
atomic_fetch_add_explicit(sh_hist + histBin, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if (tidx < 16)
atomic_fetch_add_explicit(p_hist + tidx, atomic_load_explicit(sh_hist + tidx, memory_order_relaxed), memory_order_relaxed);
}

Summing the values in a Webgl2 R32F Texture by generating a MipMap

If I have rendered data into a R32F texture (of 2^18 (~250,000) texels) and I want to compute the sum of these values, is it possible to do this by asking the gpu to generate a mipmap?
(the idea being that the smallest mipmap level would have a single texel that contains the average of all the original texels)
What mipmap settings (clamp, etc) would I use to generate the correct average?
I'm not so good with webgl gymnastics, and would appreciate a snippet of how one would render into a R32F texture the numbers from 1 to 2^18 and then produce a sum over that texture.
For this number of texels, would this approach be faster than trying to transfer the texels back to the cpu and performing the sum in javascript?
Thanks!
There are no settings that define the algorithm used to generate mipmaps. Clamp settings, filter settings have no effect. There's only a hint you can set with gl.hint on whether to prefer quality over performance but a driver has no obligation to even pay attention to that flag. Further, every driver is different. The results of generating mipmaps is one of the differences used to fingerprint WebGL.
In any case if you don't care about the algorithm used and you just want to read the result of generating mipmaps then you just need to attach the last mip to a framebuffer and read the pixel after calling gl.generateMipmap.
You likely wouldn't render into a texture all the numbers from 1 to 2^18 but that's not hard. You'd just draw a single quad 512x512. The fragment shader could look like this
#version 300 es
precision highp float;
out vec4 fragColor;
void main() {
float i = 1. + gl_FragCoord.x + gl_FragCoord.y * 512.0;
fragColor = vec4(i, 0, 0, 0);
}
Of course you could pass in that 512.0 as a uniform if you wanted to work with other sizes.
Rendering to a floating point texture is an optional feature of WebGL2. Desktops support it but as of 2018 most mobile devices do not. Similarly being able to filter a floating point texture is also an optional feature which is also usually not supported on most mobile devices as of 2018 but is on desktop.
function main() {
const gl = document.createElement("canvas").getContext("webgl2");
if (!gl) {
alert("need webgl2");
return;
}
{
const ext = gl.getExtension("EXT_color_buffer_float");
if (!ext) {
alert("can not render to floating point textures");
return;
}
}
{
const ext = gl.getExtension("OES_texture_float_linear");
if (!ext) {
alert("can not filter floating point textures");
return;
}
}
// create a framebuffer and attach an R32F 512x512 texture
const numbersFBI = twgl.createFramebufferInfo(gl, [
{ internalFormat: gl.R32F, minMag: gl.NEAREST },
], 512, 512);
const vs = `
#version 300 es
in vec4 position;
void main() {
gl_Position = position;
}
`;
const fillFS = `
#version 300 es
precision highp float;
out vec4 fragColor;
void main() {
float i = 1. + gl_FragCoord.x + gl_FragCoord.y * 512.0;
fragColor = vec4(i, 0, 0, 0);
}
`
// creates a buffer with a single quad that goes from -1 to +1 in the XY plane
// calls gl.createBuffer, gl.bindBuffer, gl.bufferData
const quadBufferInfo = twgl.primitives.createXYQuadBufferInfo(gl);
const fillProgramInfo = twgl.createProgramInfo(gl, [vs, fillFS]);
gl.useProgram(fillProgramInfo.program);
// calls gl.bindBuffer, gl.enableVertexAttribArray, gl.vertexAttribPointer
twgl.setBuffersAndAttributes(gl, fillProgramInfo, quadBufferInfo);
// tell webgl to render to our texture 512x512 texture
// calls gl.bindBuffer and gl.viewport
twgl.bindFramebufferInfo(gl, numbersFBI);
// draw 2 triangles (6 vertices)
gl.drawElements(gl.TRIANGLES, 6, gl.UNSIGNED_SHORT, 0);
// compute the last mip level
const miplevel = Math.log2(512);
// get the texture twgl created above
const texture = numbersFBI.attachments[0];
// create a framebuffer with the last mip from
// the texture
const readFBI = twgl.createFramebufferInfo(gl, [
{ attachment: texture, level: miplevel },
]);
gl.bindTexture(gl.TEXTURE_2D, texture);
// try each hint to see if there is a difference
['DONT_CARE', 'NICEST', 'FASTEST'].forEach((hint) => {
gl.hint(gl.GENERATE_MIPMAP_HINT, gl[hint]);
gl.generateMipmap(gl.TEXTURE_2D);
// read the result.
const result = new Float32Array(4);
gl.readPixels(0, 0, 1, 1, gl.RGBA, gl.FLOAT, result);
log('mip generation hint:', hint);
log('average:', result[0]);
log('average * count:', result[0] * 512 * 512);
log(' ');
});
function log(...args) {
const elem = document.createElement('pre');
elem.textContent = [...args].join(' ');
document.body.appendChild(elem);
}
}
main();
pre {margin: 0}
<script src="https://twgljs.org/dist/4.x/twgl-full.min.js"></script>
Note I used twgl.js to make the code less verbose. If you don't know how to make a framebuffer and attach textures or how to setup buffers and attributes, compile shaders, and set uniforms then you're asking way too broad a question and I suggest you go read some tutorials.
Let me point how there's no guarantee this method is faster than others. First off it's up to the driver. It's possible the driver does this in software (though unlikely).
one obvious speed up is to use RGBAF32 and let the code do 4 values at a time then read all 4 channels (R,G,B,A) at the end and sum those.
Also since you only care about the last 1x1 pixel mip your asking the code to render a lot more pixels than a more direct method. Really you only need to render 1 pixel, the result. But for this example of 2^18 values which is a 512x512 texture that means a 256x526, a 128x128, a 64x64, a 32x32, a 16x16, a 8x8, a 4x4, and a 2x2 mip are all allocated and computed which is arguably wasted time. In fact the spec says all mips are generated from the first mip. Of course a driver is free to take shortcuts and most likely generates mip N from mip N-1 as the result will be similar but that's not how the spec is defined. But, even generating one mip from the previous is 87380 values computed you didn't care about.
I'm only guessing it would be faster to generate in larger chucks than 2x2. At the same time there are texture caches and if I understand correctly they usually cache a rectangular part of a texture so that reading 4 values from a mip is fast. When you have a texture cache miss it can really kill your performance. So, if your chunks are too large it's possible you'd have lots of cache misses. You'd basically have to test and each GPU would likely show different performance characteristics.
Yet another speed up would be to consider using multiple drawing buffers then you can write 16 to 32 values per fragment shader iteration instead of just 4.

Filling Float buffer in Metal

Problem:
I need to fill a MTLBuffer of Floats with a constant value — say 1729.68921. I also need it to be as fast as possible.
Therefore I'm prohibited from filling the buffer on the CPU side (i.e. getting UnsafeMutablePointer<Float> from the MTLBuffer and assigning in serial manner).
My approach
Ideally I'd use MTLBlitCommandEncoder.fill(), however AFAIK it's only capable to fill a buffer with UInt8 values (given that UInt8 is 1 byte long and Float is 4 bytes long, I can't specify arbitrary value of my Float constant).
So far I can see only 2 options left, but both seem to be overkill:
create another buffer B filled with the constant value and copy its contents into my buffer via MTLBlitCommandEncoder
create a kernel function that'd fill the buffer
Questions
What's the fastest way of filling MTLBuffer of Floats with a
constant value?
Using a compute shader that writes to multiple buffer elements from each thread was the fastest approach in my experiments. This is hardware-dependent, so you should test on the full range of devices you expect the app to be deployed on.
I wrote two compute shaders: one that fills 16 contiguous array elements without checking against the array bounds, and one that sets a single array element after checking against the length of the buffer:
kernel void fill_16_unchecked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
for (int i = 0; i < 16; ++i) {
buffer[index * 16 + i] = value;
}
}
kernel void single_fill_checked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
constant uint &buffer_length [[buffer(2)]],
uint index [[thread_position_in_grid]])
{
if (index < buffer_length) {
buffer[index] = value;
}
}
If you know that your buffer count will always be a multiple of the thread execution width multiplied by the number of elements you set in the loop, you can just use the first function. The second function is a fallback for when you might dispatch a grid that would otherwise overrun the buffer.
Once you have two pipelines built from these functions, you can dispatch the work with a pair of compute commands as follows:
NSInteger executionWidth = [unchecked16Pipeline threadExecutionWidth];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setBuffer:buffer offset:0 atIndex:0];
[computeEncoder setBytes:&value length:sizeof(float) atIndex:1];
if (bufferCount / (executionWidth * 16) != 0) {
[computeEncoder setComputePipelineState:unchecked16Pipeline];
[computeEncoder dispatchThreadgroups:MTLSizeMake(bufferCount / (executionWidth * 16), 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
if (bufferCount % (executionWidth * 16) != 0) {
int remainder = bufferCount % (executionWidth * 16);
[computeEncoder setComputePipelineState:checkedSinglePipeline];
[computeEncoder setBytes:&bufferCount length:sizeof(bufferCount) atIndex:2];
[computeEncoder dispatchThreadgroups:MTLSizeMake((remainder / executionWidth) + 1, 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
[computeEncoder endEncoding];
Note that doing the work in this manner will not necessarily be faster than the naive approach that just writes one element per thread. In my tests, it was 40% faster on A8, roughly equivalent on A10, and 2-3x slower (!) on A9. Always test with your own workload.

How to Speed Up Metal Code for iOS/Mac OS

I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const device int& dataSize [[ buffer(1) ]],
const device float *filterVector [[ buffer(2) ]],
const device int& filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]]) {
int outputSize = dataSize - filterSize + 1;
for (int i=0;i<outputSize;i++) {
float sum = 0.0;
for (int j=0;j<filterSize;j++) {
sum += dataVector[i+j] * filterVector[j];
}
outVector[i] = sum;
}
}
My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?
The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
{
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
}
outVector[id] = sum;
}
In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.
The following code shows how to render encoded commands in parallel on the GPU using the Objective-C Metal API (the threading code above only divides rendering of the output into grid sections for parallel processing; the calculations are still not performed in parallel). It is what you're referring to in your question, even while it's not exactly what you want. I've provided this answer to help anyone who might have stumbled upon this question, thinking that it was going to provide an answer related to parallel rendering (when, in fact, it does not):
- (void)drawInMTKView:(MTKView *)view
{
dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
renderPassDesc.renderTargetWidth = self.texture.width;
renderPassDesc.renderTargetHeight = self.texture.height;
renderPassDesc.colorAttachments[0].texture = drawable.texture;
if (renderPassDesc != nil)
{
dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
[commandBuffer enqueue];
// START PARALLEL RENDERING OPERATIONS HERE
id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
// FIRST PARALLEL RENDERING OPERATION
id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];
[renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];
[renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];
[renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];
[renderEncoder setFragmentTexture:self.texture
atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
vertexStart:0
vertexCount:4
instanceCount:1];
[renderEncoder endEncoding];
// ADD SECOND, THIRD, ETC. PARALLEL RENDERING OPERATION HERE
.
.
.
// SUBMIT ALL RENDERING OPERATIONS IN PARALLEL HERE
[parallelRCE endEncoding];
__block dispatch_semaphore_t block_sema = self._inflight_semaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
if (drawable)
[commandBuffer presentDrawable:drawable];
[commandBuffer commit];
[commandBuffer waitUntilScheduled];
}
});
}
In the above example, you would duplicate the renderEncoder-related for each calculation you want to perform in parallel. I do not see how this would be of benefit to you in your code example, as one operation appears to be dependent on another. Probably, then, the best you could hope for is the code provided to you by warrenm, even though that doesn't really qualify as parallel rendering, though.

Why Global memory version is faster than constant memory in my CUDA code?

I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.
I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.
Here is constant memory code:
__constant__ float4 constPlanes[MAX_PLANES_COUNT];
__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {
unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;
if (vertexIndex >= vertsCount) {
return;
}
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
//__syncthreads();
for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
float4 plane = constPlanes[planeIndex];
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
}
displacements[vertexIndex] = displacementSteps;
}
Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.
I thought that those first global memory reads
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.
What is wrong? Thanks in advance!
System:
CUDA Driver Version: 5.0
CUDA Capability: 2.0
Parameters:
number of vertices: ~2.5 millions
number of planes: 1024
Results:
constant mem version: 46 ms
global mem version: 35 ms
EDIT:
So I've tried many things how to make the constant memory faster, such as:
1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.
2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.
2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.
After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.
EDIT2:
Another day of tweaking, I've tried:
3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup
4) Replace this if statement
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
which is giving 'unpredictable' results with little bit of math to avoid branching using this code:
float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29)); // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1)); // compute sign without using ifs
displacementSteps += sign;
Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.
EDIT3:
I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:
*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.
Although a compute capability 2.0 chip has 64k of constant memory, each of the multi-processors has only 8k of constant-memory cache. Your code has each thread requiring access to all 16k of the constant memory, so you are losing performance through cache misses. To effectively use constant memory for the plane data, you will need to restructure your implementation.

Resources