CUDA coalesced access for two-dimensional block - memory

For 1D cases I've pretty much understood the whole coalesced access requirement of global memory in CUDA.
However I'm a bit stuck for two-dimensional case (that is we have a 2D grid, made of 2D blocks).
Suppose I have a vector in_vector and in my kernel I want to access it in a coalesced manner. Like so:
__global__ void my_kernel(float* out_matrix, float* in_vector, int size)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
// ...
float vx = in_vector[i]; // This is good. Here we have coalesced access
float vy = in_vector[j]; // Not sure about this. All threads in my warp access the same global address. (See explanation)
// ...
// Do some calculations... Obtain result
}
In my understanding for this 2D case the threads inside the block are "arranged" in a column-major fashion. Eg: assuming a (threadIdx.x, threadIdx.y) notation:
the first warp would be: (0, 0), (1, 0), (2, 0), ..., (31, 0),
the second warp would be: (0, 1), (1, 1), (2, 1), ..., (31, 1),
and so on...
In this case calling in_vector[i] gives us a coalesced access because each consecutive thread in the same warp will access consecutive addresses. However calling in_vector[j] seems a bad ideea, as each consecutive thread will access the same address in global memory (For example all the threads in warp 0 will access in_vector[0], which would give us 32 different global memory requests)
Did I understood this correctly? If so how can I make a coalesced access to global memory using in_vector[j]?

What you have shown in your question is only correct for certain block sizes. Your "coalesced" access:
int i = blockIdx.x * blockDim.x + threadIdx.x;
float vx = in_vector[i];
will result in coalesced access of in_vector from global memory only when blockDim.x is greater than or equal to 32. Even in the coalesced case, each thread within a block which shares the same threadIdx.x value reads the same word from global memory, which seems to be counter-intuitive and wasteful.
The correct way to ensure reads are unique per thread and coalesced is to calculate the thread number within the block and an offset within the grid, perhaps something like:
int tid = threadIdx.x + blockDim.x * threadIdx.y; // must use column major order
int bid = blockIdx.x + gridDim.x * blockDim.y; // can either use column or row major
int offset = (blockDim.x * blockDim.y) * bid; // block id * threads per block
float vx = in_vector[tid + offset];
If your intention really isn't to read a unique value per thread, then you can save a lot of memory bandwidth and achieve coalescing using shared memory, something like this:
__shared__ float vx[32], vy[32];
int tid = threadIdx.x + blockDim.x * threadIdx.y;
if (tid < 32) {
vx[tid] = in_vector[blockIdx.x * blockDim.x + tid];
vy[tid] = in_vector[blockIdx.y * blockDim.y + tid];
}
__syncthread();
and you will get a single warp reading unique values into shared memory once. Other threads can then read values from shared memory without requiring any further global memory access. Note that in the above example I followed the conventions of your code, even if it doesn't necessarily make that much sense to read in_vector twice in that way.

Related

Fast vectorized pixel-wise operations on images

I want to measure the similarity degree between two grayscale same sized images using mean square error. I can't use any framework which is not a part of macOS SDK(e.g. OpenCV, Eigen). Simple realization of this algorithm without vectorization looks like this:
vImage_Buffer imgA;
vImage_Buffer imgB;
NSUInteger mse = 0;
unsigned char *pxlsA = (unsigned char *)imgA.data;
unsigned char *pxlsB = (unsigned char *)imgB.data;
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
NSUInteger d = pxlsA[i] - pxlsB[i]);
mse += d * d;
}
Is there some way to do this without loop, in more vectorized way? Maybe something like:
mse = ((imgA - imgB) ^ 2).sum();
The answer to this question is stored in vDSP library, which is part of macOS SDK.
https://developer.apple.com/documentation/accelerate/vdsp
vDSP - Perform basic arithmetic operations and common digital signal processing routines on large vectors.
In my situation I have not really big vectors, but still.
Firstly, you need to convert unsigned char * to float *, and btw it is a significant moment, I don't know how to do this not in loop. Then you need two vDSP function: vDSP_vsbsbm and vDSP_sve.
vDSP_vsbsm - Multiplies the difference of two single-precision vectors by a second difference of two single-precision vectors.
vDSP_sve - Calculates the sum of values in a single-precision vector.
So the final code looks like that:
float *fpxlsA = (float *)malloc(imgA.height * imgA.width * sizeof(float));
float *fpxlsB = (float *)malloc(imgB.height * imgB.width * sizeof(float));
float *output = (float *)malloc(imgB.height * imgB.width * sizeof(float));
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
fpxlsA[i] = (float)(pxlsA[i]);
fpxlsB[i] = (float)(pxlsB[i]);
}
vDSP_vsbsbm(fpxlsA, 1, fpxlsB, 1, fpxlsA, 1, fpxlsB, 1, output, 1, imgA.height * imgB.width);
float sum;
vDSP_sve(output, 1, &sum, imgA.height * imgB.width);
free(output);
free(fpxlsA);
free(fpxlsB);
So, this code did exactly what I wanted and in a more vectorized form. But the result isn't good enough. Comparing performances of the loop approach and vDSP approach, vDSP is two times faster if there isn't any additional memory allocation. But in reality, where additional memory allocation takes place, loop approach is slightly faster.
This appears to be part of Mac OS: https://developer.apple.com/documentation/accelerate
Nice and fast using pointer arithmetic way to loop that would be as follows ...
int d;
size_t i = imgA.height * imgA.width;
while ( i -- )
{
d = ( int )(*pxlsA++) - ( int )(*pxlsB++);
mse += d * d;
}
EDIT
Ooops since those are unsigned char's and since we calculate the difference we need to use signed integers to do so.
And another edit - must use pxls... here, don't know what img... is.

What does the "simd" prefix mean in SceneKit?

There is a SCNNode category named SCNNode(SIMD), which declares some properties like simdPosition, simdRotation and so on. It seems these are duplicated properties of the original/normal properties position and rotation.
#property(nonatomic) simd_float3 simdPosition API_AVAILABLE(macos(10.13), ios(11.0), tvos(11.0), watchos(4.0));
#property(nonatomic) simd_float4 simdRotation API_AVAILABLE(macos(10.13), ios(11.0), tvos(11.0), watchos(4.0));
What's the difference between position and simdPosition? What does the prefix "simd" mean exactly?
SIMD: Single Instruction Multiple Data
SIMD instructions allow you to perform the same operation on multiple values at the same time.
Let's see an example
Serial Approach (NO SIMD)
We have these 4 Int32 values
let x0: Int32 = 10
let y0: Int32 = 20
let x1: Int32 = 30
let y1: Int32 = 40
Now we want to sum the 2 x and the 2 y values, so we write
let sumX = x0 + x1 // 40
let sumY = y0 + y1 // 60
In order to perform the 2 previous sums the CPU needs to
load x0 and x1 in memory and add them
load y0 and y1 in memory and add them
So the result is obtained with 2 operations.
I created some graphics to better show you the idea
Step 1
Step 2
SIMD
Let's see now how SIMD does work.
First of all we need the input values stored in the proper SIMD format so
let x = simd_int2(10, 20)
let y = simd_int2(30, 40)
As you can see the previous x and y are vectors. Infact both x and y contain 2 components.
Now we can write
let sum = x + y
Let's see what the CPU does in order to perform the previous operations
load x and y in memory and add them
That's it!
Both components of x and both components of y are processed at the same time.
Parallel Programming
We are NOT talking about concurrent programming, instead this is real parallel programming.
As you can imagine in certain operation the SIMD approach is way faster then the serial one.
Scene Kit
Let's see now an example in SceneKit
We want to add 10 to the x, y and z components of all the direct descendants of the scene node.
Using the classic serial approach we can write
for node in scene.rootNode.childNodes {
node.position.x += 10
node.position.y += 10
node.position.z += 10
}
Here a total of childNodes.count * 3 operations is executed.
Let's see now how we can convert the previous code in SIMD instructions
let delta = simd_float3(10)
for node in scene.rootNode.childNodes {
node.simdPosition += delta
}
This code is much faster then the previous one. I am not sure whether 2x or 3x faster but, believe me, it's way better.
Wrap up
If you need to perform several times the same operation on different value, just use the SIMD properties :)
SIMD is a small library built on top of vector types that you can import from <simd/simd.h>. It allows for more expressive and more performant code.
For instance using SIMD you can write
simd_float3 result = a + 2.0 * b;
instead of
SCNVector3 result = SCNVector3Make(a.x + 2.0 * b.x, a.y + 2.0 * b.y, a.z + 2.0 * b.z);
In Objective-C you can not overload methods. That is you can not have both
#property(nonatomic) SCNVector3 position;
#property(nonatomic) simd_float3 position API_AVAILABLE(macos(10.13), ios(11.0), tvos(11.0), watchos(4.0));
The new SIMD-based API needed a different name, and that's why SceneKit exposes simdPosition.

Filling Float buffer in Metal

Problem:
I need to fill a MTLBuffer of Floats with a constant value — say 1729.68921. I also need it to be as fast as possible.
Therefore I'm prohibited from filling the buffer on the CPU side (i.e. getting UnsafeMutablePointer<Float> from the MTLBuffer and assigning in serial manner).
My approach
Ideally I'd use MTLBlitCommandEncoder.fill(), however AFAIK it's only capable to fill a buffer with UInt8 values (given that UInt8 is 1 byte long and Float is 4 bytes long, I can't specify arbitrary value of my Float constant).
So far I can see only 2 options left, but both seem to be overkill:
create another buffer B filled with the constant value and copy its contents into my buffer via MTLBlitCommandEncoder
create a kernel function that'd fill the buffer
Questions
What's the fastest way of filling MTLBuffer of Floats with a
constant value?
Using a compute shader that writes to multiple buffer elements from each thread was the fastest approach in my experiments. This is hardware-dependent, so you should test on the full range of devices you expect the app to be deployed on.
I wrote two compute shaders: one that fills 16 contiguous array elements without checking against the array bounds, and one that sets a single array element after checking against the length of the buffer:
kernel void fill_16_unchecked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
for (int i = 0; i < 16; ++i) {
buffer[index * 16 + i] = value;
}
}
kernel void single_fill_checked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
constant uint &buffer_length [[buffer(2)]],
uint index [[thread_position_in_grid]])
{
if (index < buffer_length) {
buffer[index] = value;
}
}
If you know that your buffer count will always be a multiple of the thread execution width multiplied by the number of elements you set in the loop, you can just use the first function. The second function is a fallback for when you might dispatch a grid that would otherwise overrun the buffer.
Once you have two pipelines built from these functions, you can dispatch the work with a pair of compute commands as follows:
NSInteger executionWidth = [unchecked16Pipeline threadExecutionWidth];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setBuffer:buffer offset:0 atIndex:0];
[computeEncoder setBytes:&value length:sizeof(float) atIndex:1];
if (bufferCount / (executionWidth * 16) != 0) {
[computeEncoder setComputePipelineState:unchecked16Pipeline];
[computeEncoder dispatchThreadgroups:MTLSizeMake(bufferCount / (executionWidth * 16), 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
if (bufferCount % (executionWidth * 16) != 0) {
int remainder = bufferCount % (executionWidth * 16);
[computeEncoder setComputePipelineState:checkedSinglePipeline];
[computeEncoder setBytes:&bufferCount length:sizeof(bufferCount) atIndex:2];
[computeEncoder dispatchThreadgroups:MTLSizeMake((remainder / executionWidth) + 1, 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
[computeEncoder endEncoding];
Note that doing the work in this manner will not necessarily be faster than the naive approach that just writes one element per thread. In my tests, it was 40% faster on A8, roughly equivalent on A10, and 2-3x slower (!) on A9. Always test with your own workload.

Why Global memory version is faster than constant memory in my CUDA code?

I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.
I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.
Here is constant memory code:
__constant__ float4 constPlanes[MAX_PLANES_COUNT];
__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {
unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;
if (vertexIndex >= vertsCount) {
return;
}
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
//__syncthreads();
for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
float4 plane = constPlanes[planeIndex];
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
}
displacements[vertexIndex] = displacementSteps;
}
Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.
I thought that those first global memory reads
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.
What is wrong? Thanks in advance!
System:
CUDA Driver Version: 5.0
CUDA Capability: 2.0
Parameters:
number of vertices: ~2.5 millions
number of planes: 1024
Results:
constant mem version: 46 ms
global mem version: 35 ms
EDIT:
So I've tried many things how to make the constant memory faster, such as:
1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.
2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.
2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.
After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.
EDIT2:
Another day of tweaking, I've tried:
3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup
4) Replace this if statement
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
which is giving 'unpredictable' results with little bit of math to avoid branching using this code:
float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29)); // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1)); // compute sign without using ifs
displacementSteps += sign;
Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.
EDIT3:
I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:
*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.
Although a compute capability 2.0 chip has 64k of constant memory, each of the multi-processors has only 8k of constant-memory cache. Your code has each thread requiring access to all 16k of the constant memory, so you are losing performance through cache misses. To effectively use constant memory for the plane data, you will need to restructure your implementation.

how to make a CUDA Histogram kernel?

I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?
__global__ void Hist(
TColor *dst, //input image
int imageW,
int imageH,
int*data
){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < imageW && iy < imageH)
{
int pixel = get_red(dst[imageW * (iy) + (ix)]);
//this assign specific RED value of image to pixel
data[pixel] ++; // ?? problem statement ...
}
}
#para d_dst: input image TColor is equals to float4.
#para data: the array for histogram size [255]
extern "C" void
cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);
}
Have you looked at the SDK sample? The "histogram" sample is available in the CUDA SDK (currently version 3.0 on the NVIDIA developer site, version 3.1 beta available for registered developers).
The documentation with the sample explains nicely how to handle your summation, either using global memory atomics on the GPU or by collecting the results for each block separately and then doing a separate reduction (either on the host or the GPU).
Histogramming is not particularly efficient when implemented with CUDA (or with GPGPU in general) - typically you need to generate lots of partial histograms in shared memory and then sum them. You might want to consider keeping this particular task on the CPU.
You will have to either use atomic function to block other thread from using he same memory, or use the partial histogram. Either way it not that efficient unless the input image is very very large.

Resources