Accessing global memory in CUDA is slow - memory

I have a CUDA kernel doing some computation on a local variable (in register), and after it gets computed, its value gets written into a global array p:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
float val = 0;
val = SomeComputationOnVal();
p[idx ]= val;
Unfortunately, this function executes very slow.
However, it runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
float val = 0;
//val = SomeComputationOnVal();
p[idx ]= val;
It also runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
float val = 0;
val = SomeComputationOnVal();
// p[idx ]= val;
So I am confused, and have no idea how to solve this problem. I have used NSight step in, and did not find access violations.
Here is how I launch the kernel (dimX:924; dimY: 16: dimZ: 1120):
gridSize(dimX/blockSize.x+1,dimY/blockSize.y, dimZ/blockSize.z);
float* dev_p; cudaMalloc((void**)&dev_p, dimX*dimY*dimZ*sizeof(float));
dd<<<gridSize, blockSize>>>( dev_p,dimX,dimY,dimZ);
Could anyone please gives some pointers? Because it does not make much sense to me. All computation of val is fast, and the final step is to move val into p. p never gets involved in the computation, and it only shows up once. So why is it so slow?
The computations are basically a loop over a 512 X 512 matrix. It is pretty fair amount of computation I'd say.

The computations you perform in the SomeComputationOnVal are extremely expensive. Each thread reads at least 1MB of data which is off cache (or in L2 at best for a small part should k vary in a small range) which totals for your run about 16 TB of data. Even on a high end gpu, it would take about 2 minutes to run, at the minimum. Not to mention everything that could slow this down.
Your function does not write any data in global memory and has no boundary effect. The compiler may decide to optimize out the method call should you not use the output.
Hence cases two and three not doing calculation are very fast. Writing 64 MB on gpu memory, with coesced threads is very fast (milliseconds range).
You can verify the generated ptx to see if code gets optimized out. Use the --keep option in nvcc and search for ptx files.


Can Montgomery multiplication be used to speed up the computation of (large number)! % (some prime)

This question originates in a comment I almost wrote below this question, where Zack is computing the factorial of a large number modulo a large number (that we will assume to be prime for the sake of this question). Zack is using the traditional computation of factorial, taking the remainder at each multiplication.
I almost commented that an alternative to consider was Montgomery multiplication, but thinking more about it, I have only seen this technique used to speed up several multiplications by the same multiplicand (in particular, to speed up the computation of an mod p).
My question is: can Montgomery multiplication be used to speed up the computation of n! mod p for large n and p?
Naively, no; you need to transform each of the n terms of the product into the "Montgomery space", so you have n full reductions mod m, the same as the "usual" algorithm.
However, a factorial isn't just an arbitrary product of n terms; it's much more structured. In particular, if you already have the "Montgomerized" kr mod m, then you can use a very cheap reduction to get (k+1)r mod m.
So this is perfectly feasible, though I haven't seen it done before. I went ahead and wrote a quick-and-dirty implementation (very untested, I wouldn't trust it very far at all):
// returns m^-1 mod 2**64 via clever 2-adic arithmetic (
uint64_t inverse(uint64_t m) {
assert(m % 2 == 1);
uint64_t minv = 2 - m;
uint64_t m_1 = m - 1;
for (int i=1; i<6; i+=1) { m_1 *= m_1; minv *= (1 + m_1); }
return minv;
uint64_t montgomery_reduce(__uint128_t x, uint64_t minv, uint64_t m) {
return x + (__uint128_t)((uint64_t)x*-minv)*m >> 64;
uint64_t montgomery_multiply(uint64_t x, uint64_t y, uint64_t minv, uint64_t m) {
return montgomery_reduce(full_product(x, y), minv, m);
uint64_t montgomery_factorial(uint64_t x, uint64_t m) {
assert(x < m && m % 2 == 1);
uint64_t minv = inverse(m); // m^-1 mod 2**64
uint64_t r_mod_m = -m % m; // 2**64 mod m
uint64_t mont_term = r_mod_m;
uint64_t mont_result = r_mod_m;
for (uint64_t k=2; k<=x; k++) {
// Compute the montgomerized product term: kr mod m = (k-1)r + r mod m.
mont_term += r_mod_m;
if (mont_term >= m) mont_term -= m;
// Update the result by multiplying in the new term.
mont_result = montgomery_multiply(mont_result, mont_term, minv, m);
// Final reduction
return montgomery_reduce(mont_result, minv, m);
and benchmarked it against the usual implementation:
__uint128_t full_product(uint64_t x, uint64_t y) {
return (__uint128_t)x*y;
uint64_t naive_factorial(uint64_t x, uint64_t m) {
assert(x < m);
uint64_t result = x ? x : 1;
while (x --> 2) result = full_product(result,x) % m;
return result;
and against the usual implementation with some inline asm to fix a minor inefficiency:
uint64_t x86_asm_factorial(uint64_t x, uint64_t m) {
assert(x < m);
uint64_t result = x ? x : 1;
while (x --> 2) {
__asm__("mov %[result], %%rax; mul %[x]; div %[m]"
: [result] "+d" (result) : [x] "r" (x), [m] "r" (m) : "%rax", "flags");
return result;
Results were as follows on my Haswell laptop for reasonably large x:
implementation speedup
naive 1.00x
x86_asm 1.76x
montgomery 5.68x
So this really does seem to be a pretty nice win. The codegen for the Montgomery implementation is pretty decent, but could probably be improved somewhat further with hand-written assembly as well.
This is an interesting approach for "modest" x and m. Once x gets large, the various approaches that have sub-linear complexity in x will necessarily win out; factorial has so much structure that this method doesn't take advantage of.

CUDA: __syncthreads() before shared memory operation?

I'm in the rather poor situation of not being able to use the CUDA debugger. I'm getting some strange results from usage of __syncthreads in an application with a single shared array (deltas). The following piece of code is performed in a loop:
__syncthreads(); //if I comment this out, things get funny
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this line doesnt seem to matter regardless if the first sync is commented out or not
//after sync: do something with the values of delta written in this threads and other threads of this block
Basically, I have code with overlapping blocks (required due to the nature of the algorithm). The program does compile and run but somehow I get systematically wrong values in the areas of vertical overlap. This is very confusing to me as I thought that the correct way to sync is to sync after the threads have performed my write to the shared memory.
This is the whole function:
//XC without repetitions
template <int blocksize, int order>
__global__ void __xc(unsigned short* raw_input_data, int num_frames, int width, int height,
float * raw_sofi_data, int block_size, int order_deprecated){
//we make a distinction between real pixels and virtual pixels
//real pixels are pixels that exist in the original data
//overlap correction: every new block has a margin of 3 threads doing less work (only computing deltas)
int x_corrected = global_x() - blockIdx.x * 3;
int y_corrected = global_y() - blockIdx.y * 3;
//if the thread is responsible for any real pixel
if (x_corrected < width && y_corrected < height){
// __shared__ float deltas[blocksize];
__shared__ float deltas[blocksize];
//the outer pixels of a block do not update SOFI values as they do not have sufficient information available
//they are used only to compute mean and delta
//also, pixels at the global edge have to be thrown away (as there is not sufficient data to interpolate)
bool within_inner_block =
threadIdx.x > 0
&& threadIdx.y > 0
&& threadIdx.x < blockDim.x - 2
&& threadIdx.y < blockDim.y - 2
//global edge
&& x_corrected > 0
&& y_corrected > 0
&& x_corrected < width - 1
&& y_corrected < height - 1
//init virtual pixels
float virtual_pixels[order * order];
if (within_inner_block){
for (int i = 0; i < order * order; ++i) {
virtual_pixels[i] = 0;
float mean = 0;
float intensity;
int lex_index_block = threadIdx.x + threadIdx.y * blockDim.x;
//main loop
for (int frame_idx = 0; frame_idx < num_frames; ++frame_idx) {
//shared memory read and computation of mean/delta
intensity = raw_input_data[lex_index_3D(x_corrected,y_corrected, frame_idx, width, height)];
__syncthreads(); //if I comment this out, things break
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this doesnt seem to matter
mean = deltas[lex_index_block]/(float)(frame_idx+1);
//if the thread is responsible for correlated pixels, i.e. not at the border of the original frame
if (within_inner_block){
virtual_pixels[0] += deltas[lex_index_2D(
threadIdx.y + 1,
threadIdx.y - 1,
virtual_pixels[1] += deltas[lex_index_2D(
threadIdx.x + 1,
virtual_pixels[2] += deltas[lex_index_2D(
threadIdx.y + 1,
virtual_pixels[3] += deltas[lex_index_2D(
// xc_update<order>(virtual_pixels, delta2, mean);
if (within_inner_block){
for (int virtual_idx = 0; virtual_idx < order*order; ++virtual_idx) {
raw_sofi_data[lex_index_2D(x_corrected*order + virtual_idx % order,
y_corrected*order + (int)floorf(virtual_idx / order),
From what I can see, there could be a hazard in your application between loop iterations. The write to deltas[lex_index_block] for loop iteration frame_idx+1 could be mapped to the same location as the read of deltas[lex_index_2D(threadIdx.x, threadIdx.y -1, blockDim.x)] in a different thread at iteration frame_idx. The two accesses are unordered and the result is nondeterministic. Try running the app with cuda-memcheck --tool racecheck.

stripes while calculating image gradient with CUDA

I'm writing a code for the image denoising and came across a strange problem with stripes in the processed images. Basically when I'm calculating X-gradient of image the horizontal stripes appear (or vertical for Y direction) Lena X gradient.
The whole algorithm works OK and it looks like I'm getting the correct answer (I'm comparing with program in C) except those annoying stripes Lena result.
The distance between stripes is changing with different block sizes. I'm also having different stripes positions each time I run the program! Here is the part of the program related to the gradient calculation. I have a feeling that I'm doing something very stupid :) Thank you!
#define BLKXSIZE 16
#define BLKYSIZE 16
#define idivup(a, b) ( ((a)%(b) != 0) ? (a)/(b)+1 : (a)/(b) )
void Diff4th_GPU(float* A, float* B, int N, int M, int Z, float sigma, int iter, float tau, int type)
float *Ad;
dim3 dimGrid(idivup(N,BLKXSIZE), idivup(M,BLKYSIZE));
int n = 1;
while (n <= iter) {
Diff4th2D<<<dimGrid,dimBlock>>>(Ad, N, M, sigma, iter, tau, type);
__global__ void Diff4th2D(float* A, int N, int M, float sigma, int iter, float tau, int type)
float gradX, gradX_sq, gradY, gradY_sq, gradXX, gradYY, gradXY, sq_sum, xy_2, Lam, V_norm, V_orth, c, c_sq, lam_t;
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
int index = j + i*N;
if ((i < N) && (j < M))
float gradX = 0, gradY = 0, gradXX = 0, gradYY = 0, gradXY = 0;
if ((i>1) && (i<N)) {
if ((j>1) && (j<M)){
int indexN = (j)+(i-1)*(N);
if (indexN > ((N*M)-1)) indexN = (N*M)-1;
if (indexN < 0) indexN = 0;
int indexS = (j)+(i+1)*(N);
if (indexS > ((N*M)-1)) indexS = (N*M)-1;
if (indexS < 0) indexS = 0;
int indexW = (j-1)+(i)*(N);
if (indexW > ((N*M)-1)) indexW = (N*M)-1;
if (indexW < 0) indexW = 0;
int indexE = (j+1)+(i)*(N);
if (indexE > ((N*M)-1)) indexE = (N*M)-1;
if (indexE < 0) indexE = 0;
gradX = 0.5*(A[indexN]-A[indexS]);
A[index] = gradX;
You have a race condition inside your kernel, as elements of A may or may not be overwritten before they are used.
Use different arrays for input and output.

CUDA: Shift arrays on shared memory

I am trying to load a flattened 2D matrix into shared memory, shift the data along x, write back to global memory shifting also along y. The input data is therefore shifted along x and y. What I have:
__global__ void test_shift(float *data_old, float *data_new)
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
// load from global to shared
VAR = data_old[glob_index];
// do some stuff on VAR
if (threadIdx.x < NUM_THREADS - 1)
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
// write to global memory
if (threadIdx.y < ny - 1)
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x; // redefine glob_index to shift along y (+1)
data_new[glob_index] = VAR2[threadIdx.x];
The call to the kernel:
test_shift <<< grid, block >>> (data_old, data_new);
and grid and blocks (blockDim.x is equal to the matrix width, i.e. 64):
dim3 block(NUM_THREADS, 1);
dim3 grid(1, ny);
I am not able to achieve it. Could someone please point out what's wrong with this? Should I use a strided index or an offset?
VAR should not have been declared as shared, because in the current form all threads scribble over each other's data when you load from global memory: VAR = data_old[glob_index];.
You also have an out-of-bounds access when you access VAR2[threadIdx.x + 1], so your kernel never finishes (depending on the compute capability of the device - 1.x devices didn't check shared memory accesses as rigorously).
You could have detected the latter by checking the return codes of all calls to CUDA functions for errors.
Shared variables are, well, shared by all threads in a single block. This means that you don't have blockDim.y complects of shared variables but only a single complect per block.
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
VAR = data_old[glob_index];
if (threadIdx.x < NUM_THREADS - 1)
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
This instructs all threads in a block to write data into a single variable (VAR). Next you have no synchronization, and you use this variable in the second assignment. This will have undefined result, because threads from the first warp are reading from this variable and threads from the second warp are still trying to write something there.
You should change VAR to be local, or create an array of shared memory variables for all threads in block.
if (threadIdx.y < ny - 1)
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x;
data_new[glob_index] = VAR2[threadIdx.x];
In VAR2[0] you still have some garbage (you've never written there). threadIdx.y is always zero in your blocks.
And avoid using uints. They have (or used to have) some perfomance problems.
Actually, for such simple task you don't need to use shared memory
__global__ void test_shift(float *data_old, float *data_new)
int glob_index = threadIdx.x + blockIdx.y*blockDim.x;
float VAR;
// load from global to local
VAR = data_old[glob_index];
int glob_index_new;
// calculate only if we are going to output something
if ( (blockIdx.y < gridDim.y - 1) && ( threadIdx.x < blockDim.x - 1 ))
glob_index_new = threadIdx.x + 1 + (blockIdx.y + 1)*blockDim.x;
// do some stuff on VAR
} else // just write 0.0 to remove garbage
glob_index_new = ( (blockIdx.y == gridDim.y - 1) && ( threadIdx.x == blockDim.x - 1 ) ) ? 0 : ((blockIdx.y == gridDim.y - 1) ? threadIdx.x : (blockIdx.y)*blockDim.x );
VAR = 0.0;
// write to global memory
data_new[glob_index_new] = VAR;

Fast Pixel Count on Binary Image- ARM neon intrinsics - iOS Dev

Can someone tell me a fast function to count the number of white pixels in a binary image. I need it for iOS app dev. I am working directly on the memory of the image defined as
bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool));
I am implementing the function
int whiteCount = 0;
for (int q=i; q<i+windowHeight; q++)
for (int w=j; w<j+windowWidth; w++)
if (imageData[q*W + w] == 1)
This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS
can be used to make several operations in 1 cycle. Maybe thats the way to go ??
The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.
The only code in neon intrinsics that I am able to find online is the code for rgb to gray
Firstly you can speed up the original code a little by factoring out the multiply and getting rid of the branch:
int whiteCount = 0;
for (int q = i; q < i + windowHeight; q++)
const bool * const row = &imageData[q * W];
for (int w = j; w < j + windowWidth; w++)
whiteCount += row[w];
(This assumes that imageData[] is truly binary, i.e. each element can only ever be 0 or 1.)
Here is a simple NEON implementation:
#include <arm_neon.h>
// ...
int i, w;
int whiteCount = 0;
uint32x4_t v_count = { 0 };
for (q = i; q < i + windowHeight; q++)
const bool * const row = &imageData[q * W];
uint16x8_t vrow_count = { 0 };
for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
uint8x16_t v = vld1q_u8(&row[j]); // load 16 x 8 bit pixels
vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts
for ( ; w < j + windowWidth; ++w) // scalar clean up loop
whiteCount += row[j];
v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts
} // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount
(This assumes that imageData[] is truly binary, imageWidth <= 2^19, and sizeof(bool) == 1.)
Updated version for unsigned char and values of 255 for white, 0 for black:
#include <arm_neon.h>
// ...
int i, w;
int whiteCount = 0;
const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
uint32x4_t v_count = { 0 };
for (q = i; q < i + windowHeight; q++)
const uint8_t * const row = &imageData[q * W];
uint16x8_t vrow_count = { 0 };
for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
uint8x16_t v = vld1q_u8(&row[j]); // load 16 x 8 bit pixels
v = vandq_u8(v, v_mask); // mask out all but LS bit
vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts
for ( ; w < j + windowWidth; ++w) // scalar clean up loop
whiteCount += (row[j] == 255);
v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts
} // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount
(This assumes that imageData[] is has values of 255 for white and 0 for black, and imageWidth <= 2^19.)
Note that all the above code is untested and may need some further work.
The vectorized algorithm will do the comparisons and put them in a structure for you, but you'd still need to go through each element of the structure and determine if it's a zero or not.
How fast does that loop currently run and how fast do you need it to run? Also remember that NEON will work in the same registers as the floating point unit, so using NEON here may force an FPU context switch.
