CUDA: Shift arrays on shared memory - memory

I am trying to load a flattened 2D matrix into shared memory, shift the data along x, write back to global memory shifting also along y. The input data is therefore shifted along x and y. What I have:
__global__ void test_shift(float *data_old, float *data_new)
{
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
// load from global to shared
VAR = data_old[glob_index];
// do some stuff on VAR
if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}
__syncthreads();
// write to global memory
if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x; // redefine glob_index to shift along y (+1)
data_new[glob_index] = VAR2[threadIdx.x];
}
The call to the kernel:
test_shift <<< grid, block >>> (data_old, data_new);
and grid and blocks (blockDim.x is equal to the matrix width, i.e. 64):
dim3 block(NUM_THREADS, 1);
dim3 grid(1, ny);
I am not able to achieve it. Could someone please point out what's wrong with this? Should I use a strided index or an offset?

VAR should not have been declared as shared, because in the current form all threads scribble over each other's data when you load from global memory: VAR = data_old[glob_index];.
You also have an out-of-bounds access when you access VAR2[threadIdx.x + 1], so your kernel never finishes (depending on the compute capability of the device - 1.x devices didn't check shared memory accesses as rigorously).
You could have detected the latter by checking the return codes of all calls to CUDA functions for errors.

Shared variables are, well, shared by all threads in a single block. This means that you don't have blockDim.y complects of shared variables but only a single complect per block.
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
VAR = data_old[glob_index];
if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}
This instructs all threads in a block to write data into a single variable (VAR). Next you have no synchronization, and you use this variable in the second assignment. This will have undefined result, because threads from the first warp are reading from this variable and threads from the second warp are still trying to write something there.
You should change VAR to be local, or create an array of shared memory variables for all threads in block.
if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x;
data_new[glob_index] = VAR2[threadIdx.x];
}
In VAR2[0] you still have some garbage (you've never written there). threadIdx.y is always zero in your blocks.
And avoid using uints. They have (or used to have) some perfomance problems.
Actually, for such simple task you don't need to use shared memory
__global__ void test_shift(float *data_old, float *data_new)
{
int glob_index = threadIdx.x + blockIdx.y*blockDim.x;
float VAR;
// load from global to local
VAR = data_old[glob_index];
int glob_index_new;
// calculate only if we are going to output something
if ( (blockIdx.y < gridDim.y - 1) && ( threadIdx.x < blockDim.x - 1 ))
{
glob_index_new = threadIdx.x + 1 + (blockIdx.y + 1)*blockDim.x;
// do some stuff on VAR
} else // just write 0.0 to remove garbage
{
glob_index_new = ( (blockIdx.y == gridDim.y - 1) && ( threadIdx.x == blockDim.x - 1 ) ) ? 0 : ((blockIdx.y == gridDim.y - 1) ? threadIdx.x : (blockIdx.y)*blockDim.x );
VAR = 0.0;
}
// write to global memory
data_new[glob_index_new] = VAR;
}

Related

Accessing global memory in CUDA is slow

I have a CUDA kernel doing some computation on a local variable (in register), and after it gets computed, its value gets written into a global array p:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
Unfortunately, this function executes very slow.
However, it runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
//val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
It also runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
// p[idx ]= val;
__syncthreads();
}
So I am confused, and have no idea how to solve this problem. I have used NSight step in, and did not find access violations.
Here is how I launch the kernel (dimX:924; dimY: 16: dimZ: 1120):
dim3
blockSize(8,16,2),
gridSize(dimX/blockSize.x+1,dimY/blockSize.y, dimZ/blockSize.z);
float* dev_p; cudaMalloc((void**)&dev_p, dimX*dimY*dimZ*sizeof(float));
dd<<<gridSize, blockSize>>>( dev_p,dimX,dimY,dimZ);
Could anyone please gives some pointers? Because it does not make much sense to me. All computation of val is fast, and the final step is to move val into p. p never gets involved in the computation, and it only shows up once. So why is it so slow?
The computations are basically a loop over a 512 X 512 matrix. It is pretty fair amount of computation I'd say.
The computations you perform in the SomeComputationOnVal are extremely expensive. Each thread reads at least 1MB of data which is off cache (or in L2 at best for a small part should k vary in a small range) which totals for your run about 16 TB of data. Even on a high end gpu, it would take about 2 minutes to run, at the minimum. Not to mention everything that could slow this down.
Your function does not write any data in global memory and has no boundary effect. The compiler may decide to optimize out the method call should you not use the output.
Hence cases two and three not doing calculation are very fast. Writing 64 MB on gpu memory, with coesced threads is very fast (milliseconds range).
You can verify the generated ptx to see if code gets optimized out. Use the --keep option in nvcc and search for ptx files.

OpenCL :Access proper index by using globalid(.)

Hi,
I am coding in OpenCL.
I am converting a "C function" having 2D array starting from i=1 and j=1 .PFB .
cv::Mat input; //Input :having some data in it ..
//Image input size is :input.rows=288 ,input.cols =640
cv::Mat output(input.rows-2,input.cols-2,CV_32F); //Output buffer
//Image output size is :output.rows=286 ,output.cols =638
This is a code Which I want to modify in OpenCL:
for(int i=1;i<output.rows-1;i++)
{
for(int j=1;j<output.cols-1;j++)
{
float xVal = input.at<uchar>(i-1,j-1)-input.at<uchar>(i-1,j+1)+ 2*(input.at<uchar>(i,j-1)-input.at<uchar>(i,j+1))+input.at<uchar>(i+1,j-1) - input.at<uchar>(i+1,j+1);
float yVal = input.at<uchar>(i-1,j-1) - input.at<uchar>(i+1,j-1)+ 2*(input.at<uchar>(i-1,j) - input.at<uchar>(i+1,j))+input.at<uchar>(i-1,j+1)-input.at<uchar>(i+1,j+1);
output.at<float>(i-1,j-1) = xVal*xVal+yVal*yVal;
}
}
...
Host code :
//Input Image size is :input.rows=288 ,input.cols =640
//Output Image size is :output.rows=286 ,output.cols =638
OclStr->global_work_size[0] =(input.cols);
OclStr->global_work_size[1] =(input.rows);
size_t outBufSize = (output.rows) * (output.cols) * 4;//4 as I am copying all 4 uchar values into one float variable space
cl_mem cl_input_buffer = clCreateBuffer(
OclStr->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR ,
(input.rows) * (input.cols),
static_cast<void *>(input.data), &OclStr->returnstatus);
cl_mem cl_output_buffer = clCreateBuffer(
OclStr->context, CL_MEM_WRITE_ONLY| CL_MEM_USE_HOST_PTR ,
(output.rows) * (output.cols) * sizeof(float),
static_cast<void *>(output.data), &OclStr->returnstatus);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 0, sizeof(cl_mem), (void *)&cl_input_buffer);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 1, sizeof(cl_mem), (void *)&cl_output_buffer);
OclStr->returnstatus = clEnqueueNDRangeKernel(
OclStr->command_queue,
OclStr->objkernel,
2,
NULL,
OclStr->global_work_size,
NULL,
0,
NULL,
NULL
);
clEnqueueMapBuffer(OclStr->command_queue, cl_output_buffer, true, CL_MAP_READ, 0, outBufSize, 0, NULL, NULL, &OclStr->returnstatus);
kernel Code :
__kernel void Sobel_uchar (__global uchar *pSrc, __global float *pDstImage)
{
const uint cols = get_global_id(0)+1;
const uint rows = get_global_id(1)+1;
const uint width= get_global_size(0);
uchar Opsoble[8];
Opsoble[0] = pSrc[(cols-1)+((rows-1)*width)];
Opsoble[1] = pSrc[(cols+1)+((rows-1)*width)];
Opsoble[2] = pSrc[(cols-1)+((rows+0)*width)];
Opsoble[3] = pSrc[(cols+1)+((rows+0)*width)];
Opsoble[4] = pSrc[(cols-1)+((rows+1)*width)];
Opsoble[5] = pSrc[(cols+1)+((rows+1)*width)];
Opsoble[6] = pSrc[(cols+0)+((rows-1)*width)];
Opsoble[7] = pSrc[(cols+0)+((rows+1)*width)];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(cols-1)+(rows-1)*width] = gx*gx + gy*gy;
}
Here I am not able to get the output as expected.
I am having some questions that
My for loop is starting from i=1 instead of zero, then How can I get proper index by using the global_id() in x and y direction
What is going wrong in my above kernel code :(
I am suspecting there is a problem in buffer stride but not able to further break my head as already broke it throughout a day :(
I have observed that with below logic output is skipping one or two frames after some 7/8 frames sequence.
I have added the screen shot of my output which is compared with the reference output.
My above logic is doing partial sobelling on my input .I changed the width as -
const uint width = get_global_size(0)+1;
PFB
Your suggestions are most welcome !!!
It looks like you may be fetching values in (y,x) format in your opencl version. Also, you need to add 1 to the global id to replicate your for loops starting from 1 rather than 0.
I don't know why there is an unused iOffset variable. Maybe your bug is related to this? I removed it in my version.
Does this kernel work better for you?
__kernel void simple(__global uchar *pSrc, __global float *pDstImage)
{
const uint i = get_global_id(0) +1;
const uint j = get_global_id(1) +1;
const uint width = get_global_size(0) +2;
uchar Opsoble[8];
Opsoble[0] = pSrc[(i-1) + (j - 1)*width];
Opsoble[1] = pSrc[(i-1) + (j + 1)*width];
Opsoble[2] = pSrc[i + (j-1)*width];
Opsoble[3] = pSrc[i + (j+1)*width];
Opsoble[4] = pSrc[(i+1) + (j - 1)*width];
Opsoble[5] = pSrc[(i+1) + (j + 1)*width];
Opsoble[6] = pSrc[(i-1) + (j)*width];
Opsoble[7] = pSrc[(i+1) + (j)*width];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(i-1) + (j-1)*width] = gx*gx + gy*gy ;
}
I am a bit apprehensive about posting an answer suggesting optimizations to your kernel, seeing as the original output has not been reproduced exactly as of yet. There is a major improvement available to be made for problems related to image processing/filtering.
Using local memory will help you out by reducing the number of global reads by a factor of eight, as well as grouping the global writes together for potential gains with the single write-per-pixel output.
The kernel below reads a block of up to 34x34 from pSrc, and outputs a 32x32(max) area of the pDstImage. I hope the comments in the code are enough to guide you in using the kernel. I have not been able to give this a complete test, so there could be changes required. Any comments are appreciated as well.
__kernel void sobel_uchar_wlocal (__global uchar *pSrc, __global float *pDstImage, __global uint2 dimDstImage)
{
//call this kernel 1-dimensional work group size: 32x1
//calculates 32x32 region of output with 32 work items
const uint wid = get_local_id(0);
const uint wid_1 = wid+1; // corrected for the calculation step
const uint2 gid = (uint2)(get_group_id(0),get_group_id(1));
const uint localDim = get_local_size(0);
const uint2 globalTopLeft = (uint2)(localDim * gid.x, localDim * gid.y); //position in pSrc to copy from/to
//dimLocalBuff is used for the right and bottom edges of the image, where the work group may run over the border
const uint2 dimLocalBuff = (uint2)(localDim,localDim);
if(dimDstImage.x - globalTopLeft.x < dimLocalBuff.x){
dimLocalBuff.x = dimDstImage.x - globalTopLeft.x;
}
if(dimDstImage.y - globalTopLeft.y < dimLocalBuff.y){
dimLocalBuff.y = dimDstImage.y - globalTopLeft.y;
}
int i,j;
//save region of data into local memory
__local uchar srcBuff[34][34]; //34^2 uchar = 1156 bytes
for(j=-1;j<dimLocalBuff.y+1;j++){
for(i=x-1;i<dimLocalBuff.x+1;i+=localDim){
srcBuff[i+1][j+1] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//compute output and store locally
__local float dstBuff[32][32]; //32^2 float = 4096 bytes
if(wid_1 < dimLocalBuff.x){
for(i=0;i<dimLocalBuff.y;i++){
float gx = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1-1)+ (i + 1)]+2*(srcBuff[wid_1+ (i-1)]-srcBuff[wid_1+ (i+1)])+srcBuff[(wid_1+1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i + 1)];
float gy = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i - 1)]+2*(srcBuff[(wid_1-1)+ (i)]-srcBuff[(wid_1+1)+ (i)])+srcBuff[(wid_1-1)+ (i + 1)]-srcBuff[(wid_1+1)+ (i + 1)];
dstBuff[wid][i] = gx*gx + gy*gy;
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//copy results to output
for(j=0;j<dimLocalBuff.y;j++){
for(i=0;i<dimLocalBuff.x;i+=localDim){
srcBuff[i][j] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
}

CUDA: __syncthreads() before shared memory operation?

I'm in the rather poor situation of not being able to use the CUDA debugger. I'm getting some strange results from usage of __syncthreads in an application with a single shared array (deltas). The following piece of code is performed in a loop:
__syncthreads(); //if I comment this out, things get funny
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this line doesnt seem to matter regardless if the first sync is commented out or not
//after sync: do something with the values of delta written in this threads and other threads of this block
Basically, I have code with overlapping blocks (required due to the nature of the algorithm). The program does compile and run but somehow I get systematically wrong values in the areas of vertical overlap. This is very confusing to me as I thought that the correct way to sync is to sync after the threads have performed my write to the shared memory.
This is the whole function:
//XC without repetitions
template <int blocksize, int order>
__global__ void __xc(unsigned short* raw_input_data, int num_frames, int width, int height,
float * raw_sofi_data, int block_size, int order_deprecated){
//we make a distinction between real pixels and virtual pixels
//real pixels are pixels that exist in the original data
//overlap correction: every new block has a margin of 3 threads doing less work (only computing deltas)
int x_corrected = global_x() - blockIdx.x * 3;
int y_corrected = global_y() - blockIdx.y * 3;
//if the thread is responsible for any real pixel
if (x_corrected < width && y_corrected < height){
// __shared__ float deltas[blocksize];
__shared__ float deltas[blocksize];
//the outer pixels of a block do not update SOFI values as they do not have sufficient information available
//they are used only to compute mean and delta
//also, pixels at the global edge have to be thrown away (as there is not sufficient data to interpolate)
bool within_inner_block =
threadIdx.x > 0
&& threadIdx.y > 0
&& threadIdx.x < blockDim.x - 2
&& threadIdx.y < blockDim.y - 2
//global edge
&& x_corrected > 0
&& y_corrected > 0
&& x_corrected < width - 1
&& y_corrected < height - 1
;
//init virtual pixels
float virtual_pixels[order * order];
if (within_inner_block){
for (int i = 0; i < order * order; ++i) {
virtual_pixels[i] = 0;
}
}
float mean = 0;
float intensity;
int lex_index_block = threadIdx.x + threadIdx.y * blockDim.x;
//main loop
for (int frame_idx = 0; frame_idx < num_frames; ++frame_idx) {
//shared memory read and computation of mean/delta
intensity = raw_input_data[lex_index_3D(x_corrected,y_corrected, frame_idx, width, height)];
__syncthreads(); //if I comment this out, things break
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this doesnt seem to matter
mean = deltas[lex_index_block]/(float)(frame_idx+1);
//if the thread is responsible for correlated pixels, i.e. not at the border of the original frame
if (within_inner_block){
//WORKING WITH DELTA STARTS HERE
virtual_pixels[0] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y + 1,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x,
threadIdx.y - 1,
blockDim.x)];
virtual_pixels[1] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x + 1,
threadIdx.y,
blockDim.x)];
virtual_pixels[2] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x,
threadIdx.y + 1,
blockDim.x)];
virtual_pixels[3] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x+1,
threadIdx.y+1,
blockDim.x)];
// xc_update<order>(virtual_pixels, delta2, mean);
}
}
if (within_inner_block){
for (int virtual_idx = 0; virtual_idx < order*order; ++virtual_idx) {
raw_sofi_data[lex_index_2D(x_corrected*order + virtual_idx % order,
y_corrected*order + (int)floorf(virtual_idx / order),
width*order)]=virtual_pixels[virtual_idx];
}
}
}
}
From what I can see, there could be a hazard in your application between loop iterations. The write to deltas[lex_index_block] for loop iteration frame_idx+1 could be mapped to the same location as the read of deltas[lex_index_2D(threadIdx.x, threadIdx.y -1, blockDim.x)] in a different thread at iteration frame_idx. The two accesses are unordered and the result is nondeterministic. Try running the app with cuda-memcheck --tool racecheck.

Cuda-memcheck not reporting out of bounds shared memory access

I am runnig the follwoing code using shared memory:
__global__ void computeAddShared(int *in , int *out, int sizeInput){
//not made parameters gidata and godata to emphasize that parameters get copy of address and are different from pointers in host code
extern __shared__ float temp[];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int ltid = threadIdx.x;
temp[ltid] = 0;
while(tid < sizeInput){
temp[ltid] += in[tid];
tid+=gridDim.x * blockDim.x; // to handle array of any size
}
__syncthreads();
int offset = 1;
while(offset < blockDim.x){
if(ltid % (offset * 2) == 0){
temp[ltid] = temp[ltid] + temp[ltid + offset];
}
__syncthreads();
offset*=2;
}
if(ltid == 0){
out[blockIdx.x] = temp[0];
}
}
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
/*FILE *f;
f = fopen("invertedList.txt" , "w");
a[0] = 1 + (rand() % 8);
fprintf(f, "%d,",a[0]);
for( int i = 1 ; i< N; i++){
a[i] = a[i-1] + (rand() % 8) + 1;
fprintf(f, "%d,",a[i]);
}
fclose(f);*/
int* gidata;
int* godata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int TPB = 4;
int blocks = 10; //to get things kicked off
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
while(blocks != 1 ){
if(size < TPB){
TPB = size; // size is 2^sth
}
blocks = (size+ TPB -1 ) / TPB;
cudaMalloc((void**)&godata, blocks * sizeof(int));
computeAddShared<<<blocks, TPB,TPB>>>(gidata, godata,size);
cudaFree(gidata);
gidata = godata;
size = blocks;
}
//printf("The error by cuda is %s",cudaGetErrorString(cudaGetLastError()));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int *output = (int*)malloc(sizeof(int));
cudaMemcpy(output, gidata, sizeof(int), cudaMemcpyDeviceToHost);
//Cant free either earlier as both point to same location
cudaError_t chk = cudaFree(godata);
if(chk!=0){
printf("First chk also printed error. Maybe error in my logic\n");
}
printf("The error by threadsyn is %s", cudaGetErrorString(cudaGetLastError()));
printf("The sum of the array is %d\n", output[0]);
getchar();
return 0;
}
Clearly, the first while loop in computeAddShared is causing out of bounds error because I am allocating 4 bytes to shared memory. Why does cudamemcheck not catch this. Below is the output of cuda-memcheck
========= CUDA-MEMCHECK
time is 12.334816 msThe error by threadsyn is no errorThe sum of the array is 13
6
========= ERROR SUMMARY: 0 errors
Shared memory allocation granularity. The Hardware undoubtedly has a page size for allocations (probably the same as the L1 cache line side). With only 4 threads per block, there will "accidentally" be enough shared memory in a single page to let you code work. If you used a sensible number of threads block (ie. a round multiple of the warp size) the error would be detected because there would not be enough allocated memory.

CUDA memory limitations

If I try to send to my CUDA device a struct wich is heavier than the size of memory available, will CUDA give me any kind of warning or error?
I'm asking that because my GPU has 1024 MBytes (1073414144 bytes) Total amount of global memory, but I don't know how I should handle and eventual problem.
That's my code:
#define VECSIZE 2250000
#define WIDTH 1500
#define HEIGHT 1500
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
{
int width;
int height;
int* elements;
};
int main()
{
Matrix M;
M.width = WIDTH;
M.height = HEIGHT;
M.elements = (int *) calloc(VECSIZE,sizeof(int));
int row, col;
// define Matrix M
// Matrix generator:
for (int i = 0; i < M.height; i++)
for(int j = 0; j < M.width; j++)
{
row = i;
col = j;
if (i == j)
M.elements[row * M.width + col] = INFINITY;
else
{
M.elements[row * M.width + col] = (rand() % 2); // because 'rand() % 1' just does not seems to work ta all.
if (M.elements[row * M.width + col] == 0) // can't have zero weight.
M.elements[row * M.width + col] = INFINITY;
else if (M.elements[row * M.width + col] == 2)
M.elements[row * M.width + col] = 1;
}
}
// Declare & send device Matrix to Device.
Matrix d_M;
d_M.width = M.width;
d_M.height = M.height;
size_t size = M.width * M.height * sizeof(int);
cudaMalloc(&d_M.elements, size);
cudaMemcpy(d_M.elements, M.elements, size, cudaMemcpyHostToDevice);
int *d_k= (int*) malloc(sizeof(int));
cudaMalloc((void**) &d_k, sizeof (int));
int *d_width=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_width, sizeof(int));
unsigned int *width=(unsigned int*)malloc(sizeof(unsigned int));
width[0] = M.width;
cudaMemcpy(d_width, width, sizeof(int), cudaMemcpyHostToDevice);
int *d_height=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_height, sizeof(int));
unsigned int *height=(unsigned int*)malloc(sizeof(unsigned int));
height[0] = M.height;
cudaMemcpy(d_height, height, sizeof(int), cudaMemcpyHostToDevice);
/*
et cetera .. */
While you may not currently be sending enough data to the GPU to max out it's memory, when you do, your cudaMalloc will return the error code cudaErrorMemoryAllocation which as per the cuda api docs, signals that the memory allocation failed. I note that in your example code you are not checking the return values of the cuda calls. These return codes need to be checked to make sure your program is running correctly. The cuda api does not throw exceptions: you must check the return codes. See this article for info on checking the errors and getting meaningful messages about the errors
If you are using cutil.h, then it provides two very useful macros:
CUDA_SAFE_CALL (used while issuing functions like cudaMalloc, cudaMemcpy etc.)
and
CUT_CHECK_ERROR (used after executing a kernel to check for errors in kernel execution).
They take care of the errors, if any, by using the error checking mechanism detailed in the article provided by flipchart.

Resources