Copying memory allocated by cudaMallocPitch - memory

Can cudaMemcpy be used for memory allocated with cudaMallocPitch? If not, can you tell, which function should be used. cudaMallocPitch returns linear memory, so I suppose that cudaMemcpy should be used.

You certainly could use cudaMemcpy to copy pitched device memory, but it would be more usual to use cudaMemcpy2D. An example of a pitched copy from host to device would look something like this:
#include "cuda.h"
#include <assert.h>
typedef float real;
int main(void)
cudaFree(0); // Establish context
// Host array dimensions
const size_t dx = 300, dy = 300;
// For the CUDA API width and pitch are specified in bytes
size_t width = dx * sizeof(real), height = dy;
// Host array allocation
real * host = new real[dx * dy];
size_t pitch1 = dx * sizeof(real);
// Device array allocation
// pitch is determined by the API call
real * device;
size_t pitch2;
assert( cudaMallocPitch((real **)&device, &pitch2, width, height) == cudaSuccess );
// Sample memory copy - note source and destination pitches can be different
assert( cudaMemcpy2D(device, pitch2, host, pitch1, width, height, cudaMemcpyHostToDevice) == cudaSuccess );
// Destroy context
assert( cudaDeviceReset() == cudaSuccess );
return 0;
(note: untested, cavaet emptor and all that.....)


How to correctly manipulate a CV_16SC3 Mat in a CUDA Kernel

I am writing a CUDA Program while working with OpenCV. I have an empty Mat of a given size (e.g. 1000x800) which I explicitly converted to GPUMat with dataytpe CV_16SC3. It is desired to manipulate the Image in this format in the CUDA Kernel. However trying to manipulate the Mat does not seem to work correctly.
I am calling my CUDA kernel as follows:
my_kernel <<< gridDim, blockDim >>>( (unsigned short*), img.cols, img.rows, img.step);
and my sample kernel looks like this
__global__ void my_kernel( unsigned short* img, int width, int height, int img_step)
int x, y, pixel;
y = blockIdx.y * blockDim.y + threadIdx.y;
x = blockIdx.x * blockDim.x + threadIdx.x;
if (y >= height)
if (x >= width)
pixel = (y * (img_step)) + (3 * x);
img[pixel] = 255; //I know 255 is basically an uchar, this is just part of my test
img[pixel+1] = 255
img[pixel+2] = 255;
I am expecting this small kernel sample to write al pixels to white. However, after downloading the Mat again from the GPU and visualizing it with imshow, not all the pixels are white and some weird black lines are present, which makes me believe that somehow I am writing to invalid memory addresses.
My guess is the following. The OpenCV documentation states that cv::mat::data returns an uchar pointer. However, my Mat has a data type "16U" (short unsigned to my knowledge). That is why in the kernel launch I am casting the pointer to (unsigned short*). But apparently that is incorrect.
How should I correctly proceed to be able to read and write the Mat data as short in my kernel?
First of all, the input image type should be short instead of unsigned short because the type of Mat is 16SC3 ( rather than 16UC3 ).
Now, since the image step is in bytes and the data type is short, the pixel index ( or address ) should be calculated taken into account the difference in byte width of those. There are 2 ways to fix this issue.
Method 1:
__global__ void my_kernel( short* img, int width, int height, int img_step)
int x, y, pixel;
y = blockIdx.y * blockDim.y + threadIdx.y;
x = blockIdx.x * blockDim.x + threadIdx.x;
if (y >= height)
if (x >= width)
//Reinterpret the input pointer as char* to allow jump in bytes instead of short
char* imgBytes = reinterpret_cast<char*>(img);
//Calculate row start address using the newly created pointer
char* rowStartBytes = imgBytes + (y * img_step); // Jump in byte
//Reinterpret the row start address back to required data type.
short* rowStartShort = reinterpret_cast<short*>(rowStartBytes);
short* pixelAddress = rowStartShort + ( 3 * x ); // Jump in short
//Modify the image values
pixelAddress[0] = 255;
pixelAddress[1] = 255;
pixelAddress[2] = 255;
Method 2:
Divide the input image step by the size of required data type (short). It may be done when passing the step as a kernel argument.
my_kernel<<<grid,block>>>( img, width, height, img_step/sizeof(short));
I have used method 2 for quite a long time. It is a shortcut method, but later on when I got to look at the source code of certain image processing libraries, I realized that actually Method 1 is more portable, since the size of type can vary across different platforms.

Passing Mat to OpenCL Kernels causes Segmentation fault

I want to pass an OpenCL Mat to a selfwritten OpenCL Kernel for a FGPA (doesnt´t support the OpenCV OpenCL).
Host- Code:
Mat img = imread( "template.jpg", IMREAD_GRAYSCALE );
Mat output(img.rows, img.cols, CV_8UC1);
// Program, Context already declared
// Create Kernel
cl_kernel kernel = NULL;
kernel = clCreateKernel(program, "copy", &status);
// Create Command Queue and associate it with the device you want to execute on
cl_command_queue cmdQueue;
cmdQueue = clCreateCommandQueue(context,devices[0], 0, &status);
// Buffer, prob i do something wrong here
cl_mem buffer_img = clCreateBuffer(context,CL_MEM_READ_ONLY, sizeof(uint) * img.cols * img.rows, NULL,&status);
cl_mem buffer_outputimg = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(uint) * img.cols * img.rows,NULL,&status);
status = clEnqueueWriteBuffer(cmdQueue, buffer_img,CL_FALSE,0,sizeof(uint) * img.cols * img.rows,&img,0,NULL,NULL);
// set kernel arguments
status = clSetKernelArg(kernel,0,sizeof(cl_mem),&buffer_img);
status = clSetKernelArg(kernel,1,sizeof(cl_mem),&buffer_outputimg);
size_t globalWorkSize[2];
globalWorkSize[0] = img.cols;
globalWorkSize[1] = img.rows;
status = clEnqueueNDRangeKernel(cmdQueue,kernel,2,NULL, globalWorkSize, NULL,0, NULL,NULL);
clEnqueueReadBuffer(cmdQueue,buffer_outputimg,CL_TRUE,0,sizeof(uint) * img.cols * img.rows, &output, 0, NULL, NULL);
//stop cpu till queue is finish
__kernel void copy(__global uchar * input, __global uchar * output)
const int x = get_global_id(0);
const int y = get_global_id(1);
output[y * get_global_size(0) + x] = input[y * get_global_size(0) + x] ;
When excecuting it on the FPGA i get a Segmentation fault, whichs is propably due the wrong handling with the OpenCV Mat.
Edited Host-Code as suggested by api55 solved the problem:
Mat img = imread( "scene.jpg", IMREAD_GRAYSCALE );
Mat output(img.rows, img.cols, CV_8UC1);
// Program, Context already declared
// Create Kernel
cl_kernel kernel = NULL;
kernel = clCreateKernel(program, "copy", &status);
// Create Command Queue and associate it with the device you want to execute on
cl_command_queue cmdQueue;
cmdQueue = clCreateCommandQueue(context,devices[0], 0, &status);
checkError(status, "Failed to create commadnqueue");
// Buffer
cl_mem buffer_img = clCreateBuffer(context,CL_MEM_READ_ONLY, sizeof(uchar) * img.cols * img.rows, NULL,&status);
cl_mem buffer_outputimg = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(uchar) * img.cols * img.rows,NULL,&status);
checkError(status, "Failed to create buffer_mask");
status = clEnqueueWriteBuffer(cmdQueue, buffer_img,CL_FALSE,0,sizeof(uchar) * img.cols * img.rows,,0,NULL,NULL);
checkError(status, "Failed to enqueue buffer_img");
status = clSetKernelArg(kernel,0,sizeof(cl_mem),&buffer_img);
status = clSetKernelArg(kernel,1,sizeof(cl_mem),&buffer_outputimg);
size_t globalWorkSize[2];
globalWorkSize[0] = img.cols;
globalWorkSize[1] = img.rows;
status = clEnqueueNDRangeKernel(cmdQueue,kernel,2,NULL, globalWorkSize, NULL,0, NULL,NULL);
clEnqueueReadBuffer(cmdQueue,buffer_outputimg,CL_TRUE,0,sizeof(uchar) * img.cols * img.rows,,0,NULL,NULL);
imwrite("output.jpg", output);
I do not have much experience with opencl, but i think it is an opencv/c++ problem.
The opencv mat data lies in which is an uchar* of the size sizeof(T) * channels * rows * cols.
Usually, T is uchar when loading images, and channels is 3 (unless that is a greyscale img). 3 channel uchar is 24 bits per pixel and greyscale (as you are loading) is 8 bits per pixel and you are using uint which is size of 32 bits. At some point it will go outside the memory and do the segmentation error. Also, if you do not use the data pointer in the structure, you may be copying the header information and just the pointer to the data and not the data itself.
I suggest you to change &img in:
status = clEnqueueWriteBuffer(cmdQueue, buffer_img,CL_FALSE,0,sizeof(uint) * img.cols * img.rows,&img,0,NULL,NULL);
Finally, you need to have the correct data. I am not sure if opencl may use uchar, but if it can't, change the cv::Mat to another type like this:
img.convertTo(img, CV_32S);
After loading the image. This will change it to int... opencv does not support matrices with unsigned int... just make sure to change it accordingly in the other places (i.e. sizeof(uint)) and if you convert the input, remember to create the output with the same type.
If you prefer float, use CV_32F and if you like double CV_64F.

Unknown error when inverting image using cuda

i began to implement some simple image processing using cuda but i have an error in my code
the error happens when i copy pixels from device to host
this is my try
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <opencv2\core\core.hpp>
#include <opencv2\highgui\highgui.hpp>
#include <stdio.h>
using namespace cv;
unsigned char *h_pixels;
unsigned char *d_pixels;
int bufferSize;
int width,height;
const int BLOCK_SIZE = 32;
Mat image;
void get_pixels(const char* fileName)
image = imread(fileName);
bufferSize = image.size().width * image.size().height * 3 * sizeof(unsigned char);
width = image.size().width;
height = image.size().height;
h_pixels = new unsigned char[bufferSize];
__global__ void invert_image(unsigned char* pixels,int width,int height)
int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int cidx = (row * width + col) * 3;
pixels[cidx] = 255 - pixels[cidx];
pixels[cidx + 1] = 255 - pixels[cidx + 1];
pixels[cidx + 2] = 255 - pixels[cidx + 2];
int main()
cudaError_t err = cudaMalloc((void**)&d_pixels,bufferSize);
err = cudaMemcpy(d_pixels,h_pixels,bufferSize,cudaMemcpyHostToDevice);
dim3 dimGrid(width/dimBlock.x,height/dimBlock.y);
unsigned char *pixels = new unsigned char[bufferSize];
err= cudaMemcpy(pixels,d_pixels,bufferSize,cudaMemcpyDeviceToHost);// unknown error
const char * errStr = cudaGetErrorString(err);
cudaFree(d_pixels); = pixels;
namedWindow("display image");
imshow("display image",image);
return 0;
also how can i find out error that occurs in cuda device
thanks for your help
OpenCV images are not continuous. Each row is 4 byte or 8 byte aligned. You should also pass the step field of the Mat to the CUDA kernel, so that you can calculate the cidx correctly. The generic formula to calculate the output index is:
cidx = row * (step/elementSize) + (NumberOfChannels * col);
in your case, it will be:
cidx = row * step + (3 * col);
Referring to the alignment of images, you buffer size is equal to image.step * image.size().height.
Next thing is the one pointed out by #phoad in the third point. You should create enough number of thread blocks to cover the whole image.
Here is a generic formula for Grid which will create enough number of blocks for any image size.
dim3 grid((width + block.x - 1)/block.x,(height + block.y - 1)/block.y);
First of all be sure that the image file is read correctly.
Check if the device memory is allocated with CUDA_SAFE_CALL(cudaMalloc(..))
Check the dimensions of the image. If the dimension of the image is not multiples of BLOCKSIZE than you might be missing some indices and the image is not fully inverted.
Call cudaDeviceSynchronize after the kernel call and check its return value.
Do you get any error when you run the code without calling the kernel anyway?
You are not freeing the h_pixels and might have a memory leak.
Instead of using BLOCKSIZE in the kernel you might use "blockDim.x". So calculating indices like "blockIdx.x * blockDim.x + threadIdx.x"
Try to do not touch the memory area in the kernel code, namely comment out the memory updates at the kernel (the lines where you access the pixels array) and check if the program continues to fail. If it does not continue to fail you might be accessing out of the bounds.
Use this command immediately after the kernel invocation to print the kernel errors:
printf("error code: %s\n",cudaGetErrorString(cudaGetLastError()))

Direct X Sprite Rendering Issue

I have been writing my own library using Direct X and have hit an odd issue. Whilst trying to render an animating sprite I am simply seeing a big black square:
I have stepped through the code obsessively and have concluded that it must be something about the loading of the actual sprites, because everything that I can see in my code is fine. Obviously, I cannot step into the functions such as BltFast, and so cannot tell if my sprite surfaces are being blitted onto the backbuffer successfully.
Here are my load and render functions for the sprite:
* loads a bitmap file and copies it to a directdraw surface
* #param pID wait
* #param pFileName name of the bitmap file to load into memory
void Sprite::Load (const char *pID, const char *pFileName)
// initialises the member variables with the new image id and file name
mID = pID;
mFileName = pFileName;
// creates the necessary variables
IDirectDrawSurface7 *tDDS;
// stores bitmap image into HBITMAP handler
GetObject (tHBM, sizeof (tBM), &tBM);
// create surface for the HBITMAP to be copied onto
ZeroMemory (&tDDSD, sizeof (tDDSD));
tDDSD.dwSize = sizeof (tDDSD);
tDDSD.dwWidth = tBM.bmWidth;
tDDSD.dwHeight = tBM.bmHeight;
DirectDraw::GetInstance ()->DirectDrawObject()->CreateSurface (&tDDSD, &tDDS, NULL);
// copying bitmap image onto surface
CopyBitmap(tDDS, tHBM, 0, 0, 0, 0);
// deletes bitmap image now that it has been used
// stores the new width and height of the image
mSpriteWidth = tBM.bmWidth;
mSpriteHeight = tBM.bmHeight;
// sets the address of the bitmap surface to this temporary surface with the new bitmap image
mBitmapSurface = tDDS;
* renders the sprites surface to the back buffer
* #param pBackBuffer surface to render the sprite to
* #param pX x co-ordinate to render to (default is 0)
* #param pY y co-ordinate to render to (default is 0)
void Sprite::Render (LPDIRECTDRAWSURFACE7 &pBackBuffer, float pX, float pY)
if (mSpriteWidth > 800) mSpriteWidth = 800;
RECT tFrom;
tFrom.left = = 0;
tFrom.right = mSpriteWidth;
tFrom.bottom = mSpriteHeight;
// bltfast parameters are (position x, position y, dd surface, draw rect, wait flag)
// pBackBuffer->BltFast (0 + DirectDraw::GetInstance()->ScreenWidth(), 0, mBitmapSurface, &tFrom, DDBLTFAST_WAIT);
pBackBuffer->BltFast (static_cast<DWORD>(pX + DirectDraw::GetInstance()->ScreenWidth()),
static_cast<DWORD>(pY), mBitmapSurface, &tFrom, DDBLTFAST_WAIT);
The surfaces were simply not a compatible format.
Here's the fixed copybitmap function which I now call in the load function:
extern "C" HRESULT
DDCopyBitmap(IDirectDrawSurface7 * pdds, HBITMAP hbm, int x, int y,
int dx, int dy)
HDC hdcImage;
HDC hdc;
if (hbm == NULL || pdds == NULL)
return E_FAIL;
// Make sure this surface is restored.
// Select bitmap into a memoryDC so we can use it.
hdcImage = CreateCompatibleDC(NULL);
if (!hdcImage)
OutputDebugString("createcompatible dc failed\n");
SelectObject(hdcImage, hbm);
// Get size of the bitmap
GetObject(hbm, sizeof(bm), &bm);
dx = dx == 0 ? bm.bmWidth : dx; // Use the passed size, unless zero
dy = dy == 0 ? bm.bmHeight : dy;
// Get size of surface.
ddsd.dwSize = sizeof(ddsd);
ddsd.dwFlags = DDSD_HEIGHT | DDSD_WIDTH;
if ((hr = pdds->GetDC(&hdc)) == DD_OK)
StretchBlt(hdc, 0, 0, ddsd.dwWidth, ddsd.dwHeight, hdcImage, x, y,
dx, dy, SRCCOPY);
return hr;

CUDA memory limitations

If I try to send to my CUDA device a struct wich is heavier than the size of memory available, will CUDA give me any kind of warning or error?
I'm asking that because my GPU has 1024 MBytes (1073414144 bytes) Total amount of global memory, but I don't know how I should handle and eventual problem.
That's my code:
#define VECSIZE 2250000
#define WIDTH 1500
#define HEIGHT 1500
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
int width;
int height;
int* elements;
int main()
Matrix M;
M.width = WIDTH;
M.height = HEIGHT;
M.elements = (int *) calloc(VECSIZE,sizeof(int));
int row, col;
// define Matrix M
// Matrix generator:
for (int i = 0; i < M.height; i++)
for(int j = 0; j < M.width; j++)
row = i;
col = j;
if (i == j)
M.elements[row * M.width + col] = INFINITY;
M.elements[row * M.width + col] = (rand() % 2); // because 'rand() % 1' just does not seems to work ta all.
if (M.elements[row * M.width + col] == 0) // can't have zero weight.
M.elements[row * M.width + col] = INFINITY;
else if (M.elements[row * M.width + col] == 2)
M.elements[row * M.width + col] = 1;
// Declare & send device Matrix to Device.
Matrix d_M;
d_M.width = M.width;
d_M.height = M.height;
size_t size = M.width * M.height * sizeof(int);
cudaMalloc(&d_M.elements, size);
cudaMemcpy(d_M.elements, M.elements, size, cudaMemcpyHostToDevice);
int *d_k= (int*) malloc(sizeof(int));
cudaMalloc((void**) &d_k, sizeof (int));
int *d_width=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_width, sizeof(int));
unsigned int *width=(unsigned int*)malloc(sizeof(unsigned int));
width[0] = M.width;
cudaMemcpy(d_width, width, sizeof(int), cudaMemcpyHostToDevice);
int *d_height=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_height, sizeof(int));
unsigned int *height=(unsigned int*)malloc(sizeof(unsigned int));
height[0] = M.height;
cudaMemcpy(d_height, height, sizeof(int), cudaMemcpyHostToDevice);
et cetera .. */
While you may not currently be sending enough data to the GPU to max out it's memory, when you do, your cudaMalloc will return the error code cudaErrorMemoryAllocation which as per the cuda api docs, signals that the memory allocation failed. I note that in your example code you are not checking the return values of the cuda calls. These return codes need to be checked to make sure your program is running correctly. The cuda api does not throw exceptions: you must check the return codes. See this article for info on checking the errors and getting meaningful messages about the errors
If you are using cutil.h, then it provides two very useful macros:
CUDA_SAFE_CALL (used while issuing functions like cudaMalloc, cudaMemcpy etc.)
CUT_CHECK_ERROR (used after executing a kernel to check for errors in kernel execution).
They take care of the errors, if any, by using the error checking mechanism detailed in the article provided by flipchart.
