My code is giving an error message and I am trying to track down the cause of it. To make it easier to find the problem, I have stripped away code that apparently is not relevant to causing the error message. If you can tell me why the following simple code produces an error message, then I think I should be able to fix my original code:
#include "cuComplex.h"
#include <cutil.h>
__device__ void compute_energy(void *data, int isample, int nsamples) {
cuDoubleComplex * const nminusarray = (cuDoubleComplex*)data;
cuDoubleComplex * const f = (cuDoubleComplex*)(nminusarray+101);
double * const abs_est_errorrow_all = (double*)(f+3);
double * const rel_est_errorrow_all = (double*)(abs_est_errorrow_all+nsamples*51);
int * const iid_all = (int*)(rel_est_errorrow_all+nsamples*51);
int * const iiu_all = (int*)(iid_all+nsamples*21);
int * const piv_all = (int*)(iiu_all+nsamples*21);
cuDoubleComplex * const energyrow_all = (cuDoubleComplex*)(piv_all+nsamples*12);
cuDoubleComplex * const refinedenergyrow_all = (cuDoubleComplex*)(energyrow_all+nsamples*51);
cuDoubleComplex * const btplus_all = (cuDoubleComplex*)(refinedenergyrow_all+nsamples*51);
cuDoubleComplex * const btplus = btplus_all+isample*21021;
btplus[0] = make_cuDoubleComplex(0.0, 0.0);
__global__ void computeLamHeight(void *data, int nlambda) {
compute_energy(data, blockIdx.x, nlambda);
int main(int argc, char *argv[]) {
void *device_data;
CUT_DEVICE_INIT(argc, argv);
CUDA_SAFE_CALL(cudaMalloc(&device_data, 184465640));
computeLamHeight<<<dim3(101, 1, 1), dim3(512, 1, 1), 45000>>>(device_data, 101);
I am using a GeForce GTX 480 and I am compiling the code like so:
nvcc -L /soft/cuda-sdk/4.0.17/C/lib -I /soft/cuda-sdk/4.0.17/C/common/inc -lcutil_x86_64 -arch sm_13 -O3 -Xopencc "-Wall"
The output is:
Using device 0: GeForce GTX 480
Cuda error in file '' in line 31 : unspecified launch failure.
EDIT: I have now further simplified the code. The following simpler code still produces the error message:
#include <cutil.h>
__global__ void compute_energy(void *data) {
*(double*)((int*)data+101) = 0.0;
int main(int argc, char *argv[]) {
void *device_data;
CUT_DEVICE_INIT(argc, argv);
CUDA_SAFE_CALL(cudaMalloc(&device_data, 101*sizeof(int)+sizeof(double)));
compute_energy<<<dim3(1, 1, 1), dim3(1, 1, 1)>>>(device_data);
Now it is easy to see that the offset should be valid. I tried running cuda-memcheck and it says the following:
Using device 0: GeForce GTX 480
Cuda error in file '' in line 13 : unspecified launch failure.
========= Invalid __global__ write of size 8
========= at 0x00000020 in compute_energy
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x200200194 is misaligned
========= ERROR SUMMARY: 1 error
I tried searching the internet to find what is meant by the address being misaligned, but I failed to find an explanation. What is the deal?

It was very hard to parse your original code with all of those magic constants, but your updated repro case makes the problem immediately obvious. The GPU architecture requires all pointers to be aligned to word boundaries. Your kernel contains a pointer access which is not correctly word aligned. Doubles are an 64 bit type, and your addressing is not aligned to an even 64 bit boundary. This:
*(double*)((int*)data+100) = 0.0; // 50th double
or this:
*(double*)((int*)data+102) = 0.0; // 51st double
are both legal. This:
*(double*)((int*)data+101) = 0.0; // not aligned to a 64 bit boundary
is not.

the error indicates out of bound memory access, please check the offset value.


Number of thread increase but no effect on runtime

I have tried to implement alpha image blending algorithm in CUDA C. There is no error in my code. It compiled fine. As per the thread logic, If I run the code with the increased number of threads the runtime should be decreased. In my code, I got a weird pattern of run time. When I run the code with 1 thread the runtime was 8.060539 e-01 sec, when I run the code with 4 thread I got the runtime 7.579031 e-01 sec, When It ran for 8 threads the runtime was 7.810102e-01, and for 256 thread the runtime is 7.875319e-01.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include "timer.h"
#include "stb_image.h"
#include "stb_image_write.h"
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
int col = threadIdx.x + blockIdx.x*blockDim.x;
int row = threadIdx.y + blockIdx.y*blockDim.y;
if(col<width && row<height){
size_t img_size = width * height * channels;
if (Pout != NULL)
for (size_t i = 0; i < img_size; i++)
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
int main(int argc, char* argv[]){
int thread_count;
double start, finish;
float alpha;
int width, height, channels;
unsigned char *new_img;
thread_count = strtol(argv[1], NULL, 10);
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
unsigned char *orange = stbi_load("orange.jpg", &width, &height, &channels, 0);
size_t img_size = width * height * channels;
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&orange, img_size*sizeof(unsigned char));
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
stbi_write_jpg("new_image.jpg", width, height, channels, new_img, 100);
printf("\n Elapsed time for cuda = %e seconds\n", finish-start);
After getting a weird pattern in the runtime I am bit skeptical about the implementation of the code. Can anyone let me know why I get those runtime even if my code has no bug.
Let's start here:
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
It seems evident you don't understand the kernel launch syntax:
The first number (1) is the number of blocks to launch.
The second number (16) is the number of threads per block.
The third number (thread_count) is the size of the dynamically allocated shared memory in bytes.
So our first observation will be that although you claimed to have changed the thread count, you didn't. You were changing the number of bytes of dynamically allocated shared memory. Since your kernel code doesn't use shared memory, this is a completely meaningless variable.
Let's also observe your kernel code:
for (size_t i = 0; i < img_size; i++)
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
For every thread that passes your if test, each one of those threads will execute the entire for-loop and will process the entire image. That is not the general idea with writing CUDA kernels. The general idea is to break up the work so that each thread does a portion of the work, not the whole activity.
These are very basic observations. If you take advantage of an orderly introduction to CUDA, such as here, you can get beyond some of these basic concepts.
We could also point out that your kernel nominally expects a 2D launch, and you are not providing one, and perhaps many other observations. Another important concept that you are missing is that you cannot do this:
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
and expect anything sensible to come from that. If you want to see how data is moved from a host allocation to the device, study nearly any CUDA sample code, such as vectorAdd. Using a managed allocation doesn't allow you to overwrite the pointer like you are doing and get anything useful from that.
I'll provide an example of how one might go about doing what I think you are suggesting, without providing a complete tutorial on CUDA. To provide an example, I'm going to skip the STB image loading routines. To understand the work you are trying to do here, the actual image content does not matter.
Here's an example of an image processing kernel (1D) that will:
Process the entire image, only once
Use less time, roughly speaking, as you increase the thread count.
You haven't provided your timer routine/code, so I'll provide my own:
$ cat
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
unsigned char *i_load(int w, int h, int c, int init){
unsigned char *res = new unsigned char[w*h*c];
for (int i = 0; i < w*h*c; i++) res[i] = init;
return res;
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
if (Pout != NULL)
size_t img_size = width * height * channels;
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < img_size; i+=gridDim.x*blockDim.x) // grid-stride loop
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
int main(int argc, char* argv[]){
int threads_per_block = 64;
unsigned long long dt;
float alpha;
int width = 1920;
int height = 1080;
int channels = 3;
size_t img_size = width * height * channels;
int thread_count = img_size;
if (argc > 1) thread_count = atoi(argv[1]);
unsigned char *new_img, *m_apple, *m_orange;
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = i_load(width, height, channels, 10);
unsigned char *orange = i_load(width, height, channels, 70);
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&m_apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&m_orange, img_size*sizeof(unsigned char));
memcpy(m_apple, apple, img_size);
memcpy(m_orange, orange, img_size);
int blocks;
if (thread_count < threads_per_block) {threads_per_block = thread_count; blocks = 1;}
else {blocks = thread_count/threads_per_block;}
printf("running with %d blocks of %d threads\n", blocks, threads_per_block);
dt = dtime_usec(0);
image_blend<<<blocks, threads_per_block>>>(new_img,m_apple, m_orange, width, height, channels,alpha);
dt = dtime_usec(dt);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("CUDA Error: %s\n", cudaGetErrorString(err));
else printf("\n Elapsed time for cuda = %e seconds\n", dt/(float)USECPSEC);
$ nvcc -o t2130
$ ./t2130 1
Enter the value for alpha:0.2
running with 1 blocks of 1 threads
Elapsed time for cuda = 5.737880e-01 seconds
$ ./t2130 2
Enter the value for alpha:0.2
running with 1 blocks of 2 threads
Elapsed time for cuda = 3.230150e-01 seconds
$ ./t2130 32
Enter the value for alpha:0.2
running with 1 blocks of 32 threads
Elapsed time for cuda = 4.865200e-02 seconds
$ ./t2130 64
Enter the value for alpha:0.2
running with 1 blocks of 64 threads
Elapsed time for cuda = 2.623300e-02 seconds
$ ./t2130 128
Enter the value for alpha:0.2
running with 2 blocks of 64 threads
Elapsed time for cuda = 1.546000e-02 seconds
$ ./t2130
Enter the value for alpha:0.2
running with 97200 blocks of 64 threads
Elapsed time for cuda = 5.809000e-03 seconds
(CentOS 7, CUDA 11.4, V100)
The key methodology that allows the kernel to do all the work (only once) while making use of an "arbitrary" number of threads efficiently is the grid-stride loop.

lldb - how to read the permissions of a memory region for a thread?

Apple says that on ARM64 Macs memory regions can have either write or execution permissions for a thread. How would someone find out the current permissions for a memory region for a thread in lldb? I have tried 'memory region ' but that returns rwx. I am working on a Just-In-Time compiler that will run on my M1 Mac. For testing I made a small simulation of a Just-In-Time compiler.
#include <cstdio>
#include <sys/mman.h>
#include <pthread.h>
#include <libkern/OSCacheControl.h>
#include <stdlib.h>
int main(int argc, const char * argv[]) {
size_t size = 1024 * 1024 * 640;
int fd = -1;
int offset = 0;
unsigned *addr = 0;
// allocate a mmap'ed region of memory
addr = (unsigned *)mmap(0, size, prot, flags, fd, offset);
if (addr == MAP_FAILED){
printf("failure detected\n");
// Write instructions to the memory
addr[0] = 0xd2800005; // mov x5, #0x0
addr[1] = 0x910004a5; // add x5, x5, #0x1
addr[2] = 0x17ffffff; // b <address>
sys_icache_invalidate(addr, size);
// Execute the code
int(*f)() = (int (*)()) addr;
return 0;
Once the assembly instructions start executing thru the (*f)() call, I can pause execution in Xcode and type
memory region {address of instructions}
into the debugger. For some reason it keeps returning 'rwx'. Am I using the right command or could this be a bug with lldb?
When I run your little program on a Mac where I can poke around (I'm on x86_64 but it shouldn't matter, I don't actually need to run the instructions...) I see in lldb:
Process 43209 stopped
* thread #1, queue = '', stop reason = breakpoint 1.1
frame #0: 0x0000000100003f20 protectit`main at protectit.cpp:31
28 addr[2] = 0x17ffffff; // b <address>
30 pthread_jit_write_protect_np(1);
-> 31 sys_icache_invalidate(addr, size);
33 // Execute the code
34 int(*f)() = (int (*)()) addr;
Target 0: (protectit) stopped.
(lldb) memory region addr
[0x0000000101000000-0x0000000129000000) rwx
which is as you report. I then double-checked with vmmap:
> vmmap 43209 0x0000000101000000
0x101000000 is in 0x101000000-0x129000000; bytes after start: 0 bytes before end: 671088639
MALLOC_SMALL 100800000-101000000 [ 8192K 8K 8K 0K] rw-/rwx SM=PRV MallocHelperZone_0x1001c4000
---> VM_ALLOCATE 101000000-129000000 [640.0M 4K 4K 0K] rwx/rwx SM=PRV
GAP OF 0x5ffed7000000 BYTES
MALLOC_NANO 600000000000-600008000000 [128.0M 88K 88K 0K] rw-/rwx SM=PRV DefaultMallocZone_0x1001f1000
so vmmap agrees with lldb that the region is rwx.
Whatever pthread_jit_write_protect_np is doing, it doesn't seem to be changing the underlying memory region protections.
I found out the answer to my question is to read an undocumented Apple register called S3_6_c15_c1_5.
This code reads the raw value from the register:
// Returns the S3_6_c15_c1_5 register's value
uint64_t read_S3_6_c15_c1_5_register(void)
uint64_t v;
__asm__ __volatile__("isb sy\n"
"mrs %0, S3_6_c15_c1_5\n"
: "=r"(v)::"memory");
return v;
This code tells you what your thread's current mode is:
// Returns the mode for a thread.
// Returns "Executable" or "Writable".
// Remember to free() the value returned by this function.
char *get_thread_mode()
uint64_t value = read_S3_6_c15_c1_5_register();
char *return_value = (char *) malloc(50);
case 0x2010000030300000:
sprintf(return_value, "Writable");
case 0x2010000030100000:
sprintf(return_value, "Executable");
sprintf(return_value, "Unknown state: %llx", value);
return return_value;
This is a small test program to demonstrate these two functions:
int main(int argc, char *argv[]) {
printf("Thread's mode: %s\n", get_thread_mode());
// The mode is Executable
printf("Thread's mode: %s\n", get_thread_mode());
// The mode is Writable
return 0;

Global device memory size limit when using statically alocated memory in cuda

I thought the maximal size of global memory should be only limited by the GPU device no matter it is allocated statically using __device__ __manged__ or dynamically using cudaMalloc.
But I found that if using the __device__ manged__ way, the maximum array size I can declare is much smaller than the GPU device limit.
The minimal working example is as follows:
#include <stdio.h>
#include <cuda_runtime.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
#define MX 64
#define MY 64
#define MZ 64
#define NX 64
#define NY 64
#define M (MX * MY * MZ)
__device__ __managed__ float A[NY][NX][M];
__device__ __managed__ float B[NY][NX][M];
__global__ void swapAB()
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int j = 0; j < NY; j++)
for(int i = 0; i < NX; i++)
A[j][i][tid] = B[j][i][tid];
int main()
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return 0;
It uses 64 ^5 * 2 * 4 / 2^30 GB = 8 GB global memory, and I'll run compile and run it on a Nvidia Telsa K40c GPU which has a 12GB global memory.
Compiler cmd:
nvcc -gencode arch=compute_30,code=sm_30
Output warning:
warning: overflow in implicit constant conversion.
When I ran the generated executable, an error says:
GPUassert: an illegal memory access was encountered
Surprisingly, if I use the dynamically allocated global memory of the same size (8GB) via the cudaMalloc API instead, there is no compiling warning and runtime error.
I'm wondering if there are any special limitation about the allocatable size of static global device memory in CUDA.
PS: OS and CUDA: CentOS 6.5 x64, CUDA-7.5.
This would appear to be a limitation of the CUDA runtime API. The root cause is this function (in CUDA 7.5):
void **fatCubinHandle,
char *hostVar,
char *deviceAddress,
const char *deviceName,
int ext,
int size,
int constant,
int global
which only accepts a signed int for the size of any statically declared device variable. This would limit the maximum size to 2^31 (2147483648) bytes. The warning you see is because the CUDA front end is emitting boilerplate code containing calls to __cudaResgisterVar like this:
__cudaRegisterManagedVariable(__T26, __shadow_var(A,::A), 0, 4294967296, 0, 0);
__cudaRegisterManagedVariable(__T26, __shadow_var(B,::B), 0, 4294967296, 0, 0);
It is the 4294967296 which is the source of the problem. The size will overflow the signed integer and cause the API call to blow up. So it seems you are limited to 2Gb per static variable for the moment. I would recommend raising this as a bug with NVIDIA if it is a serious problem for your application.

main input error and delays when libvlc stream images in memory

I'm working on a project to stream images in memory with libvlc.For test, I stream camera frames. I have troubles here: first there are huge delays(about 7s), and the stream is very unstable.
It would be helpful if you can find some mistakes in my code!
I have these 3 errors repeated lots times .
main input error: ES_OUT_SET_PCR is caaled too late(pts_delay increased to 692 ms)
main input error: ES_OUT_RESET_PCR called
avcodec decoder error: more than 5 seconds of late video -> dropping frame (computer too slow?)
I'm especially curious about the last mistake: why is there a decoder error when I only want to encode some images?
And here is my code:
#include <Windows.h>
#include <vlc.h>
#include <vlc_common.h>
#include <vlc_threads.h>
//#include <vlc/plugins/vlc_threads.h>
using namespace std;
#include <opencv2/opencv.hpp>
using namespace cv;
#define CAMERA_WIDTH 640
#define CAMERA_HEIGHT 480
vlc_mutex_t imem_get_mutex;
VideoCapture *g_camera;
int g_transport_number = 8080;
static int vlc_imem_get_callback(void *data, const char *cookie,
int64_t * dts, int64_t * pts,
unsigned *flags, size_t * size,
void **output)
Mat frame;
(*g_camera) >> frame;
*output = malloc(frame.rows * frame.cols * 3);
memcpy(*output,,frame.rows * frame.cols * 3);
if (pts)
*pts = 1;
if (dts)
*dts = 1;
// *size=(size_t)300;
*size=(size_t)(frame.rows * frame.cols * 3);
return 0;
static void vlc_imem_release_callback(void *data, const char *cookie,
size_t size, void *unknown)
// printf("release\n\n");
int main()
g_camera = new VideoCapture(0);
libvlc_instance_t * inst;
libvlc_media_player_t *mp;
libvlc_media_t *m;
char smem_options1[2000];
char venc_options[1000];
// sprintf(venc_options,"profile=baseline,level=3,keyint=50,bframes=3,no-cabac,ref=3,no-interlaced,vbv-maxrate=512,vbv-bufsize=256,aq-mode=0,no-mbtree,partitions=none,no-weightb,weightp=0,me=dia,subme=0,no-mixed-refs,no-8x8dct,trellis=0");
// sprintf(smem_options1,"#transcode{vcodec=h264,vb=1000,fps=30,scale=0,width=640,height=480,channels=1,samplerate=44100}:duplicate{dst=http{mux=ts,dst=:%d/test},dst=display",venc_options,g_transport_number);
char str_imem_get[100], str_imem_release[100],str_imem_data[100];
sprintf(str_imem_get, "--imem-get=%ld", vlc_imem_get_callback);
sprintf(str_imem_release, "--imem-release=%ld", vlc_imem_release_callback);
// sprintf(str_imem_data,"--imem-data=%ld",(long int)test_buffer);
const char * const vlc_args[] = {
inst = libvlc_new (sizeof (vlc_args) / sizeof (vlc_args[0]), vlc_args);
m = libvlc_media_new_location(inst, "imem://");
mp = libvlc_media_player_new_from_media (m);
libvlc_media_release (m);
libvlc_media_player_play (mp);
Sleep (200000);
libvlc_media_player_stop (mp);
libvlc_media_player_release (mp);
libvlc_release (inst);
return 0;
Thanks for your help and I'm sorry for my poor English...
I had this exact same problem and discovered the issue was related to the DTS and PTS values I was using when capturing from a live source using OpenCV like you. I was calculating the DTS and PTS values in real-time to avoid the pts_delay increase, but then after about 5 seconds like you, the time between imem get function callbacks kept increasing. I then used a fixed frame rate value interval value like 33333 added each time. This fixed the lag issue, but resulted in the 1st error with the clock reset. The solution I found was to set DTS to -1 (unused), and set the value of PTS to libvlc_clock(). For example:
int MyImemGetCallback (void *data,
const char *cookie,
int64_t *dts,
int64_t *pts,
unsigned *flags,
size_t * bufferSize,
void ** buffer)
MyImemData* imem = (MyImemData*)data;
if(imem == NULL)
return 0;
return 1; // Exit
if(imem->mFrameNumber == imem->mPrevFrameNumber)
return 0; // No new image data
// Update frame count information...
imem->mPrevFrameNumber = imem->mFrameNumber;
*dts = -1;
// You can use libvlc_clock to avoid PCR reset and delays
// on realtime data...
*pts = libvlc_clock();
*bufferSize =
imem->mOpenCvImage->rows *
imem->mOpenCvImage->cols *
*buffer = imem->mOpenCvImage->data;
return 0; // Success.
I know this was posted 6 months ago, but I came across the same problem today and figured somewhat else might too.

Unknown error when inverting image using cuda

i began to implement some simple image processing using cuda but i have an error in my code
the error happens when i copy pixels from device to host
this is my try
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <opencv2\core\core.hpp>
#include <opencv2\highgui\highgui.hpp>
#include <stdio.h>
using namespace cv;
unsigned char *h_pixels;
unsigned char *d_pixels;
int bufferSize;
int width,height;
const int BLOCK_SIZE = 32;
Mat image;
void get_pixels(const char* fileName)
image = imread(fileName);
bufferSize = image.size().width * image.size().height * 3 * sizeof(unsigned char);
width = image.size().width;
height = image.size().height;
h_pixels = new unsigned char[bufferSize];
__global__ void invert_image(unsigned char* pixels,int width,int height)
int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int cidx = (row * width + col) * 3;
pixels[cidx] = 255 - pixels[cidx];
pixels[cidx + 1] = 255 - pixels[cidx + 1];
pixels[cidx + 2] = 255 - pixels[cidx + 2];
int main()
cudaError_t err = cudaMalloc((void**)&d_pixels,bufferSize);
err = cudaMemcpy(d_pixels,h_pixels,bufferSize,cudaMemcpyHostToDevice);
dim3 dimGrid(width/dimBlock.x,height/dimBlock.y);
unsigned char *pixels = new unsigned char[bufferSize];
err= cudaMemcpy(pixels,d_pixels,bufferSize,cudaMemcpyDeviceToHost);// unknown error
const char * errStr = cudaGetErrorString(err);
cudaFree(d_pixels); = pixels;
namedWindow("display image");
imshow("display image",image);
return 0;
also how can i find out error that occurs in cuda device
thanks for your help
OpenCV images are not continuous. Each row is 4 byte or 8 byte aligned. You should also pass the step field of the Mat to the CUDA kernel, so that you can calculate the cidx correctly. The generic formula to calculate the output index is:
cidx = row * (step/elementSize) + (NumberOfChannels * col);
in your case, it will be:
cidx = row * step + (3 * col);
Referring to the alignment of images, you buffer size is equal to image.step * image.size().height.
Next thing is the one pointed out by #phoad in the third point. You should create enough number of thread blocks to cover the whole image.
Here is a generic formula for Grid which will create enough number of blocks for any image size.
dim3 grid((width + block.x - 1)/block.x,(height + block.y - 1)/block.y);
First of all be sure that the image file is read correctly.
Check if the device memory is allocated with CUDA_SAFE_CALL(cudaMalloc(..))
Check the dimensions of the image. If the dimension of the image is not multiples of BLOCKSIZE than you might be missing some indices and the image is not fully inverted.
Call cudaDeviceSynchronize after the kernel call and check its return value.
Do you get any error when you run the code without calling the kernel anyway?
You are not freeing the h_pixels and might have a memory leak.
Instead of using BLOCKSIZE in the kernel you might use "blockDim.x". So calculating indices like "blockIdx.x * blockDim.x + threadIdx.x"
Try to do not touch the memory area in the kernel code, namely comment out the memory updates at the kernel (the lines where you access the pixels array) and check if the program continues to fail. If it does not continue to fail you might be accessing out of the bounds.
Use this command immediately after the kernel invocation to print the kernel errors:
printf("error code: %s\n",cudaGetErrorString(cudaGetLastError()))
