Cuda memory allocation - image-processing

I am working with jetson TX2. I capture images from camera, as unsigned char *image.
Then, I need to do some image processing. For that, I use the GPU. With the jetson TX2, we can avoid the transfer of data host/device and device/host because the RAM is shared between the GPU and the CPU. For that, I use :
int height = 6004 ;
int width = 7920 ;
int NumElement = height*width ;
unsigned char *img1 ;
cudaMallocManaged(&img1, NumElement*sizeof(unsigned char));
Using that method, there is no limitation with the PCI. My problem is how assign the image from the buffer, to img1.
This method works, but it is too long :
for(int i =0 ; i<NumElement ; i++)
img[i] = buffer[i] ;
I loose the advantage of the GPU using naive for loop ... And I if just use that method :
img = buffer
Like that, I have a problem when I enter in the kernel .

Use cudaMemcpy with cudaMemcpyDefault, something like
cudaMemcpy(&buffer[0], &img[0], NumElement * sizeof(char), cudaMemcpyDefault);
You could also potentially use memcpy

Related

Number of thread increase but no effect on runtime

I have tried to implement alpha image blending algorithm in CUDA C. There is no error in my code. It compiled fine. As per the thread logic, If I run the code with the increased number of threads the runtime should be decreased. In my code, I got a weird pattern of run time. When I run the code with 1 thread the runtime was 8.060539 e-01 sec, when I run the code with 4 thread I got the runtime 7.579031 e-01 sec, When It ran for 8 threads the runtime was 7.810102e-01, and for 256 thread the runtime is 7.875319e-01.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include "timer.h"
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
int col = threadIdx.x + blockIdx.x*blockDim.x;
int row = threadIdx.y + blockIdx.y*blockDim.y;
if(col<width && row<height){
size_t img_size = width * height * channels;
if (Pout != NULL)
{
for (size_t i = 0; i < img_size; i++)
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
}
}
}
int main(int argc, char* argv[]){
int thread_count;
double start, finish;
float alpha;
int width, height, channels;
unsigned char *new_img;
thread_count = strtol(argv[1], NULL, 10);
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
unsigned char *orange = stbi_load("orange.jpg", &width, &height, &channels, 0);
size_t img_size = width * height * channels;
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&orange, img_size*sizeof(unsigned char));
GET_TIME(start);
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
cudaDeviceSynchronize();
GET_TIME(finish);
stbi_write_jpg("new_image.jpg", width, height, channels, new_img, 100);
cudaFree(new_img);
cudaFree(apple);
cudaFree(orange);
printf("\n Elapsed time for cuda = %e seconds\n", finish-start);
}
After getting a weird pattern in the runtime I am bit skeptical about the implementation of the code. Can anyone let me know why I get those runtime even if my code has no bug.
Let's start here:
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
It seems evident you don't understand the kernel launch syntax:
<<<1,16,thread_count>>>
The first number (1) is the number of blocks to launch.
The second number (16) is the number of threads per block.
The third number (thread_count) is the size of the dynamically allocated shared memory in bytes.
So our first observation will be that although you claimed to have changed the thread count, you didn't. You were changing the number of bytes of dynamically allocated shared memory. Since your kernel code doesn't use shared memory, this is a completely meaningless variable.
Let's also observe your kernel code:
for (size_t i = 0; i < img_size; i++)
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
For every thread that passes your if test, each one of those threads will execute the entire for-loop and will process the entire image. That is not the general idea with writing CUDA kernels. The general idea is to break up the work so that each thread does a portion of the work, not the whole activity.
These are very basic observations. If you take advantage of an orderly introduction to CUDA, such as here, you can get beyond some of these basic concepts.
We could also point out that your kernel nominally expects a 2D launch, and you are not providing one, and perhaps many other observations. Another important concept that you are missing is that you cannot do this:
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
...
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
and expect anything sensible to come from that. If you want to see how data is moved from a host allocation to the device, study nearly any CUDA sample code, such as vectorAdd. Using a managed allocation doesn't allow you to overwrite the pointer like you are doing and get anything useful from that.
I'll provide an example of how one might go about doing what I think you are suggesting, without providing a complete tutorial on CUDA. To provide an example, I'm going to skip the STB image loading routines. To understand the work you are trying to do here, the actual image content does not matter.
Here's an example of an image processing kernel (1D) that will:
Process the entire image, only once
Use less time, roughly speaking, as you increase the thread count.
You haven't provided your timer routine/code, so I'll provide my own:
$ cat t2130.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
unsigned char *i_load(int w, int h, int c, int init){
unsigned char *res = new unsigned char[w*h*c];
for (int i = 0; i < w*h*c; i++) res[i] = init;
return res;
}
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
if (Pout != NULL)
{
size_t img_size = width * height * channels;
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < img_size; i+=gridDim.x*blockDim.x) // grid-stride loop
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
}
}
int main(int argc, char* argv[]){
int threads_per_block = 64;
unsigned long long dt;
float alpha;
int width = 1920;
int height = 1080;
int channels = 3;
size_t img_size = width * height * channels;
int thread_count = img_size;
if (argc > 1) thread_count = atoi(argv[1]);
unsigned char *new_img, *m_apple, *m_orange;
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = i_load(width, height, channels, 10);
unsigned char *orange = i_load(width, height, channels, 70);
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&m_apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&m_orange, img_size*sizeof(unsigned char));
memcpy(m_apple, apple, img_size);
memcpy(m_orange, orange, img_size);
int blocks;
if (thread_count < threads_per_block) {threads_per_block = thread_count; blocks = 1;}
else {blocks = thread_count/threads_per_block;}
printf("running with %d blocks of %d threads\n", blocks, threads_per_block);
dt = dtime_usec(0);
image_blend<<<blocks, threads_per_block>>>(new_img,m_apple, m_orange, width, height, channels,alpha);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("CUDA Error: %s\n", cudaGetErrorString(err));
else printf("\n Elapsed time for cuda = %e seconds\n", dt/(float)USECPSEC);
cudaFree(new_img);
cudaFree(m_apple);
cudaFree(m_orange);
}
$ nvcc -o t2130 t2130.cu
$ ./t2130 1
Enter the value for alpha:0.2
running with 1 blocks of 1 threads
Elapsed time for cuda = 5.737880e-01 seconds
$ ./t2130 2
Enter the value for alpha:0.2
running with 1 blocks of 2 threads
Elapsed time for cuda = 3.230150e-01 seconds
$ ./t2130 32
Enter the value for alpha:0.2
running with 1 blocks of 32 threads
Elapsed time for cuda = 4.865200e-02 seconds
$ ./t2130 64
Enter the value for alpha:0.2
running with 1 blocks of 64 threads
Elapsed time for cuda = 2.623300e-02 seconds
$ ./t2130 128
Enter the value for alpha:0.2
running with 2 blocks of 64 threads
Elapsed time for cuda = 1.546000e-02 seconds
$ ./t2130
Enter the value for alpha:0.2
running with 97200 blocks of 64 threads
Elapsed time for cuda = 5.809000e-03 seconds
$
(CentOS 7, CUDA 11.4, V100)
The key methodology that allows the kernel to do all the work (only once) while making use of an "arbitrary" number of threads efficiently is the grid-stride loop.

Global device memory size limit when using statically alocated memory in cuda

I thought the maximal size of global memory should be only limited by the GPU device no matter it is allocated statically using __device__ __manged__ or dynamically using cudaMalloc.
But I found that if using the __device__ manged__ way, the maximum array size I can declare is much smaller than the GPU device limit.
The minimal working example is as follows:
#include <stdio.h>
#include <cuda_runtime.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
#define MX 64
#define MY 64
#define MZ 64
#define NX 64
#define NY 64
#define M (MX * MY * MZ)
__device__ __managed__ float A[NY][NX][M];
__device__ __managed__ float B[NY][NX][M];
__global__ void swapAB()
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int j = 0; j < NY; j++)
for(int i = 0; i < NX; i++)
A[j][i][tid] = B[j][i][tid];
}
int main()
{
swapAB<<<M/256,256>>>();
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return 0;
}
It uses 64 ^5 * 2 * 4 / 2^30 GB = 8 GB global memory, and I'll run compile and run it on a Nvidia Telsa K40c GPU which has a 12GB global memory.
Compiler cmd:
nvcc test.cu -gencode arch=compute_30,code=sm_30
Output warning:
warning: overflow in implicit constant conversion.
When I ran the generated executable, an error says:
GPUassert: an illegal memory access was encountered test.cu
Surprisingly, if I use the dynamically allocated global memory of the same size (8GB) via the cudaMalloc API instead, there is no compiling warning and runtime error.
I'm wondering if there are any special limitation about the allocatable size of static global device memory in CUDA.
Thanks!
PS: OS and CUDA: CentOS 6.5 x64, CUDA-7.5.
This would appear to be a limitation of the CUDA runtime API. The root cause is this function (in CUDA 7.5):
__cudaRegisterVar(
void **fatCubinHandle,
char *hostVar,
char *deviceAddress,
const char *deviceName,
int ext,
int size,
int constant,
int global
);
which only accepts a signed int for the size of any statically declared device variable. This would limit the maximum size to 2^31 (2147483648) bytes. The warning you see is because the CUDA front end is emitting boilerplate code containing calls to __cudaResgisterVar like this:
__cudaRegisterManagedVariable(__T26, __shadow_var(A,::A), 0, 4294967296, 0, 0);
__cudaRegisterManagedVariable(__T26, __shadow_var(B,::B), 0, 4294967296, 0, 0);
It is the 4294967296 which is the source of the problem. The size will overflow the signed integer and cause the API call to blow up. So it seems you are limited to 2Gb per static variable for the moment. I would recommend raising this as a bug with NVIDIA if it is a serious problem for your application.

Why the same opencl code cannot be run on IMX.6?

We're trying to use OpenCL for some image processing on IMX.6.
We used a already-tested opencl code. In the kernel.cl file, the only opencl thing is
int i= get_global_id(0);
int j= get_global_id(1);
All other works are based on pure-c language instead of opencl.
And the code runs well on the PC.
However, when we test the code on IMX.6. All of the status shows correct, but we cannot have the correct result.
The read and write buffer function clEnqueueReadBuffer has no problem at all, we tested the uploaded image. BUT the kernel running function doesn't have any result. clEnqueueNDRangeKernel.
Does anyone know why?
By the way, this question is the 2000 question of opencl:)
Here is the whole code:
__kernel void IPM(__global const unsigned char* image_ROI_data, __global unsigned char* IPM_data, __global float* parameter_IPM)
{
float camera_col=parameter_IPM[1];
float camera_row=parameter_IPM[0];
float camera_height=parameter_IPM[2];
float camera_alpha=parameter_IPM[3];
float camera_theta=parameter_IPM[4];
float image_vp=parameter_IPM[5];
float IPM_width=parameter_IPM[6];
float IPM_height=parameter_IPM[7];
int IPM_lineByte=(((int)IPM_width+3)/4)*4;
int image_lineByte=(((int)camera_col+3)/4)*4;
int i= get_global_id(0);
int j= get_global_id(1);
*(IPM_data+((int)IPM_height-j)*IPM_lineByte+i)=0;
float multiple=(float)(IPM_width/20);
// Real x and Real y(they are both meters)
float x=(float)(i-IPM_width/2)/multiple;
float y=(float)(j)/multiple;
// The coordinator in capture image.
float u=(camera_row-1)*(atan(camera_height/sqrt(x*x+y*y))+camera_alpha-camera_theta)/(2*camera_alpha);
float v=(camera_col-1)*(atan(x/y)+camera_alpha)/(2*camera_alpha);
// If the point was in capture image, choose its pixel and fill the image.
// As it is only a ROI so it is u-image_vp
if (((int)u-(int)image_vp)>0 && (int)u<(int)camera_row && v>0 && v<camera_col)
{
*(IPM_data+((int)IPM_height-j)*IPM_lineByte+i)=
*(image_ROI_data+((int)u-(int)image_vp)* image_lineByte+(int)v);
}
}
int i= get_global_id(0); // starts from zero
int j= get_global_id(1); // this too
float x=(float)(i-IPM_width/2) // maybe zero maybe not
float y=(float)(j)/multiple; // becomes zero
float v=(camera_col-1)*(atan(x/y)+camera_alpha)/(2*camera_alpha);
^
|
|
/ \
division by zero
Becomes NaN or INF, and the rest follow.
Then you get a wrong result originating from this.
Especially when you use it for pointer calculus:
*(IPM_data+((int)IPM_height-j)*IPM_lineByte+i)=
*(image_ROI_data+((int)u-(int)image_vp)* image_lineByte+(int)v);
^
|
|
/-\
gg if "if" body is entered
Your Device support only embedded OpenCL profile, which is a subset of full profile, supported by your PC. Generally, you need to re-factor your code to make it embedded-profile compatible.

Using loaded .raw image data as a IDirect3DTexture9 texture in DirectX9?

Im trying to make use of a simple .raw loader as an easy way to load images into a program to be used as textures by DirectX9.
I have a problem in that the D3DX functions are not available to me at all, nor can i find them anywhere. I have constructed my own matrix routines fine, but can't use the D3DX Texture file function without some pointers.
I've done my homework, so i'm thinking what i need is to use the CreateTexture function and some code to marry my unsigned char image with IDirect3DTexture9 *DXTexture.
IDirect3DTexture9 *DXTexture;
unsigned char texture;
loadRawImage(&texture, "tex", 128, 128);
g_pD3DDevice->CreateTexture(128,128,0,D3DUSAGE_DYNAMIC,D3DFMT_A8R8G8B8,
D3DPOOL_DEFAULT, &DXTexture,NULL);
//code required here to marry my unsigned char image with DXTexture
g_pD3DDevice->SetTexture(0, texture);
I've seen this page, looks sort of like what i need..
http://www.gamedev.net/topic/567044-problem-loading-image-data-into-idirect3dtexture9/
IDirect3DTexture9* tempTexture = 0;
HRESULT hr = device->CreateTexture(this->width,this,>height,0,D3DUSAGE_DYNAMIC,
D3DFMT_A8R8G8B8, D3DPOOL_DEFAULT,&tempTexture,0);
//assignment pointer
D3DCOLOR *Ptr;
unsigned char *tempPtr = 0; // increment pointer
int count = 0; //index into color data
//lock texture and get ptr
D3DLOCKED_RECT rect;
hr = tempTexture->LockRect(0,&rect,0,D3DLOCK_DISCARD);
tempPtr = (unsigned char*)rect.pBits; // assign to unsigned char
// pointer to make pointer arithmetic
// smooth
for(unsigned int i = 0; i < this->height; i++)
{
tempPtr += rect.Pitch; //move to next line in texture
Ptr = (D3DCOLOR*)tempPtr;
for(unsigned int j = 0; j < this->width; j++)
{
Ptr[j] = D3DCOLOR_XRGB(this->imageData[count++],
this->imageData[count++],
this->imageData[count++]);
}
}
tempTexture->UnlockRect(0);
Any pointers would be appreciated. This is for a small demo so code is being kept down to a minimum.
EDIT to respond to drop
Basically my question is how can I use the loaded .raw image data as a DirectX9 texture? I know there must be some internal byte format in which IDirectTexture9 textures are arranged, I just need some pointers on how to convert my data to this format.This is without using D3DX functions.
Have a try using below approach
D3DLOCKED_RECT rect;
ppTexture->LockRect( 0, &rect, 0, D3DLOCK_DISCARD );
unsigned char* dest = static_cast<unsigned char*>(rect.pBits);
memcpy(dest, &pBitmapData[0], sizeof(unsigned char) * biWidth * biHeight * 4);
ppTexture->UnlockRect(0);

CUDA shared array not getting values?

I am trying to implement simple parallel reduction. I am using the code from CUDA sdk. BUt somehow there is a problem in my kernel as the shared array is not getting values of the global array and its all zeroes.
extern __ shared __ float4 sdata[];
// each thread loadsone element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i= blockIdx.x*blockDim.x+ threadIdx.x;
sdata[tid] = dev_src[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if(tid % (2*s) == 0){
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if(tid == 0)
out[blockIdx.x] = sdata[0];
EDIT::
ok I got it working by removing extern keyword and making shared array a constant size like 512. I am in good shape now. Maybe someone can explain why it was not working with extern keyword
I think I know why this is happening as I have faced this before. How are you calling the kernel?
Remember in the call kernel<<<blocks,threads,sharedMemory>>> the sharedMemory should be the size of the shared memory in bytes. So, if you are declaring for 512 elements, the third parameter should be 512 * sizeof(float4). I think you are just calling as below, which is wrong
kernel<<<blocks,threads,512>>> // this is wrong
Hope that helps

Resources