Global device memory size limit when using statically alocated memory in cuda - memory

I thought the maximal size of global memory should be only limited by the GPU device no matter it is allocated statically using __device__ __manged__ or dynamically using cudaMalloc.
But I found that if using the __device__ manged__ way, the maximum array size I can declare is much smaller than the GPU device limit.
The minimal working example is as follows:
#include <stdio.h>
#include <cuda_runtime.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
#define MX 64
#define MY 64
#define MZ 64
#define NX 64
#define NY 64
#define M (MX * MY * MZ)
__device__ __managed__ float A[NY][NX][M];
__device__ __managed__ float B[NY][NX][M];
__global__ void swapAB()
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int j = 0; j < NY; j++)
for(int i = 0; i < NX; i++)
A[j][i][tid] = B[j][i][tid];
}
int main()
{
swapAB<<<M/256,256>>>();
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return 0;
}
It uses 64 ^5 * 2 * 4 / 2^30 GB = 8 GB global memory, and I'll run compile and run it on a Nvidia Telsa K40c GPU which has a 12GB global memory.
Compiler cmd:
nvcc test.cu -gencode arch=compute_30,code=sm_30
Output warning:
warning: overflow in implicit constant conversion.
When I ran the generated executable, an error says:
GPUassert: an illegal memory access was encountered test.cu
Surprisingly, if I use the dynamically allocated global memory of the same size (8GB) via the cudaMalloc API instead, there is no compiling warning and runtime error.
I'm wondering if there are any special limitation about the allocatable size of static global device memory in CUDA.
Thanks!
PS: OS and CUDA: CentOS 6.5 x64, CUDA-7.5.

This would appear to be a limitation of the CUDA runtime API. The root cause is this function (in CUDA 7.5):
__cudaRegisterVar(
void **fatCubinHandle,
char *hostVar,
char *deviceAddress,
const char *deviceName,
int ext,
int size,
int constant,
int global
);
which only accepts a signed int for the size of any statically declared device variable. This would limit the maximum size to 2^31 (2147483648) bytes. The warning you see is because the CUDA front end is emitting boilerplate code containing calls to __cudaResgisterVar like this:
__cudaRegisterManagedVariable(__T26, __shadow_var(A,::A), 0, 4294967296, 0, 0);
__cudaRegisterManagedVariable(__T26, __shadow_var(B,::B), 0, 4294967296, 0, 0);
It is the 4294967296 which is the source of the problem. The size will overflow the signed integer and cause the API call to blow up. So it seems you are limited to 2Gb per static variable for the moment. I would recommend raising this as a bug with NVIDIA if it is a serious problem for your application.

Related

Number of thread increase but no effect on runtime

I have tried to implement alpha image blending algorithm in CUDA C. There is no error in my code. It compiled fine. As per the thread logic, If I run the code with the increased number of threads the runtime should be decreased. In my code, I got a weird pattern of run time. When I run the code with 1 thread the runtime was 8.060539 e-01 sec, when I run the code with 4 thread I got the runtime 7.579031 e-01 sec, When It ran for 8 threads the runtime was 7.810102e-01, and for 256 thread the runtime is 7.875319e-01.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include "timer.h"
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
int col = threadIdx.x + blockIdx.x*blockDim.x;
int row = threadIdx.y + blockIdx.y*blockDim.y;
if(col<width && row<height){
size_t img_size = width * height * channels;
if (Pout != NULL)
{
for (size_t i = 0; i < img_size; i++)
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
}
}
}
int main(int argc, char* argv[]){
int thread_count;
double start, finish;
float alpha;
int width, height, channels;
unsigned char *new_img;
thread_count = strtol(argv[1], NULL, 10);
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
unsigned char *orange = stbi_load("orange.jpg", &width, &height, &channels, 0);
size_t img_size = width * height * channels;
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&orange, img_size*sizeof(unsigned char));
GET_TIME(start);
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
cudaDeviceSynchronize();
GET_TIME(finish);
stbi_write_jpg("new_image.jpg", width, height, channels, new_img, 100);
cudaFree(new_img);
cudaFree(apple);
cudaFree(orange);
printf("\n Elapsed time for cuda = %e seconds\n", finish-start);
}
After getting a weird pattern in the runtime I am bit skeptical about the implementation of the code. Can anyone let me know why I get those runtime even if my code has no bug.
Let's start here:
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
It seems evident you don't understand the kernel launch syntax:
<<<1,16,thread_count>>>
The first number (1) is the number of blocks to launch.
The second number (16) is the number of threads per block.
The third number (thread_count) is the size of the dynamically allocated shared memory in bytes.
So our first observation will be that although you claimed to have changed the thread count, you didn't. You were changing the number of bytes of dynamically allocated shared memory. Since your kernel code doesn't use shared memory, this is a completely meaningless variable.
Let's also observe your kernel code:
for (size_t i = 0; i < img_size; i++)
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
For every thread that passes your if test, each one of those threads will execute the entire for-loop and will process the entire image. That is not the general idea with writing CUDA kernels. The general idea is to break up the work so that each thread does a portion of the work, not the whole activity.
These are very basic observations. If you take advantage of an orderly introduction to CUDA, such as here, you can get beyond some of these basic concepts.
We could also point out that your kernel nominally expects a 2D launch, and you are not providing one, and perhaps many other observations. Another important concept that you are missing is that you cannot do this:
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
...
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
and expect anything sensible to come from that. If you want to see how data is moved from a host allocation to the device, study nearly any CUDA sample code, such as vectorAdd. Using a managed allocation doesn't allow you to overwrite the pointer like you are doing and get anything useful from that.
I'll provide an example of how one might go about doing what I think you are suggesting, without providing a complete tutorial on CUDA. To provide an example, I'm going to skip the STB image loading routines. To understand the work you are trying to do here, the actual image content does not matter.
Here's an example of an image processing kernel (1D) that will:
Process the entire image, only once
Use less time, roughly speaking, as you increase the thread count.
You haven't provided your timer routine/code, so I'll provide my own:
$ cat t2130.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
unsigned char *i_load(int w, int h, int c, int init){
unsigned char *res = new unsigned char[w*h*c];
for (int i = 0; i < w*h*c; i++) res[i] = init;
return res;
}
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
if (Pout != NULL)
{
size_t img_size = width * height * channels;
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < img_size; i+=gridDim.x*blockDim.x) // grid-stride loop
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
}
}
int main(int argc, char* argv[]){
int threads_per_block = 64;
unsigned long long dt;
float alpha;
int width = 1920;
int height = 1080;
int channels = 3;
size_t img_size = width * height * channels;
int thread_count = img_size;
if (argc > 1) thread_count = atoi(argv[1]);
unsigned char *new_img, *m_apple, *m_orange;
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = i_load(width, height, channels, 10);
unsigned char *orange = i_load(width, height, channels, 70);
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&m_apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&m_orange, img_size*sizeof(unsigned char));
memcpy(m_apple, apple, img_size);
memcpy(m_orange, orange, img_size);
int blocks;
if (thread_count < threads_per_block) {threads_per_block = thread_count; blocks = 1;}
else {blocks = thread_count/threads_per_block;}
printf("running with %d blocks of %d threads\n", blocks, threads_per_block);
dt = dtime_usec(0);
image_blend<<<blocks, threads_per_block>>>(new_img,m_apple, m_orange, width, height, channels,alpha);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("CUDA Error: %s\n", cudaGetErrorString(err));
else printf("\n Elapsed time for cuda = %e seconds\n", dt/(float)USECPSEC);
cudaFree(new_img);
cudaFree(m_apple);
cudaFree(m_orange);
}
$ nvcc -o t2130 t2130.cu
$ ./t2130 1
Enter the value for alpha:0.2
running with 1 blocks of 1 threads
Elapsed time for cuda = 5.737880e-01 seconds
$ ./t2130 2
Enter the value for alpha:0.2
running with 1 blocks of 2 threads
Elapsed time for cuda = 3.230150e-01 seconds
$ ./t2130 32
Enter the value for alpha:0.2
running with 1 blocks of 32 threads
Elapsed time for cuda = 4.865200e-02 seconds
$ ./t2130 64
Enter the value for alpha:0.2
running with 1 blocks of 64 threads
Elapsed time for cuda = 2.623300e-02 seconds
$ ./t2130 128
Enter the value for alpha:0.2
running with 2 blocks of 64 threads
Elapsed time for cuda = 1.546000e-02 seconds
$ ./t2130
Enter the value for alpha:0.2
running with 97200 blocks of 64 threads
Elapsed time for cuda = 5.809000e-03 seconds
$
(CentOS 7, CUDA 11.4, V100)
The key methodology that allows the kernel to do all the work (only once) while making use of an "arbitrary" number of threads efficiently is the grid-stride loop.

Cuda problems using shared buffer for simulated memory allocation [duplicate]

My code is giving an error message and I am trying to track down the cause of it. To make it easier to find the problem, I have stripped away code that apparently is not relevant to causing the error message. If you can tell me why the following simple code produces an error message, then I think I should be able to fix my original code:
#include "cuComplex.h"
#include <cutil.h>
__device__ void compute_energy(void *data, int isample, int nsamples) {
cuDoubleComplex * const nminusarray = (cuDoubleComplex*)data;
cuDoubleComplex * const f = (cuDoubleComplex*)(nminusarray+101);
double * const abs_est_errorrow_all = (double*)(f+3);
double * const rel_est_errorrow_all = (double*)(abs_est_errorrow_all+nsamples*51);
int * const iid_all = (int*)(rel_est_errorrow_all+nsamples*51);
int * const iiu_all = (int*)(iid_all+nsamples*21);
int * const piv_all = (int*)(iiu_all+nsamples*21);
cuDoubleComplex * const energyrow_all = (cuDoubleComplex*)(piv_all+nsamples*12);
cuDoubleComplex * const refinedenergyrow_all = (cuDoubleComplex*)(energyrow_all+nsamples*51);
cuDoubleComplex * const btplus_all = (cuDoubleComplex*)(refinedenergyrow_all+nsamples*51);
cuDoubleComplex * const btplus = btplus_all+isample*21021;
btplus[0] = make_cuDoubleComplex(0.0, 0.0);
}
__global__ void computeLamHeight(void *data, int nlambda) {
compute_energy(data, blockIdx.x, nlambda);
}
int main(int argc, char *argv[]) {
void *device_data;
CUT_DEVICE_INIT(argc, argv);
CUDA_SAFE_CALL(cudaMalloc(&device_data, 184465640));
computeLamHeight<<<dim3(101, 1, 1), dim3(512, 1, 1), 45000>>>(device_data, 101);
CUDA_SAFE_CALL(cudaThreadSynchronize());
}
I am using a GeForce GTX 480 and I am compiling the code like so:
nvcc -L /soft/cuda-sdk/4.0.17/C/lib -I /soft/cuda-sdk/4.0.17/C/common/inc -lcutil_x86_64 -arch sm_13 -O3 -Xopencc "-Wall" Main.cu
The output is:
Using device 0: GeForce GTX 480
Cuda error in file 'Main.cu' in line 31 : unspecified launch failure.
EDIT: I have now further simplified the code. The following simpler code still produces the error message:
#include <cutil.h>
__global__ void compute_energy(void *data) {
*(double*)((int*)data+101) = 0.0;
}
int main(int argc, char *argv[]) {
void *device_data;
CUT_DEVICE_INIT(argc, argv);
CUDA_SAFE_CALL(cudaMalloc(&device_data, 101*sizeof(int)+sizeof(double)));
compute_energy<<<dim3(1, 1, 1), dim3(1, 1, 1)>>>(device_data);
CUDA_SAFE_CALL(cudaThreadSynchronize());
}
Now it is easy to see that the offset should be valid. I tried running cuda-memcheck and it says the following:
========= CUDA-MEMCHECK
Using device 0: GeForce GTX 480
Cuda error in file 'Main.cu' in line 13 : unspecified launch failure.
========= Invalid __global__ write of size 8
========= at 0x00000020 in compute_energy
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x200200194 is misaligned
=========
========= ERROR SUMMARY: 1 error
I tried searching the internet to find what is meant by the address being misaligned, but I failed to find an explanation. What is the deal?
It was very hard to parse your original code with all of those magic constants, but your updated repro case makes the problem immediately obvious. The GPU architecture requires all pointers to be aligned to word boundaries. Your kernel contains a pointer access which is not correctly word aligned. Doubles are an 64 bit type, and your addressing is not aligned to an even 64 bit boundary. This:
*(double*)((int*)data+100) = 0.0; // 50th double
or this:
*(double*)((int*)data+102) = 0.0; // 51st double
are both legal. This:
*(double*)((int*)data+101) = 0.0; // not aligned to a 64 bit boundary
is not.
the error indicates out of bound memory access, please check the offset value.

Cuda: Global memory broadcast to registers on Compute 5.0 architecture

I have the following code:
__global__ void someKernel(unsigned char * global_mem, unsigned int * start) {
unsigned int size;
size = *(unsigned int *)&global_mem[start[blockIdx.x]];
//Do many things with size
}
Where all of my threads from the same block will read the same memory location from the global memory and do many things with it.
How fast/slow is the copy going to be? I have a block of 256 threads.
Will the global memory broadcast to the whole block, or to a single warp (meaning I will have to do 256/32 reads from global memory)?
Would the following situation be better where I first read from global memory to shared memory and then I copy to a register (bearing the warp divergence and the synchronization overhead):
__global__ void someKernel(unsigned char * global_mem, unsigned int * start) {
__shared__ int tmpsize;
unsigned int size;
int i = threadIdx.x;
if (i == 0) {
tmpsize = *(unsigned int *)&global_mem[start[blockIdx.x]];
}
__syncthreads();
size = tmpsize;
//Do many things with size
}
Would this be faster, considering that I have a huge grid with blocks of 256 threads and each block reads a different start location

How to solve CUDA Thrust library - for_each synchronization error?

I'm trying to modify a simple dynamic vector in CUDA using the thrust library of CUDA. But I'm getting "launch_closure_by_value" error on the screen indicatiing that the error is related to some synchronization process.
A simple 1D dynamic array modification is not possible due to this error.
My code segment which is causing the error is as follows.
from a .cpp file I call setIndexedGrid, which is defined in System.cu
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
setIndexedGridInfo(a,b);
The code segment at System.cu:
void
setIndexedGridInfo(float* a, float*b)
{
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData,d_newData)),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData+8,d_newData+8)),
grid_functor(c));
}
grid_functor is defined in _kernel.cu
struct grid_functor
{
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float data = thrust::get<0>(t);
float pos = data + 0.1;
thrust::get<1>(t) = pos;
}
};
I also get these on the Output window (I use Visual Studio):
First-chance exception at 0x000007fefdc7cacd in Particles.exe:
Microsoft C++ exception: cudaError_enum at memory location
0x0029eb60.. First-chance exception at 0x000007fefdc7cacd in
smokeParticles.exe: Microsoft C++ exception:
thrust::system::system_error at memory location 0x0029ecf0.. Unhandled
exception at 0x000007fefdc7cacd in Particles.exe: Microsoft C++
exception: thrust::system::system_error at memory location
0x0029ecf0..
What is causing the problem?
You are trying to use host memory pointers in functions expecting pointers in device memory. This code is the problem:
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
setIndexedGridInfo(a,b);
.....
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
The thrust::device_ptr is intended for "wrapping" a device memory pointer allocated with the CUDA API so that thrust can use it. You are trying to treat a host pointer directly as a device pointer. That is illegal. You could modify your setIndexedGridInfo function like this:
void setIndexedGridInfo(float* a, float*b, const int n)
{
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.begin(),d_newData.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.end(),d_newData.end())),
grid_functor(c));
}
The device_vector constructor will allocate device memory and then copy the contents of your host memory to the device. That should fix the error you are seeing, although I am not sure what you are trying to do with the for_each iterator and whether the functor you have wrttien is correct.
Edit:
Here is a complete, compilable, runnable version of your code:
#include <cstdlib>
#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/copy.h>
struct grid_functor
{
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float data = thrust::get<0>(t);
float pos = data + 0.1f;
thrust::get<1>(t) = pos;
}
};
void setIndexedGridInfo(float* a, float*b, const int n)
{
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.begin(),d_newData.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.end(),d_newData.end())),
grid_functor(c));
thrust::copy(d_newData.begin(), d_newData.end(), b);
}
int main(void)
{
const int n = 8;
float* a= (float*)(malloc(n*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(n*sizeof(float)));
setIndexedGridInfo(a,b,n);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d (%f,%f)\n", i, a[i], b[i]);
}
return 0;
}
I can compile and run this code on an OS 10.6.8 host with CUDA 4.1 like this:
$ nvcc -Xptxas="-v" -arch=sm_12 -g -G thrustforeach.cu
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEi12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEj12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
$ ./a.out
0 (0.000000,0.100000)
1 (1.000000,1.100000)
2 (2.000000,2.100000)
3 (3.000000,3.100000)
4 (4.000000,4.100000)
5 (5.000000,5.100000)
6 (6.000000,6.100000)
7 (7.000000,7.100000)

CUDA shared array not getting values?

I am trying to implement simple parallel reduction. I am using the code from CUDA sdk. BUt somehow there is a problem in my kernel as the shared array is not getting values of the global array and its all zeroes.
extern __ shared __ float4 sdata[];
// each thread loadsone element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i= blockIdx.x*blockDim.x+ threadIdx.x;
sdata[tid] = dev_src[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if(tid % (2*s) == 0){
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if(tid == 0)
out[blockIdx.x] = sdata[0];
EDIT::
ok I got it working by removing extern keyword and making shared array a constant size like 512. I am in good shape now. Maybe someone can explain why it was not working with extern keyword
I think I know why this is happening as I have faced this before. How are you calling the kernel?
Remember in the call kernel<<<blocks,threads,sharedMemory>>> the sharedMemory should be the size of the shared memory in bytes. So, if you are declaring for 512 elements, the third parameter should be 512 * sizeof(float4). I think you are just calling as below, which is wrong
kernel<<<blocks,threads,512>>> // this is wrong
Hope that helps

Resources