OpenCL CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE not consistent - memory

I am trying to understand the value of CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, specifically for a RX Vega 56 GPU. My own program queries and outputs the following:
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE : 4244635648
Moreover, using clinfo:
% clinfo | grep constant
Max constant buffer size 4244635648 (3.953GiB)
Max number of constant args 8
I would expect something like 65536, but getting almost 4 GB for the constant buffer size is too large. Could someone explain me what is going on? I was wondering that this could be a driver problem ...
I saw other people getting large values for such parameter too. See here.

GPU has VideoMemory also called VRAM. And its all can be used as Constant Memory.

Related

OpenCL kernel out of resources based on number of loop iterations INSIDE the kernel. Can the compiled kernel be too large to fit on the GPU?

TL;DR I have an OpenCL kernel that loops for a large number of iterations and calls several user-made functions. The kernel works fine for few iterations, but increasing the number of iterations causes an CL_OUT_OF_RESOURCES (-5) error. If the same kernel is executed on a better GPU it is able to loop for more iterations without the error. What can be causing this error based on the number of iterations? Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
I am developing an OpenCL kernel to run on GPU that computes a very complex function. To keep things organized, I have a "kernel.cl" file with the main kernel (the __kernel void function) and a "aux_functions.cl" file with ~20 auxiliary functions (they are of type int, int2, int16, but not __kernel) that are called several times by the kernel and by themselves.
The problem specification is roughly as follows (justification for such many loops):
I have two arrays representing full HD images (1920x1080 integers)
For each 128x128 patch of one image, I must find the value of 4 parameters that optimize a given function (the second image is used to evaluate how good it is)
For the same 128x128 patch and the same 4 parameters, each 4x4 sub-patch is transformed slightly different based on its position inside the larger 128x128 patch
And I tried to model the execution as follows:
Each workgroup will compute the kernel for one 128x128 patch (I started processing only the 10 first patches -- 10 workgroups)
Each workgroup is composed of 256 workitems
Each workitem will test a distinct set of values (a fraction of a predefiend set) for the 4 parameters based on their IDs
The main structure of the kernel is as follows:
__kernel void funct(__global int *referenceFrameSamples, __global int *currentFrameSamples,const int frameWidth, const int frameHeight, __global int *result){
// Initialize some variables and get global and local IDs
for(executed X times){ // These 4 outer loops are used to test different combinations of parameters to be applied in a given function in the inner loops
for(executed Y times){
for(executed Z times){
for(executed W times){
// Simple assignments based on the outer for loops
for(executed 32x){ // Each execution of the inner loop applies a function to a 4x4 patch
for(executed 32x){
// The relevant computation is performed here
// Calls a couple of lightweight functions using only the private variables
// Calls a complex function that uses the __global int *referenceFrameSamples variable
}
}
// Compute something and use select() to update the best value
}
}
}
// Write the best value to __global *results buffer
}
The problem is that when the outer 4 loops are repeated a few times the kernel runs fine, but if I increase the iterations the kernel crashes with the error ERROR! clWaitForEvents returned CL_OUT_OF_RESOURCES (-5). I am testing it on a notebook with a GPU GeForce 940MX with 2GB, and the kernel starts crashing when X * Y * Z * W = 1024.
The clEnqueueNDRangeKernel() call has no error, only the clWaitForEvents() called after it returns an error. I am using CLIntercept to profile the errors and running time of the kernel. Also, when the kernel runs smooth I can measure the execution time correctly (showed next), but when it crashes, the "measured" execution time is ridiculously wrong (billions of miliseconds) even though it crashes on the first seconds.
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f miliseconds \n",nanoSeconds / 1000000.0);
What I tested:
Improve the complex auxiliary function that used __global variable: instead of passing the __global pointer, I read the relevant part of the array into a private array and passed it as argument. Outcome: improved running time on success cases, but still fails in the same case
Reduce workgroups and workitems: even using 1 workgroup and 1 workitem (the absolute minimum) with the same number of iterations yields the same error. For a smaller number of iterations, running time decreases with less workitems/groups
Running the same kernel on a better GPU: after doing the previous 2 modifications (improved function and reduced workitems) I launched the kernel on a desktop equipped with a GPU Titan V with 12GB. It is able to compute the kernel with a larger number of iterations (I tried up to 1 million iterations) without giving the CL_OUT_OF_RESOURCES, and the running time seems to increase linearly with the iterations. Although this is the computer that will actually run the kernel over a dataset to solve my problem, it is a server that must be accessed remotely. I would prefer to do the development on my notebook and deploy the final code on the server.
My guess: I know that function calls are inlined in GPU. Since the program is crashing based on the number of iterations, my only guess is that these for loops are being unrolled, and with the inlined functions, the compiled kernel is too big to fit on the GPU (even with a single workitem). This also explains why using a better GPU allows increasing the number of iterations.
Question: What could be causing this CL_OUT_OF_RESOURCES error based on the number of iterations?
Of course I could reduce the number of iterations in each workitem, but then I would need multiple workgroups to process the same set of data (the same 128x128 patch) and would need to access global memory to select the best result between workgroups of the same patch. I may end up proceeding in this direction, but I would really like to know what is happening with the kernel at this moment.
Update after #doqtor comment:
Using -cl-nv-verbose when building the program reports the following resources usage. It's strange that these values do not change irrespective of the number of iterations, either when the program runs successfully and when it crashes.
ptxas info : 0 bytes gmem, 532 bytes cmem[3]
ptxas info : Compiling entry function 'naive_affine_2CPs_CU_128x128_SR_64x64' for 'sm_50'
ptxas info : Function properties for naive_affine_2CPs_CU_128x128_SR_64x64
ptxas . 66032 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 140 registers, 360 bytes cmem[0]
Running clinfo reports that my GPU has
Registers per block (NV) 65536
Global memory size 2101870592 (1.958GiB)
Max memory allocation 525467648 (501.1MiB)
Local memory size 49152 (48KiB)
It seems that I am not using too many registers, but I don't know how those stack frame, cmem[0] and cmem[3] relate to the memory information reported by clinfo.
Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
Yey, that is part of the problem. The compiler sees that you have a loop with a fixed, small range, and it automatically unrolls it. This happens for the six nested loops and then the assembly blows up. This will get register spilling into global memory which makes the application very slow.
However even if the compiler does not unroll the loops, every thread does X*Y*Z*W*32*32 iterations of "The relevant computation", which takes an eternety. The system thinks it freezed up and you get CL_OUT_OF_RESOURCES .
Can you really not parallelize any of these six nested loops? The best solution would be to parallelize them all, that means include them in the kernel range and launch a few hundred million threads that do "The relevant computation" without any loops. You should have as much independent threads / workgroups as possible to get the best performance (saturate the GPU).
Remember, your GPU has thousands of cores grouped into warps of 32 and SMs of 2 or 4 warps, and if you only launch a single workgroup, it will run on only a single SM with 64 or 128 cores and the remaining cores stay idle.

I am trying to work with 5 Dimensional arrays with a lot of steps. I get an error: ' 'std::bad_alloc' '

I am working for a problem which requires 5-dimensional arrays (dynamically allocated )with a number of steps of at least 1000. The code works fine for 50 steps. While, it gives an error of 'terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
' when the number of steps is increased. Any suggestions ?
I was about to write another comment but it got too long...
You asked for a way out to std::bad_alloc, I can think in two options:
Buy more RAM ;-)
Stop allocating so much memory
Given that but you didn't provide any code and it is not clear what you mean with "steps", I will assume "steps" is the size of each one-dimensional array.
Now, consider a one-dimensional (standard) array. A char array of size 50 would require 50 contiguous bytes in memory.
For a 2-dimensional array, the required memory would be 50^50: 2500 bytes.
For a 3-dimensional array, the required memory would be 50^3: 125 KB.
Going up to a five-dimensional array, you would need 50^5 (312500000) bytes, roughly 300 MB (not necessarily contiguous because you usually allocate each sub-array using a nested for loop).
If the length of the array is 1000 instead of 50, the required memory would be of 1000^5 (a 1 with 15 zeros at the right) bytes, almost a Petabyte!
And that's why you are running out of memory.

How is virtual stack/heap size calculated given physical RAM constraints

I am working on a particular embedded device from Nordic. The datasheet states 256kB of flash, with 16kB of RAM, on a 32-bit Cortex M0. With that said, my main question is about the stack/heap size given the physical RAM constraints. I found documentation here regarding RAM/ROM management and their associated usages, however it's still unclear how the stack size is calculated. For example,
I found an assembly configuration(arm_startup.s) that sets the max stack/heap size to 2048 bytes.
IF :DEF: __STACK_SIZE
Stack_Size EQU __STACK_SIZE
ELSE
Stack_Size EQU 2048
ENDIF
AREA STACK, NOINIT, READWRITE, ALIGN=3
Stack_Mem SPACE Stack_Size
__initial_sp
IF :DEF: __HEAP_SIZE
Heap_Size EQU __HEAP_SIZE
ELSE
Heap_Size EQU 2048
ENDIF
AREA HEAP, NOINIT, READWRITE, ALIGN=3
__heap_base
Heap_Mem SPACE Heap_Size
__heap_limit
PRESERVE8
THUMB
As a test in main, I purposely set void *x = malloc(2064) which caused a stack overflow to prove 2048 bytes is being honored. My question is, if you can only have 16kB of RAM, how can you allocated 2048 bytes to both the stack and heap. After searching around a bit I ended up in the Cortex M0 documentation. I could not find any such constraint, but then again I may be looking at this from the wrong angle. Given the assembly snippet above, it would seem somewhere, either in hardware, or software, there is a total virtual RAM size of 4096, split into two for the stack and heap.
Is there some programming class going on why are we all of the sudden getting hit with the same or related questions?
If you have 16KB of ram then all of your read/write information has to fit yes? If you dont set those defines then that code is producing 2K of stack and 2K of heap so 4K. 4K is less than 16K so you have 12K left for .bss and .data.
Heap makes no sense on a microcontroller normally sure you might port some as written library the mallocs and you just fake that (but you do have to "allocate" basically define how much to use and then implement a simple malloc). But other than say borrowing code written for a higher level platform you dont want to be using malloc.
"allocating" stack doesnt make a lot of sense either as how are you managing that are you telling the compiler to check against some number before every push? and if so then it is compiler dependent and you have to feed the compiler what it needs as well as deal with the excess code that produces and consumes out of your limited resources as well as burns time out of your limited bandwidth. Do your system engineering instead and not have to do it runtime.
The cortex-m has nothing to do with this nor does anything hardware related at all. 2K + 2K < 16K. Just grade school math.

Why Enable/Disable A20 Line

I have a question about the A20 gate. I read an article about it saying that the mechanism exists to solve problems with address "wraparound" that appeared when newer CPUs got a 32-bit address bus instead of the older 20-bit bus.
It would seem to me that the correct way to handle the wraparound would be to turn off all of bits A20-A31, and not just A20.
Why is it sufficient to only turn off bit A20 to handle the problem?
The original problem had to do with x86 memory segmentation.
Memory would be accessed using segment:offset, where both segment and offset were 16-bit integers. The actual address was computed as segment*16+offset. On a 20-bit address bus, this would naturally get truncated to the lowest 20 bits.
This truncation could present a problem when the same code was run on an address bus wider than 20 bits, since instead of the wraparound, the program could access memory past the first megabyte. While not a problem per se, this could be a backwards compatibility issue.
To work around this issue, they introduced a way to force the A20 address line to zero, thereby forcing the wraparound.
Your question is: "Why just A20? What about A21-A31?"
Note that the highest location that could be addressed using the the 16-bit segment:offset scheme was 0xffff * 16 + 0xffff = 0x10ffef. This fits in 21 bits. Thus, the lines A21-A31 were always zero, and it was only A20 that needed to be controlled.

CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like this at clEnqueueReadBuffer right? It should be when I allocated the data right? Is there some way to get more details than just the error code? It would be cool if I could see how much VRAM was allocated when OpenCL declared CL_OUT_OF_RESOURCES.
I just had the same problem you had (took me a whole day to fix).
I'm sure people with the same problem will stumble upon this, that's why I'm posting to this old question.
You propably didn't check for the maximum work group size of the kernel.
This is how you do it:
size_t kernel_work_group_size;
clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &kernel_work_group_size, NULL);
My devices (2x NVIDIA GTX 460 & Intel i7 CPU) support a maximum work group size of 1024, but the above code returns something around 500 when I pass my Path Tracing kernel.
When I used a workgroup size of 1024 it obviously failed and gave me the CL_OUT_OF_RESOURCES error.
The more complex your kernel becomes, the smaller the maximum workgroup size for it will become (or that's at least what I experienced).
Edit:
I just realized you said "clEnqueueReadBuffer" instead of "clEnqueueNDRangeKernel"...
My answer was related to clEnqueueNDRangeKernel.
Sorry for the mistake.
I hope this is still useful to other people.
From another source:
- calling clFinish() gets you the error status for the calculation (rather than getting it when you try to read data).
- the "out of resources" error can also be caused by a 5s timeout if the (NVidia) card is also being used as a display
- it can also appear when you have pointer errors in your kernel.
A follow-up suggests running the kernel first on the CPU to ensure you're not making out-of-bounds memory accesses.
Not all available memory can necessarily be supplied to a single acquisition request. Read up on heap fragmentation 1, 2, 3 to learn more about why the largest allocation that can succeed is for the largest contiguous block of memory and how blocks get divided up into smaller pieces as a result of using the memory.
It's not that the resource is exhausted... It just can't find a single piece big enough to satisfy your request...
Out of bounds acesses in a kernel are typically silent (since there is still no error at the kernel queueing call).
However, if you try to read the kernel result later with a clEnqueueReadBuffer(). This error will show up. It indicates something went wrong during kernel execution.
Check your kernel code for out-of-bounds read/writes.

Resources