OpenCL slow memory access in for loop - memory

I have a program that I built in OpenCL, in which each kernel accesses a read-only buffer located in global memory. At some point each kernel needs to copy some data from global memory into a temporary buffer. I made a for loop to copy a region of global memory byte-by-byte into the temporary buffer. I execute the aforementioned kernel using the clEnqueueNDRangeKernel command which is located inside a while loop. In order to measure how fast the clEnqueueNDRangeKernel command is, I added a counter called ups (Updates Per Second) which is incremented at the end of each while loop. Every one second I print the value of the counter and set it to zero.
I noticed that my program was running slowly, at about 53 ups. After some investigation I found out that the problem was the memory copying loop that was described above. This is the code:
typedef uchar byte;
byte tempBuffer[128]
byte* destPtr = (byte*)&tempBuffer0];
__global const byte* srcPtr = (__global const byte*)globalMemPtr;
for(size_t k = 0; k < regionSize; ++k)
{
destPtr[k] = srcPtr[k];
}
In variable globalMemPtr is a pointer to the region of global memory that needs to be copied into the temporary buffer, and tempBuffer the temporary buffer. The variable regionSize holds the size of the region to be copied in bytes. In this case its value is 12.
What I noticed was that if I replace regionSize with 12, the kernel runs much faster, at about 90 ups. My assumption is that the OpenCL compiler can optimize the for loop to copy memory much faster when 12 is used, but it can't when regionSize is used.
Does anyone know what is happening? Can any one help me?

Related

OpenCL kernel out of resources based on number of loop iterations INSIDE the kernel. Can the compiled kernel be too large to fit on the GPU?

TL;DR I have an OpenCL kernel that loops for a large number of iterations and calls several user-made functions. The kernel works fine for few iterations, but increasing the number of iterations causes an CL_OUT_OF_RESOURCES (-5) error. If the same kernel is executed on a better GPU it is able to loop for more iterations without the error. What can be causing this error based on the number of iterations? Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
I am developing an OpenCL kernel to run on GPU that computes a very complex function. To keep things organized, I have a "kernel.cl" file with the main kernel (the __kernel void function) and a "aux_functions.cl" file with ~20 auxiliary functions (they are of type int, int2, int16, but not __kernel) that are called several times by the kernel and by themselves.
The problem specification is roughly as follows (justification for such many loops):
I have two arrays representing full HD images (1920x1080 integers)
For each 128x128 patch of one image, I must find the value of 4 parameters that optimize a given function (the second image is used to evaluate how good it is)
For the same 128x128 patch and the same 4 parameters, each 4x4 sub-patch is transformed slightly different based on its position inside the larger 128x128 patch
And I tried to model the execution as follows:
Each workgroup will compute the kernel for one 128x128 patch (I started processing only the 10 first patches -- 10 workgroups)
Each workgroup is composed of 256 workitems
Each workitem will test a distinct set of values (a fraction of a predefiend set) for the 4 parameters based on their IDs
The main structure of the kernel is as follows:
__kernel void funct(__global int *referenceFrameSamples, __global int *currentFrameSamples,const int frameWidth, const int frameHeight, __global int *result){
// Initialize some variables and get global and local IDs
for(executed X times){ // These 4 outer loops are used to test different combinations of parameters to be applied in a given function in the inner loops
for(executed Y times){
for(executed Z times){
for(executed W times){
// Simple assignments based on the outer for loops
for(executed 32x){ // Each execution of the inner loop applies a function to a 4x4 patch
for(executed 32x){
// The relevant computation is performed here
// Calls a couple of lightweight functions using only the private variables
// Calls a complex function that uses the __global int *referenceFrameSamples variable
}
}
// Compute something and use select() to update the best value
}
}
}
// Write the best value to __global *results buffer
}
The problem is that when the outer 4 loops are repeated a few times the kernel runs fine, but if I increase the iterations the kernel crashes with the error ERROR! clWaitForEvents returned CL_OUT_OF_RESOURCES (-5). I am testing it on a notebook with a GPU GeForce 940MX with 2GB, and the kernel starts crashing when X * Y * Z * W = 1024.
The clEnqueueNDRangeKernel() call has no error, only the clWaitForEvents() called after it returns an error. I am using CLIntercept to profile the errors and running time of the kernel. Also, when the kernel runs smooth I can measure the execution time correctly (showed next), but when it crashes, the "measured" execution time is ridiculously wrong (billions of miliseconds) even though it crashes on the first seconds.
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f miliseconds \n",nanoSeconds / 1000000.0);
What I tested:
Improve the complex auxiliary function that used __global variable: instead of passing the __global pointer, I read the relevant part of the array into a private array and passed it as argument. Outcome: improved running time on success cases, but still fails in the same case
Reduce workgroups and workitems: even using 1 workgroup and 1 workitem (the absolute minimum) with the same number of iterations yields the same error. For a smaller number of iterations, running time decreases with less workitems/groups
Running the same kernel on a better GPU: after doing the previous 2 modifications (improved function and reduced workitems) I launched the kernel on a desktop equipped with a GPU Titan V with 12GB. It is able to compute the kernel with a larger number of iterations (I tried up to 1 million iterations) without giving the CL_OUT_OF_RESOURCES, and the running time seems to increase linearly with the iterations. Although this is the computer that will actually run the kernel over a dataset to solve my problem, it is a server that must be accessed remotely. I would prefer to do the development on my notebook and deploy the final code on the server.
My guess: I know that function calls are inlined in GPU. Since the program is crashing based on the number of iterations, my only guess is that these for loops are being unrolled, and with the inlined functions, the compiled kernel is too big to fit on the GPU (even with a single workitem). This also explains why using a better GPU allows increasing the number of iterations.
Question: What could be causing this CL_OUT_OF_RESOURCES error based on the number of iterations?
Of course I could reduce the number of iterations in each workitem, but then I would need multiple workgroups to process the same set of data (the same 128x128 patch) and would need to access global memory to select the best result between workgroups of the same patch. I may end up proceeding in this direction, but I would really like to know what is happening with the kernel at this moment.
Update after #doqtor comment:
Using -cl-nv-verbose when building the program reports the following resources usage. It's strange that these values do not change irrespective of the number of iterations, either when the program runs successfully and when it crashes.
ptxas info : 0 bytes gmem, 532 bytes cmem[3]
ptxas info : Compiling entry function 'naive_affine_2CPs_CU_128x128_SR_64x64' for 'sm_50'
ptxas info : Function properties for naive_affine_2CPs_CU_128x128_SR_64x64
ptxas . 66032 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 140 registers, 360 bytes cmem[0]
Running clinfo reports that my GPU has
Registers per block (NV) 65536
Global memory size 2101870592 (1.958GiB)
Max memory allocation 525467648 (501.1MiB)
Local memory size 49152 (48KiB)
It seems that I am not using too many registers, but I don't know how those stack frame, cmem[0] and cmem[3] relate to the memory information reported by clinfo.
Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
Yey, that is part of the problem. The compiler sees that you have a loop with a fixed, small range, and it automatically unrolls it. This happens for the six nested loops and then the assembly blows up. This will get register spilling into global memory which makes the application very slow.
However even if the compiler does not unroll the loops, every thread does X*Y*Z*W*32*32 iterations of "The relevant computation", which takes an eternety. The system thinks it freezed up and you get CL_OUT_OF_RESOURCES .
Can you really not parallelize any of these six nested loops? The best solution would be to parallelize them all, that means include them in the kernel range and launch a few hundred million threads that do "The relevant computation" without any loops. You should have as much independent threads / workgroups as possible to get the best performance (saturate the GPU).
Remember, your GPU has thousands of cores grouped into warps of 32 and SMs of 2 or 4 warps, and if you only launch a single workgroup, it will run on only a single SM with 64 or 128 cores and the remaining cores stay idle.

Where does the increment value for sbrk come from when using newlib nano malloc?

I have a target (Stm32f030R8) I am using with FreeRTOS and the newlib reentrant heap implementation (http://www.nadler.com/embedded/newlibAndFreeRTOS.html). This shim defines sbrk in addition to implementing the actual FreeRTOS memory allocation shim. The project is built with GNU ARM GCC and using --specs=nosys.specs.
For the following example, FreeRTOS has not been started yet. malloc has not been called before. The mcu is fresh off boot and has only just copied initialized data from flash into ram and configured clocks and basic peripherals.
Simply having the lib in my project, I am seeing that sbrk is being called with very large increment values for seemingly small malloc calls.
The target has 8K of memory, of which I have 0x12b8 (~4KB bytes between the start of the heap and end of ram (top of the stack).
I am seeing that if I allocate 1000 bytes with str = (char*) malloc(1000);, that sbrk gets called twice. First with an increment value of 0x07e8 and then again with an increment value of 0x0c60. the result being that the desired total increment count is 0x1448 (5192 bytes!) and of course this overflows not just the stack, but available ram.
What on earth is going on here? Why are these huge increment values being used by malloc for such a relatively small desired buffer allocation?
I think it may not possible to answer definitively, rather than just advise on debugging. The simplest solution is to step through the allocation code to determine where and why the allocation size request is being corrupted (as it appears to be). You will need to course to build the library from source or at least include mallocr.c in your code to override any static library implementation.
In Newlib Nano the call-stack to _sbrk_r is rather simple (compared to regular Newlib). The increment is determined from the allocation size s in nano_malloc() in nano-mallocr.c as follows:
malloc_size_t alloc_size;
alloc_size = ALIGN_TO(s, CHUNK_ALIGN); /* size of aligned data load */
alloc_size += MALLOC_PADDING; /* padding */
alloc_size += CHUNK_OFFSET; /* size of chunk head */
alloc_size = MAX(alloc_size, MALLOC_MINCHUNK);
Then if a chunk of alloc_size is not found in the free-list (which is always the case when free() has not been called), then sbrk_aligned( alloc_size ) is called which calls _sbrk_r(). None of the padding, alignment or minimum allocation macros add such a large amount.
sbrk_aligned( alloc_size ) calls _sbrk_r if the request is not aligned. The second call should be never be larger than CHUNK_ALIGN - (sizeof(void*)).
In you debugger you should be able to inspect the call stack or step through the call to see where parameter becomes incorrect.

shm_open and mmap with stl vector

I've read that stl vector does not work well with SYS V shared memory. But if I use POSIX shm_open and then mmap with NULL (mmap(NULL, LARGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0) and give a much larger size than my object, which contains my vector, and after mapping add aditional items to the vector, can there be a problem other than exceeding the LARGE_SIZE space? Other related question: is it guaranteed on a recent SUSE linux that when mapped to the same start address (using above syntax) in unrelated processes my object will be mapped directly and no (system) copy is performed to actualize changed values in the processes (like what happens with normal open and normal files when mmap-ed)?
Thanks!
Edit:
Is this correct then?:
void* mem = allocate_memory_with_mmap(); // say from a shared region
MyType* ptr = new ( mem ) MyType( args );
ptr.~MyType() //is this really needed?
now in an unrelated process:
MyType* myptr = (MyType*)fetch_address_from_mmap(...)
myptr->printHelloWorld();
myptr->myvalue = 1; //writes to shared memory
myptr.~MyType() //is this really needed?
now if I want to free memory
munmap(address...) //but this done only once, when none of the processes use it any more
You are missing the fact that the STL vector is usually just a tuple of (mem pointer, mem size, element count), where actual memory for contained objects is received from the allocator template parameter.
Placing an instance of std::vector in shared memory does not make any sense. You probably want to check out boost::interprocess library instead.
Edit 0:
Memory allocation and object construction are two distinct phased, though combined in a single statement like bellow (unless operator new is re-defined for MyType):
// allocates from process heap and constructs
MyType* ptr = new MyType( args );
You can split these two phases with placement new:
void* mem = allocate_memory_somehow(); // say from a shared region
MyType* ptr = new ( mem ) MyType( args );
Though now you will have to explicitly call the destructor and release the memory:
ptr->~MyType();
release_memory_back_to_where_it_came_from( ptr );
This is essentially how you can construct objects in shared memory in C++. Note though that types that store pointers are not suitable for shared memory since any pointer in one process memory space does not make any sense in the other. Use explicit sizes and offsets instead.
Hope this helps.

CUDA: are access times for texture memory similar to coalesced global memory?

My kernel threads access a linear character array in a coalesced fashion. If I map
the array to texture I don't see any speedup. The running times are
almost the same. I'm working on a Tesla C2050 with compute capability 2.0 and read
somewhere that global accesses are cached. Is that true? Perhaps that is why I
am not seeing a difference in the running time.
The array in the main program is
char *dev_database = NULL;
cudaMalloc( (void**) &dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
and I bind it to texture texture<char> texdatabase with
cudaBindTexture(NULL, texdatabase, dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
Each thread then reads a character ch = tex1Dfetch(texdatabase, p + id) where id
is threadIdx.x + blockIdx.x * blockDim.x and p is an offset.
I'm binding only once and dev_database is a large array. Actually I found that
if the size is too large the bind fails. Is there a limit on the size of the array
to bind? Thanks very much.
There are several possibilities for why you don't see any difference in performance, but the most likely is that this memory access is not your bottleneck. If it is not your bottleneck, making it faster will have no effect on performance.
Regarding caching: for this case, since you are reading only bytes, each warp will read 32 bytes, which means each group of 4 warps will map to each cache line. So assuming few cache conflicts, you will get up to 4x reuse from the cache. So if this memory access is a bottleneck, it is conceivable that the texture cache might not benefit you more than the general-purpose cache.
You should first determine if you are bandwidth bound and if this data access is the culprit. Once you have done that, then optimize your memory accesses. Another tactic to consider is to access 4 to 16 chars per thread per load (using a char4 or int4 struct with byte packing/unpacking) rather than one per thread to increase the number of memory transactions in flight at a time -- this can help to saturate the global memory bus.
There is a good presentation by Paulius Micikevicius from GTC 2010 that you might want to watch. It covers both analysis-driven optimization and the specific concept of memory transactions in flight.

Beginner assembly programming memory usage question

I've been getting into some assembly lately and its fun as it challenges everything i have learned. I was wondering if i could ask a few questions
When running an executable, does the entire executable get loaded into memory?
From a bit of fiddling i've found that constants aren't really constants? Is it just a compiler thing?
const int i = 5;
_asm { mov i, 0 } // i is now 0 and compiles fine
So are all variables assigned with a constant value embedded into the file as well?
Meaning:
int a = 1;
const int b = 2;
void something()
{
const int c = 3;
int d = 4;
}
Will i find all of these variables embedded in the file (in a hex editor or something)?
If the executable is loaded into memory then "constants" are technically using memory? I've read around on the net people saying that constants don't use memory, is this true?
Your executable's text (i.e. code) and data segments get mapped into the process's virtual address space when the executable starts up, but the bytes might not actually be copied from the disk until those memory locations are accessed. See http://en.wikipedia.org/wiki/Demand_paging
C-language constants actually exist in memory, because you have to be able to take the address of them. (That is, &i.) Constants are usually found in the .rdata segment of your executable image.
A constant is going to take up memory somewhere--if you have the constant number 42 in your program, there must be somewhere in memory where the 42 is stored, even if that means that it's stored as the argument of an immediate-mode instruction.
The OS loads the code and data segments in order to prepare them for execution.
If the executable has a resource segment, the application loads parts of it at demand.
It's true that const variables take memory space but compilers are free to optimize
for memory usage and code size, and embed their values in the code.
(in case they don't detect any address references for those variables)
const char * aka C strings, usually are interned by the compilers, to save memory.

Resources