OpenCL constant memory caching

OpenCL constant memory caching - memory

If I execute a kernel that uses a small piece of constant memory, then write to that constant memory while the kernel is running, does the kernel immediately see the change, or is the contents of the constant memory "cached" upon kernel launch - or does the OpenCL driver unconditionally delay the constant memory update until the kernel is done running?
If the first or third options occur, then how can I execute the same kernel with different constant memory data simultaneously? Do I need to create multiple kernel/constant buffer objects and work with that? Note I can't precalculate anything as kernel launches are a result of external signals that can occur at any time and rate. I could also create kernel objects on the fly, but that seems like an ugly solution.

It's a fundamental concept in OpenCL that commands that are 'Enqueued' into the same command queue are executed in order. This includes WriteBuffer and similar commands. This means if you do
EnqueueNDKernalRange()
EnqueueWriteBuffer()
EnqueueNDKernalRange()
Then regardless of them being blocking or non-blocking, the write will only effect the second set of kernels.
If you're updating via a mapped pointer then it should be unmapped before any kernels run. Running kernels which access a buffer that is currently mapped is undefined (Spec 1.1 - Section 5.4.2.1).
As EnqueueMapBuffer and EnqueUnmapMemObject are also placed on the command queue, as long as you unmap the ordering of updates is still guaranteed.
Does that answer your question, or are you updating your buffer in another way?
how can I execute the same kernel with different constant memory data simultaneously? Do I need to create multiple kernel/constant buffer objects and work with that?
Yes, multiple buffer objects.

Related

Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?

The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.

As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

How do modern OS's achieve idempotent cleanup functions for process deaths?

Let's say that I have an OS that implements malloc by storing a list of segments that the process points to in a process control block. I grab my memory from a free list and give it to the process.
If that process dies, I simply remove the reference to the segment from the process control block, and move the segment back to my free list.
Is it possible to create an idempotent function that does this process cleanup? How is it possible to create a function such that it can be called again, regardless of whether it was called many times before or if previous calls died in the middle of executing the cleanup function? It seems to me that you can't execute two move commands atomically.
How do modern OS's implement the magic involved in culling memory from processes that randomly die? How do they implement it so that it's okay for even the process performing the cull to randomly die, or is this a false assumption that I made?

I'll assume your question boils down to how the OS culls a process's memory if that process crashes.
Although I'm self educated in these matters, I'll give you two ways an OS can make sure any memory used by a process is reclaimed if the process crashes.
In a typical modern CPU and modern OS with virtual memory:
You have two layers of allocation. Whenever the process calls malloc, malloc tries to satisfy the request from already available memory pages the kernel gave the process. If not enough pages are available, malloc asks the kernel to allocate more pages.
In this case, whenever a process crashes or even if it exits normally, the kernel doesn't care what malloc did, or what memory the process forgot to release. It only needs to free all the pages it gave the process.
In a simpler OS that doesn't care much about performance, memory fragmentation or virtual memory and maybe not even about memory protection:
Malloc/free is implemented completely on the kernel side (e.g: system calls). Whenever a process calls malloc/free, the kernel does all the work, and therefore knows about all the memory that needs to be freed. Once the process crashes or exits, the kernel can cleanup. Since the kernel is never supposed to crash, and keep a record of all the allocated memory per process, it's trivial.
Like I said, I'm self educated, and I didn't check how for example Linux or Windows implement it.

Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize blocks. Please let me know where i am going wrong

Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.
On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.
This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.
Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.

The accepted answer is not always correct.
In most cases, kernel launch is asynchronous. But in the following case, it is synchronous. And they are easily ignored by people.
environment variable CUDA_LAUNCH_BLOCKING equals to 1.
using a profiler(nvprof), without enabling concurrent kernel profiling
memcpy that involve host memory which is not page-locked.
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.
From the NVIDIA CUDA programming guide(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device).

Concurrent kernel execution is supported since 2.0 CUDA capability version.
In addition, a return to the CPU code can be made earlier than all the warp kernel to have worked.
In this case, you can provide synchronization yourself.

Short question about CUDA memory access

hey there,
assuming I have a problem where each thread calculates something (reading some parameters out of the constant memory and using them for calculation) and than stores it to a global memory matrix. this matrix gets never read, just writing access... is there now any sense of using shared memory first to store all the calculated values in and than later write them to the global memory? I think no because the writes to global memory stay the same in complete, so the writes to shared memory just add to the writes which I had before already....
Thanks!

There can be, depending on the access patterns in the kernel code. Using a shared memory buffer to "stage" output can be a useful way of ensure writes are coalesced, when the naive write would not be coalesced. This was pretty crucial for performance in the first couple of generations of CUDA compatible hardware (G80/G90). In newer hardware, the case for this is a lot less strong. Fermi cards have a pretty effective L1 and L2 cache scheme which can (within reason) get close to what used to be only achievable using shared memory without any extra code.
There isn't really a general answer to this question, because it depends a lot of the specifics of what any given code does, and what target hardware it is expected to run well on.

memory not freed in matlab?

I am running a script that animates a plot (simulation of a water flow). After a while, I kill the loop by doing ctrl-c.
After doing this several times I get the error:
??? Error: Out of memory.
And after I start receiving that error, every call to my script will generate it.
Now, it happens before anything inside the function that I am calling is executed, i.e even if I add the line a=1 as the first line of the function I am calling, I still get the error and no printout, so the code inside the function doesn't even get executed.
What could be causing this?

There are several possible reasons.
Most likely your script creates some variables that are filling up the memory. Run
clear all
before restarting the script, so that all the variables are cleared, or change your script to a function (which will automatically erase all temporary variables after the function returns). Note that this also clears all loaded functions, so your next execution of the script has to load them again which will slow down the next execution by a (usually tiny) bit. It may be sufficient to call clear only.
Maybe you're animating by plotting several plots over one another (without clearing the axes first). Thus you might run out of Java heap space. You can close the open figures individually, or run
close all
You can also increase the amount of Java Memory Matlab uses on your system (see instructions here) - note that the limit is generally rather low, annoyingly so if you want to tons of figures.
Especially if you're running an older version of Windows, you may get your memory fragmented. Matlab needs contiguous blocks of free space to assign variables. To check for memory fragmentation, run
memory
and look at the number for the maximum possible variable size. If this is much smaller than the size available for all arrays, it's time to restart Matlab (I guess if you use a Windows version that would require a reboot to fix the problem, you may want to look into getting a new computer with Win7).

You can also try the pack command, eg:
close all;
clear all;
pack;
to clear memory. Although after a recent mathworks seminar I asked one of the mathworks guru's and he also conformed #Andrew Janke's comment regarding memory fragmentation. Usually quitting and restarting matlab sorts this out for me (on XP).

clear all close all are straight-forward ways to free memory, which are known by all non-beginners.
The main issue is that when you have done some data large data processing, and cleared/closed everything off - there is still significant memory used by matlab.
This is a currently major problem with matlab, and to my knowledge there is no solution rather than restarting matlab, which is a pity.

It sounds like you are not clearing any of your variables. You should either provide a way to stop the loop without hitting ctrl-c (write a simple GUI with a "Stop" button and your display) and then clean up your workspace in the script or clear your variables at the start of the script.
Are you intentionally storing all the data (or some large component) on each iteration of your loop?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart