pthreads and opencl - pthreads

Can I create pthreads and inside each pthread can I create opencl environment and call the same kernel. What I am trying to do is launch opencl kernels in parallel on the same device. Is this possible?
Thanks for answering.

At first sight this seams unnecessary.
When you launch an OpenCL kernel, using clEnqueueNDRange() API call, you can launch as many kernels as you need; each as its own thread on the same device. The OpenCL Model is that one Context/Command Queue can launch 100 - 1000s of light weight kernel threads on a GPU.

Ya as Tim pointed out, when OpenCL supports so many threads/kernels why would you want to go inside pthreads with opencl. Further threads on the GPU are very light weight as compared to pthreads. Pthreads are costly and involved lot of overhead for context switching which might actually bring down your performance significantly.
But launching many kernels with the same command queue will execute the kernels sequentially. There should be different command queues for each kernel. I believe single context should not be a problem to launch the kernels parallely...

Related

What is the GPU scheduling method used in DirectX?

I am curious that what is the GPU scheduling method used in DirectX ?
It's hard for me to get a full picture of DirectX GPU scheduling method by reading this MSDN section
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/video-memory-management-and-gpu-scheduling
This paper classified a list of GPU scheduling methods. Can DirectX's method be identified by this table?
https://pure.qub.ac.uk/portal/files/130042880/acm_surveys_GPU_2017.pdf
As far as I know, neither DirectX nor Windows kernel does GPU scheduling. Windows just throws tasks submitted by running software to the GPU, and closely monitors how much time do they take to complete. If it’s more than 2 seconds, Windows detects GPU timeout. See this article for what happens next.
IMO the approach works good enough for desktops. Vast majority of users only run single GPU-demanding application at any given time, so they don’t care about the “scheduler” being a simple FIFO queue.
However, if you’re offering cloud-based GPGPUs like the article you’ve linked, or cloud-based gaming platform like GameFly (now part of EA) you likely want something better. nVidia definitely has true GPU scheduler in their GRID vGPUs drivers and/or hardware. I have no idea how it functions, never dealt with that technology.

Is there any way to know whether a dask-worker is running on CPU device or GPU device.?

Suppose a dask cluster has some CPU devices as well as some GPU devices. Each device runs a singe dask-worker. Now, the question is how do I find that the underlying device of a dask-worker is CPU or GPU.
For example:- if the dask-worker is running on CPU device, I should know that it's running on CPU or if the dask-worker is running on GPU device, I should know the device type programmatically. is there any method to know this programmatically.?
The linked answer in the comment above is about marking different workers beforehand by resource, and then assigning tasks depending on what resources they may need.
Perhaps you, instead, were wanting to run your computation in a heterogeneous way, i.e., you don't mind which task gets to run on a GPU machine and which not, but in the case a GPU is available, you want to make use of it. This case is actually very simple from dask's point of view.
Consider where you might have a function that detects whether a GPU is present, and two functions which you can run to process your data, depends upon the case.
def process_data(d):
if this_machine_has_gpu():
return gpu_process(d)
else:
return cpu_process(d)
This structure is perfectly allowed to be used as a dask task, whether with the delayed mechanism or with arrays/dataframes.

ImageMagick and OpenMP, use all cpu os a cluster?

I know i can use ImageMagick with OpenMP to use all cores of my CPU, but the question is: Can i use all the cores of my 10 computer cluster? or imagemagick will only use the cores of my local CPU?
Thanks a lot.
OpenMP is based on the fork-join paradigm which is bound to shared-memory systems because of the very nature of the fork function. Thus, you won't be able to use all the cores of your cluster since they don't share their memory. OpenMP-enabled programs are then limited to cores that share their memory (cores from the local CPU).
There is a way around this, though, but it may not be worth it. You can simulate a virtual NUMA architecture over your 10 computer cluster and execute ImageMagick on this virtual machine. ImageMagick will hence believe it runs on a single system containing all your cores. That's what ScaleMP offers with their vSMP software. But the performance gains are strongly related to the program's memory accesses, as they might occur over network in this VM, which is magnitudes slower than cache or RAM access. You may get a significant performance hit depending on how memory is accessed in ImageMagick.
In order to directly perform what you ask, instead of trying to use OpenMP outside its use cases, you could use a Message-Passing Interface (MPI) framework (such as OpenMPI) to parallelize the program over multiple networked computers. That would require to rewrite the parallel portion of ImageMagick to use OpenMPI instead, which may be a pretty daunting task depending on ImageMagick's code base.

MPI/Pthread program does not scale

I have a MPI/Pthread program in which each MPI process will be running on a separate computing node. Within each MPI process, certain number of Pthreads (1-8) are launched. However, no matter how many Pthreads are launched within a MPI process, the overall performance is pretty much the same. I suspect all the Pthreads are running on the same CPU core. How can I assign threads to different CPU cores?
Each computing node has 8 cores.(two Quad core Nehalem processors)
Open MPI 1.4
Linux x86_64
Questions like this are often dependent on the problem at hand. Most likely, you are running into a resource lock issue (where the threads are competing for a lock) -- this would look like only one core was doing any work, because only one thread can (effectively) do any work at any given time.
Setting CPU affinity for a certain thread is not a good solution. You should allow for the OS scheduler to optimally determine the physical core assignment for a given pthread.
Look at your code and try to figure out where you are locking where you shouldn't be, or if you've come up with a correct parallel solution to the problem at hand. You should also test a version of the program using only pthreads (not MPI) and see if scaling is achieved.

GPU memory latency hiding?

How does a kernel's working set like number of registers, affect the GPUs ability to hide memory latencies.
By spreading the lookup latency across a group of parallel threads (warp). Refer to the CUDA Programming Guide in the CUDA SDK for detail

Resources