(How) can I run multiple compute kernels concurrently in Metal (I don't necessarily need them to run in parallel at the exact same time, concurrent execution is sufficient)?
Are MTLCommandBuffers in 2 different MTLCommandQueues (on the same device) executed concurrently or not (ie. do queues behave like CUDA streams?)?
Note: I cannot further split the kernels into smaller subkernels.
Related
I am using dask on LSF cluster to process some images in parallel. The processing function itself uses joblib to perform multiple computations on the image in parallel.
It seems that setting n_workers and cores parameters to some numbers will generally produce n_workers * cores futures running at the same time. I would like to have n_workers futures being processesed at a time, each of them having cores cores at disposal for the purpose of using them with joblib.
How do I achieve such result?
To compare and contrast the performances of three different algorithms in a scientific experiment, I am planning to use Celery scheduler. These algorithms are implemented by three different tools. They may or may not have parallelism implemented which I don't want to make any prior assumption about. The dataset contains 10K data points. All three tools are supposed to run on all the data points; which translates to 30K tasks scheduled by the scheduler. All I want is to allocate the same amount of resources to all the tools, across all the executions.
Assume, my physical Ubuntu 18.04 server is equipped with 24 cores and 96 GB of RAM. Tasks are scheduled by 4 Celery workers, each handling a single task. I want to put an upper limit of 4 CPU cores and 16 GB of memory per task. Moreover, no two tasks should race for the same cores, i.e., 4 tasks should be using 16 cores in total, each scheduled on its own set of cores.
Is there any means to accomplish this setup, either through Celery, or cgroup, or by any other mechanism? I want to refrain from using docker, kubernetes, or any VM based approach, unless it is absolutely required.
Dealing with CPU cores should be fairly easy by specifying concurrency to 6. But limiting memory usage is hard part of the requirement and I believe you can accomplish that by making worker processes be owned by particular cgroup that you specified memory limit on.
An alternative would be to run Celery workers in containers with specified limits.
I prefer not to do this as there may be tasks (or task with particular arguments) that allocate tiny amount of RAM so it would be wasteful if you can't use 4G of RAM while such task runs.
Pity Celery autoscaling is deprecated (it is one of the coolest features of Celery, IMHO). It should not be a difficult task to implement Celery autoscaler that scales up/down depending on memory utilization.
I'm setting up a Dask script to be executed on PSC Bridges P100 GPU nodes. These nodes offer 2 GPUs and 32 CPU-cores. I would like to start CPU and GPU-based dask-workers.
The CPU workers will be started:
dask-worker --nprocs 1 --nthreads 1
while the GPU workers as:
CUDA_VISIBLE_DEVICE=0 dask-worker --nprocs 1 --nthreads 1
My workflow consists of a set of CPU only tasks and a set of GPU tasks, that depend on the results from the CPU tasks. Is there a way to bind a GPU task only to the GPU workers?
Furthermore, I would like to make sure that the GPU tasks land on the same compute node as the CPU task they depend on. Can I somehow do that?
For your sort of problem it makes sense to run dask using the dask.distributed backend (a more sophisticated task scheduler), which provides a functionality called "worker resources".
For every worker it lets you specify virtual worker resources with an associated count, such as "GPU=2". On the client side you can then specify which and how many resources are required for each task. See the docs here.
For making sure a GPU task lands on the same compute node as the task it depends on you could:
set the resources accordingly, i.e. splitting up tasks explicitly
using resources like "GPU1" and "GPU2"
alternatively, bundle the CPU and subsequent GPU task into one task either by manually defining an encompassing function or by using dask graph optimizations as documented here (I'm thinking of "fusing" tasks).
I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster.
The terms I came across are: thread, process, processor, node, worker, scheduler.
My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example:
1 worker per node with n processes for the n cores on the node
threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as processes in the client
Any other suggestions?
By "node" people typically mean a physical or virtual machine. That node can run several programs or processes at once (much like how my computer can run a web browser and text editor at once). Each process can parallelize within itself with many threads. Processes have isolated memory environments, meaning that sharing data within a process is free, while sharing data between processes is expensive.
Typically things work best on larger nodes (like 36 cores) if you cut them up into a few processes, each of which have several threads. You want the number of processes times the number of threads to equal the number of cores. So for example you might do something like the following for a 36 core machine:
Four processes with nine threads each
Twelve processes with three threads each
One process with thirty-six threads
Typically one decides between these choices based on the workload. The difference here is due to Python's Global Interpreter Lock, which limits parallelism for some kinds of data. If you are working mostly with Numpy, Pandas, Scikit-Learn, or other numerical programming libraries in Python then you don't need to worry about the GIL, and you probably want to prefer few processes with many threads each. This helps because it allows data to move freely between your cores because it all lives in the same process. However, if you're doing mostly Pure Python programming, like dealing with text data, dictionaries/lists/sets, and doing most of your computation in tight Python for loops then you'll want to prefer having many processes with few threads each. This incurs extra communication costs, but lets you bypass the GIL.
In short, if you're using mostly numpy/pandas-style data, try to get at least eight threads or so in a process. Otherwise, maybe go for only two threads in a process.
I wanted to start erlang while varying the number of cores
in order to test the scalability of my program. I expect that running the program on more cores should be faster than running on less cores.
How can I specify the core limits?
In fact, I have tried with smp -disable (and I supposed that it will run on 1 core? isn't it?) But the execution time still the same as with more cores.
I tried also to put +S 1:1 (assuming 1 scheduler so as to run on 1 core? as well as other scheduler numbers), but it seems nothing has changed.
Was that because of characteristic of my program or did I do something wrong on specifying the core limits?
And if possible could someone give some tips on how to scale your Erlang programs.
Thank you very much.