DirectCompute optimal numthreads setup - directx

I've recently been playing with compute shaders and I'm trying to determine the most optimal way to setup my [numthreads(x,y,z)] and dispatch calls. My demo window is 800x600 and I am launching 1 thread per pixel. I am performing 2D texture modifications - nothing too heavy.
My first try was to specify
[numthreads(32,32,1)]
My Dispatch() calls are always
Dispatch(ceil(screenWidth/numThreads.x),ceil(screenHeight/numThreads.y),1)
So for the first instance that would be
Dispatch(25,19,1)
This ran at 25-26 fps. I then reduced to [numthreads(4,4,1)] which ran at 16 fps. Increasing that to [numthreads(16,16,1)] started yeilding nice results of about 30 fps.
Toying with the Y thread group number [numthreads(16,8,1)] managed to push it to 32 fps.
My question is is there an optimal way to determine the thread number so I can utilize the GPU most effectively or is the just good ol' trial and error?

It's pretty GPU-specific but if you are on NVIDIA hardware you can try using the CUDA Occupancy Calculator.
I know you are using DirectCompute, but they map to the same underlying hardware. If you look at the output of FXC you can see the shared memory size and registers per thread in the assembly. Also you can deduce the compute capability from which card you have. Compute capability is the CUDA equivalent of profiles like cs_4_0, cs_4_1, cs_5_0, etc.
The goal is to increase the "occupancy", or in other words occupancy == 100% - %idle-due-to-HW-overhead

Profiling is the only way to guarantee maximum performance on a particular piece of hardware. But as a general rule, as long as you keep your live register count low (16 or lower) and don't use a ton of shared memory, thread groups of exactly 256 threads should be able to saturate most compute hardware (assuming you're dispatching at least 8 or so groups).

Related

Can hardware threads access main memory at the same time?

I am trying to understand microarchitecture.
When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), can each execution context issue memory reads in parallel or is the pipeline shared?
I am trying to do some rough calculations and complexity analysis and I want to know if memory bandwidth is shared and if I should divide my calculation by the number of cores or hardware threads (assuming the pipeline is shared) or hardware threads (the memory bandwidth is parallel).
Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).

What do the charts in the System Panels signify in Wandb (PyTorch)

I recently started using the wandb module with my PyTorch script, to ensure that the GPU's are operating efficiently. However, I am unsure as to what exactly the charts indicate.
I have been following the tutorial in this link, https://lambdalabs.com/blog/weights-and-bias-gpu-cpu-utilization/ , and was confused by this plot:
I am uncertain about the GPU % and the GPU Memory Access % charts. The descriptions in the blog are as following:
GPU %: This graph is probably the most important one. It tracks the percent of the time over the past sample period during which one or more kernels was executing on the GPU. Basically, you want this to be close to 100%, which means GPU is busy all the time doing data crunching. The above diagram has two curves. This is because there are two GPUs and only of them (blue) is used for the experiment. The Blue GPU is about 90% busy, which means it is not too bad but still has some room for improvement. The reason for this suboptimal utilization is due to the small batch size (4) we used in this experiment. The GPU fetches a small amount of data from its memory very often, and can not saturate the memory bus nor the CUDA cores. Later we will see it is possible to bump up this number by merely increasing the batch size.
GPU Memory Access %: This is an interesting one. It measures the percent of the time over the past sample period during which GPU memory was being read or written. We should keep this percent low because you want GPU to spend most of the time on computing instead of fetching data from its memory. In the above figure, the busy GPU has around 85% uptime accessing memory. This is very high and caused some performance problem. One way to lower the percent here is to increase the batch size, so data fetching becomes more efficient.
I had the following questions:
The aforementioned values do not sum to 100%. It seems as though our GPU can either be spending time on computation or spending time on reading/writing memory. How can the sum of these two values be greater than 100%?
Why does increasing batch size decrease the time spent accessing GPU Memory?
GPU Utilization and GPU memory access should add up to 100% is true if the hardware is doing both the process sequentially. But modern hardware doesn't do operations like this. GPU will be busy computing the numbers at the same time it will be accessing the memory.
GPU% is actually GPU Utilization %. We want this to be 100%. Thus it will do the desired computation 100% of the time.
GPU memory access % is the amount of time GPU is either reading from or writing to GPU memory. We want this number to be low. If the GPU memory access % is high there can be some delay before GPU can use the data to compute on. That doesn't mean that it's a sequential process.
W&B allows you to monitor both the metrics and take decisions based on them. Recently I implemented a data pipeline using tf.data.Dataset. The GPU utilization was close to 0% while memory access was close to 0% as well. I was reading three different image files and stacking them. Here CPU was the bottleneck. To counter this I created a dataset by stacking the images. The ETA went from 1h per epoch to 3 min.
From the plot, you can infer that the memory access of GPU increased while GPU utilization is close to 100%. CPU utilization decreased which was the bottleneck.
Here's a nice article by Lukas answering this question.

cudaMallocPitch is failed while multi GPUs are controlled by separated CPU processes despite the fact that enough memory is exist

I'm getting 'out of memory' error while using cudaMallocPitch API with GeForce GTX 1080 TI and\or GeForce GTX 1080 GPUs which are part of an entire PC server that include 4 GPUs (1 1080 TI and 3 1080) and two CPUs.
Each GPU is controlled by a dedicated CPU thread which calls to cudaSetDevice with the right device index at the begining of its running.
Based on a configuration file information the application know how much CPU threads shall be created.
I can also run my application several times as a separated processes that each one will control different GPU.
I'm using OpenCV version 3.2 in order to perform an image Background Subtraction.
First, you shall create the BackgroundSubtractorMOG2 object by using this method: cv::cuda::createBackgroundSubtractorMOG2 and after that you shall call its apply method.
The first time apply method is called all required memory is alocated once.
My image size is 10000 cols and 7096 rows. Each pixel is 1B (Grayscale).
When I run my application as a one process which have several threads (each one for each GPU) everything works fine but when I run it 4 times as a separated processes (each one for each GPU) the OpenCV apply function start to fail due to cudaMallocPitch 'not enough memory' failure.
For all GPUs i was verified that I have enough available memory before apply was activated for the first time. For the 1080 it is reported that I have ~5.5GB and for the the 1080 TI I have ~8.3GB and the requested size is: width - 120000bytes, Height - 21288bytes - ~2.4GB.
Please advise.
The problem source was found:
cudaMallocPitch API returned value was cudaErrorMemoryAllocation due to the fact that there wasn’t available OS virtual memory which used by the OS when the process performs read\write accesses to the GPU physical memory.
Because of that, the CUDA driver fails any kind of GPU physical memory allocation.
The complexity here was to figured out why this API is failed while enough GPU physical memory is exist (checked by cudaMemGetInfo API).
I started to analyze two points:
Why I don’t have enough virtual memory in my PC?
By performing the following link instructions I changed its size and the problem was disappeared:
https://www.online-tech-tips.com/computer-tips/simple-ways-to-increase-your-computers-performace-configuring-the-paging-file/
Why my process consume a lot of OS virtual memory?
In the past I figured it out that in order to have a better performance during processing time I shall allocate all required GPU physical memory only once at the beginning because an allocation operation takes a lot of time depends on the required memory size.
Due to the fact that I’m working with a frame resolution of ~70Mbytes and my processing logics required a huge amount of auxiliary buffers, a massive GPU and CPU memory areas were required to be allocated which empty the OS virtual memory available areas.

In OpenCL, how can __local memory be faster when work-group sizes aren't part of the architecture?

Apologies for my naiveté if this question is silly, I'm new to GPGPU programming.
My question is, since the architecture of the device can't change, how is it that __local memory can be optimized for access by items only in the local work-group, when it's the user that chooses the work-group size (subject to divisibility)?
Local memory is usually attached to a certain cluster of execution units in GPU hardware. Work group size is indeed chosen by the client application, but the OpenCL implementation will impose a limit. Your application needs to query this via clGetKernelWorkGroupInfo() using the CL_KERNEL_WORK_GROUP_SIZE parameter name.
There's some flexibility in work group size because most GPUs are designed so multiple threads of execution can be scheduled to run on a single execution unit. (A form of SMT.) Note also that the scheduled threads don't even need to be in the same work group, so if for example a GPU has 64 processors in a cluster, and supports 4-way SMT on each processor, those 256 threads could be from 1, 2, or 4, or possibly even 8 or 16 work groups, depending on hardware and compiler capabilities.
Some GPUs' processors also use vector registers and instructions internally, so threads don't map 1:1 to OpenCL work items - one processor might handle 4 work items at once, for example.
Ultimately though, a work-group must fit onto the cluster of processors that is attached to one chunk of local memory; so you've got local memory size and maximum number of threads that can be scheduled on one cluster influencing the maximum work group size.
In general, try to minimise the amount of local memory your work group uses so that the OpenCL implementation has the maximum flexibility for scheduling work groups. (But definitely do use local memory when it helps performance! Just use as little of it as possible.)

NSThread, NSOperation or GCD for CoreMotion and accurate timing purposes?

I'm looking to do some high precision core motion reading (>=100Hz if possible) and motion analysis on the iPhone 4+ which will run continuously for the duration of the main part of the app. It's imperative that the motion response and the signals that the analysis code sends out are as free from lag as possible.
My original plan was to launch a dedicated NSThread based on the code in the metronome project as referenced here: Accurate timing in iOS, along with a protocol for motion analysers to link in and use the thread. I'm wondering whether GCD or NSOperation queues might be better?
My impression after copious reading is that they are designed to handle a quantity of discrete, one-off operations rather than a small number of operations performed over and over again on a regular interval and that using them every millisecond or so might inadvertently create a lot of thread creation/destruction overhead. Does anyone have any experience here?
I'm also wondering about the performance implications of an endless while loop in a thread (such as in the code in the above link). Does anyone know more about how things work under the hood with threads? I know that iPhone4 (and under) are single core processors and use some sort of intelligent multitasking (pre-emptive?) which switches threads based on various timing and I/O demands to create the effect of parallelism...
If you have a thread that has a simple "while" loop running endlessly but only doing any additional work every millisecond or so, does the processor's switching algorithm consider the endless loop a "high demand" on resources thus hogging them from other threads or will it be smart enough to allocate resources more heavily towards other threads in the "downtime" between additional code execution?
Thanks in advance for the help and expertise...
IMO the bottleneck are rather the sensors. The actual update frequency is most often not equal to what you have specified. See update frequency set for deviceMotionUpdateInterval it's the actual frequency? and Actual frequency of device motion updates lower than expected, but scales up with setting
Some time ago I made a couple of measurements using Core Motion and the raw sensor data as well. I needed a high update rate too because I was doing a Simpson integration and thus wnated to minimise errors. It turned out that the real frequency is always lower and that there is limit at about 80 Hz. It was an iPhone 4 running iOS 4. But as long as you don't need this for scientific purposes in most cases 60-70 Hz should fit your needs anyway.

Resources