CUDA: Best number of pixel computed per thread (grayscale) - image-processing

I'm working on a program to convert an image in grayscale. I'm using the CImg library. I have to read for each pixel, the 3 values R-G-B, calculate the corresponding gray value and store the gray pixel on the output image. I'm working with an NVIDIA GTX 480. Some details about the card:
Microarchitecture: Fermi
Compute capability (version): 2.0
Cores per SM (warp size): 32
Streaming Multiprocessors: 15
Maximum number of resident warps per multiprocessor: 48
Maximum amount of shared memory per multiprocessor: 48KB
Maximum number of resident threads per multiprocessor: 1536
Number of 32-bit registers per multiprocessor: 32K
I'm using a square grid with blocks of 256 threads.
This program can have as input images of different sizes (e.g. 512x512 px, 10000x10000 px). I observed that incrementing the number of the pixels assigned to each thread increments the performance, so it's better than compute one pixel per thread. The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number? I know that on the GTX 480, 1536 is the maximum number of resident threads per multiprocessor. Have I to consider this number? The following, is the code executed by the kernel.
for(i = ((gridDim.x + blockIdx.x) * blockDim.x) + threadIdx.x; i < width * height; i += (gridDim.x * blockDim.x)) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[i]);
float g = static_cast< float >(inputImage[(width * height) + i]);
float b = static_cast< float >(inputImage[(2 * width * height) + i]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[i] = static_cast< unsigned char >(grayPix);
}

The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number?
Although you haven't shown any code, you've mentioned an observed characteristic:
I observed that incrementing the number of the pixels assigned to each thread increments the performance,
This is actually a fairly common observation for these types of workloads, and it may also be the case that this is more evident on Fermi than on newer architectures. A similar observation occurs during matrix transpose. If you write a "naive" matrix transpose that transposes one element per thread, and compare it with the matrix transpose discussed here that transposes multiple elements per thread, you will discover, especially on Fermi, that the multiple element per thread transpose can achieve approximately the available memory bandwidth on the device, whereas the one-element-per-thread transpose cannot. This ultimately has to do with the ability of the machine to hide latency, and the ability of your code to expose enough work to allow the machine to hide latency. Understanding the underlying behavior is somewhat involved, but fortunately, the optimization objective is fairly simple.
GPUs hide latency by having lots of available work to switch to, when they are waiting on previously issued operations to complete. So if I have a lot of memory traffic, the individual requests to memory have a long latency associated with them. If I have other work that the machine can do while it is waiting for the memory traffic to return data (even if that work generates more memory traffic), then the machine can use that work to keep itself busy and hide latency.
The way to give the machine lots of work starts by making sure that we have enabled the maximum number of warps that can fit within the machine's instantaneous capacity. This number is fairly simple to compute, it is the product of the number of SMs on your GPU and the maximum number of warps that can be resident on each SM. We want to launch a kernel that meets or exceeds this number, but additional warps/blocks beyond this number don't necessarily help us hide latency.
Once we have met the above number, we want to pack as much "work" as possible into each thread. Effectively, for the problem you describe and the matrix transpose case, packing as much work into each thread means handling multiple elements per thread.
So the steps are fairly simple:
Launch as many warps as the machine can handle instantaneously
Put all remaining work in the thread code, if possible.
Let's take a simplistic example. Suppose my GPU has 2 SMs, each of which can handle 4 warps (128 threads). Note that this is not the number of cores, but the "Maximum number of resident warps per multiprocessor" as indicated by the deviceQuery output.
My objective then is to create a grid of 8 warps, i.e. 256 threads total (in at least 2 threadblocks, so they can distribute to each of the 2 SMs) and make those warps perform the entire problem by handling multiple elements per thread. So if my overall problem space is a total of 1024x1024 elements, I would ideally want to handle 1024*1024/256 elements per thread.
Note that this method gives us an optimization direction. We do not necessarily have to achieve this objective completely in order to saturate the machine. It might be the case that it is only necessary, for example, to handle 8 elements per thread, in order to allow the machine to fully hide latency, and usually another limiting factor will appear, as discussed below.
Following this method will tend to remove latency as a limiting factor for performance of your kernel. Using the profiler, you can assess the extent to which latency is a limiting factor in a number of ways, but a fairly simple one is to capture the sm_efficiency metric, and perhaps compare that metric in the two cases you have outlined (one element per thread, multiple elements per thread). I suspect you will find, for your code, that the sm_efficiency metric indicates a higher efficiency in the multiple elements per thread case, and this is indicating that latency is less of a limiting factor in that case.
Once you remove latency as a limiting factor, you will tend to run into one of the other two machine limiting factors for performance: compute throughput and memory throughput (bandwidth). In the matrix transpose case, once we have sufficiently dealt with the latency issue, then the kernel tends to run at a speed limited by memory bandwidth.

Related

What is maximum (ideal) memory bandwidth of an OpenCL device?

My OpenCL device memory-relevant specs are:
Max compute units 20
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Global Memory cache line size 64 bytes
Does it mean that to utilize my device at full memory-wise potential it needs to have 8 work items on different CUs constantly reading memory chunks of 64 bytes? Are memory channels arranged so that they allow different CUs access memory simultaneously? Are memory reads of 64 bytes always considered as single reads or only if address is % 64 == 0?
Does memory banks quantity/width has anything to do with memory bandwidth and is there a way to reason about memory performance when writing kernel with respect to memory banks?
Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.
Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.
If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.
Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.
If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.
But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.
Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.
What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.
According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.
L1-L2 bandwidth:
It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).
Shader bandwidth:
If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.
Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.
N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.
If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.
But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).
Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_​DEVICE_​MAX_​CLOCK_​FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.

iOS Metal compute pipeline slower than CPU implementation for search task

I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
METAL IMPLEMENTATION DETAILS:
Host code gist:
https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d
Kernel (shader) code:
https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126
GPU frame capture profiling results:
The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.
I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.

IOPS versus Throughput

What is the key difference between IOPS and Throughput in large data storage?
Does file size have an effect on IOPS? Why?
IOPS measures the number of read and write operations per second, while throughput measures the number of bits read or written per second.
Although they measure different things, they generally follow each other as IO operations have about the same size.
If you have large files, you simply need more IO operations to read the entire file. The file size has no effect on the IOPS as it measures the number of clusters read or written, not the number of files.
If you have small files, there will be more overhead, so while the IOPS and throughput look good, you may experience a lower actual performance.
This is the analogy I came up with when talking about Throughput and IOPS.
Think of it as:
You have 4 buckets (Disk blocks) of the same size that you want to fill or empty with water.
You'll be using a jug to transfer the water into the buckets. Now your question will be:
At a given time (per second), how many jugs of water can you pour (write) or withdraw (read)? This is IOPS.
At a given time (per second) what's the amount (bit, kb, mb, etc) of water the jug can transfer into/out of the bucket continuously? This is throughput.
Additionally, there is a delay in the process of you pouring and/or withdrawing the water. This is Latency.
There's 3 things to consider when talking about IOPS and Throughput:
Size (file size/block size)
Patterns (Random/Sequential)
Mix (Read/Write) percentage
The Disk IOPS Describes the count of input/output operations on the disk per seconds, regardless block size.
The disk throughput describes how many data may be transferred per second, so the block size play a huge role upon calculating the throughput required by app
Let's consider as the sample the 3000 IOPS and SQL database engine, the block size in terms of db engine is called the page size and for SQL Server it's equal to 8 KB. If you wish to calculate the actual throughput, if the IOPS defined, you will end up with the formula below:
throughput = [IOPS] * [block size] = 3000 * 8 = 24 000 KB/s = 24 MB/s
IOPS - Number of read write operations mostly useful for OLTP transactions used in AWS for DBs like Cassandra.
Throughput - Is the number of bit transferred per sec. i.e.data transferred per sec.
Mainly a unit for high data transfer applications like big data hadoop,kafka streaming
IOPS- The time taken for a storage system to perform an Input/Output operation per second from start to finish constitutes IOPS.
Throughput- Data transfer speed in megabytes per second is often termed as throughput. Earlier, it was measured in Kilobytes. But now the standard has become megabytes.
More about this see: What is the difference between IOPS and throughput?

Block dimensions in CUDA

I have a NVIDIA GTX 570 compute capability 2.0 running cuda-4.0.
The deviceQuery executable in the CUDA SDK gives me information on my CUDA device and its various properties. Two of the lines in the output are
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Why is the 3rd dimension of the block restricted to be upto 64 threads only wheras the X and the Y dimension can vary upto 1024 threads?
EDIT2: ALso, please take this with a grain of salt; this is a purely hypothetical answer, or a guess. There may indeed be a clear hardware-based reason why 64 is the maximum. Frankly I don't know, and my answer is based on an assumption that there is no such hardware limit, per se.
It's probably a combination of three things: first, there is a limit to the number of threads which can be resident inside a block; second, block dimensions are typically in multiples of 32, and even more often in powers of 2 greater than 32; third, coordinate systems used in the solution of multi-dimensional problems are most often oriented so that you're looking at the scene directly (i.e., with the important bits more distributed in X and Y than in Z).
CUDA naturally has to support 1D access, as this is an immensely common and efficient access pattern when it is applicable. TO support this, the X dimension must be allowed to vary over the entire range of 1024 threads.
To support 2D access, which is less common, CUDA should minimally support up to 512 in the X dimension (using the convention that the X dimension should be oriented in the coordinate system so that it measures the biggest spread) and 32 in the Y dimension. It must support up to 1024 in the X dimension, and I suppose they relax the requirement that the X dimension be no smaller than the Y dimension and allow the full 1024 range of Y values. However, in my understanding, 32 would have been plenty big for the Y dimension maximum.
To support 3D access, maintaining X, Y >= Z and trying to reach 1024, it seems to be that in the best case X=Y=Z=10; so there's no real argument for allowing Z to be greater than 10, given my assumptions
In summary, I don't see why they couldn't have made the maximums (1024, 32, 10). My question is why make them (1024, 1024, 64)? The only answer I keep coming back to is to allow some flexibility to programmers to violate the X>=Y>=Z coordinate system convention.
Edit: given my summary and hypothetical answer, the real answer to your question is this: it's an arbitary decision.
My wild guess is that because threadIdx.x, threadIdx.y and threadIdx.z are kept in a special single 32-bit register, possibly with even some other additional data. Maybe warp id? Or maybe multiprocessor-block id to identify which block given thread handles, if given multiprocessor runs more than one?
This is purely speculative, I have no data to support it, but I would imagine that they want to have as few special registers as possible.

Can FFT length affect filtering accuracy?

I am designing a fractional delay filter, and my lagrange coefficient of order 5 h(n) have 6 taps in time domain. I have tested to convolute the h(n) with x(n) which is 5000 sampled signal using matlab, and the result seems ok. When I tried to use FFT and IFFT method, the output is totally wrong. Actually my FFT is computed with 8192 data in frequency domain, which is the nearest power of 2 for 5000 signal sample. For the IFFT portion, I convert back the 8192 frequency domain data back to 5000 length data in time domain. So, the problem is, why this thing works in convolution, but not in FFT multiplication. Does converting my 6 taps h(n) to 8192 taps in frequency domain causes this problem?
Actually I have tried using overlap-save method, which perform the FFT and multiplication with smaller chunks of x(n) and doing it 5 times separately. The result seems slight better than the previous, and at least I can see the waveform pattern, but still slightly distorted. So, any idea where goes wrong, and what is the solution. Thank you.
The reason I am implementing the circular convolution in frequency domain instead of time domain is, I am try to merge the Lagrange filter with other low pass filter in frequency domain, so that the implementation can be more efficient. Of course I do believe implement filtering in frequency domain will be much faster than convolution in time domain. The LP filter has 120 taps in time domain. Due to the memory constraints, the raw data including the padding will be limited to 1024 in length, and so with the fft bins.
Because my Lagrange coefficient has only 6 taps, which is huge different with 1024 taps. I doubt that the fft of the 6 taps to 1024 bins in frequency domain will cause error. Here is my matlab code on Lagrange filter only. This is just a test code only, not implementation code. It's a bit messy, sorry about that. Really appreciate if you can give me more advice on this problem. Thank you.
t=1:5000;
fs=2.5*(10^12);
A=70000;
x=A*sin(2*pi*10.*t.*(10^6).*t./fs);
delay=0.4;
N=5;
n = 0:N;
h = ones(1,N+1);
for k = 0:N
index = find(n ~= k);
h(index) = h(index) * (delay-k)./ (n(index)-k);
end
pad=zeros(1,length(h)-1);
out=[];
H=fft(hh,1024);
H=fft([h zeros(1,1024-length(h))]);
for i=0:1:ceil(length(x)/(1024-length(h)+1))-1
if (i ~= ceil(length(x)/(1024-length(h)+1))-1)
a=x(1,i*(1024-length(h)+1)+1:(i+1)*(1024-length(h)+1));
else
temp=x(1,i*(1024-length(h)+1)+1:length(x));
a=[temp zeros(1,1024-length(h)+1-length(temp))];
end
xx=[pad a];
X=fft(xx,1024);
Y=H.*X;
y=abs(ifft(Y,1024));
out=[out y(1,length(h):length(y))];
pad=y(1,length(a)+1:length(y));
end
Some comments:
The nearest power of two is actually 4096. Do you expect the remaining 904 samples to contribute much? I would guess that they are significant only if you are looking for relatively low-frequency features.
How did you pad your signal out to 8192 samples? Padding your sample out to 8192 implies that approximately 40% of your data is "fictional". If you used zeros to lengthen your dataset, you likely injected a step change at the pad point - which implies a lot of high-frequency content.
A short code snippet demonstrating your methods couldn't hurt.

Resources