Block dimensions in CUDA - sdk

I have a NVIDIA GTX 570 compute capability 2.0 running cuda-4.0.
The deviceQuery executable in the CUDA SDK gives me information on my CUDA device and its various properties. Two of the lines in the output are
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Why is the 3rd dimension of the block restricted to be upto 64 threads only wheras the X and the Y dimension can vary upto 1024 threads?

EDIT2: ALso, please take this with a grain of salt; this is a purely hypothetical answer, or a guess. There may indeed be a clear hardware-based reason why 64 is the maximum. Frankly I don't know, and my answer is based on an assumption that there is no such hardware limit, per se.
It's probably a combination of three things: first, there is a limit to the number of threads which can be resident inside a block; second, block dimensions are typically in multiples of 32, and even more often in powers of 2 greater than 32; third, coordinate systems used in the solution of multi-dimensional problems are most often oriented so that you're looking at the scene directly (i.e., with the important bits more distributed in X and Y than in Z).
CUDA naturally has to support 1D access, as this is an immensely common and efficient access pattern when it is applicable. TO support this, the X dimension must be allowed to vary over the entire range of 1024 threads.
To support 2D access, which is less common, CUDA should minimally support up to 512 in the X dimension (using the convention that the X dimension should be oriented in the coordinate system so that it measures the biggest spread) and 32 in the Y dimension. It must support up to 1024 in the X dimension, and I suppose they relax the requirement that the X dimension be no smaller than the Y dimension and allow the full 1024 range of Y values. However, in my understanding, 32 would have been plenty big for the Y dimension maximum.
To support 3D access, maintaining X, Y >= Z and trying to reach 1024, it seems to be that in the best case X=Y=Z=10; so there's no real argument for allowing Z to be greater than 10, given my assumptions
In summary, I don't see why they couldn't have made the maximums (1024, 32, 10). My question is why make them (1024, 1024, 64)? The only answer I keep coming back to is to allow some flexibility to programmers to violate the X>=Y>=Z coordinate system convention.
Edit: given my summary and hypothetical answer, the real answer to your question is this: it's an arbitary decision.

My wild guess is that because threadIdx.x, threadIdx.y and threadIdx.z are kept in a special single 32-bit register, possibly with even some other additional data. Maybe warp id? Or maybe multiprocessor-block id to identify which block given thread handles, if given multiprocessor runs more than one?
This is purely speculative, I have no data to support it, but I would imagine that they want to have as few special registers as possible.

Related

What is maximum (ideal) memory bandwidth of an OpenCL device?

My OpenCL device memory-relevant specs are:
Max compute units 20
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Global Memory cache line size 64 bytes
Does it mean that to utilize my device at full memory-wise potential it needs to have 8 work items on different CUs constantly reading memory chunks of 64 bytes? Are memory channels arranged so that they allow different CUs access memory simultaneously? Are memory reads of 64 bytes always considered as single reads or only if address is % 64 == 0?
Does memory banks quantity/width has anything to do with memory bandwidth and is there a way to reason about memory performance when writing kernel with respect to memory banks?
Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.
Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.
If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.
Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.
If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.
But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.
Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.
What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.
According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.
L1-L2 bandwidth:
It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).
Shader bandwidth:
If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.
Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.
N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.
If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.
But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).
Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_​DEVICE_​MAX_​CLOCK_​FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.

Gigabyte or Gibibyte (1000 or 1024)?

This may be a duplicate and I apologies if that is so but I really want a definitive answer as that seems to change depending upon where I look.
Is it acceptable to say that a gigabyte is 1024 megabytes or should it be said that it is 1000 megabytes? I am taking computer science at GCSE and a typical exam question could be how many bytes in a kilobyte and I believe the exam board, AQA, has the answer for such a question as 1024 not 1000. How is this? Are both correct? Which one should I go with?
Thanks in advance- this has got me rather bamboozled!
The sad fact is that it depends on who you ask. But computer terminology is slowly being aligned with normal terminology, in which kilo is 103 (1,000), mega is 106 (1,000,000), and giga is 109 (1,000,000,000).
This is reflected in the International System of Quantities and the International Electrotechnical Commission, which define gigabyte as 109 and use gibibyte for the computer-specific 1024 x 1024 x 1024 value.
The reason it "depends who you ask," is that for many years, specifically in relation to "bytes" of storage, the prefixes kilo, mega, and giga meant 1024, 10242, and 10243. But that flies in the face of normal convention with regard to these prefixes. So again, computer terminology is being aligned with non-computer terminology.
The term gigabyte is commonly used to mean either 10003 bytes or 10243 bytes depending on the context. Disk manufacturers prefer the decimal term while memory manufacturers use the binary.
Decimal definition
1 GB = 1,000,000,000 bytes (= 10003 B = 109 B)
Based on powers of 10, this definition uses the prefix as defined in the International System of Units (SI). This is the recommended definition by the International Electrotechnical Commission (IEC). This definition is used in networking contexts and most storage media, particularly hard drives, flash-based storage, and DVDs, and is also consistent with the other uses of the SI prefix in computing, such as CPU clock speeds or measures of performance.
Binary definition
1 GiB = 1,073,741,824 bytes (= 10243 B = 230 B).
The binary definition uses powers of the base 2, as is the architectural principle of binary computers. This usage is widely promulgated by some operating systems, such as Microsoft Windows in reference to computer memory (e.g., RAM). This definition is synonymous with the unambiguous unit gibibyte.
The difference between units based on decimal and binary prefixes increases as a semi-logarithmic (linear-log) function—for example, the decimal kilobyte value is nearly 98% of the kibibyte, a megabyte is under 96% of a mebibyte, and a gigabyte is just over 93% of a gibibyte value. This means that a 300 GB (279 GiB) hard disk might be indicated variously as 300 GB, 279 GB or 279 GiB, depending on the operating system.
The Wikipedia article https://en.wikipedia.org/wiki/Gigabyte has a good writeup of the confusion surrounding the usage of the term

CUDA: Best number of pixel computed per thread (grayscale)

I'm working on a program to convert an image in grayscale. I'm using the CImg library. I have to read for each pixel, the 3 values R-G-B, calculate the corresponding gray value and store the gray pixel on the output image. I'm working with an NVIDIA GTX 480. Some details about the card:
Microarchitecture: Fermi
Compute capability (version): 2.0
Cores per SM (warp size): 32
Streaming Multiprocessors: 15
Maximum number of resident warps per multiprocessor: 48
Maximum amount of shared memory per multiprocessor: 48KB
Maximum number of resident threads per multiprocessor: 1536
Number of 32-bit registers per multiprocessor: 32K
I'm using a square grid with blocks of 256 threads.
This program can have as input images of different sizes (e.g. 512x512 px, 10000x10000 px). I observed that incrementing the number of the pixels assigned to each thread increments the performance, so it's better than compute one pixel per thread. The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number? I know that on the GTX 480, 1536 is the maximum number of resident threads per multiprocessor. Have I to consider this number? The following, is the code executed by the kernel.
for(i = ((gridDim.x + blockIdx.x) * blockDim.x) + threadIdx.x; i < width * height; i += (gridDim.x * blockDim.x)) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[i]);
float g = static_cast< float >(inputImage[(width * height) + i]);
float b = static_cast< float >(inputImage[(2 * width * height) + i]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[i] = static_cast< unsigned char >(grayPix);
}
The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number?
Although you haven't shown any code, you've mentioned an observed characteristic:
I observed that incrementing the number of the pixels assigned to each thread increments the performance,
This is actually a fairly common observation for these types of workloads, and it may also be the case that this is more evident on Fermi than on newer architectures. A similar observation occurs during matrix transpose. If you write a "naive" matrix transpose that transposes one element per thread, and compare it with the matrix transpose discussed here that transposes multiple elements per thread, you will discover, especially on Fermi, that the multiple element per thread transpose can achieve approximately the available memory bandwidth on the device, whereas the one-element-per-thread transpose cannot. This ultimately has to do with the ability of the machine to hide latency, and the ability of your code to expose enough work to allow the machine to hide latency. Understanding the underlying behavior is somewhat involved, but fortunately, the optimization objective is fairly simple.
GPUs hide latency by having lots of available work to switch to, when they are waiting on previously issued operations to complete. So if I have a lot of memory traffic, the individual requests to memory have a long latency associated with them. If I have other work that the machine can do while it is waiting for the memory traffic to return data (even if that work generates more memory traffic), then the machine can use that work to keep itself busy and hide latency.
The way to give the machine lots of work starts by making sure that we have enabled the maximum number of warps that can fit within the machine's instantaneous capacity. This number is fairly simple to compute, it is the product of the number of SMs on your GPU and the maximum number of warps that can be resident on each SM. We want to launch a kernel that meets or exceeds this number, but additional warps/blocks beyond this number don't necessarily help us hide latency.
Once we have met the above number, we want to pack as much "work" as possible into each thread. Effectively, for the problem you describe and the matrix transpose case, packing as much work into each thread means handling multiple elements per thread.
So the steps are fairly simple:
Launch as many warps as the machine can handle instantaneously
Put all remaining work in the thread code, if possible.
Let's take a simplistic example. Suppose my GPU has 2 SMs, each of which can handle 4 warps (128 threads). Note that this is not the number of cores, but the "Maximum number of resident warps per multiprocessor" as indicated by the deviceQuery output.
My objective then is to create a grid of 8 warps, i.e. 256 threads total (in at least 2 threadblocks, so they can distribute to each of the 2 SMs) and make those warps perform the entire problem by handling multiple elements per thread. So if my overall problem space is a total of 1024x1024 elements, I would ideally want to handle 1024*1024/256 elements per thread.
Note that this method gives us an optimization direction. We do not necessarily have to achieve this objective completely in order to saturate the machine. It might be the case that it is only necessary, for example, to handle 8 elements per thread, in order to allow the machine to fully hide latency, and usually another limiting factor will appear, as discussed below.
Following this method will tend to remove latency as a limiting factor for performance of your kernel. Using the profiler, you can assess the extent to which latency is a limiting factor in a number of ways, but a fairly simple one is to capture the sm_efficiency metric, and perhaps compare that metric in the two cases you have outlined (one element per thread, multiple elements per thread). I suspect you will find, for your code, that the sm_efficiency metric indicates a higher efficiency in the multiple elements per thread case, and this is indicating that latency is less of a limiting factor in that case.
Once you remove latency as a limiting factor, you will tend to run into one of the other two machine limiting factors for performance: compute throughput and memory throughput (bandwidth). In the matrix transpose case, once we have sufficiently dealt with the latency issue, then the kernel tends to run at a speed limited by memory bandwidth.

about CUFFT input sizes

It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.
How could they managed to do that?
For as far as I know, FFT must provide best perfomance only for 2^a input size.
This means that input sizes with prime factors larger than 7 would go slower.
The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.
As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).
However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)
Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.
http://mathworld.wolfram.com/FastFourierTransform.html
https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm

Why is RAM in powers of 2?

Why is the amount of RAM always a power of 2?
512, 1024, etc.
Specifically, what is the difference between using 512, 768, and 1024 RAM for an Android emulator?
Memory is closely tied to the CPU, so making their size a power of two
means that multiple modules can be packed requiring a minimum of logic
in order to switch between them; only a few bits from the end need to
be checked (since the binary representation of the size is 1000...0000
regardless of its size) instead of many more bits were it not a power
of two.
Hard drives are not tied to the CPU and not packed in the same manner,
so exactness of their size is not required.
from https://superuser.com/questions/235030/why-are-ram-size-usually-in-powers-of-2-512-mb-1-2-4-8-gb
as referenced by BrajeshKumar in the comments on the OP. Thanks Brajesh!
Because computers deal with binary values such as 0 and 1, because registers are on(1) or off(0)
So if you use powers of 2, your hardware will use 100% of the registers.
If computers used ternary values in their circuits, then we'd have memory, processors and anything else in powers of 3.
I think, it is related with the number of bits in an address bus (or bits used to select between address spaces). n bits can address 2^n bytes, so whenever the number of address bits increases to n+1, automatically the space increases by a factor of 2. The manufacturers use their maximum address capacity when including memory chips to the design.
In Android emulator, the increase in RAM may make your program more efficient, because when your application exceeds the RAM, a part of ROM (non-volatile memory) and it is slower.

Resources