What does aspect ratio mean in memory? - xilinx

Does anyone know what aspect ratio mean in memories? and how it's different in block ram and distributed ram in Xilinx FPGA?
Thanks

"Aspect ratio" refers to the number of address bits and data bits when accessing memory.
For example, say you have memory that holds 65,536 bits. If you use 8 data bits per address, you have 8,192 addresses (65,536 / 8 = 8,192), which means 13 address bits (2^13 = 8,192). So one aspect ratio for accessing 65,536 bits is 8 data bits and 13 address bits.
But say you want 16 data bits per address. Then you can only have 4,096 addresses since 4,096 * 16 is 65,536, and that's all the memory you have. In that case, you can only have 12 address bits. So another aspect ratio for accessing 65,536 bits of memory is 16 data bits and 12 address bits.

Aspect Ratio is another term that is used for the data organization of a RAM block.
In Xilinx FPGAs, a Block RAM is a dedicated two-port memory containing several kilobits of RAM. The FPGA contains several (or many) of these blocks.
Inside of each small logic block is a configurable lookup table. It is normally used for logic functions, but you can reconfigure it as a few bits of RAM. You can combine several (or many) of them into a larger RAM. This is distributed RAM.
Both types of RAM can be initialized with data, or used as ROM.
More Information can be found here:
http://www.xilinx.com/support/index.htm#nav=sd-nav-link-182711&tab=tab-sd

Related

What is maximum (ideal) memory bandwidth of an OpenCL device?

My OpenCL device memory-relevant specs are:
Max compute units 20
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Global Memory cache line size 64 bytes
Does it mean that to utilize my device at full memory-wise potential it needs to have 8 work items on different CUs constantly reading memory chunks of 64 bytes? Are memory channels arranged so that they allow different CUs access memory simultaneously? Are memory reads of 64 bytes always considered as single reads or only if address is % 64 == 0?
Does memory banks quantity/width has anything to do with memory bandwidth and is there a way to reason about memory performance when writing kernel with respect to memory banks?
Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.
Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.
If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.
Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.
If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.
But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.
Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.
What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.
According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.
L1-L2 bandwidth:
It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).
Shader bandwidth:
If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.
Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.
N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.
If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.
But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).
Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_​DEVICE_​MAX_​CLOCK_​FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.

Word size and memory addresses

My understanding of word size and memory addresses is as follows. An 8bit machine will have an address bus of size 8bit and have 256 memory addresses. A memory address is the location of each address so this machine can make use of 256 bytes of RAM. Now a 32 bit machine has an 32 bit word size ie 4bytes. At this point I get confused. In terms of usable memory, online tells me 4gb but if each memory address is 4bytes in size then surely it only has 1 million total memory addresses available ie 1gb of ram can be used? What am I mixing up here?

Calculating memory size based on address bit-length and memory cell contents

I am trying to calculate the maximum memory size knowing the bit length of an address and the size of the memory cell.
It is my understanding that if the address is n bits then there are 2^n memory locations. But then to calculate the actual memory size of the machine, you would need to multiply the number of addresses by the size of the memory cell. Is that correct?
To put it another way,
Step 1: calculate the length of the address in bits (n bits)
Step 2: calculate the number of memory locations 2^n(bits)
Step 3: take the number of memory locations and multiply it by the Byte size of the memory cells.
If each cell was 2 bytes for example, would I multiply 2^n bits (for address length) by the 2 Bytes per memory cell.
So total memory would be 2^n bits (address size) * x bytes (cell size)?
"actual memory size of the machine"
I will assume here that you mean the physical address space of the machine in question, ignoring virtual addressing, etc.
Most modern machines are byte-addressable (8-bit) meaning that each address refers to 1 byte. In this case, assuming that you have an n-bit processor with a matching n-bit address bus (there are cases where these aren't the same, e.g. Pentium processors) the number of memory locations possible is 2^n bytes.
If you have more specialized hardware, (embedded microcontrollers, etc) that are word addressable (16-bit, 32-bit), then you are correct that you would multiply 2^n * (word-size in bits) / (8) = # of bytes.
That being said, when you take into consideration virtual addressing and physical bus sizes that might not be the same as the processor's address lines, you would have to take a look at that specific machine for the "theoretical limit".

Why is RAM in powers of 2?

Why is the amount of RAM always a power of 2?
512, 1024, etc.
Specifically, what is the difference between using 512, 768, and 1024 RAM for an Android emulator?
Memory is closely tied to the CPU, so making their size a power of two
means that multiple modules can be packed requiring a minimum of logic
in order to switch between them; only a few bits from the end need to
be checked (since the binary representation of the size is 1000...0000
regardless of its size) instead of many more bits were it not a power
of two.
Hard drives are not tied to the CPU and not packed in the same manner,
so exactness of their size is not required.
from https://superuser.com/questions/235030/why-are-ram-size-usually-in-powers-of-2-512-mb-1-2-4-8-gb
as referenced by BrajeshKumar in the comments on the OP. Thanks Brajesh!
Because computers deal with binary values such as 0 and 1, because registers are on(1) or off(0)
So if you use powers of 2, your hardware will use 100% of the registers.
If computers used ternary values in their circuits, then we'd have memory, processors and anything else in powers of 3.
I think, it is related with the number of bits in an address bus (or bits used to select between address spaces). n bits can address 2^n bytes, so whenever the number of address bits increases to n+1, automatically the space increases by a factor of 2. The manufacturers use their maximum address capacity when including memory chips to the design.
In Android emulator, the increase in RAM may make your program more efficient, because when your application exceeds the RAM, a part of ROM (non-volatile memory) and it is slower.

Maximum memory size a system can support

Suppose that I have a computer with an address register of size 16 bits (MAR, for example). The smallest addressable unit in this computer is a word and each word is of size 2 bytes. What is the maximum memory size (in bytes) this system can support?
I thought it would be 2^16 = 65536 bytes, but the part about the smallest addressable unit implies that this is not the way to solve it.
Thanks in advance
There is no direct correlation to the maximum amount of memory a system can support, and the size of address registers.
16bit computers 30 years ago could very well support more than 64 kilobytes. On the other hand, modern 64bit processors typilcally only have lanes for 52 bits (or less), but even so a typical computer cannot nearly support 2^52 bytes of memory.
Typical 64bit computers today could in theory address 16 exibytes, but present-time CPUs only support 4 petabytes of phyisical and 256 terabytes of per-process virtual memory. Typical desktop mainboards support 128GiB maximum, if you buy extra expensive DIMMS. With affordable DIMMS, you're limited to about half as much (there are only so and so many slots).
Operating systems typically allow for main memory sizes in the hundreds of gigabytes only (e.g. 512 GiB for Windows 8 enterprise/professional, and 128GiB otherwise, or as little as 16GiB for Windows 7 Home Premium)
Generally the smallest addressable size is one byte, as you have calculated it, if it were one byte it would be 2^16*1 = 65536 bytes. However, because on this system there are two bytes per address, it is actually 2^16*2 = 131072 bytes.

Resources