I am designing a simplistic memory controller and PHY on an Artix-7 FPGA but am having problems reading the datasheet. The timings in the memory part's datasheet (and in the JEDEC JSD79-3F doc) are expressed in CK/tCK/nCK units, which are in my opinion ambiguous if not running the memory at the nominal frequency (e.g. lower than 666 MHz clock for a 1333 MT/s module).
If I run a 1333 MT/s module at a frequency of 300 MHz -- still allowed with DLL on, as per the datasheet speed bins, -- is the CK/tCK/nCK unit equal to 1.5 ns (from the module's native 666 MHz), or 3.33 ns (from the frequency it is actually run at)? On one hand it makes sense that certain delays are constant, but then again some delays are expressed relative to the clock edges on the CK/CK# pins (like CL or CWL).
That is to say, some timing parameters in the datasheet only change when changing speed bins. E.g. tRP is 13.5 ns for a 1333 part, which is also backwards compatible with the tRP of 13.125 ns of a 1066 part -- no matter the chosen operating frequency of the physical clock pins of the device.
But then, running a DDR3 module at 300 MHz only allows usage of CL = CWL = 5, which is again expressed in "CK" units. To my understanding, this means 5 periods of the input clock, i.e. 5 * 3.33 ns.
I suppose all I am asking is whether the "CK" (or nCK or tCK) unit is tied to the chosen speed bin (tCK = 1.5 ns when choosing DDR3-1333) or the actual frequency of the clock signal provided to the memory module by the controlling hardware (e.g. 3.3 ns for the 600 MT/s mode)?
This is the response of u/Allan-H on reddit who has helped me reach a conclusion:
When you set the CL in the mode register, that's the number of clocks that the chip will wait before putting the data on the pins. That clock is the clock that your controller is providing to the chip (it's SDRAM, after all).
It's your responsibility to ensure that the number of clocks you program (e.g. CL=5) when multiplied by the clock period (e.g. 1.875ns) is at least as long as the access time of the RAM. Note that you program a number of clocks, but the important parameter is actually time. The RAM must have the data ready before it can send it to the output buffers.
Now let's run the RAM at a lower speed, say 312.5MHz (3.2ns period). We now have the option of programming CL to be as low as 3, since 3 x 3.2ns > 5 x 1.875ns.
BTW, since we are dealing with fractions of a ns, we also need to take the clock jitter into account.
Counterintuitively, the DRAM chip doesn't know how fast it is; it must be programmed with that information by the DRAM controller. That information might be hard coded into the controller (e.g. for an FPGA implementation) or by software which would typically read the SPD EEPROM on the DIMM to work out the speed grade then write the appropriate values into the DRAM controller.
This also explains timing values defined as e.g. "Greater of 3CK or 5ns". In this case, the memory chip cannot respond faster than 5 ns, but the internal logic also needs 3 positive clock edges on the input CK pins to complete the action defined by this example parameter.
Related
My OpenCL device memory-relevant specs are:
Max compute units 20
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Global Memory cache line size 64 bytes
Does it mean that to utilize my device at full memory-wise potential it needs to have 8 work items on different CUs constantly reading memory chunks of 64 bytes? Are memory channels arranged so that they allow different CUs access memory simultaneously? Are memory reads of 64 bytes always considered as single reads or only if address is % 64 == 0?
Does memory banks quantity/width has anything to do with memory bandwidth and is there a way to reason about memory performance when writing kernel with respect to memory banks?
Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.
Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.
If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.
Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.
If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.
But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.
Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.
What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.
According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.
L1-L2 bandwidth:
It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).
Shader bandwidth:
If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.
Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.
N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.
If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.
But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).
Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_DEVICE_MAX_CLOCK_FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.
I want to check if my understanding is correct, however I cannot find any precise explanation or examples. Let's say I have UART communication set to 57600 bits/second and I am transferring 8 bit chars. Let's say I choose to have no parity and since I need one start bit and one stop bit, that means that essentially for transferring one char I would need to transfer 10 bits. Does that mean that the transfer speed would be 5760 chars/second?
Your calculation is essentially correct.
But the 5760 chars/second would be the maximum transfer rate. Since it's an asynchronous link, the UART transmitter is allowed to idle the line between character frames.
IOW the baud rate only applies to the bits of the character frame.
The rate that characters are transmitted depends on whether there is data available to keep the transmitter busy/saturated.
For instance, if a microcontroller used programmed I/O (with either polling or interrupt) instead of DMA for UART transmiting, high-priority interrupts could stall the transmissions and introduce delays between frames.
Baudrate = 57600
Time for 1 Bit: 1 / 57600 = 17,36 us
Time for a frame with 10 Bit = 173,6 us
this means max. 1 / 1736 us = 5760 frames(characters) / s**
I intend to use the beaglebone to sample a shaped signal of the order of 1 microsec. I need to fit the signal after and therefore i would like to have a sampling rate of let's 10 MHZ. Something that seems feasible with PRU and libpruio. The point is, looking to the adc specifications it seems there is a limit at 200KHz. Is my reasoning correct?
thanks
You'll need additional hardware for a sampling rate of 10 MHz! libpruio isn't designed to work at that speed, as well as the BBB hardware.
The ADC subsystem in the AM335x CPU is clocked at 24 MHz and needs 15 cycles for a sample (14 in continous mode). This leads to a maximum sample rate of 1.6 (1.74) MSamples/s. See SRM, chapter 12 for details.
The problem is to get the samples in to the host memory. I couldn't get this working faster than ~250 kSamples/s (by CPU access - I didn't try DMA).
As long as you don't need more values than the FIFO can hold, you can sample a single line at maximum 1.7 MHz.
BR
Why is the amount of RAM always a power of 2?
512, 1024, etc.
Specifically, what is the difference between using 512, 768, and 1024 RAM for an Android emulator?
Memory is closely tied to the CPU, so making their size a power of two
means that multiple modules can be packed requiring a minimum of logic
in order to switch between them; only a few bits from the end need to
be checked (since the binary representation of the size is 1000...0000
regardless of its size) instead of many more bits were it not a power
of two.
Hard drives are not tied to the CPU and not packed in the same manner,
so exactness of their size is not required.
from https://superuser.com/questions/235030/why-are-ram-size-usually-in-powers-of-2-512-mb-1-2-4-8-gb
as referenced by BrajeshKumar in the comments on the OP. Thanks Brajesh!
Because computers deal with binary values such as 0 and 1, because registers are on(1) or off(0)
So if you use powers of 2, your hardware will use 100% of the registers.
If computers used ternary values in their circuits, then we'd have memory, processors and anything else in powers of 3.
I think, it is related with the number of bits in an address bus (or bits used to select between address spaces). n bits can address 2^n bytes, so whenever the number of address bits increases to n+1, automatically the space increases by a factor of 2. The manufacturers use their maximum address capacity when including memory chips to the design.
In Android emulator, the increase in RAM may make your program more efficient, because when your application exceeds the RAM, a part of ROM (non-volatile memory) and it is slower.
I'm trying to learn about DMA transfer rates and I don't understand this questions. I have the answers but don't know how to get there.
This question concerns the use of DMA to handle the input and storage in memory of data arriving at an input interface, the achievable data rates that can achieved using this mechanism, and the bus bandwidth (capacity) used for particular data rates.
You are given details of the execution of the clock cycles executed for each DMA transfer, and the clock cycles to acquire and release the busses.
Below you are given: the number of clock cycles required for the DMA device to transfer a single data item between the input interface and memory, the number of clock cycles to acquire and releases the system busses, the size (in bits) of each data item, and the clock frequency.
number of clock cycles for each data transfer 8
number of clock cycles to acquire and release busses 4
number of bits per data item = 8
clock frequency = 20MHz
A) What is the maximum achievable data rate in Kbits/second?
B) What percentage of the bus clocks are used by the DMA device if the data rate is 267Kbits/sec?
Answers
A)20000.0
B)2.0
Thanks in advance.
There are two modes to Transfer Data using DMA
1.Burst Mode
Once the DMA controller is granted access to the system bus by the CPU, it transfers all bytes of data in the data block before releasing control of the system buses back to the CPU. CPU is disabled to use Memory Bus for long time.it will not release bus access until entire Block of Data Transferred.
2.Cycle Stealing Mode
Once the DMA controller is granted access to the system bus by the CPU,it transfer the one byte of data then it releases the memory access to cpu . again for another byte of transfer it has to acquire bus access by cpu via BR and BG signal(BUS REQUEST and BUS GRANT).For Every Byte of Transfer it acquires bus access and releases it until entire Block of Data Transferred.
In the Above Example
Clock Frequency is 20MHZ (hz is cycles per second).20 million clock cycles per second.(20 x 10^6 cycles /second )
For Every Byte of Transfer B/W IO Interface and Memory takes 8 clock cycles. There are 20 x 10^6 Clock Cycles.In Cycle Stealing Mode For Every One Byte of Transfer it has to take another 4 clock cycles to For Bus Grant and release Access. So to Transfer One Byte Between IO Interface and Memory, 12 clock cycles are Required. Here 2/3 of clock cycles are used for Data Transfer and 1/3 of Clock Cycles are Used For Bus Access. Here One Clock Cycle is Used to Transfer One Bit of Data. 2/3 of 20 Million Clock Cycles are Used to Transfer Data and 1/3 of 20 Million Clock Cycles are Used for Bus Access.So, 13333.333 Kbits are Transferred b/w IO interface and Memory . If we take 2% from the 13333 Kb/sec approximately it will be 267 Kb/sec. Max Achievable Data Rate in this mode is 13333 Kb/s .
In Burst Mode once DMA Acquired the Bus it will release the Bus After Entire Transfer.20000 x 10^3 clock cycles are used to Transfer 20000 x 10^3 bits which is 20000 kb/sec.4 clock cycles are used for bus access. approximately it will be 20000 Kb/s
To find the maximum achievable data rate, you need to divide the number of bits by the length of the process so:
number of bits per data item / number of clock cycles for each data transfer * clock frequency * 1000
I'm stuck on the second part so if you managed to find the answer, please share :)