how does burst-mode DMA speed up data transfer between main memory and I/O devices? - dma

According to Wikipedia, there are three kinds of DMA modes, namely, the Burst Mode, the cycle stealing mode and the transparent mode.
In the Burst Mode, the dma controller will take over the control of the bus. Before the transfer completes, CPU tasks that need the bus will be suspended. However, in each instruction cycle, the fetch cycle has to reference the main memory. Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
In my understanding, the cycle stealing mode is essentially the same. The only difference is that in those mode the CPU uses one in two consecutive cycles, as opposed to being totally idle in the bust mode.
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process up?
Thanks a lot!

how does burst-mode DMA speed up data transfer between main memory and I/O devices?
There is no "speed up" as you allege, nor is any "speed up" typically necessary/possible. The data transfer is not going to occur any faster than the slower of the source or destination.
The DMA controller will consolidate several individual memory requests into occasional burst requests, so the benefit of burst mode is reduced memory contention due to a reduction in the number of memory arbitrations.
Burst mode combined with a wide memory word improves memory bandwidth utilization. For example, with a 32-bit wide memory, four sequential byte reads consolidated into a single burst could result in only one memory access cycle.
Before the transfer completes, CPU tasks that need the bus will be suspended.
The concept of "task" does not exist at this level of operations. There is no "suspension" of anything. At most the CPU has to wait (i.e. insertion of wait states) to gain access to memory.
However, in each instruction cycle, the fetch cycle has to reference the main memory.
Not true. A hit in the instruction cache will make a memory access unnecessary.
Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
Faulty assumption for every cache hit.
Apparently you are misusing the term "interrupt-driven IO" to really mean programmed I/O using interrupts.
Equating a wait cycle or two to the execution of numerous instructions of an interrupt service routine for programmed I/O is a ridiculous exaggeration.
And "interrupt-driven IO" (in its proper meaning) does not exclude the use of DMA.
In my understanding, the cycle stealing mode is essentially the same.
Then your understanding is incorrect.
If the benefits of DMA are so minuscule or nonexistent as you allege, then how do you explain the existence of DMA controllers, and the preference of using DMA over programmed I/O?
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process
Comparing DMA to "interrupt-driven I/O" is illogical. See this.
Programmed I/O using interrupts requires a lot more than just the one instruction that you allege.
I'm unfamiliar with any CPU that can read a device port, write that value to main memory, bump the write pointer, and check if the block transfer is complete all with just a single instruction.
And you're completely ignoring the ISR code (e.g. save and then restore processor state) that is required to be executed for each interrupt (that the device would issue for requesting data).

When used with many older or simpler CPUs, burst mode DMA can speed up data transfer in cases where a peripheral is able to accept data at a rate faster than the CPU itself could supply it. On a typical ARM, for example, a loop like:
lp:
ldr r0,[r1,r2] ; r1 points to address *after* end of buffer
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
adds r2,#4
bne lp
would likely take at least 11 cycles for each group of four bytes to transfer (including five 32-bit instruction fetches, one 32-bit data fetch, four 8-bit writes, plus a wasted fetch for the instruction following the loop). A burst-mode DMA operation, by contrast, DMA would only need 5 cycles per group (assuming the receiving device was able to accept data that fast).
Because a typical low-end ARM will only use the bus about every other cycle when running most kinds of code, a DMA controller that grabs the bus on every other cycle could allow the CPU to run at almost normal speed while the DMA controller performed one access every other cycle. On some platforms, it may be possible to have a DMA controller perform transfers on every cycle where the CPU isn't doing anything, while giving the CPU priority on cycles where it needs the bus. DMA performance would be highly variable in such a mode (no data would get transferred while running code that needs the bus on every cycle) but DMA operations would have no impact on CPU performance.

Related

Does a DMA controller copy one word of memory at a time?

A DMA controller greatly speeds up memory copy operations because the data in memory doesn't have to be read into the CPU.
From what I've read, DMA controllers can "copy a block of memory from one location to another" in one operation, but thinking about this at a low level, I'm guessing the DMA ultimately has to iterate over memory one word at a time. Is that correct? Is that one word per clock cycle? One word per two clock cycles? (one for read memory into the DMA, one for write to memory) Or does the DMA have a circuit that can somehow (I can't imagine how) copy large chunks of memory in one or two clock cycles?
If the CPU tells the DMA to copy 1024 bytes of memory from one address to another, how many clock cycles will the CPU have free to perform other tasks while waiting for the DMA to finish?
Is it possible to have an architecture where the DMA is doing a memory copy using one bus, while the CPU can access memory at the same time in a different area? Say, in a different bank?
I'm sure it's architecture dependent, so for the answers just pick one or more 8 or 16 bit home micros.
Yes this is architecture dependent. Usually there is a main memory bus, and one or more cache. All buses in the system are not required to be the same width. Memory buses are usually larger than processor words, it might be 64bits wide, so it loads 64bits at a time.
Then the destination might be the same memory bus, or PCIE, or even another bus that is memory mapped, in which case the transfers might be constrained by the destination bus width.
How many clock cycles are available again depend of how things are done. Usually in a µC the DMA triggers an interrupt when it is done, and the CPU does nothing. An other option is polling.
There are dual port memory but usually there is only 1 main memory bus. IIRC banks are usually a trick to avoid large addresses, but use the same memory bus.
Cache and bus arbitration are used to mitigate bus contention, the user really shouldn't care about that. You can have a look at your µC datasheet if you want reliable information.

Is memory outside each core always conceptually flat/uniform/synchronous in a multiprocessor system?

Multi processor systems perform "real" memory operations (those that influence definitive executions, not just speculative execution) out of order and asynchronously as waiting for global synchronization of global state would needlessly stall all executions nearly all the time. On the other hand, immediately outside each individual core, it seems that the memory system, starting with L1 cache, is purely synchronous, consistent, flat from the allowed behavior point of view (allowed semantics); obviously timing depends on the cache size and behavior.
So on a CPU there on one extreme are named "registers" which are private by definition, and on the other extreme there is memory which is shared; it seems a shame that outside the minuscule space of registers, which have peculiar naming or addressing mode, the memory is always global, shared and globally synchronous, and effectively entirely subject to all fences, even if it's memory used as unnamed registers, for the purpose of storing more data than would fit in the few registers, without a possibility of being examined by other threads (except by debugging with ptrace which obviously stalls, halts, serializes and stores the complete observable state of an execution).
Is that always the case on modern computers (modern = those that can reasonably support C++ and Java)?
Why doesn't the dedicated L1 cache provide register-like semantics for those memory units that are only used by a particular core? The cache must track which memory is shared, no matter what. Memory operations on such local data doesn't have to be stalled when strict global ordering of memory operations are needed, as no other core is observing it, and the cache has the power to stall such external accesses if needed. The cache would just have to know which memory units are private (non globally readable) until a stall of out of order operations, which makes then consistent (the cache would probably need a way to ask the core to serialize operations and publish a consistent state in memory).
Do all CPU stall and synchronize all memory accesses on a fence or synchronizing operation?
Can the memory be used as an almost infinite register resource not subject to fencing?
In practice, a single core operating on memory that no other threads are accessing doesn't slow down much in order to maintain global memory semantics, vs. how a uniprocessor system could be designed.
But on a big multi-socket system, especially x86, cache-coherency (snooping the other socket) is part of what makes memory latency worse for cache misses than on a single-socket system, though. (For accesses that miss in private caches).
Yes, all multi-core systems that you can run a single multi-threaded program on have coherent shared memory between all cores, using some variant of the MESI cache-coherency protocol. (Any exceptions to this rule are considered exotic and have to be programmed specially.)
Huge systems with multiple separate coherency domains that require explicit flushing are more like a tightly-coupled cluster for efficient message passing, not an SMP system. (Normal NUMA multi-socket systems are cache-coherent: Is mov + mfence safe on NUMA? goes into detail for x86 specifically.)
While a core has a cache line in MESI Modified or Exclusive state, it can modify it without notifying other cores about changes. M and E states in one cache mean that no other caches in the system have any valid copy of the line. But loads and stores still have to respect the memory model, e.g. an x86 core still has to commit stores to L1d cache in program order.
L1d and L2 are part of a modern CPU core, but you're right that L1d is not actually modified speculatively. It can be read speculatively.
Most of what you're asking about is handled by a store buffer with store forwarding, allowing store/reload to execute without waiting for the store to become globally visible.
what is a store buffer? and Size of store buffers on Intel hardware? What exactly is a store buffer?
A store buffer is essential for decoupling speculative out-of-order execution (writing data+address into the store buffer) from in-order commit to globally-visible L1d cache.
It's very important even for an in-order core, otherwise cache-miss stores would stall execution. And generally you want a store buffer to coalesce consecutive narrow stores into a single wider cache write, especially for weakly-ordered uarches that can do so aggressively; many non-x86 microarchitectures only have fully efficient commit to cache for aligned 4-byte or wider chunks.
On a strongly-ordered memory model, speculative out-of-order loads and checking later to see if any other core invalidated the line before we're "allowed" to read it is also essential for high performance, allowing hit-under-miss for out-of-order exec to continue instead of one cache miss load stalling all other loads.
There are some limitations to this model:
limited store-buffer size means we don't have much private store/reload space
a strongly-ordered memory model stops private stores from committing to L1d out of order, so a store to a shared variable that has to wait for the line from another core could result in the store buffer filling up with private stores.
memory barrier instructions like x86 mfence or lock add, or ARM dsb ish have to drain the store buffer, so stores to (and reloads from) thread-private memory that's not in practice shared still has to wait for stores you care about to become globally visible.
conversely, waiting for shared store you care about to become visible (with a barrier or a release-store) has to also wait for private memory operations even if they're independent.
the memory is always global, shared and globally synchronous, and
effectively entirely subject to all fences, even if it's memory used
as unnamed registers,
I'm not sure what you mean here. If a thread is accessing private data (i.e., not shared with any other thread), then there is almost no need for memory fence instructions1. Fences are used to control the order in which memory accesses from one core are seen by other cores.
Why doesn't the dedicated L1 cache provide register-like semantics for
those memory units that are only used by a particular execution unit?
I think (if I understand you correctly) what you're describing is called a scratchpad memory (SPM), which is a hardware memory structure that is mapped to the architectural physical address space or has its own physical address space. The software can directly access any location in an SPM, similar to main memory. However, unlike main memory, SPM has a higher bandwidth and/or lower latency than main memory, but is typically much smaller in size.
SPM is much simpler than a cache because it doesn't need tags, MSHRs, a replacement policy, or hardware prefetchers. In addition, the coherence of SPM works like main memory, i.e., it comes into play only when there are multiple processors.
SPM has been used in many commercial hardware accelerators such as GPUs, DSPs, and manycore processor. One example I am familiar with is the MCDRAM of the Knights Landing (KNL) manycore processor, which can be configured to work as near memory (i.e., an SPM), a last-level cache for main memory, or as a hybrid. The portion of the MCDRAM that is configured to work as SPM is mapped to the same physical address space as DRAM and the L2 cache (which is private to each tile) becomes the last-level cache for that portion of MCDRAM. If there is a portion of MCDRAM that is configured as a cache for DRAM, then it would be the last-level cache of DRAM only and not the SPM portion. MCDRAM has a much higher bandwdith than DRAM, but the latency is about the same.
In general, SPM can be placed anywhere in the memory hierarchy. For example, it could placed at the same level as the L1 cache. SPM improves performance and reduces energy consumption when there is no or little need to move data between SPM and DRAM.
SPM is very suitable for systems with real-time requirements because it provides guarantees regarding the maximum latency and/or lowest bandwdith, which is necessary to determine with certainty whether real-time constraints can be met.
SPM is not very suitable for general-purpose desktop or server systems where they can be multiple applications running concurrently. Such systems don't have real-time requirements and, currently, the average bandwdith demand doesn't justify the cost of including something like MCDRAM. Moreover, using an SPM at the L1 or L2 level imposes size constraints on the SPM and the caches and makes difficult for the OS and applications to exploit such a memory hierarchy.
Intel Optance DC memory can be mapped to the physical address space, but it is at the same level as main memory, so it's not considered as an SPM.
Footnotes:
(1) Memory fences may still be needed in single-thread (or uniprocessor) scenarios. For example, if you want to measure the execution time of a specific region of code on an out-of-order processor, it may be necessary to wrap the region between two suitable fence instructions. Fences are also required when communicating with an I/O device through write-combining memory-mapped I/O pages to ensure that all earlier stores have reached the device.

How DMA and PCIe play together?

in a PCIe configuration, devices have dedicated addresses and they send data in Peer-to-Peer mode to each other - every device can write when it wills and the switches take care to correctly pass data forward. There is no need to have a "bus master", which decides when and how data will be transmitted.
How does DMA come into play in such configuration? For me it seems that DMA is an outdated feature, which is not needed in a PCIe configuration. Every device can send data to the main memory, or read from it - obviously the main memory will always be the "slave" in such operations.
Or is there some other functionality of DMA, which I am missing?
Thank you in advance!
When a device other than a CPU accesses memory that is attached to a CPU, this is called direct memory access (DMA). So any PCIe read or write requests issued from PCIe devices constitute DMA operations. This can be extended with 'device to device' or 'peer to peer' DMA where devices perform reads and writes against each other without involving the CPU or system memory.
There are two main advantages of DMA: First, DMA operations can move data into and out of memory with minimal CPU load, improving software efficiency. Second, the CPU can only issue reads and writes of whatever the CPU word size is, which results in very poor throughput over the PCIe bus due to TLP headers and other protocol overheads. Devices directly issuing read and write requests can issue read and write operations with much larger payloads, resulting in higher throughput and more efficient use of the bus bandwidth.
So, DMA is absolutely not obsolete or outdated - basically all high-performance devices connected over PCIe will use DMA to use the bus efficiently.

cache read system memory vs cpu read system memory

On an arm based SoC running Android/Linux, I observed following:
Allocate a memory area as un-cached for device DMA input. After DMA finishes, the content of this memory area is copied to another system memory area.
Alloc a memory area as cached for device DMA input. After DMA finished, invalid the memory range, then copy the content of this memory area to anther system memory area.
The size of memory area allocated is about 2MB which is larger than the cache size (the L2 cache size is 256KB).
method 2 is x10 faster than method 1
That is: the memory copy operation of method 2 is x10 faster than method 1
I speculate that method 2 using cache read by cache line size from system memory when copying and the method 1 needs cpu read by bus transaction size from system memory bypassing the cache hardware.
However, I cannot find explicit explanation. I appreciate who can help providing detailed explaination.
There are so many hardware items involved that it is difficult to give specifics. The SOC determines a lot of this. However, what you observe is typical in performance terms for modern ARM systems.
The main factor is SDRAM. All DRAM is structured with 'rows' and 'columns'.DRAM history On the DRAM chip, an entire 'row' can be read at one time. Ie, there is a matrix of transistors and there is a physical point/wiring where an entire row can be read (in fact there maybe SRAM to store the ROW on the chip). When you read another 'column', you need to 'un-charge/pre-charge' the wiring to access the new 'row'. This takes some time. The main point is that DRAM can read sequential memory very fast in large chunks. Also, there is no command overhead as the memory streams out with each clock edge.
If you mark memory as un-cached, then a CPU/SOC may issue single beat reads. Often these will 'pre-charge' consuming extra cycles during a single read/write and many extra commands must be sent to the DRAM device.
SDRAM also has 'banks'. A bank has a separate 'ROW' buffer (static RAM/multi-transistor memory) which allows you to read from one bank to another without having to recharge/re-read. The banks are often very far apart. If your OS has physically allocated the 'un-cached' memory in a different bank from the 2nd 'cached' area, then this will also add an additional efficiency. It common in an OS to manage cached/un-cached memory separately (for MMU issues). The memory pools are often distant enough to be in separate banks.

Does a one cycle instruction take one cycle, even if RAM is slow?

I am using an embedded RISC processor. There is one basic thing I have a problem figuring out.
The CPU manual clearly states that the instruction ld r1, [p1] (in C: r1 = *p1) takes one cycle. Size of register r1 is 32 bits. However, the memory bus is only 16 bits wide. So how can it fetch all data in one cycle?
The clock times are assuming full width zero wait state memory. The time it takes for the core to execute that instruction is one clock cycle.
There was a time when each instruction took a different number of clock cycles. Memory was relatively fast then too, usually zero wait state. There was a time before pipelines as well where you had to burn a clock cycle fetching, then a clock cycle decoding, then a clock cycle executing, plus extra clock cycles for variable length instructions and extra clock cycles if the instruction had a memory operation.
Today clock speeds are high, chip real estate is relatively cheap so a one clock cycle add or multiply is the norm, as are pipelines and caches. Processor clock speed is no longer the determining factor for performance. Memory is relatively expensive and slow. So caches (configuration, number of and size), bus size, memory speed, peripheral speed determine the overall performance of a system. Normally increasing the processor clock speed but not the memory or peripherals will show minimal if any performance gain, in some occasions it can make it slower.
Memory size and wait states are not part of the clock execution spec in the reference manual, they are talking about only what the core itself costs you in units of clocks for each of the instructions. If it is a harvard architecture where the instruction and data bus are separate, then one clock is possible with the memory cycle. The fetch of the instruction happens at least the prior clock cycle if not before that, so at the beginning of the clock cycle the instruction is ready, decode, and execute (the read memory cycle) happen during the one clock at the end of the one clock cycle the result of the read is latched into the register. If the instruction and data bus are shared, then you could argue that it still finishes in one clock cycle, but you do not get to fetch the next instruction so there is a bit of a stall there, they might cheat and call that one clock cycle.
My understanding is : when saying some instruction take one cycle , it is not that instruction will be finished in one cycle. We should take in count of instruction pipe-line. Suppose your CPU has 5 stage pipe line , that instruction would takes 5 cycles if it were exectued sequentially.

Resources