How DMA and PCIe play together? - dma

in a PCIe configuration, devices have dedicated addresses and they send data in Peer-to-Peer mode to each other - every device can write when it wills and the switches take care to correctly pass data forward. There is no need to have a "bus master", which decides when and how data will be transmitted.
How does DMA come into play in such configuration? For me it seems that DMA is an outdated feature, which is not needed in a PCIe configuration. Every device can send data to the main memory, or read from it - obviously the main memory will always be the "slave" in such operations.
Or is there some other functionality of DMA, which I am missing?
Thank you in advance!

When a device other than a CPU accesses memory that is attached to a CPU, this is called direct memory access (DMA). So any PCIe read or write requests issued from PCIe devices constitute DMA operations. This can be extended with 'device to device' or 'peer to peer' DMA where devices perform reads and writes against each other without involving the CPU or system memory.
There are two main advantages of DMA: First, DMA operations can move data into and out of memory with minimal CPU load, improving software efficiency. Second, the CPU can only issue reads and writes of whatever the CPU word size is, which results in very poor throughput over the PCIe bus due to TLP headers and other protocol overheads. Devices directly issuing read and write requests can issue read and write operations with much larger payloads, resulting in higher throughput and more efficient use of the bus bandwidth.
So, DMA is absolutely not obsolete or outdated - basically all high-performance devices connected over PCIe will use DMA to use the bus efficiently.

Related

Memcpy from PCIe memory takes more time than memcpy to PCIe memory

I am trying to do read/write data to/from a Linux PC from/to a PCIe 2.0 (2 lane) device. The memory for reading and writing are at different RAM locations in the PCIe device. Those memories are mapped in Linux PC using ioremap. My use case is to achieve 18MBytes/second read/write throughput which is obviously supported by the PCIe link. The memory at the PCIe device is uncached.
I am able to achieve the write throughput i.e when I write from Linux PC local memory to PCIe device memory using memcpy. The memcpy takes less than 1 ms for 9216 bytes of data in this case. But when I read the ioremapped PCIe memory to Linux local memory, data loss is happening. I profiled the memcpy and it takes more than 1ms, sometimes 2ms for 9216 bytes of data. I don't want to do DMA for this operation.
Any thoughts on what can be the problem in this case? How can I handle this?
That's entirely expected, and there is nothing you can do about that. The CPU can only issue serialized word-sized reads and writes, which have very poor throughput over the PCIe link due to protocol overheads. Every operation has 24 or 28 byte-times worth of overhead associated with it - that's a 12 or 16 byte TLP header plus 12 byte-times of link layer overhead, and the CPU can only operate on 4 or 8 bytes at a time....which is best case 25% efficient (8/(8+24) = 25%) and at worst 12.5% efficient (4/(4+28) = 12.5%).
The protocol overhead is not the only issue, however. Writes in PCIe are posted, so the CPU can simply issue a bunch of back-to-back writes which eventually make their way onto the bus and to the device. On the other hand, when reading, the CPU can only issue a single read operation, wait for it to traverse the bus twice, store the result, issue another read, etc. Since it can only operate on 8 bytes at a time, the performance is horrible due to the relatively high latency over the PCIe bus (can be on the order of microseconds for each transfer).
The solution? Use DMA. PCIe is specifically designed to support efficient DMA operations over the bus as devices can issue much larger read and write operations, minimum up to 128 bytes per operation.

how does burst-mode DMA speed up data transfer between main memory and I/O devices?

According to Wikipedia, there are three kinds of DMA modes, namely, the Burst Mode, the cycle stealing mode and the transparent mode.
In the Burst Mode, the dma controller will take over the control of the bus. Before the transfer completes, CPU tasks that need the bus will be suspended. However, in each instruction cycle, the fetch cycle has to reference the main memory. Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
In my understanding, the cycle stealing mode is essentially the same. The only difference is that in those mode the CPU uses one in two consecutive cycles, as opposed to being totally idle in the bust mode.
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process up?
Thanks a lot!
how does burst-mode DMA speed up data transfer between main memory and I/O devices?
There is no "speed up" as you allege, nor is any "speed up" typically necessary/possible. The data transfer is not going to occur any faster than the slower of the source or destination.
The DMA controller will consolidate several individual memory requests into occasional burst requests, so the benefit of burst mode is reduced memory contention due to a reduction in the number of memory arbitrations.
Burst mode combined with a wide memory word improves memory bandwidth utilization. For example, with a 32-bit wide memory, four sequential byte reads consolidated into a single burst could result in only one memory access cycle.
Before the transfer completes, CPU tasks that need the bus will be suspended.
The concept of "task" does not exist at this level of operations. There is no "suspension" of anything. At most the CPU has to wait (i.e. insertion of wait states) to gain access to memory.
However, in each instruction cycle, the fetch cycle has to reference the main memory.
Not true. A hit in the instruction cache will make a memory access unnecessary.
Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
Faulty assumption for every cache hit.
Apparently you are misusing the term "interrupt-driven IO" to really mean programmed I/O using interrupts.
Equating a wait cycle or two to the execution of numerous instructions of an interrupt service routine for programmed I/O is a ridiculous exaggeration.
And "interrupt-driven IO" (in its proper meaning) does not exclude the use of DMA.
In my understanding, the cycle stealing mode is essentially the same.
Then your understanding is incorrect.
If the benefits of DMA are so minuscule or nonexistent as you allege, then how do you explain the existence of DMA controllers, and the preference of using DMA over programmed I/O?
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process
Comparing DMA to "interrupt-driven I/O" is illogical. See this.
Programmed I/O using interrupts requires a lot more than just the one instruction that you allege.
I'm unfamiliar with any CPU that can read a device port, write that value to main memory, bump the write pointer, and check if the block transfer is complete all with just a single instruction.
And you're completely ignoring the ISR code (e.g. save and then restore processor state) that is required to be executed for each interrupt (that the device would issue for requesting data).
When used with many older or simpler CPUs, burst mode DMA can speed up data transfer in cases where a peripheral is able to accept data at a rate faster than the CPU itself could supply it. On a typical ARM, for example, a loop like:
lp:
ldr r0,[r1,r2] ; r1 points to address *after* end of buffer
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
adds r2,#4
bne lp
would likely take at least 11 cycles for each group of four bytes to transfer (including five 32-bit instruction fetches, one 32-bit data fetch, four 8-bit writes, plus a wasted fetch for the instruction following the loop). A burst-mode DMA operation, by contrast, DMA would only need 5 cycles per group (assuming the receiving device was able to accept data that fast).
Because a typical low-end ARM will only use the bus about every other cycle when running most kinds of code, a DMA controller that grabs the bus on every other cycle could allow the CPU to run at almost normal speed while the DMA controller performed one access every other cycle. On some platforms, it may be possible to have a DMA controller perform transfers on every cycle where the CPU isn't doing anything, while giving the CPU priority on cycles where it needs the bus. DMA performance would be highly variable in such a mode (no data would get transferred while running code that needs the bus on every cycle) but DMA operations would have no impact on CPU performance.

How can a PCIe card dma data into CPU ram?

This is in reference to this answer given to a similar dma/pci question. I gathered from this answer that the PC does not have a dma capable of transferring data to/from a PCI card, and that the PCI card must provide the dma capabilites. I have received similar answers from colleagues saying, "A two-way dma needs to be on the FPGA (referring to the PCI card) to enable burst transfers to/from cpu memory."
My understanding is that when the PC receives a read request, it needs to fulfill the read request by creating a return packet with the data requested. So, if the card requests a page of data (4096 bytes), the PC needs to return a packet with 4096 bytes. How does the card's dma reach across the bus and use it's dma to fill the needed packet as this answer suggests?
I think there might be a misunderstanding here. The card does not "reach across" the bus to use a DMA function in the PC.
The card itself is a bus master. It can directly read and write the entire memory of the PC, just like the CPU can.
From the PC memory system point of view, there is no difference between the card or the main CPU in the PC. Both are bus masters. Both can perform reads and writes to memory.
Bursts of 4096 Bytes are not supported. You will have to split it up into multiple smaller bursts.

How long does it take to set up an I/O controller on PCIe bus

Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...
Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.

Realistic data rate over PCI bus using DMA?

What is the realistic data transfer rate over a 32-bit/33MHz PCI bus? We need to transfer 32K 32-bit samples from a PCI card to an Intel CPU running Windows. I would think the block would transfer in 1msec but it is taking 40msec. The PCI board has a PLX PCI-9056. We are accessing card memory with a virtual address, but our CPU is bricked-out which make me think the data rate is being held up by CPU involvement. If we go to DMA, will we transfer in closer to 1msec? The reason I have my doubts is the PXI SDK User Manual states:
"BAR space memory read/write is generally slow in relative terms. Reads are typically only 2-4MB/s."
You should check if you can enable burst mode and continuous burst, such that multiple DWords can be transmitted without new address cycles. This makes things much faster. The PLX PCI9056 supports this option, but it must be set by SW accordingly.
We have data rates up to 90 MB/s with DMA Master Transfer on our custom designed frame grabber card.

Resources