How long does it take to set up an I/O controller on PCIe bus - device-driver

Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...

Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.

Related

How DMA and PCIe play together?

in a PCIe configuration, devices have dedicated addresses and they send data in Peer-to-Peer mode to each other - every device can write when it wills and the switches take care to correctly pass data forward. There is no need to have a "bus master", which decides when and how data will be transmitted.
How does DMA come into play in such configuration? For me it seems that DMA is an outdated feature, which is not needed in a PCIe configuration. Every device can send data to the main memory, or read from it - obviously the main memory will always be the "slave" in such operations.
Or is there some other functionality of DMA, which I am missing?
Thank you in advance!
When a device other than a CPU accesses memory that is attached to a CPU, this is called direct memory access (DMA). So any PCIe read or write requests issued from PCIe devices constitute DMA operations. This can be extended with 'device to device' or 'peer to peer' DMA where devices perform reads and writes against each other without involving the CPU or system memory.
There are two main advantages of DMA: First, DMA operations can move data into and out of memory with minimal CPU load, improving software efficiency. Second, the CPU can only issue reads and writes of whatever the CPU word size is, which results in very poor throughput over the PCIe bus due to TLP headers and other protocol overheads. Devices directly issuing read and write requests can issue read and write operations with much larger payloads, resulting in higher throughput and more efficient use of the bus bandwidth.
So, DMA is absolutely not obsolete or outdated - basically all high-performance devices connected over PCIe will use DMA to use the bus efficiently.

Memcpy from PCIe memory takes more time than memcpy to PCIe memory

I am trying to do read/write data to/from a Linux PC from/to a PCIe 2.0 (2 lane) device. The memory for reading and writing are at different RAM locations in the PCIe device. Those memories are mapped in Linux PC using ioremap. My use case is to achieve 18MBytes/second read/write throughput which is obviously supported by the PCIe link. The memory at the PCIe device is uncached.
I am able to achieve the write throughput i.e when I write from Linux PC local memory to PCIe device memory using memcpy. The memcpy takes less than 1 ms for 9216 bytes of data in this case. But when I read the ioremapped PCIe memory to Linux local memory, data loss is happening. I profiled the memcpy and it takes more than 1ms, sometimes 2ms for 9216 bytes of data. I don't want to do DMA for this operation.
Any thoughts on what can be the problem in this case? How can I handle this?
That's entirely expected, and there is nothing you can do about that. The CPU can only issue serialized word-sized reads and writes, which have very poor throughput over the PCIe link due to protocol overheads. Every operation has 24 or 28 byte-times worth of overhead associated with it - that's a 12 or 16 byte TLP header plus 12 byte-times of link layer overhead, and the CPU can only operate on 4 or 8 bytes at a time....which is best case 25% efficient (8/(8+24) = 25%) and at worst 12.5% efficient (4/(4+28) = 12.5%).
The protocol overhead is not the only issue, however. Writes in PCIe are posted, so the CPU can simply issue a bunch of back-to-back writes which eventually make their way onto the bus and to the device. On the other hand, when reading, the CPU can only issue a single read operation, wait for it to traverse the bus twice, store the result, issue another read, etc. Since it can only operate on 8 bytes at a time, the performance is horrible due to the relatively high latency over the PCIe bus (can be on the order of microseconds for each transfer).
The solution? Use DMA. PCIe is specifically designed to support efficient DMA operations over the bus as devices can issue much larger read and write operations, minimum up to 128 bytes per operation.

How can a PCIe card dma data into CPU ram?

This is in reference to this answer given to a similar dma/pci question. I gathered from this answer that the PC does not have a dma capable of transferring data to/from a PCI card, and that the PCI card must provide the dma capabilites. I have received similar answers from colleagues saying, "A two-way dma needs to be on the FPGA (referring to the PCI card) to enable burst transfers to/from cpu memory."
My understanding is that when the PC receives a read request, it needs to fulfill the read request by creating a return packet with the data requested. So, if the card requests a page of data (4096 bytes), the PC needs to return a packet with 4096 bytes. How does the card's dma reach across the bus and use it's dma to fill the needed packet as this answer suggests?
I think there might be a misunderstanding here. The card does not "reach across" the bus to use a DMA function in the PC.
The card itself is a bus master. It can directly read and write the entire memory of the PC, just like the CPU can.
From the PC memory system point of view, there is no difference between the card or the main CPU in the PC. Both are bus masters. Both can perform reads and writes to memory.
Bursts of 4096 Bytes are not supported. You will have to split it up into multiple smaller bursts.

Use dma transfert with Cyclone V Avalon-MM for PCIe

Is it possible to do DMA transferts with the IP core «Cyclone V Avalon-MM for PCIe» provided by altera in Qsys (quartus 14.0) ?
Altera provide an ip-core named «Cyclone V Avalon-MM DMA for PCIe» to do dma transfert. But this ip-core does not support PCIe Gen1 with 1x lane.
The demo (ep_g1x1) design for «Cyclone V Avalon-MM for PCIe» include a DMA block that is connected on Avalon-mm TX bus of PCIe ip-core.
Then I'm wondering if it's possible to write data from this DMA block to the root-complex (host) ? Because I can't find how to do that.
From my brief skim of the material, it should be possible to issue DMA reads or writes from an RC to your Cyclone V (EP) using the IP core you're interested in.
I've done DMA reads and writes on a Stratix V, however it was in a non-Qsys design just using the PCIe core HIP block (custom TLP encoding and decoding logic). This block just seems to be a wrapper around their PCIe HIP block that also handles the transaction layer for you.
The first step will be to get your RC to issue PCIe DMA read or writes requests. In the case of a read request, you'll want to send a memory read complete with data (CplD) request with a length greater than 1 DWORD. I would suggest dedicating an entire BAR to map the memory space you want to DMA from on the FPGA to keep your address targeting simple.
On the FPGA side, I would suggest using Signal Tap and probing the Rxm* interface signals on the core. This way you can see the exact timing of the DMA read request that comes out of the core. My guess is that the RXMRead_<n>_o signal will go high indicating the start of the request. At which point you'll have to decode and pass the RxmAddress_<n>_o and RXMBurstCount_<n>_o to some glue logic that will fetch the requested data from the FPGA's memory. Once you're ready to send back the data, assert the RXMReadDataValid_<n>_i for each valid word being sent.
I'm guessing that the «Cyclone V Avalon-MM DMA for PCIe» core that you referenced takes care of that 'glue' logic I mentioned for you, and allows you to connect straight to a SDRAM controller on your Qsys bus. Altera doesn't usually encrypt their megafuction code, so if your system verilog is strong, it might be worth digging through their generated files and seeing if you can reuse that bit of code in some way.
As for core settings, the only thing that I saw that you need to look out for is making sure the Single DW Completer setting is turned OFF. Otherwise the core will abort any requests it receives with a length greater than 1 DWORD.
Hope that helped somewhat.
I finally managed to make DMA request with the «Cyclone V Avalon-MM for PCIe» altera core-ip. Then yes it's possible.
On my Linux system, rootcomplex (RC) is included under i.MX6 with Linux operating system. Then most of the tricks are on the Linux side in fact.
Under the Linux driver a PAGE must be requested with dma_alloc_coherent() call and the address of this page must be written on the CRA register named ADDR_MAP_LO0 and ADDR_MAP_HI0.
On my system, memory pages are 4k sized, then I had to configure the «address translation settings» of the PCIe hard ip with pages of 4k to be coherent.
Once that done, I simply connected the DMA controller provided by Qsys on the TX avalon-MM slave port of PCIe IP.
Telling the DMA to write data on this port will automatically generate TLPs from the FPGA to write on i.MX6 ram.

Realistic data rate over PCI bus using DMA?

What is the realistic data transfer rate over a 32-bit/33MHz PCI bus? We need to transfer 32K 32-bit samples from a PCI card to an Intel CPU running Windows. I would think the block would transfer in 1msec but it is taking 40msec. The PCI board has a PLX PCI-9056. We are accessing card memory with a virtual address, but our CPU is bricked-out which make me think the data rate is being held up by CPU involvement. If we go to DMA, will we transfer in closer to 1msec? The reason I have my doubts is the PXI SDK User Manual states:
"BAR space memory read/write is generally slow in relative terms. Reads are typically only 2-4MB/s."
You should check if you can enable burst mode and continuous burst, such that multiple DWords can be transmitted without new address cycles. This makes things much faster. The PLX PCI9056 supports this option, but it must be set by SW accordingly.
We have data rates up to 90 MB/s with DMA Master Transfer on our custom designed frame grabber card.

Resources