How is data transferred from memory to PCIe cards?

How is data transferred from memory to PCIe cards? - dma

My question is about data transfer between a PCIe peripherial and system memory. Example:
Suppose we need to send a big chunk of data ( stored in system memory ) over an Ethernet network.
How will it usually be done ?
Will the Ethernet card's controller have to request the data via "Bus Mastering" ( after being programmed by the CPU to do so ) ?
Or will a centralized DMA controller near the CPU just write the card's buffers ( after being programmed by the CPU to do so ) ?
In other words:
Does the card have to ask for the data ? Or can it simply be written to it via the CPU's DMA ?

PCIe bulk data transfers are generally performed by the device, using bus mastering, as you guessed.
In your question, you imply that having a separate DMA device do the transfer is simpler than having the device do it, but I think the opposite--it is simpler for the device to do it.

Related

How DMA and PCIe play together?

in a PCIe configuration, devices have dedicated addresses and they send data in Peer-to-Peer mode to each other - every device can write when it wills and the switches take care to correctly pass data forward. There is no need to have a "bus master", which decides when and how data will be transmitted.
How does DMA come into play in such configuration? For me it seems that DMA is an outdated feature, which is not needed in a PCIe configuration. Every device can send data to the main memory, or read from it - obviously the main memory will always be the "slave" in such operations.
Or is there some other functionality of DMA, which I am missing?
Thank you in advance!

When a device other than a CPU accesses memory that is attached to a CPU, this is called direct memory access (DMA). So any PCIe read or write requests issued from PCIe devices constitute DMA operations. This can be extended with 'device to device' or 'peer to peer' DMA where devices perform reads and writes against each other without involving the CPU or system memory.
There are two main advantages of DMA: First, DMA operations can move data into and out of memory with minimal CPU load, improving software efficiency. Second, the CPU can only issue reads and writes of whatever the CPU word size is, which results in very poor throughput over the PCIe bus due to TLP headers and other protocol overheads. Devices directly issuing read and write requests can issue read and write operations with much larger payloads, resulting in higher throughput and more efficient use of the bus bandwidth.
So, DMA is absolutely not obsolete or outdated - basically all high-performance devices connected over PCIe will use DMA to use the bus efficiently.

How can a PCIe card dma data into CPU ram?

This is in reference to this answer given to a similar dma/pci question. I gathered from this answer that the PC does not have a dma capable of transferring data to/from a PCI card, and that the PCI card must provide the dma capabilites. I have received similar answers from colleagues saying, "A two-way dma needs to be on the FPGA (referring to the PCI card) to enable burst transfers to/from cpu memory."
My understanding is that when the PC receives a read request, it needs to fulfill the read request by creating a return packet with the data requested. So, if the card requests a page of data (4096 bytes), the PC needs to return a packet with 4096 bytes. How does the card's dma reach across the bus and use it's dma to fill the needed packet as this answer suggests?

I think there might be a misunderstanding here. The card does not "reach across" the bus to use a DMA function in the PC.
The card itself is a bus master. It can directly read and write the entire memory of the PC, just like the CPU can.
From the PC memory system point of view, there is no difference between the card or the main CPU in the PC. Both are bus masters. Both can perform reads and writes to memory.
Bursts of 4096 Bytes are not supported. You will have to split it up into multiple smaller bursts.

Use dma transfert with Cyclone V Avalon-MM for PCIe

Is it possible to do DMA transferts with the IP core «Cyclone V Avalon-MM for PCIe» provided by altera in Qsys (quartus 14.0) ?
Altera provide an ip-core named «Cyclone V Avalon-MM DMA for PCIe» to do dma transfert. But this ip-core does not support PCIe Gen1 with 1x lane.
The demo (ep_g1x1) design for «Cyclone V Avalon-MM for PCIe» include a DMA block that is connected on Avalon-mm TX bus of PCIe ip-core.
Then I'm wondering if it's possible to write data from this DMA block to the root-complex (host) ? Because I can't find how to do that.

From my brief skim of the material, it should be possible to issue DMA reads or writes from an RC to your Cyclone V (EP) using the IP core you're interested in.
I've done DMA reads and writes on a Stratix V, however it was in a non-Qsys design just using the PCIe core HIP block (custom TLP encoding and decoding logic). This block just seems to be a wrapper around their PCIe HIP block that also handles the transaction layer for you.
The first step will be to get your RC to issue PCIe DMA read or writes requests. In the case of a read request, you'll want to send a memory read complete with data (CplD) request with a length greater than 1 DWORD. I would suggest dedicating an entire BAR to map the memory space you want to DMA from on the FPGA to keep your address targeting simple.
On the FPGA side, I would suggest using Signal Tap and probing the Rxm* interface signals on the core. This way you can see the exact timing of the DMA read request that comes out of the core. My guess is that the RXMRead_<n>_o signal will go high indicating the start of the request. At which point you'll have to decode and pass the RxmAddress_<n>_o and RXMBurstCount_<n>_o to some glue logic that will fetch the requested data from the FPGA's memory. Once you're ready to send back the data, assert the RXMReadDataValid_<n>_i for each valid word being sent.
I'm guessing that the «Cyclone V Avalon-MM DMA for PCIe» core that you referenced takes care of that 'glue' logic I mentioned for you, and allows you to connect straight to a SDRAM controller on your Qsys bus. Altera doesn't usually encrypt their megafuction code, so if your system verilog is strong, it might be worth digging through their generated files and seeing if you can reuse that bit of code in some way.
As for core settings, the only thing that I saw that you need to look out for is making sure the Single DW Completer setting is turned OFF. Otherwise the core will abort any requests it receives with a length greater than 1 DWORD.
Hope that helped somewhat.

I finally managed to make DMA request with the «Cyclone V Avalon-MM for PCIe» altera core-ip. Then yes it's possible.
On my Linux system, rootcomplex (RC) is included under i.MX6 with Linux operating system. Then most of the tricks are on the Linux side in fact.
Under the Linux driver a PAGE must be requested with dma_alloc_coherent() call and the address of this page must be written on the CRA register named ADDR_MAP_LO0 and ADDR_MAP_HI0.
On my system, memory pages are 4k sized, then I had to configure the «address translation settings» of the PCIe hard ip with pages of 4k to be coherent.
Once that done, I simply connected the DMA controller provided by Qsys on the TX avalon-MM slave port of PCIe IP.
Telling the DMA to write data on this port will automatically generate TLPs from the FPGA to write on i.MX6 ram.

How long does it take to set up an I/O controller on PCIe bus

Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...

Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.

Realistic data rate over PCI bus using DMA?

What is the realistic data transfer rate over a 32-bit/33MHz PCI bus? We need to transfer 32K 32-bit samples from a PCI card to an Intel CPU running Windows. I would think the block would transfer in 1msec but it is taking 40msec. The PCI board has a PLX PCI-9056. We are accessing card memory with a virtual address, but our CPU is bricked-out which make me think the data rate is being held up by CPU involvement. If we go to DMA, will we transfer in closer to 1msec? The reason I have my doubts is the PXI SDK User Manual states:
"BAR space memory read/write is generally slow in relative terms. Reads are typically only 2-4MB/s."

You should check if you can enable burst mode and continuous burst, such that multiple DWords can be transmitted without new address cycles. This makes things much faster. The PLX PCI9056 supports this option, but it must be set by SW accordingly.
We have data rates up to 90 MB/s with DMA Master Transfer on our custom designed frame grabber card.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart