Realistic data rate over PCI bus using DMA? - dma

What is the realistic data transfer rate over a 32-bit/33MHz PCI bus? We need to transfer 32K 32-bit samples from a PCI card to an Intel CPU running Windows. I would think the block would transfer in 1msec but it is taking 40msec. The PCI board has a PLX PCI-9056. We are accessing card memory with a virtual address, but our CPU is bricked-out which make me think the data rate is being held up by CPU involvement. If we go to DMA, will we transfer in closer to 1msec? The reason I have my doubts is the PXI SDK User Manual states:
"BAR space memory read/write is generally slow in relative terms. Reads are typically only 2-4MB/s."

You should check if you can enable burst mode and continuous burst, such that multiple DWords can be transmitted without new address cycles. This makes things much faster. The PLX PCI9056 supports this option, but it must be set by SW accordingly.
We have data rates up to 90 MB/s with DMA Master Transfer on our custom designed frame grabber card.

Related

Does a DMA controller copy one word of memory at a time?

A DMA controller greatly speeds up memory copy operations because the data in memory doesn't have to be read into the CPU.
From what I've read, DMA controllers can "copy a block of memory from one location to another" in one operation, but thinking about this at a low level, I'm guessing the DMA ultimately has to iterate over memory one word at a time. Is that correct? Is that one word per clock cycle? One word per two clock cycles? (one for read memory into the DMA, one for write to memory) Or does the DMA have a circuit that can somehow (I can't imagine how) copy large chunks of memory in one or two clock cycles?
If the CPU tells the DMA to copy 1024 bytes of memory from one address to another, how many clock cycles will the CPU have free to perform other tasks while waiting for the DMA to finish?
Is it possible to have an architecture where the DMA is doing a memory copy using one bus, while the CPU can access memory at the same time in a different area? Say, in a different bank?
I'm sure it's architecture dependent, so for the answers just pick one or more 8 or 16 bit home micros.
Yes this is architecture dependent. Usually there is a main memory bus, and one or more cache. All buses in the system are not required to be the same width. Memory buses are usually larger than processor words, it might be 64bits wide, so it loads 64bits at a time.
Then the destination might be the same memory bus, or PCIE, or even another bus that is memory mapped, in which case the transfers might be constrained by the destination bus width.
How many clock cycles are available again depend of how things are done. Usually in a µC the DMA triggers an interrupt when it is done, and the CPU does nothing. An other option is polling.
There are dual port memory but usually there is only 1 main memory bus. IIRC banks are usually a trick to avoid large addresses, but use the same memory bus.
Cache and bus arbitration are used to mitigate bus contention, the user really shouldn't care about that. You can have a look at your µC datasheet if you want reliable information.

Large capacity, SPI storage for PIC12f1840?

I'm using a PIC12F1840 chip with an MPU9250 accelerometer to collect movement data. I'm currently using a 1Kb, SPI RAM chip, but it gets full quite quickly, and there is also data loss while trying to read the data from it (due to the RAM's need for continuous power!).
It doesn't have to be quick, the MPU's fastest speed is the only 184Hz, and I'm planning to use it on a slower setting by default. Can someone suggest some type of memory?

How DMA and PCIe play together?

in a PCIe configuration, devices have dedicated addresses and they send data in Peer-to-Peer mode to each other - every device can write when it wills and the switches take care to correctly pass data forward. There is no need to have a "bus master", which decides when and how data will be transmitted.
How does DMA come into play in such configuration? For me it seems that DMA is an outdated feature, which is not needed in a PCIe configuration. Every device can send data to the main memory, or read from it - obviously the main memory will always be the "slave" in such operations.
Or is there some other functionality of DMA, which I am missing?
Thank you in advance!
When a device other than a CPU accesses memory that is attached to a CPU, this is called direct memory access (DMA). So any PCIe read or write requests issued from PCIe devices constitute DMA operations. This can be extended with 'device to device' or 'peer to peer' DMA where devices perform reads and writes against each other without involving the CPU or system memory.
There are two main advantages of DMA: First, DMA operations can move data into and out of memory with minimal CPU load, improving software efficiency. Second, the CPU can only issue reads and writes of whatever the CPU word size is, which results in very poor throughput over the PCIe bus due to TLP headers and other protocol overheads. Devices directly issuing read and write requests can issue read and write operations with much larger payloads, resulting in higher throughput and more efficient use of the bus bandwidth.
So, DMA is absolutely not obsolete or outdated - basically all high-performance devices connected over PCIe will use DMA to use the bus efficiently.

How can a PCIe card dma data into CPU ram?

This is in reference to this answer given to a similar dma/pci question. I gathered from this answer that the PC does not have a dma capable of transferring data to/from a PCI card, and that the PCI card must provide the dma capabilites. I have received similar answers from colleagues saying, "A two-way dma needs to be on the FPGA (referring to the PCI card) to enable burst transfers to/from cpu memory."
My understanding is that when the PC receives a read request, it needs to fulfill the read request by creating a return packet with the data requested. So, if the card requests a page of data (4096 bytes), the PC needs to return a packet with 4096 bytes. How does the card's dma reach across the bus and use it's dma to fill the needed packet as this answer suggests?
I think there might be a misunderstanding here. The card does not "reach across" the bus to use a DMA function in the PC.
The card itself is a bus master. It can directly read and write the entire memory of the PC, just like the CPU can.
From the PC memory system point of view, there is no difference between the card or the main CPU in the PC. Both are bus masters. Both can perform reads and writes to memory.
Bursts of 4096 Bytes are not supported. You will have to split it up into multiple smaller bursts.

How long does it take to set up an I/O controller on PCIe bus

Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...
Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.

Resources