I have a simple theoretical question. The DMAs I know usually have half full or full interrupts. If I want to use a DMA for data transfer from a peripheral, how can I ensure I got all the data since data may not be at the dma transfer boundary.
For example, serial port might send 5 bytes, I would get and interrupt for the first 4 combined together (assuming dma size is 4), but nothing for the 5th one. What is the method people usually use to solve such a problem.
My best approach is this:
Setup a DMA memory region. lets say it's address 0x2 to 0x1000
The serial device writes bytes in this region, as a circular buffer
Each time the serial device writes, it updates it's "write pointer" and saves in bytes 0x0 and 0x1
The PC Host can dma the write pointer, and compare with it's own read pointer. The read pointer can be kept on the pc host and not deal with DMA at all. Then the PC knows how much memory to read, and it also knows if there has been an underflow or overflow.
This should be a good starting point for what you want.
Related
I am trying to do read/write data to/from a Linux PC from/to a PCIe 2.0 (2 lane) device. The memory for reading and writing are at different RAM locations in the PCIe device. Those memories are mapped in Linux PC using ioremap. My use case is to achieve 18MBytes/second read/write throughput which is obviously supported by the PCIe link. The memory at the PCIe device is uncached.
I am able to achieve the write throughput i.e when I write from Linux PC local memory to PCIe device memory using memcpy. The memcpy takes less than 1 ms for 9216 bytes of data in this case. But when I read the ioremapped PCIe memory to Linux local memory, data loss is happening. I profiled the memcpy and it takes more than 1ms, sometimes 2ms for 9216 bytes of data. I don't want to do DMA for this operation.
Any thoughts on what can be the problem in this case? How can I handle this?
That's entirely expected, and there is nothing you can do about that. The CPU can only issue serialized word-sized reads and writes, which have very poor throughput over the PCIe link due to protocol overheads. Every operation has 24 or 28 byte-times worth of overhead associated with it - that's a 12 or 16 byte TLP header plus 12 byte-times of link layer overhead, and the CPU can only operate on 4 or 8 bytes at a time....which is best case 25% efficient (8/(8+24) = 25%) and at worst 12.5% efficient (4/(4+28) = 12.5%).
The protocol overhead is not the only issue, however. Writes in PCIe are posted, so the CPU can simply issue a bunch of back-to-back writes which eventually make their way onto the bus and to the device. On the other hand, when reading, the CPU can only issue a single read operation, wait for it to traverse the bus twice, store the result, issue another read, etc. Since it can only operate on 8 bytes at a time, the performance is horrible due to the relatively high latency over the PCIe bus (can be on the order of microseconds for each transfer).
The solution? Use DMA. PCIe is specifically designed to support efficient DMA operations over the bus as devices can issue much larger read and write operations, minimum up to 128 bytes per operation.
Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...
Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.
Say we have a 32-bit wide memory bus to a shared memory in a network switch. Now I want to make the storing of packets maximize parallel. I put a DMA after each input port, so the switch controller will not be blocked until one packet is stored completely. Assume one packet of each input port is 8 bits. So Could the memory bus be decomposed into 4 8-bit sub-memory buses in order to make each DMA could lead a 8-bit wide packet into the corresponding memory address parallelly(ignore conflicts temporarily)?
Sorry for such a weird question, and for not quite knowing about the computer organization and architecture.
I've been going through many technical documents on packet capture/processing and host stacks trying to understand it all, there's a few areas where I'm troubled, hopefully someone can help.
Assuming you're running tcpdump:
After a packet gets copied from a NIC's ring buffer (physical NIC memory right?)
does it immediately get stored into an mbuf? and then BPF gets a copy of the packet from the mbuf , which is then stored in the BPF buffer, so there are two copies in memory at the same time? I'm trying to understand the exact process.
Or is it more like: the packet gets copied from the NIC to both the mbuf (for host stack processing) and to the BPF pseudo-simultaneously?
Once a packet goes through host stack processing by ip/tcp input functions taking the mbuf as the location(pointing to an mbuf) i.e. packets are stored in mbufs, if the packet is not addressed for the system, say received by monitoring traffic via hub or SPAN/Monitor port, the packet is discarded and never makes its way up the host stack.
I seem to have come across diagrams which show the NIC ring buffer(RX/TX) in a kernel "box"/separating it from userspace, which makes me second guess whether a ring buffer is actually allocated system memory different from the physical memory on a NIC.
Assuming that a ring buffer refers to the NIC's physical memory, is it correct that the device driver determines the size of the NIC ring buffer, setting physical limitations aside? e.g. can I shrink the buffer by modifying the driver?
Thanks!
ETHER_BPF_MTAP macro calls bpf_mtap(), which excepts packet in mbuf format, and bpf copies data from this mbuf to internal buffer.
But mbufs can use external storage, so there can be or not be copying from NIC ring buffer to mbuf. Mbufs can actually contain packet data or serve just as a header with reference to receiving buffer.
Also, current NICs use their little (128/96/... Kb) onboard memory for FIFO only and immediately transfer all data to ring buffers in main memory. So you really can adjust buffer size in device driver.
Does it mean that buffers of I/O devices are assigned addresses in the total memory space just like the bytes of the main memory are assigned??
That's basically it. You have I/O devices which monitor the address lines (and data lines, and control lines) of your processor to "capture" certain addresses and act on them.
For example, you may have a memory mapped keyboard device (using address 0xff00) that basically collects the keystrokes from the physical keyboard and buffers them ready to be received by the processor.
So, when it see address 0xff00 on the address lines and a read signal (such as a memio line and the r/not-w line both going high (indicating a memory read is desired), it will inject the code for the keypress onto the data lines and signal the processor to read it.
If no keypresses are buffered, it may just give back a code of 0 (it depends entirely on the protocol).
Pretty much. Not that the actual peripheral hardware buffers must be mapped but the OS / Mapper will take care of it somehow.