FreeBSD: Questions about NIC ring buffers, mbufs, and bpf buffers

FreeBSD: Questions about NIC ring buffers, mbufs, and bpf buffers - buffer

I've been going through many technical documents on packet capture/processing and host stacks trying to understand it all, there's a few areas where I'm troubled, hopefully someone can help.
Assuming you're running tcpdump:
After a packet gets copied from a NIC's ring buffer (physical NIC memory right?)
does it immediately get stored into an mbuf? and then BPF gets a copy of the packet from the mbuf , which is then stored in the BPF buffer, so there are two copies in memory at the same time? I'm trying to understand the exact process.
Or is it more like: the packet gets copied from the NIC to both the mbuf (for host stack processing) and to the BPF pseudo-simultaneously?
Once a packet goes through host stack processing by ip/tcp input functions taking the mbuf as the location(pointing to an mbuf) i.e. packets are stored in mbufs, if the packet is not addressed for the system, say received by monitoring traffic via hub or SPAN/Monitor port, the packet is discarded and never makes its way up the host stack.
I seem to have come across diagrams which show the NIC ring buffer(RX/TX) in a kernel "box"/separating it from userspace, which makes me second guess whether a ring buffer is actually allocated system memory different from the physical memory on a NIC.
Assuming that a ring buffer refers to the NIC's physical memory, is it correct that the device driver determines the size of the NIC ring buffer, setting physical limitations aside? e.g. can I shrink the buffer by modifying the driver?
Thanks!

ETHER_BPF_MTAP macro calls bpf_mtap(), which excepts packet in mbuf format, and bpf copies data from this mbuf to internal buffer.
But mbufs can use external storage, so there can be or not be copying from NIC ring buffer to mbuf. Mbufs can actually contain packet data or serve just as a header with reference to receiving buffer.
Also, current NICs use their little (128/96/... Kb) onboard memory for FIFO only and immediately transfer all data to ring buffers in main memory. So you really can adjust buffer size in device driver.

Related

LAN Driver Interruptions

I need to know how the computer handles Local Area Network Input and Output Processor interruptions. I have been looking for a while but can't seem to find anything. Came across some RJ-45 port information but not much of what I specifically need. If someone has some information on how the CPU interrupts a process to call the pointer and therefore the driver, plus how this process works it would be much appreciated.
Thanks

Typically, the driver for the LAN card configured the card to issue an interrupt when the receive buffer gets close to full or the send buffer gets close to empty. Typically, these buffers live in system memory and the network hardware uses DMA to pull transmitted packets and store received packets in system memory.
When the interrupt triggers, some process on some core is typically interrupted and the network code begins executing. If it's a send interrupt and there are more packets to send, more packets are attached to the send buffer. If it's a receive interrupt, typically more packet buffers are attached to the receive buffer. The driver typically arranges for a "bottom half" to be dispatched to handle whatever other work needs to be done (such as processing the received packets) and the the interrupts completes.
There's a ton of possible variation based upon many factors, but this is the basic idea.

How long does it take to set up an I/O controller on PCIe bus

Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...

Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.

Receiving data with DMA

I have a simple theoretical question. The DMAs I know usually have half full or full interrupts. If I want to use a DMA for data transfer from a peripheral, how can I ensure I got all the data since data may not be at the dma transfer boundary.
For example, serial port might send 5 bytes, I would get and interrupt for the first 4 combined together (assuming dma size is 4), but nothing for the 5th one. What is the method people usually use to solve such a problem.

My best approach is this:
Setup a DMA memory region. lets say it's address 0x2 to 0x1000
The serial device writes bytes in this region, as a circular buffer
Each time the serial device writes, it updates it's "write pointer" and saves in bytes 0x0 and 0x1
The PC Host can dma the write pointer, and compare with it's own read pointer. The read pointer can be kept on the pc host and not deal with DMA at all. Then the PC knows how much memory to read, and it also knows if there has been an underflow or overflow.
This should be a good starting point for what you want.

Memory data bus decomposition

Say we have a 32-bit wide memory bus to a shared memory in a network switch. Now I want to make the storing of packets maximize parallel. I put a DMA after each input port, so the switch controller will not be blocked until one packet is stored completely. Assume one packet of each input port is 8 bits. So Could the memory bus be decomposed into 4 8-bit sub-memory buses in order to make each DMA could lead a 8-bit wide packet into the corresponding memory address parallelly(ignore conflicts temporarily)?
Sorry for such a weird question, and for not quite knowing about the computer organization and architecture.

What is the meaning of memory-mapped I/O?

Does it mean that buffers of I/O devices are assigned addresses in the total memory space just like the bytes of the main memory are assigned??

That's basically it. You have I/O devices which monitor the address lines (and data lines, and control lines) of your processor to "capture" certain addresses and act on them.
For example, you may have a memory mapped keyboard device (using address 0xff00) that basically collects the keystrokes from the physical keyboard and buffers them ready to be received by the processor.
So, when it see address 0xff00 on the address lines and a read signal (such as a memio line and the r/not-w line both going high (indicating a memory read is desired), it will inject the code for the keypress onto the data lines and signal the processor to read it.
If no keypresses are buffered, it may just give back a code of 0 (it depends entirely on the protocol).

Pretty much. Not that the actual peripheral hardware buffers must be mapped but the OS / Mapper will take care of it somehow.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart