Say we have a 32-bit wide memory bus to a shared memory in a network switch. Now I want to make the storing of packets maximize parallel. I put a DMA after each input port, so the switch controller will not be blocked until one packet is stored completely. Assume one packet of each input port is 8 bits. So Could the memory bus be decomposed into 4 8-bit sub-memory buses in order to make each DMA could lead a 8-bit wide packet into the corresponding memory address parallelly(ignore conflicts temporarily)?
Sorry for such a weird question, and for not quite knowing about the computer organization and architecture.
Related
I am trying to do read/write data to/from a Linux PC from/to a PCIe 2.0 (2 lane) device. The memory for reading and writing are at different RAM locations in the PCIe device. Those memories are mapped in Linux PC using ioremap. My use case is to achieve 18MBytes/second read/write throughput which is obviously supported by the PCIe link. The memory at the PCIe device is uncached.
I am able to achieve the write throughput i.e when I write from Linux PC local memory to PCIe device memory using memcpy. The memcpy takes less than 1 ms for 9216 bytes of data in this case. But when I read the ioremapped PCIe memory to Linux local memory, data loss is happening. I profiled the memcpy and it takes more than 1ms, sometimes 2ms for 9216 bytes of data. I don't want to do DMA for this operation.
Any thoughts on what can be the problem in this case? How can I handle this?
That's entirely expected, and there is nothing you can do about that. The CPU can only issue serialized word-sized reads and writes, which have very poor throughput over the PCIe link due to protocol overheads. Every operation has 24 or 28 byte-times worth of overhead associated with it - that's a 12 or 16 byte TLP header plus 12 byte-times of link layer overhead, and the CPU can only operate on 4 or 8 bytes at a time....which is best case 25% efficient (8/(8+24) = 25%) and at worst 12.5% efficient (4/(4+28) = 12.5%).
The protocol overhead is not the only issue, however. Writes in PCIe are posted, so the CPU can simply issue a bunch of back-to-back writes which eventually make their way onto the bus and to the device. On the other hand, when reading, the CPU can only issue a single read operation, wait for it to traverse the bus twice, store the result, issue another read, etc. Since it can only operate on 8 bytes at a time, the performance is horrible due to the relatively high latency over the PCIe bus (can be on the order of microseconds for each transfer).
The solution? Use DMA. PCIe is specifically designed to support efficient DMA operations over the bus as devices can issue much larger read and write operations, minimum up to 128 bytes per operation.
Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...
Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.
In I/O-mapped I/O (as opposed to memory-mapped I/O), a certain set of addresses are fixed for I/O devices. Are these addresses a part of the RAM, and thus that much physical address space is unusable ? Does it correspond to the 'Hardware Reserved' memory in the attached picture ?
If yes, how is it decided which bits of an address are to be used for addressing I/O devices (because the I/O address space would be much smaller than the actual memory. I have read this helps to reduce the number of pins/bits used by the decoding circuit) ?
What would happen if one tries to access, in assembly, any address that belongs to this address space ?
I/O mapped I/O doesn't use the same address space as memory mapped I/O. The later does use part of the address space normally used by RAM and therefore, "steals" addresses that no longer belong to RAM memory.
The set of address ranges that are used by different memory mapped I/O is what you see as "Hardware reserved".
About how is it decided how to address memory mapped devices, this is largely covered by the PnP subsystem, either in BIOS, or in the SO. Memory-mapped devices, with few exceptions, are PnP devices, so that means that for each of them, its base address can be changed (for PCI devices, the base address of the memory mapped registers, if any, is contained in a BAR -Base Address Register-, which is part of the PCI configuration space).
Saving pins for decoding devices (lazy decoding) is (was) done on early 8-bit systems, to save decoders and reduce costs. It haven't anything to do with memory mapped / IO mapped devices. Lazy decoding may be used in both situations. For example, a designer could decide that the 16-bit address range C000-FFFF is going to be reserved for memory mapped devices. To decide whether to enable some memory chip, or some device, it's enough to look at the value of A15 and A14. If both address lines are high, then the block addressed is C000-FFFF and that means that memory chip enables will be deasserted. On the other hand, a designer could decide that the 8 bit IO port 254 is going to be assigned to a device, and to decode this address, it only looks at the state of A0, needing no decoders to find out the port address (this is for example, what the ZX Spectrum does for addressing the ULA)
If a program (written in whatever language that allows you to access and write to arbitrary memory locations) tries to access a memory address reserved for a device, and assuming that the paging and protection mechanism allows such access, what happens will depend solely on what the device does when that address is accessed. A well known memory mapped device in PC's is the frame buffer. If the graphics card is configured to display color text mode with its default base address, any 8-bit write operation performed to even physical addresses between B8000 and B8F9F will cause the character whose ASCII code is the value written to show on screen, in a location that depends on the address chosen.
I/O mapped devices don't collide with memory, as they use a different address space, with different instructions to read and write values to addresses (ports). These devices cannot be addressed using machine code instructions that targets memory.
Memory mapped devices share the address space with RAM. Depending on the system configuration, memory mapped registers can be present all the time, using some addresses, and thus preventing the system to use them for RAM, or memory mapped devices may "shadow" memory at times, so allowing the program to change the I/O configuration to choose if a certain memory region will be decoded as in use by a device, or used by regular RAM (for example, what the Commodore 64 does to let the user have 64KB of RAM but allowing it to access device registers some times, by temporarily disabling access to the RAM that is "behind" the device that is currently being accessed at that very same address).
At the hardware level, what is happening is that there are two different signals: MREQ and IOREQ. The first one is asserted on every memory instruction, the second one, on every I/O insruction. So this code...
MOV DX,1234h
MOV AL,[DX] ;reads memory address 1234h (memory address space)
IN AL,DX ;reads I/O port 1234h (I/O address space)
Both put the value 1234h on the CPU address bus, and both assert the RD pin to indicate a read, but the first one will assert MREQ to indicate that the address belong to the memory address space, and the second one will assert IOREQ to indicate that it belongs to the I/O address space. The I/O device at port 1234h is connected to the system bus so that it is enabled only if the address is 1234h, RD is asserted and IOREQ is asserted. This way, it cannot collide with a RAM chip addressed at 1234h, because the later will be enabled only if MREQ is asserted (the CPU ensures that IOREQ and MREQ cannot be asserted at the same time).
These two address spaces don't exist in all CPU's. In fact, the majority of them don't have this, and therefore, they have to memory map all its devices.
I have a simple theoretical question. The DMAs I know usually have half full or full interrupts. If I want to use a DMA for data transfer from a peripheral, how can I ensure I got all the data since data may not be at the dma transfer boundary.
For example, serial port might send 5 bytes, I would get and interrupt for the first 4 combined together (assuming dma size is 4), but nothing for the 5th one. What is the method people usually use to solve such a problem.
My best approach is this:
Setup a DMA memory region. lets say it's address 0x2 to 0x1000
The serial device writes bytes in this region, as a circular buffer
Each time the serial device writes, it updates it's "write pointer" and saves in bytes 0x0 and 0x1
The PC Host can dma the write pointer, and compare with it's own read pointer. The read pointer can be kept on the pc host and not deal with DMA at all. Then the PC knows how much memory to read, and it also knows if there has been an underflow or overflow.
This should be a good starting point for what you want.
I've been going through many technical documents on packet capture/processing and host stacks trying to understand it all, there's a few areas where I'm troubled, hopefully someone can help.
Assuming you're running tcpdump:
After a packet gets copied from a NIC's ring buffer (physical NIC memory right?)
does it immediately get stored into an mbuf? and then BPF gets a copy of the packet from the mbuf , which is then stored in the BPF buffer, so there are two copies in memory at the same time? I'm trying to understand the exact process.
Or is it more like: the packet gets copied from the NIC to both the mbuf (for host stack processing) and to the BPF pseudo-simultaneously?
Once a packet goes through host stack processing by ip/tcp input functions taking the mbuf as the location(pointing to an mbuf) i.e. packets are stored in mbufs, if the packet is not addressed for the system, say received by monitoring traffic via hub or SPAN/Monitor port, the packet is discarded and never makes its way up the host stack.
I seem to have come across diagrams which show the NIC ring buffer(RX/TX) in a kernel "box"/separating it from userspace, which makes me second guess whether a ring buffer is actually allocated system memory different from the physical memory on a NIC.
Assuming that a ring buffer refers to the NIC's physical memory, is it correct that the device driver determines the size of the NIC ring buffer, setting physical limitations aside? e.g. can I shrink the buffer by modifying the driver?
Thanks!
ETHER_BPF_MTAP macro calls bpf_mtap(), which excepts packet in mbuf format, and bpf copies data from this mbuf to internal buffer.
But mbufs can use external storage, so there can be or not be copying from NIC ring buffer to mbuf. Mbufs can actually contain packet data or serve just as a header with reference to receiving buffer.
Also, current NICs use their little (128/96/... Kb) onboard memory for FIFO only and immediately transfer all data to ring buffers in main memory. So you really can adjust buffer size in device driver.