Is the communication between a CPU and peripherals middleman'd by an MMU - communication

I'm aware that in most modern architectures the CPU sends read and write requests, to a memory management unit rather than directly to the RAM controller.
If other peripherals are also addressed, that is to say, read from and written to using an address bus, then are these addresses also accessed through a virtual address? In other words, to speak to a USB drive etc. does the CPU send the target virtual address to an MMU which translates it to a physical one? Or does it simply write to a physical address with no intermediary device?

I cant speak globally there may be exceptions. But that is the general idea, that being that the cpu memory interface goes completely through the mmu (And completely through a cache or layers of caches).
In order for peripherals really to work (caching a status register on the first read then subsequent reads getting the cached version not the real version) you have to set the address space for the peripheral to be not cached. So for example on an arm and no doubt others where you have separate i and d cache enables, you can turn on the i cache without the mmu, but to turn on the d cache and not have this peripheral problem you need the mmu on and the peripheral space in the tables and marked as not cached.
It us up to the software designers to decide if they want to have the virtual address for the peripherals match the physical or to move the peripherals elsewhere, both have pros and cons.
It is certainly possible to design a chip/system where an address space is automatically not sent through the mmu or cache, that can make the busses ugly, and/or the chip may have separate busses for peripherals from ram, or other solutions, so the above is not necessarily a universal answer, but for say an arm and I would assume an x86 that is how it works. On the arms I am familar with the mmu and l1 cache are in the core, the l2 is outside and l3 beyond that if you have one. the l2 is literally between the core and the world (if you have one (from arm)) but the axi/amba bus has cacheable settings so each transaction may or may not be marked as cacheable, if not cacheable then it passes right through the l2 logic. if enabled the mmu determines that if enabled on a per transaction basis.

Actually the virtual-to-physical translation is in the CPU for almost all modern (and at this point, even most old) architectures. Even the DRAM and PCIe controllers (previously in the Northbridge) made it onto the CPU. So a modern CPU doesn't even talk to a RAM controller, it directly talks to DRAM.
If other peripherals are also addressed, that is to say, read from and written to using an address bus, then are these addresses also accessed through a virtual address?
At least in the case of x86, yes. You can virtually map your memory mapped IO ranges anywhere. Good thing too, otherwise the virtual address space would necessarily mirror the weird physical layout with "holes" that you couldn't map real ram into because then you'd have two things in the same place.

Related

What is the memory map section in RISCV

I'm familiar with MIPS architecture, and I've known that MIPS has memory section such as kseg0, kseg1. Which determine whether the segment can be cached or mapped. For example, you should locate some I/O devices(like UART) to the uncached segment.
But I didn't find anything related in RISCV arch. So how does the RISCV OS know the address should be mapped or not?
By the way: I know the value in satp CSR desrcibes the translation mode. When OS is running, the value must set other than "Bare(disabled MMU)" so that the OS can support the virtual memory. So if CPU access UART address, the value in satp is still not "Bare"? But it should be "Bare"?
RISC-V is a family of instruction sets, ranging from MCU style processors that have no memory-mapping and no memory protection mechanisms (Physical Memory Protection is optional).
From your question, I assume you are talking about processors that support User and Supervisor level ISA, as documented in the RISC-V privileged spec.
It sounds like you want a spec describing which physical addresses are cacheable. Looking at the list of CSRs, I believe this information is not in the CSRs because it is platform specific. In systems I've worked with, it is either hard-coded in platform drivers or passed via device-tree.
For Linux, the device-tree entries are not RISC-V specific: there are device tree entries specifying the physical address range of memory. Additionally, each I/O device would have a device tree entry specifying its physical address range.
You can read the RISC-V privileged spec (The RISC-V Instruction Set Manual Volume II: Privileged Architecture 3.5 Physical Memory Attributes).
"For RISC-V, we separate out specification and checking of PMAs into a separate hardware structure, the PMA checker. In many cases, the attributes are known at system design time for each physical address region, and can be hardwired into the PMA checker. Where the attributes are run-time configurable, platform-specific memory-mapped control registers can be provided to specify these attributes at a granularity appropriate to each region on the platform (e.g., for an on-chip SRAM that can be flexibly divided between cacheable and uncacheable uses)"
I think if you want to check non-cacheable or cacheable in RISCV, you need to design a PMA unit that provide MMU to check memory attribute.

Purpose of logical address?

What is the purpose of logical address? Why should CPU generate logical address? it can directly access relocatable register base address and limit to exe a process. Why should MMU make a mapping between logical and physical address?
Why?
Because this gives the Operating System a way to securely manage memory.
Why is secure memory management necessary?
Imagine if there was no logical addressing. All processes were given direct access to physical addresses. A multi-process OS runs several different programs simultaneously. Imagine you are editing an important letter in MS Word while listening to music on YouTube on a very recently released browser. The browser is buggy and writes bogus values to a range of physical addresses that were being used by the Word program to store the edits of your letter. All of that information is corrupt!
Highly undesirable situation.
How can the OS prevent this?
Maintain a mapping of physical addresses allocated to each process and make sure one process cannot access the memory allocated to another process!
Clearly, having actual physical addresses exposed to programs is not a good idea. Since memory is then handled totally by the OS, we need an abstraction that we can provide to processes with a simple API that would make it seem that the process was dealing with physical memory, but all allocations would actually be handled by the OS.
Here comes virtual memory!
The need of logical address is to securely manage our physical memory.
Logical address is used to reference to access the physical memory location.
A logical address is generated so that a user program never directly access the physical memory and the process donot occupies memory which is acquired by another process thus corrupting that process.
A logical address gives us a surety that a new process will not occupy memory space occupied by any other process.
In execution time binding, the MMU makes a mapping from logical address to physical address because in this type of binding:
logical address is specifically referred to as virtual address
The address actually has no meaning because it is there to illusion the user that it has a large memory for its processes. The address actually bear meaning when mapping occurs and they get some real addresses which are present in physical memory.
Also I would like to mention that the base register and limit register are loaded by executing privileged instructions and privileged instructions are executed in kernel mode and only operating system has access to kernel mode and therefore CPU cannot directly access the registers.
So first the CPU will generate the logical address then the MMU of Operating system will take over and do the mapping.
The binding of instruction and data of a process to memory is done at compile time, load time or at execution time. Logical address comes into picture, only if the process moved during its execution time from one memory segment to another. logical address is the address of a process, before any relocation happens(memory address = 10). Once relocation happened for a process(moved to memory address = 100), just to redirect the cpu to correct memory location=> memory management unit(MMU), maintains the difference between relocated address and original address(100-10 = 90) in relocation register(base register acts as relocation register here) . once CPU have to access data in memory address 10, MMU add 90(value in relocation register) to the address, and fetch data from memory address 100.

How does DMA work with PCI Express devices?

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 days ago.
The community reviewed whether to reopen this question 3 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Let's suppose a CPU wants to make a DMA read transfer from a PCI Express device. Communication to PCI Express devices is provided by transaction layer packets (TLP). Theoretically, the maximum payload size is 1024 doubleword for TLP. So how does a DMA controller act when a CPU gives a DMA read command to PCI Express device in size of 4 megabyte?
In the PCIe enumeration phase, the maximum allowed payload size is determined (it can be lower then the device's max payload size: e.g. a intermediate PCIe switch has a lower max. payload size).
Most PCIe devices are DMA masters, so the driver transfers the command to the device. The device will send several write packets to transmit 4 MiB in xx max sized TLP chunks.
Edit 1 in reply to comment 1:
A PCI based bus has no "DMA Controller" in form of a chip or a sub circuit in the chipset. Every device on the bus can become a bus master. The main memory is always a slave.
Let's assume you have build your own PCIe device card, which can act as an PCI master and your program (running on CPU) wants to send data from that card to main memory (4 MiB).
The device driver knows the memory mapping for that particular memory region from operating system (some keywords: memory mapped I/O, PCI bus enumeration, PCI BARs, ).
The driver transfers the command (write), source-address, destination-address and length to the device. This can be done by sending bytes to a special address inside an pre-defined BAR or by writing into the PCI config space. The DMA master on the cards checks these special regions for new tasks (scatter-gather lists). If so, theses tasks get enqueued.
Now the DMA master knows where to send, how many data. He will read the data from local memory and wrap it into 512 byte TLPs of max payload size (the max payload size on path device <---> main memory is known from enumeration) and send it to the destination address. The PCI address-based routing mechanisms direct these TLPs to the main memory.
#Paebbels has already explained most of it. In PCI/PCI-e, "DMA" is implemented in terms of bus mastering, and it's the bus-master-capable peripheral devices that hold the reins. The peripheral device has the memory read/write transactions at its disposal, and it's up to the peripheral device, what granularity and ordering of the writes (or reads) it will use. I.e. the precise implementation details are hardware-specific to the peripheral device, and the corresponding software driver running on the host CPU must know how to operate the particular peripheral device, to provoke the desired DMA traffic in it.
Regarding the "memory management aspect", let me refer my distinguished audience to two chapters of a neat book by Jon Corbet, on exactly this topic in Linux. Memory management bordering on DMA, under the hood of the OS kernel. Linux and its source code and documentation are generally a good place (open source) to start looking for "how things work under the hood". I'll try to summarize to topic a bit.
First of all, please note that DMA access to the host's RAM (from a peripheral PCI device) is a different matter than PCI MMIO = where the peripheral device possesses a private bank of RAM of its very own, wants to make that available to the host system via a MMIO BAR. This is different from DMA, a different mechanism (although not quite), or maybe "the opposite perspective" if you will... suppose that the difference between a host and a peripheral device on the PCI/PCI-e is not great, and the host bridge / root complex merely has a somewhat special role in the tree topology, bus initialization and whatnot :-) I hope I've confused you enough.
The computer system containing a PCI(-e) bus tree and a modern host CPU actually works with several "address spaces". You've probably heard about the CPU's physical address space (spoken at the "front side bus" among the CPU cores, the RAM controller and the PCI root bridge) vs. the "virtual address spaces", managed by the OS with the help of some HW support on part of the CPU for individual user-space processes (including one such virtual space for the kernel itself, not identical with the physical address space). Those two address spaces, the physical one and the manifold virtual, occur irrespective of the PCI(-e) bus. And, guess what: the PCI(-e) bus has its own address space, called the "bus space". Note that there's also the so called "PCI configuration space" = yet another parallel address space. Let's abstract from the PCI config space for now, as access to it is indirect and complicated anyway = does not "get in the way" of our topic here.
So we have three different address spaces (or categories): the physical address space, the virtual spaces, and the PCI(-e) bus space. These need to be "mapped" to each other. Addresses need to be translated. The virtual memory management subsystem in the kernel uses its page tables and some x86 hardware magic (keyword: MMU) to do its job: translate from virtual to physical addresses. When speaking to PCI(-e) devices, or rather their "memory mapped IO", or when using DMA, addresses need to be translated between the CPU physical address space and the PCI(-e) bus space. In the hardware, in bus transactions, it is the job of the PCI(-e) root complex to handle the payload traffic, including address translation. And on the software side, the kernel provides functions (as part of its internal API) to drivers to be able to translate addresses where needed. As much as the software is only concerned about its respective virtual address space, when talking to PCI(-e) peripheral devices, it needs to program their "base address registers" for DMA with addresses from the "bus space", as that's where the PCI(-e) peripherals live. The peripherals are not gonna play the "game of multiple address translations" actively with us... It's up to the software, or specifically the OS, to make the PCI(-e) bus space allocations a part of the host CPU's physical address space, and to make the host physical space accessible to the PCI devices. (Although not a typical scenario, a host computer can even have multiple PCI(-e) root complexes, hosting multiple trees of the PCI(-e) bus. Their address space allocations must not overlap in the host CPU physical address space.)
There's a shortcut, although not quite: in an x86 PC, the PCI(-e) address space and the host CPU physical address space, are one.
Not sure if this is hardwired in the HW (the root complex just doesn't have any specific mapping/translation capability) or if this is how "things happen to be done", in the BIOS/UEFI and in Linux. Suffice to say that this happens to be the case.
But, at the same time, this doesn't make the life of a Linux driver writer any easier. Linux is made to work on various HW platforms, it does have an API for translating addresses, and the use of that API is mandatory, when crossing between address spaces.
Maybe interestingly, the API shorthands complicit in the context of PCI(-e) drivers and DMA, are "bus_to_virt()" and "virt_to_bus()". Because, to software, what matters is its respective virtual address - so why complicate things to the driver author by forcing him to translate (and keep track of) the virtual, the physical and the bus address space, right? There are also shorthands for allocating memory for DMA use: pci_alloc_consistent() and pci_map_single() - and their deallocation counterparts, and several companions - if interested, you really should refer to Jon Corbet's book and further docs (and kernel source code).
So as a driver author, you allocate a piece of RAM for DMA use, you get a pointer of your respective "virtual" flavour (some kernel space), and then you translate that pointer into the PCI "bus" space, which you can then quote to your PCI(-e) peripheral device = "this is where you can upload the input data".
You can then instruct your peripheral to do a DMA transaction into your allocated memory window. The DMA window in RAM can be bigger (and typically is) than the "maximum PCI-e transaction size" - which means, that the peripheral device needs to issue several consecutive transactions to accomplish a transfer of the whole allocated window (which may or may not be required, depending on your application). Exactly how that fragmented transfer is organized, that's specific to your PCI peripheral hardware and your software driver. The peripheral can just use a known integer count of consecutive offsets back to back. Or it can use a linked list. The list can grow dynamically. You can supply the list via some BAR to the peripheral device, or you can use a second DMA window (or subsection of your single window) to construct the linked list in your RAM, and the peripheral PCI device will just run along that chain. This is how scatter-gather DMA works in practical contemporary PCI-e devices.
The peripheral device can signal back completion or some other events using IRQ. In general, the operation of a peripheral device involving DMA will be a mixture of direct polling access to BAR's, DMA transfers and IRQ signaling.
As you may have inferred, when doing DMA, the peripheral device need NOT necessarily possess a private buffer on board, that would be as big as your DMA window allocation in the host RAM. Quite the contrary - the peripheral can easily "stream" the data from (or to) an internal register that's one word long (32b/64b), or a buffer worth a single "PCI-e payload size", if the application is suited for that arrangement. Or a minuscule double buffer or some such. Or the peripheral can indeed have a humongous private RAM to launch DMA against - and such a private RAM need not be mapped to a BAR (!) if direct MMIO access from the bus is not required/desired.
Note that a peripheral can launch DMA to another peripheral's MMIO BAR just as easily, as it can DMA-transfer data to/from the host RAM. I.e., given a PCI bus, two peripheral devices can actually send data directly to each other, without using bandwidth on the host's "front side bus" (or whatever it is nowadays, north of the PCI root complex: quickpath, torus, you name it).
During PCI bus initialization, the BIOS/UEFI or the OS allocates windows of bus address space (and physical address space) to PCI bus segments and peripherals - to satisfy the BARs' hunger for address space, while keeping the allocations non-overlapping systemwide. Individual PCI bridges (including the host bridge / root complex) get configured to "decode" their respective allocated spaces, but "remain in high impedance" (silent) for addresses that are not their own. Feel free to google on your own on "positive decode" vs. "subtractive decode", where one particular path down the PCI(-e) bus can be turned into an "address sink of last resort", maybe just for the range of the legacy ISA etc.
Another tangential note maybe: if you've never programmed simple MMIO in a driver, i.e. used BAR's offered by PCI devices, know ye that the relevant keyword (API call) is ioremap() (and its counterpart iounmap, upon driver unload). This is how you make your BAR accessible to memory-style access in your living driver.
And: you can make your mapped MMIO bar, or your DMA window, available directly to a user-space process, using a call to mmap(). Thus, your user-space process can then access that memory window directly, without having to go through the expensive and indirect rabbit hole of the ioctl().
Umm. Modulo PCI bus latencies and bandwidth, the cacheable attribute etc.
I feel that this is where I'm getting too deep down under the hood, and running out of steam... corrections welcome.

How 'Input/Output' ports are mapped into memory?

I have been trying to understand I/O ports and their mappings with the memory & I/O address space. I read about 'Memory Mapped I/O' and was wondering how this is accomplished by OS/Hardware. Does OS/Hardware uses some kind of table to map address specified in the instruction to respective port ?
Implementations differ in many ways. But the basic idea is that when a read or write occurs for a memory address, the microprocessor outputs the address on its bus. Hardware (called an 'address decoder') detects that the address is for a particular memory-mapped I/O device and enables that device as the target of the operation.
Typically, the OS doesn't do anything special. On some platforms, the BIOS or operating system may have to configure certain parameters for the hardware to work properly.
For example, the range may have to be set as uncacheable to prevent the caching logic from reordering operations to devices that care about the order in which things happen. (Imagine if one write tells the hardware what operation to do and another write tells the hardware to start. Reordering those could be disastrous.)
On some platforms, the operating system or BIOS may have to set certain memory-mapped I/O ranges as 'slow' by adding wait states. This is because the hardware that's the target of the operation may not be as fast as the system memory is.
Some devices may allow the operating system to choose where in memory to map the device. This is typical of newer plug-and-play devices on the PC platform.
In some devices, such as microcontrollers, this is all done entirely inside a single chip. A write to a particular address is routed in hardware to a particular port or register. This can include general-purpose I/O registers which interface to pins on the chip.

Confused over memory mapping

I've recently started getting into low level stuff and looking into bootloaders and operating systems etc...
As I understand it, for ARM processors at least, peripherals are initialized by the bootloader and then they are mapped into the physical memory space. From here, code can access the peripherals by simply writing values to the memory space mapped to the peripherals registers. Later if the chip has a MMU, it can be used to further remap into virtual memory spaces. Am I right?
What I don't understand are (assuming what I have said above is correct):
How does the bootloader initialize the peripherals if they haven't been mapped to an address space yet?
With virtual memory mapping, there are tables that tell the MMU where to map what. But what determines where peripherals are mapped in physical memory?
When a device boots, the MMU is turned off and you will be typically running in supervisor mode. This means that any addresses provide are physical addresses.
Each ARM SOC (system on Chip) will have a memory map. The correspondece of addresses to devices is determined by which physical data and address line are connect to which parts of the processor. All this information can be found in a Technical reference manual. For OMAP4 chips this can be found here.
There are several ways to connect off-chip device. One is using the GPMC. Here you will need to sepcify the address in the GPMC that you want to use on the chip.
When the MMU is then turned on, these addresses may change depending on how the MMU is programmed. Typically direct access to hardware will also only be available in kernel mode.
Though this is an old question, thought of answering this as it might help some others like me trying to get sufficient answers from stackoverflow.
you explanation is almost correct but want to give little explanation on this one:
peripherals are initialized by the bootloader and then they are mapped into the physical memory space
Onchip peripherals already have a predefined physical address space. For other external IO mapped peripherals (like PCIe), we need to config a physical addr space, but their physical address space range is still predefined. They cannot be configured at random address space.
Now to your questions, here are my answers..
How does the bootloader initialize the peripherals if they haven't been mapped to an address space yet?
As I mentioned above, all (on-chip)peripherals have physical address space predefined (usually will be listed in Memory map chapter of processor RM). So, boot loaders (assuming MMU is off) can directly access them.
With virtual memory mapping, there are tables that tell the MMU where to map what. But what determines where peripherals are mapped in physical memory?
With VMM, there are page tables (created and stored in physical DRAM by kernel) that tells MMU to map virtual addr to physical addr. In linux kernel with 1G kernel virt space (say kernel virtual addrs from 0xc0000000-0xffffffff), on-chip peripherals will need to have a VM space from within the above kernel VM space (so that kernel & only kernel can access it); and page tables will be setup to map that peripheral virt addr to its actual physical addr (the ones defined in RM)
You can't remap peripherals in ARM processor, all peripheral devices correspond to fixed positions in memory map. Even registers are mapped to internal RAM memory that has permanent fixed positions. The only things you can remap are memory devices like SRAM, FLASH, etc. via FSMC or similar core feature. You can however remap a memory mapped add-on custom peripheral that is not part of the core itself, lets say a hard disk controller for instance, but what is inside ARM core its fixed.
A good start is to take a look at processor datasheets at company sites like Philips and ST, or ARM architecture itself at www.arm.com.

Resources