Is the memory snooping possible against the Intel architecture? - memory

I am interested in developing external hardware monitor in Intel architecture.
I want to know if it is possible to snoop the bus between the CPU and DRAM to know which physical address is being read or written.
In other words, can we attach any devices to the motherboard or use other devices such as graphics card to snoop the bus to know which memory area is accessed by the CPU?
The reason I want to use the hardware approach is we don't fully trust the kernel or hypervisor which translate the VA to PA under our threat model.

Related

Is Intel QuickPath Interconnect (QPI) used by processors to access memory?

I have read An Introduction to the Intel® QuickPath Interconnect. The document does not mention that QPI is used by processors to access memory. So I think that processors don't access memory through QPI.
Is my understanding correct?
Intel QuickPath Interconnect (QPI) is not wired to the DRAM DIMMs and as such is not used to access the memory that connected to the CPU integrated memory controller (iMC).
In the paper you linked this picture is present
That shows the connections of a processor, with the QPI signals pictured separately from the memory interface.
A text just before the picture confirm that QPI is not used to access memory
The processor
also typically has one or more integrated memory
controllers. Based on the level of scalability
supported in the processor, it may include an
integrated crossbar router and more than one
Intel® QuickPath Interconnect port.
Furthermore, if you look at a typical datasheet you'll see that the CPU pins for accessing the DIMMs are not the ones used by QPI.
The QPI is however used to access the uncore, the part of the processor that contains the memory controller.
Courtesy of QPI article on Wikipedia
QPI is a fast internal general purpose bus, in addition to giving access to the uncore of the CPU it gives access to other CPUs' uncore.
Due to this link, every resource available in the uncore can potentially be accessed with QPI, including the iMC of a remote CPU.
QPI define a protocol with multiple message classes, two of them are used to read memory using another CPU iMC.
The flow use a stack similar to the usual network stack.
Thus the path to remote memory include a QPI segment but the path to local memory doesn't.
Update
For Xeon E7 v3-18C CPU (designed for multi-socket systems), the Home agent doesn't access the DIMMS directly instead it uses an Intel SMI2 link to access the Intel C102/C104 Scalable Memory Buffer that in turn accesses the DIMMS.
The SMI2 link is faster than the DDR3 and the memory controller implements reliability or interleaving with the DIMMS.
Initially the CPU used a FSB to access the North bridge, this one had the memory controller and was linked to the South bridge (ICH - IO Controller Hub in Intel terminology) through DMI.
Later the FSB was replaced by QPI.
Then the memory controller was moved into the CPU (using its own bus to access memory and QPI to communicate with the CPU).
Later, the North bridge (IOH - IO Hub in Intel terminology) was integrated into the CPU and was used to access the PCH (that now replaces the south bridge) and PCIe was used to access fast devices (like the external graphic controller).
Recently the PCH has been integrated into the CPU as well that now exposes only PCIe, DIMMs pins, SATAexpress and any other common internal bus.
As a rule of thumb the buses used by the processors are:
To other CPUs - QPI
To IOH - QPI (if IOH present)
To the uncore - QPI
To DIMMs - Pins as the DRAM technology (DDR3, DDR4, ...) support mandates. For Xeon v2+ Intel uses a fast SMI(2) link to connect to an off-core memory controller (Intel C102/104) that handle the DIMMS and channels based on two configurations.
To PCH - DMI
To devices - PCIe, SATAexpress, I2C, and so on.
Yes, QPI is used to access all remote memory on multi-socket systems, and much of its design and performance is intended to support such access in a reasonable fashion (i.e., with latency and bandwidth not too much worse than local access).
Basically, most x86 multi-socket systems are lightly1 NUMA: every DRAM bank is attached to a the memory controller of a particular socket: this memory is then local memory for that socket, while the remaining memory (attached to some other socket) is remote memory. All access to remote memory goes over the QPI links, and on many systems2 that is fully half of all memory access and more.
So QPI is designed to be low latency and high bandwidth to make such access still perform well. Furthermore, aside from pure memory access, QPI is the link through which the cache coherence between sockets occurs, e.g., notifying the other socket of invalidations, lines which have transitioned into the shared state, etc.
1 That is, the NUMA factor is fairly low, typically less than 2 for latency and bandwidth.
2 E.g., with NUMA interleave mode on, and 4 sockets, 75% of your access is remote.

How does DMA work with PCI Express devices?

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 days ago.
The community reviewed whether to reopen this question 3 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Let's suppose a CPU wants to make a DMA read transfer from a PCI Express device. Communication to PCI Express devices is provided by transaction layer packets (TLP). Theoretically, the maximum payload size is 1024 doubleword for TLP. So how does a DMA controller act when a CPU gives a DMA read command to PCI Express device in size of 4 megabyte?
In the PCIe enumeration phase, the maximum allowed payload size is determined (it can be lower then the device's max payload size: e.g. a intermediate PCIe switch has a lower max. payload size).
Most PCIe devices are DMA masters, so the driver transfers the command to the device. The device will send several write packets to transmit 4 MiB in xx max sized TLP chunks.
Edit 1 in reply to comment 1:
A PCI based bus has no "DMA Controller" in form of a chip or a sub circuit in the chipset. Every device on the bus can become a bus master. The main memory is always a slave.
Let's assume you have build your own PCIe device card, which can act as an PCI master and your program (running on CPU) wants to send data from that card to main memory (4 MiB).
The device driver knows the memory mapping for that particular memory region from operating system (some keywords: memory mapped I/O, PCI bus enumeration, PCI BARs, ).
The driver transfers the command (write), source-address, destination-address and length to the device. This can be done by sending bytes to a special address inside an pre-defined BAR or by writing into the PCI config space. The DMA master on the cards checks these special regions for new tasks (scatter-gather lists). If so, theses tasks get enqueued.
Now the DMA master knows where to send, how many data. He will read the data from local memory and wrap it into 512 byte TLPs of max payload size (the max payload size on path device <---> main memory is known from enumeration) and send it to the destination address. The PCI address-based routing mechanisms direct these TLPs to the main memory.
#Paebbels has already explained most of it. In PCI/PCI-e, "DMA" is implemented in terms of bus mastering, and it's the bus-master-capable peripheral devices that hold the reins. The peripheral device has the memory read/write transactions at its disposal, and it's up to the peripheral device, what granularity and ordering of the writes (or reads) it will use. I.e. the precise implementation details are hardware-specific to the peripheral device, and the corresponding software driver running on the host CPU must know how to operate the particular peripheral device, to provoke the desired DMA traffic in it.
Regarding the "memory management aspect", let me refer my distinguished audience to two chapters of a neat book by Jon Corbet, on exactly this topic in Linux. Memory management bordering on DMA, under the hood of the OS kernel. Linux and its source code and documentation are generally a good place (open source) to start looking for "how things work under the hood". I'll try to summarize to topic a bit.
First of all, please note that DMA access to the host's RAM (from a peripheral PCI device) is a different matter than PCI MMIO = where the peripheral device possesses a private bank of RAM of its very own, wants to make that available to the host system via a MMIO BAR. This is different from DMA, a different mechanism (although not quite), or maybe "the opposite perspective" if you will... suppose that the difference between a host and a peripheral device on the PCI/PCI-e is not great, and the host bridge / root complex merely has a somewhat special role in the tree topology, bus initialization and whatnot :-) I hope I've confused you enough.
The computer system containing a PCI(-e) bus tree and a modern host CPU actually works with several "address spaces". You've probably heard about the CPU's physical address space (spoken at the "front side bus" among the CPU cores, the RAM controller and the PCI root bridge) vs. the "virtual address spaces", managed by the OS with the help of some HW support on part of the CPU for individual user-space processes (including one such virtual space for the kernel itself, not identical with the physical address space). Those two address spaces, the physical one and the manifold virtual, occur irrespective of the PCI(-e) bus. And, guess what: the PCI(-e) bus has its own address space, called the "bus space". Note that there's also the so called "PCI configuration space" = yet another parallel address space. Let's abstract from the PCI config space for now, as access to it is indirect and complicated anyway = does not "get in the way" of our topic here.
So we have three different address spaces (or categories): the physical address space, the virtual spaces, and the PCI(-e) bus space. These need to be "mapped" to each other. Addresses need to be translated. The virtual memory management subsystem in the kernel uses its page tables and some x86 hardware magic (keyword: MMU) to do its job: translate from virtual to physical addresses. When speaking to PCI(-e) devices, or rather their "memory mapped IO", or when using DMA, addresses need to be translated between the CPU physical address space and the PCI(-e) bus space. In the hardware, in bus transactions, it is the job of the PCI(-e) root complex to handle the payload traffic, including address translation. And on the software side, the kernel provides functions (as part of its internal API) to drivers to be able to translate addresses where needed. As much as the software is only concerned about its respective virtual address space, when talking to PCI(-e) peripheral devices, it needs to program their "base address registers" for DMA with addresses from the "bus space", as that's where the PCI(-e) peripherals live. The peripherals are not gonna play the "game of multiple address translations" actively with us... It's up to the software, or specifically the OS, to make the PCI(-e) bus space allocations a part of the host CPU's physical address space, and to make the host physical space accessible to the PCI devices. (Although not a typical scenario, a host computer can even have multiple PCI(-e) root complexes, hosting multiple trees of the PCI(-e) bus. Their address space allocations must not overlap in the host CPU physical address space.)
There's a shortcut, although not quite: in an x86 PC, the PCI(-e) address space and the host CPU physical address space, are one.
Not sure if this is hardwired in the HW (the root complex just doesn't have any specific mapping/translation capability) or if this is how "things happen to be done", in the BIOS/UEFI and in Linux. Suffice to say that this happens to be the case.
But, at the same time, this doesn't make the life of a Linux driver writer any easier. Linux is made to work on various HW platforms, it does have an API for translating addresses, and the use of that API is mandatory, when crossing between address spaces.
Maybe interestingly, the API shorthands complicit in the context of PCI(-e) drivers and DMA, are "bus_to_virt()" and "virt_to_bus()". Because, to software, what matters is its respective virtual address - so why complicate things to the driver author by forcing him to translate (and keep track of) the virtual, the physical and the bus address space, right? There are also shorthands for allocating memory for DMA use: pci_alloc_consistent() and pci_map_single() - and their deallocation counterparts, and several companions - if interested, you really should refer to Jon Corbet's book and further docs (and kernel source code).
So as a driver author, you allocate a piece of RAM for DMA use, you get a pointer of your respective "virtual" flavour (some kernel space), and then you translate that pointer into the PCI "bus" space, which you can then quote to your PCI(-e) peripheral device = "this is where you can upload the input data".
You can then instruct your peripheral to do a DMA transaction into your allocated memory window. The DMA window in RAM can be bigger (and typically is) than the "maximum PCI-e transaction size" - which means, that the peripheral device needs to issue several consecutive transactions to accomplish a transfer of the whole allocated window (which may or may not be required, depending on your application). Exactly how that fragmented transfer is organized, that's specific to your PCI peripheral hardware and your software driver. The peripheral can just use a known integer count of consecutive offsets back to back. Or it can use a linked list. The list can grow dynamically. You can supply the list via some BAR to the peripheral device, or you can use a second DMA window (or subsection of your single window) to construct the linked list in your RAM, and the peripheral PCI device will just run along that chain. This is how scatter-gather DMA works in practical contemporary PCI-e devices.
The peripheral device can signal back completion or some other events using IRQ. In general, the operation of a peripheral device involving DMA will be a mixture of direct polling access to BAR's, DMA transfers and IRQ signaling.
As you may have inferred, when doing DMA, the peripheral device need NOT necessarily possess a private buffer on board, that would be as big as your DMA window allocation in the host RAM. Quite the contrary - the peripheral can easily "stream" the data from (or to) an internal register that's one word long (32b/64b), or a buffer worth a single "PCI-e payload size", if the application is suited for that arrangement. Or a minuscule double buffer or some such. Or the peripheral can indeed have a humongous private RAM to launch DMA against - and such a private RAM need not be mapped to a BAR (!) if direct MMIO access from the bus is not required/desired.
Note that a peripheral can launch DMA to another peripheral's MMIO BAR just as easily, as it can DMA-transfer data to/from the host RAM. I.e., given a PCI bus, two peripheral devices can actually send data directly to each other, without using bandwidth on the host's "front side bus" (or whatever it is nowadays, north of the PCI root complex: quickpath, torus, you name it).
During PCI bus initialization, the BIOS/UEFI or the OS allocates windows of bus address space (and physical address space) to PCI bus segments and peripherals - to satisfy the BARs' hunger for address space, while keeping the allocations non-overlapping systemwide. Individual PCI bridges (including the host bridge / root complex) get configured to "decode" their respective allocated spaces, but "remain in high impedance" (silent) for addresses that are not their own. Feel free to google on your own on "positive decode" vs. "subtractive decode", where one particular path down the PCI(-e) bus can be turned into an "address sink of last resort", maybe just for the range of the legacy ISA etc.
Another tangential note maybe: if you've never programmed simple MMIO in a driver, i.e. used BAR's offered by PCI devices, know ye that the relevant keyword (API call) is ioremap() (and its counterpart iounmap, upon driver unload). This is how you make your BAR accessible to memory-style access in your living driver.
And: you can make your mapped MMIO bar, or your DMA window, available directly to a user-space process, using a call to mmap(). Thus, your user-space process can then access that memory window directly, without having to go through the expensive and indirect rabbit hole of the ioctl().
Umm. Modulo PCI bus latencies and bandwidth, the cacheable attribute etc.
I feel that this is where I'm getting too deep down under the hood, and running out of steam... corrections welcome.

PCI Express BAR memory mapping basic understanding

I am trying to understand how PCI Express works so i can write a windows driver that can read and write to a custom PCI Express device with no on-board memory.
I understand that the Base Address Registers (BAR) in the PCIE configuration space hold the memory address that the PCI Express should respond to / is allowed to write to. (Is that correct understood?)
My questions are the following:
What is a "bus-specific address" compared to physical address when talking about PCIE?
When and how is the BAR populated with addresses? Is the driver responsible for allocating memory and writing the address to the peripheral BAR?
Is DMA used when transferring data from peripheral to host memory?
I appreciate your time.
Best regards,
i'm also working on device driver (albeit on linux) with a custom board. Here is my attempt on answering your questions:
The BARs represent memory windows as seen by the host system (CPUs) to talk to the device. The device doesn't write into that window but merely answers TLPs (Transaction Layer Packets) requests (MRd*, MWr*).
I would say "bus-specific" = "physical" addresses if your architecture doesn't have a bus layer traslation mechanism. Check this thread for more info.
In all the x86 consumer PCs i've used so far, the BAR address seemed to be allocated either by the BIOS or at OS boot. The driver has to work with whatever address has been allocated.
The term DMA seems to abused instead of bus mastering which I believe is the correct term in PCIe. In PCIe every device may be a bus master (if allowed in its command register bit 2). It does so by sending MRd, MWr TLPs to other devices in the bus (but generally to the system memory) and signalling interrupts to the CPU.
From your query its clear that you want to write a driver for a PCIe slave device. To understand the scheme of things happening behind PCIe transfer, a lot of stuff is available on internet (Like PCIe Bus Enumeration, Peripheral Address Mapping to the memory etc).
Yes your understanding is correct regarding the mapping of PCIe registers to the memory and you can read/write them.( For e.g in case of linux PCIe device driver you can do this using "ioremap" ).
An address bus is used to specify a physical address. When a processor or DMA-enabled device needs to read or write to a memory location, it specifies that memory location on the address bus. Nothing more to add to that.
"PCIe bus enumeration" topic will answer your 2nd question.
Your third question is vague. You mean slave PCIe device. Assuming it is, yes you can transfer data between a slave PCIe device and host using a DMA controller.
I am working on a project which involves "PCIe-DMA" connected with the host over the PCIe bus. Really depends on your design and implementation. So in my case PCIe-DMA is itself a slave PCIe device on the target board connected to the host over PCIe.
clarification for your doubts/questions is here.
1> There are many devices thats sits on BUS like PCI which sees Memeory in terms that are different from a Physical address, those are called bus addresses.
For example if you are initaiating DMA from a device sitting on bus to Main memory of system then destination address should be corresponding bus address of same physical address in Memmory
2> BARS gets populated at the time of enumeration, in a typical PC it is at boot time when your PCI aware frimware enumerate PCI devices presents on slot and allocate addresses and size to BARS.
3> yes you can use both DMA initiated or CPU initiated operations on these BARS.
-- flyinghigh

Is cudaMemcpy3DPeer supported on geforce cards?

Is it possible to use peer-to-peer memory transfer on GeForce cards or is it allowed only on Teslas? I assume cards are 2 GTX690s (each one has two GPUs on board).
I have tried copying between Quadro 4000 and Quadro 600, and it failed. I was transferring 3D arrays using cudaMemcpy3DPeer by filling the cudaMemcpy3DPeerParms struct.
Peer-to-peer memory copy should work on Geforce and Quadro as well as Tesla, see the programming guide for more details.
Memory copies can be performed between the memories of two different devices.
When a unified address space is used for both devices (see Unified Virtual Address
Space), this is done using the regular memory copy functions mentioned in Device
Memory.
Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(),
cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync()
Peer-to-peer memory access, which is where one GPU can directly read from another GPU, requires UVA (which means 64-bit OS) and Tesla and compute capability 2.0 or higher.
Tesla Compute Cluster Mode for Windows), on Windows XP, or on Linux, devices
of compute capability 2.0 and higher from the Tesla series may address each other’s
memory (i.e., a kernel executing on one device can dereference a pointer to the memory
of the other device).

How 'Input/Output' ports are mapped into memory?

I have been trying to understand I/O ports and their mappings with the memory & I/O address space. I read about 'Memory Mapped I/O' and was wondering how this is accomplished by OS/Hardware. Does OS/Hardware uses some kind of table to map address specified in the instruction to respective port ?
Implementations differ in many ways. But the basic idea is that when a read or write occurs for a memory address, the microprocessor outputs the address on its bus. Hardware (called an 'address decoder') detects that the address is for a particular memory-mapped I/O device and enables that device as the target of the operation.
Typically, the OS doesn't do anything special. On some platforms, the BIOS or operating system may have to configure certain parameters for the hardware to work properly.
For example, the range may have to be set as uncacheable to prevent the caching logic from reordering operations to devices that care about the order in which things happen. (Imagine if one write tells the hardware what operation to do and another write tells the hardware to start. Reordering those could be disastrous.)
On some platforms, the operating system or BIOS may have to set certain memory-mapped I/O ranges as 'slow' by adding wait states. This is because the hardware that's the target of the operation may not be as fast as the system memory is.
Some devices may allow the operating system to choose where in memory to map the device. This is typical of newer plug-and-play devices on the PC platform.
In some devices, such as microcontrollers, this is all done entirely inside a single chip. A write to a particular address is routed in hardware to a particular port or register. This can include general-purpose I/O registers which interface to pins on the chip.

Resources