Architecture of IMCs(Integrated Memory Controllers) in latest Intel processors - memory

I have been looking into Xeon architecture for a server application. I saw that Xeon supports Quad channel architecture with 3 DIMMs per channel. I have attached a page from Intel's Xeon datasheet.
I got this from :
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-1600-2600-vol-2-datasheet.pdf Page 392 Section 4.4.
I have a doubt on the statement about DRAM controllers sharing a common address decode and DMA engine.
If I have 4 cores on the Xeon processor, will I be able to access the 4 DDR channels simultaneously? For example can I use one CPU core to write to DDR channel 1 and another cpu core to read from DDR channel 2 simultaneously?
controllers share a common address decode
Also I assume the above statement means that I can have a DMA engine for a single channel at a time?
Appreciate any support.

Related

Is Intel QuickPath Interconnect (QPI) used by processors to access memory?

I have read An Introduction to the Intel® QuickPath Interconnect. The document does not mention that QPI is used by processors to access memory. So I think that processors don't access memory through QPI.
Is my understanding correct?
Intel QuickPath Interconnect (QPI) is not wired to the DRAM DIMMs and as such is not used to access the memory that connected to the CPU integrated memory controller (iMC).
In the paper you linked this picture is present
That shows the connections of a processor, with the QPI signals pictured separately from the memory interface.
A text just before the picture confirm that QPI is not used to access memory
The processor
also typically has one or more integrated memory
controllers. Based on the level of scalability
supported in the processor, it may include an
integrated crossbar router and more than one
Intel® QuickPath Interconnect port.
Furthermore, if you look at a typical datasheet you'll see that the CPU pins for accessing the DIMMs are not the ones used by QPI.
The QPI is however used to access the uncore, the part of the processor that contains the memory controller.
Courtesy of QPI article on Wikipedia
QPI is a fast internal general purpose bus, in addition to giving access to the uncore of the CPU it gives access to other CPUs' uncore.
Due to this link, every resource available in the uncore can potentially be accessed with QPI, including the iMC of a remote CPU.
QPI define a protocol with multiple message classes, two of them are used to read memory using another CPU iMC.
The flow use a stack similar to the usual network stack.
Thus the path to remote memory include a QPI segment but the path to local memory doesn't.
Update
For Xeon E7 v3-18C CPU (designed for multi-socket systems), the Home agent doesn't access the DIMMS directly instead it uses an Intel SMI2 link to access the Intel C102/C104 Scalable Memory Buffer that in turn accesses the DIMMS.
The SMI2 link is faster than the DDR3 and the memory controller implements reliability or interleaving with the DIMMS.
Initially the CPU used a FSB to access the North bridge, this one had the memory controller and was linked to the South bridge (ICH - IO Controller Hub in Intel terminology) through DMI.
Later the FSB was replaced by QPI.
Then the memory controller was moved into the CPU (using its own bus to access memory and QPI to communicate with the CPU).
Later, the North bridge (IOH - IO Hub in Intel terminology) was integrated into the CPU and was used to access the PCH (that now replaces the south bridge) and PCIe was used to access fast devices (like the external graphic controller).
Recently the PCH has been integrated into the CPU as well that now exposes only PCIe, DIMMs pins, SATAexpress and any other common internal bus.
As a rule of thumb the buses used by the processors are:
To other CPUs - QPI
To IOH - QPI (if IOH present)
To the uncore - QPI
To DIMMs - Pins as the DRAM technology (DDR3, DDR4, ...) support mandates. For Xeon v2+ Intel uses a fast SMI(2) link to connect to an off-core memory controller (Intel C102/104) that handle the DIMMS and channels based on two configurations.
To PCH - DMI
To devices - PCIe, SATAexpress, I2C, and so on.
Yes, QPI is used to access all remote memory on multi-socket systems, and much of its design and performance is intended to support such access in a reasonable fashion (i.e., with latency and bandwidth not too much worse than local access).
Basically, most x86 multi-socket systems are lightly1 NUMA: every DRAM bank is attached to a the memory controller of a particular socket: this memory is then local memory for that socket, while the remaining memory (attached to some other socket) is remote memory. All access to remote memory goes over the QPI links, and on many systems2 that is fully half of all memory access and more.
So QPI is designed to be low latency and high bandwidth to make such access still perform well. Furthermore, aside from pure memory access, QPI is the link through which the cache coherence between sockets occurs, e.g., notifying the other socket of invalidations, lines which have transitioned into the shared state, etc.
1 That is, the NUMA factor is fairly low, typically less than 2 for latency and bandwidth.
2 E.g., with NUMA interleave mode on, and 4 sockets, 75% of your access is remote.

Does each core has its own private set of registers?

Looking from this intel core i7 nehalem microarchitecure
It seems that each core has it's own private Register file. So I have a couple of short questions, because I thought that there is only 1 set of registers not dependent on number of cores.
Does each core has its own private set of registers? (rax,rbx,rsp and so on.)
Does each core has it's own MMU and TLB? not just one shared across all cores?
I know the questions are highly microarchitecture dependent but I think majority of modern x64 intel cpu's follow the same design principle.
Each core has its own set of registers, MMU, TLB, level 1 caches (data and instruction), level 2 cache (this depends on processor) etc. Cache Coherency is supported across cores via "QPI" and in the case of high end Core 7 and server-based processors like Xeon, Cache Coherency is supported across processors on a multi-processor mother board by exposing "QPI" on the external pins of those processors (for processors where multi-processor cache coherency is not supported, "QPI" is not "exposed").
Wiki article: Nehalem
Yes, each core has its set of registers. "Core" is equivalent of separate CPU on socket but with "multicore" the electronic wiring is simple.

Count read and write accesses to memory

On a Linux machine, I need to count the number of read and write accesses to memory (DRAM) performed by a process. The machine has a NUMA configuration and I am binding the process to access memory from a single remote NUMA node using numactl. The process is running on CPUs in node 0 and accessing memory in node 1.
Currently, I am using perf to count the number of LLC load miss and LLC store miss events to serve as an estimate for read and write accesses to memory. Because, I guessed LLC misses will need to be served by memory accesses. Is this approach correct i.e. is this event relevant ? And, are there any alternatives to obtain the read and write access information ?
Processor : Intel Xeon E5-4620
Kernel : Linux 3.9.0+
Depending on your hardware you should be able to acess performance counter located on the memory side, to exactly count memory accesses. On Intel processor, these events are called uncore events. I know that you can also count the same thing on AMD processors.
Counting LLC misses is not totally correct because some events such as the hardware prefetcher may lead a significant number of memory accesses.
Regarding your hardware, unfortunately you will have to use raw events (in the perf terminology). These events can't be generalized by perf because they are processor's specifics and as a consequence you will have to look into your processor's manual to find the raw encoding of the event to give to perf. For your Intel processor you should look at chapter 18.9.8 Intel® Xeon® Processor E5 Family Uncore Performance Monitoring Facility and CHAPTER 19 PERFORMANCE-MONITORING EVENTS of the Intel software developer manual document available here In these documents you'll need the exact ID of your processor that you can get using /proc/cpuinfo

Is cudaMemcpy3DPeer supported on geforce cards?

Is it possible to use peer-to-peer memory transfer on GeForce cards or is it allowed only on Teslas? I assume cards are 2 GTX690s (each one has two GPUs on board).
I have tried copying between Quadro 4000 and Quadro 600, and it failed. I was transferring 3D arrays using cudaMemcpy3DPeer by filling the cudaMemcpy3DPeerParms struct.
Peer-to-peer memory copy should work on Geforce and Quadro as well as Tesla, see the programming guide for more details.
Memory copies can be performed between the memories of two different devices.
When a unified address space is used for both devices (see Unified Virtual Address
Space), this is done using the regular memory copy functions mentioned in Device
Memory.
Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(),
cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync()
Peer-to-peer memory access, which is where one GPU can directly read from another GPU, requires UVA (which means 64-bit OS) and Tesla and compute capability 2.0 or higher.
Tesla Compute Cluster Mode for Windows), on Windows XP, or on Linux, devices
of compute capability 2.0 and higher from the Tesla series may address each other’s
memory (i.e., a kernel executing on one device can dereference a pointer to the memory
of the other device).

Nehalem memory architecture address mapping

Given a 2 processor Nehalem Xeon server with 12GB of RAM (6x2GB), how are memory addresses mapped onto the physical memory modules?
I would imagine that on a single processor Nehalem with 3 identical memory modules, the address space would be striped over the modules to give better memory bandwidth. But with what kind of stripe size? And how does the second processor (+memory) change that picture?
Intel is not very clear on that, you have to dig into their hardcore technical documentation to find out all the details. Here's my understanding. Each processor has an integrated memory controller. Some Nehalems have triple-channel controllers, some have dual-channel controllers. Each memory module is assigned to one of the processors. Triple channel means that accesses are interleaved across three banks of modules, dual channel = two banks.
The specific interleaving pattern is configurable to some extent, but, given their design, it's almost inevitable that you'll end up with 64 to 256 byte stripes.
If one of the processors wants to access memory that's attached to the IMC of some other processor, the access goes through both processor and incurs additional latency.

Resources