On a Linux machine, I need to count the number of read and write accesses to memory (DRAM) performed by a process. The machine has a NUMA configuration and I am binding the process to access memory from a single remote NUMA node using numactl. The process is running on CPUs in node 0 and accessing memory in node 1.
Currently, I am using perf to count the number of LLC load miss and LLC store miss events to serve as an estimate for read and write accesses to memory. Because, I guessed LLC misses will need to be served by memory accesses. Is this approach correct i.e. is this event relevant ? And, are there any alternatives to obtain the read and write access information ?
Processor : Intel Xeon E5-4620
Kernel : Linux 3.9.0+
Depending on your hardware you should be able to acess performance counter located on the memory side, to exactly count memory accesses. On Intel processor, these events are called uncore events. I know that you can also count the same thing on AMD processors.
Counting LLC misses is not totally correct because some events such as the hardware prefetcher may lead a significant number of memory accesses.
Regarding your hardware, unfortunately you will have to use raw events (in the perf terminology). These events can't be generalized by perf because they are processor's specifics and as a consequence you will have to look into your processor's manual to find the raw encoding of the event to give to perf. For your Intel processor you should look at chapter 18.9.8 Intel® Xeon® Processor E5 Family Uncore Performance Monitoring Facility and CHAPTER 19 PERFORMANCE-MONITORING EVENTS of the Intel software developer manual document available here In these documents you'll need the exact ID of your processor that you can get using /proc/cpuinfo
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I'm a little confused about Intel Optane DC.
I want that my Optane DC will be able to perform as DRAM and storage both.
On the one hand, I understood that only "Intel Optane DC Persistent Memory DIMM" is able to perform as DRAM.That it because he has 2 modes (Memory mode and App-Direct Mode).
On the other hand, in this link: https://www.intel.com/content/www/us/en/products/docs/memory-storage/solid-state-drives/optane-ssd-dc-p4800x-mdt-brief.html
I read that "Together, DRAM and Intel® Optane™ SSDs with Intel® Memory Drive Technology emulate a single volatile memory pool".
I'm confused, is Intel Optane DC SSD is able to perform as DRAM or only Intel persistent Memory DIMM?
Yes you can use a P4800x with Intel's IMDT (Intel Memory Drive Technology) software to give the illusion of more RAM by using the Optane DC SSD as swap space. This is what you want. IMDT sets up a hypervisor that gives the OS the illusion of DRAM + SSD as physical memory, instead of just letting the OS use it as swap space normally.
Apparently this works well when you already have enough physical RAM for most of your working set, and IMDT has smart prefetching algorithms that try to page in ahead of when a page will be needed.
One advantage to running the OS under the IMDT hypervisor instead of just using the SSD as swap space is that it will get the OS to use some of that extra space for pagecache (aka disk caching), instead of needing special code to use (some of) an SSD as cache for a slower disk.
But no, it's not Optane DC Persistent Memory, that's something else.
See also a SuperUser answer for more about Optane vs. Optane DC PM. And Hadi Brais added some nice sections to it about IMDT for Optane SSDs.
P4800x is connected over PCI-express (as you can see in pictures on https://www.anandtech.com/show/11930/intel-optane-ssd-dc-p4800x-750gb-handson-review) for example. So it's not an NV-DIMM; you can't stick it in a DIMM socket and have the CPU access it over the memory bus. The form-factor isn't DIMM.
As far as hardware, there are 3 things with the Optane brand name:
Consumer grade "Optane" SSDs. Just a fast PCIe NVMe using 3D XPoint memory instead of NAND flash.
Enterprise "Optane DC" SSDs. Just a fast PCIe NVMe using 3D XPoint memory. Not fundamentally different from the consumer stuff, just faster and higher power-consumption. P4800x is this.
The "expand your RAM" functionality here is pure software, fairly similar (and possibly worse) than just creating a swap partition on it and letting the OS handle paging to it. Especially if you weren't using virtualization already.
Enterprise "Optane DC Persistent Memory" (PM for short). 3D XPoint memory that's truly mapped (by hardware) into physical address space for access with ordinary load/store instruction, without going through a driver for each read/write. e.g. Linux mmap(MAP_SYNC) and using clflush or clwb asm instructions in user-space to commit data to persistent storage.
PM is still slower than DRAM, though, so if you just want volatile memory you might still use it as swap space like IMDT. One key use-case for DC PM is giving databases the ability to commit to persistent storage without going through the OS. This allows out-of-order execution around I/O, as well as much lower overhead.
See articles like https://www.techspot.com/news/79483-intel-announces-optane-dc-persistent-memory-dimms.html which put Optane DC Persistent Memory above Optane DC in the classic pyramid storage hierarchy.
AFAIK, Optane DC PM devices only exist in a DIMM form-factor, not PCIe (and uses something like DDR4 signalling). This requires special support from the CPU because modern CPUs integrate the memory controller.
In theory you could have a PCIe device that exposed some persistent storage in a PCIe memory region. Those are part of physical address space and can be configured as write-back cacheable. (Or can they? Mapping MMIO region write-back does not work) So they could be memory-mapped into userland virtual address space. But I don't think any PCIe Optane DC Persistent Memory devices exist, probably because PCIe command latency is (much) higher than over the DDR4 bus. Bandwidth is also lower. So it makes sense to use it as fast swap space (copying in a whole page), not as write-back cacheable physical memory where you could have cache misses waiting a very long time.
(Margaret Bloom also comments re: block size of writes maybe being a problem.)
i.e. you don't want a "hot" part of your working set on memory that the CPU accesses over the PCIe bus. You probably don't even want that for Optane DC PM.
Optane / 3D XPoint is always persistent storage; it' up to software whether you take advantage of that or just use it as slower volatile RAM.
It's not literally DRAM, that has a specific technical meaning (dynamic = data stored in tiny capacitors that need refreshing frequently). 3D XPoint isn't dynamic, and isn't even volatile. But you can use it as equivalent because 3D XPoint memory has very good write endurance (it doesn't wear out like NAND flash). If people talk about using Optane as more DRAM, they're using the term to just mean volatile RAM, filling the same role that DRAM traditionally fills.
I have read An Introduction to the Intel® QuickPath Interconnect. The document does not mention that QPI is used by processors to access memory. So I think that processors don't access memory through QPI.
Is my understanding correct?
Intel QuickPath Interconnect (QPI) is not wired to the DRAM DIMMs and as such is not used to access the memory that connected to the CPU integrated memory controller (iMC).
In the paper you linked this picture is present
That shows the connections of a processor, with the QPI signals pictured separately from the memory interface.
A text just before the picture confirm that QPI is not used to access memory
The processor
also typically has one or more integrated memory
controllers. Based on the level of scalability
supported in the processor, it may include an
integrated crossbar router and more than one
Intel® QuickPath Interconnect port.
Furthermore, if you look at a typical datasheet you'll see that the CPU pins for accessing the DIMMs are not the ones used by QPI.
The QPI is however used to access the uncore, the part of the processor that contains the memory controller.
Courtesy of QPI article on Wikipedia
QPI is a fast internal general purpose bus, in addition to giving access to the uncore of the CPU it gives access to other CPUs' uncore.
Due to this link, every resource available in the uncore can potentially be accessed with QPI, including the iMC of a remote CPU.
QPI define a protocol with multiple message classes, two of them are used to read memory using another CPU iMC.
The flow use a stack similar to the usual network stack.
Thus the path to remote memory include a QPI segment but the path to local memory doesn't.
Update
For Xeon E7 v3-18C CPU (designed for multi-socket systems), the Home agent doesn't access the DIMMS directly instead it uses an Intel SMI2 link to access the Intel C102/C104 Scalable Memory Buffer that in turn accesses the DIMMS.
The SMI2 link is faster than the DDR3 and the memory controller implements reliability or interleaving with the DIMMS.
Initially the CPU used a FSB to access the North bridge, this one had the memory controller and was linked to the South bridge (ICH - IO Controller Hub in Intel terminology) through DMI.
Later the FSB was replaced by QPI.
Then the memory controller was moved into the CPU (using its own bus to access memory and QPI to communicate with the CPU).
Later, the North bridge (IOH - IO Hub in Intel terminology) was integrated into the CPU and was used to access the PCH (that now replaces the south bridge) and PCIe was used to access fast devices (like the external graphic controller).
Recently the PCH has been integrated into the CPU as well that now exposes only PCIe, DIMMs pins, SATAexpress and any other common internal bus.
As a rule of thumb the buses used by the processors are:
To other CPUs - QPI
To IOH - QPI (if IOH present)
To the uncore - QPI
To DIMMs - Pins as the DRAM technology (DDR3, DDR4, ...) support mandates. For Xeon v2+ Intel uses a fast SMI(2) link to connect to an off-core memory controller (Intel C102/104) that handle the DIMMS and channels based on two configurations.
To PCH - DMI
To devices - PCIe, SATAexpress, I2C, and so on.
Yes, QPI is used to access all remote memory on multi-socket systems, and much of its design and performance is intended to support such access in a reasonable fashion (i.e., with latency and bandwidth not too much worse than local access).
Basically, most x86 multi-socket systems are lightly1 NUMA: every DRAM bank is attached to a the memory controller of a particular socket: this memory is then local memory for that socket, while the remaining memory (attached to some other socket) is remote memory. All access to remote memory goes over the QPI links, and on many systems2 that is fully half of all memory access and more.
So QPI is designed to be low latency and high bandwidth to make such access still perform well. Furthermore, aside from pure memory access, QPI is the link through which the cache coherence between sockets occurs, e.g., notifying the other socket of invalidations, lines which have transitioned into the shared state, etc.
1 That is, the NUMA factor is fairly low, typically less than 2 for latency and bandwidth.
2 E.g., with NUMA interleave mode on, and 4 sockets, 75% of your access is remote.
I'm a bit confused between about the difference between shared memory and distributed memory. Can you clarify?
Is shared memory for one processor and distributed for many (for network)?
Why do we need distributed memory, if we have shared memory?
Short answer
Shared memory and distributed memory are low-level programming abstractions that are used with certain types of parallel programming. Shared memory allows multiple processing elements to share the same location in memory (that is to see each others reads and writes) without any other special directives, while distributed memory requires explicit commands to transfer data from one processing element to another.
Detailed answer
There are two issues to consider regarding the terms shared memory and distributed memory. One is what do these mean as programming abstractions, and the other is what do they mean in terms of how the hardware is actually implemented.
In the past there were true shared memory cache-coherent multiprocessor systems. The systems communicated with each other and with shared main memory over a shared bus. This meant that any access from any processor to main memory would have equal latency. Today these types of systems are not manufactured. Instead there are various point-to-point links between processing elements and memory elements (this is the reason for non-uniform memory access, or NUMA). However, the idea of communicating directly through memory remains a useful programming abstraction. So in many systems this is handled by the hardware and the programmer does not need to insert any special directives. Some common programming techniques that use these abstractions are OpenMP and Pthreads.
Distributed memory has traditionally been associated with processors performing computation on local memory and then once it using explicit messages to transfer data with remote processors. This adds complexity for the programmer, but simplifies the hardware implementation because the system no longer has to maintain the illusion that all memory is actually shared. This type of programming has traditionally been used with supercomputers that have hundreds or thousands of processing elements. A commonly used technique is MPI.
However, supercomputers are not the only systems with distributed memory. Another example is GPGPU programming which is available for many desktop and laptop systems sold today. Both CUDA and OpenCL require the programmer to explicitly manage sharing between the CPU and the GPU (or other accelerator in the case of OpenCL). This is largely because when GPU programming started the GPU and CPU memory was separated by the PCI bus which has a very long latency compared to performing computation on the locally attached memory. So the programming models were developed assuming that the memory was separate (or distributed) and communication between the two processing elements (CPU and GPU) required explicit communication. Now that many systems have GPU and CPU elements on the same die there are proposals to allow GPGPU programming to have an interface that is more like shared memory.
In modern x86 terms, for example, all the CPUs in one physical computer share memory. e.g. 4-socket system with four 18-core CPUs. Each CPU has its own memory controllers, but they talk to each other so all the CPUs are part of one coherency domain. The system is NUMA shared memory, not distributed.
A room full of these machines form a distributed-memory cluster which communicates by sending messages over a network.
Practical considerations are one major reasons for distributed memory: it's impractical to have thousands or millions of CPU cores sharing the same memory with any kind of coherency semantics that make it worth calling it shared memory.
Soes anybody know any simulator that I can use in order to measure statistics of memory access latencies for multicore processors?
Are there such statistics(for any kind of multicore) already published somewhere?
You might try CodeAnalyst from AMD which monitors the performance registers during program execution on AMD processors. Multi-core too where applicable.
I don't know the name of intel's equivalent product.
Given a 2 processor Nehalem Xeon server with 12GB of RAM (6x2GB), how are memory addresses mapped onto the physical memory modules?
I would imagine that on a single processor Nehalem with 3 identical memory modules, the address space would be striped over the modules to give better memory bandwidth. But with what kind of stripe size? And how does the second processor (+memory) change that picture?
Intel is not very clear on that, you have to dig into their hardcore technical documentation to find out all the details. Here's my understanding. Each processor has an integrated memory controller. Some Nehalems have triple-channel controllers, some have dual-channel controllers. Each memory module is assigned to one of the processors. Triple channel means that accesses are interleaved across three banks of modules, dual channel = two banks.
The specific interleaving pattern is configurable to some extent, but, given their design, it's almost inevitable that you'll end up with 64 to 256 byte stripes.
If one of the processors wants to access memory that's attached to the IMC of some other processor, the access goes through both processor and incurs additional latency.