cache coherence protocol AMD Opteron chips (MOESI?) - memory

If I may start with an example.
Say we have a system of 4 sockets, where each socket has 4 cores and each socket has 2GB RAM
ccNUMA (cache coherent non-uniform memory access) type of memory.
Let's say the 4 processes running are on each socket and all have some shared memory region allocated in P2's RAM denoted SHM. This means any load/store to that region will incur a lookup into the P2's directory, correct? If so, then... When that look up happens, is that an equivalent to accessing RAM in terms of latency? Where does this directory reside physically? (See below)
With a more concrete example:
Say P2 does a LOAD on SHM and that data is brought into P2's L3 cache with the tag '(O)wner'. Furthermore, say P4 does a LOAD on the same SHM. This will cause P4 to do a lookup into P2's directory, and since the data is tagged as Owned by P2 my question is:
Does P4 get SHM from P2's RAM or does it ALWAYS get the data from P2's L3 cache?
If it always gets the data from the L3 cache, wouldn't it be faster to get the data directly from P2's RAM? Since it already has to do a look up in P2's directory? And my understanding is that the directory is literally sitting on top of the RAM.
Sorry if I'm grossly misunderstanding what is going on here, but I hope someone can help clarify this.
Also, is there any data on how fast such a directory look up is? In terms of data retrieval is there documentation on the average latencies on such lookups? How many cycles on a L3 read-hit, read-miss, directory lookup? etc.

It depends on whether the Opteron processor implements the HT Assist mechanism.
If it does not, then there is no directory. In your example, when P4 issues a load, a memory request will arrive to P2 memory controller. P2 will answer back with the cache line and will also send a probe message to the other two cores. Finally, these other two cores will answer back to P4 with an ACK saying they do not have a copy of the cache line.
If HT Assist is enabled (typically for 6-core and higher sockets), then each L3 cache contains a snoop filter (directory) used to write down which cores are keeping a line. Thus, in your example, P4 will not send probe messages to the other two cores, as it looks up the HT Assist directory to find out that no one else has a copy of the line (this is a simplification, as the state of the line would be Exclusive instead of Owned and no directory lookup would be needed).

Related

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?
It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM.
Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for Intel® Optane™ DC persistent memory.
However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.
As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:
Performance counter stats for 'system wide': 1,035,380 uncore_imc/event=0xe7/
So far this seems to be the the best guess.

Cassandra Data storage: data directory space not equal to the space occupied

This is a beginners question on Cassandra Architecture.
I have a 3 node Cassandra cluster. The data directory is at $CASSANDRA_HOME/data/data. I've loaded a huge data set. I did a nodetool flush and then nodetool tablestats on the table I loaded the data. This says the total space occupied is around 50GiB. I was curious and checked the size of my data directory du $CASSANDRA_HOME/data/data on each of the nodes,which shows around 1-2GB on each. How could the data directory be less than the space occupied by a single table? Am I missing something? My table is created with replication factor 1
du gives out the true storage capacity used by the paths given to it. This is not always directly connected to the size of the data stored in these paths.
Two main factors mix up the output of du compared to any other storage usage information you might get (e. g. from Cassandra).
du might give out a smaller number than expected because of two reasons: ⓐ It combines hard links. This means that if the paths given to it contain hard linked files (I won't explain hard links here, but this term is a fixed one for Unixish operating systems so it can be looked up easily), these are counted only once while the files exist multiple times. ⓑ It is aware of sparse files; these are files which contain large (sometimes huge) areas of empty space (zero-bytes). In many Unixish file systems these can be stored efficiently, depending on how they have been created.
du might give out a larger number than expected because file systems have some overhead. To store a file of n bytes, n + h bytes need to be stored because of this. h depends on the file system and its configuration. The most important factor is that file systems typically store files in a block structure. If a file isn't exactly the size of a multiple of the block size of the file system, the last needed block is still allocated completely by this file, so some of its size if wasted. du will show the whole block as allocated because, in fact, it is.
So in your case Cassandra might talk about space occupied of 50GiB but a lot of it might be empty (never written-to) space. This might be stored in a sparse file on the file system which in fact only uses 2GiB of storage size (which du shows).

Write back or write through to main memory

write-through : data is written to the main memory through the cache immediately
write-back : data is written in a latter time .
I have a shared memory , which is located in NUMA node 1 , suppose that Process A executed in Node 0 which modify the contents of shared memory , then the process B executed in Node 1 which like to read the contents of shared meory .
if it is in write-through mode , then the contents modified by process A will be in main memory in Node 1 , since while Node 0 write data to Node 1 main memory will go through L3 cache of Node 1 , then Process B can get the contents which modified by process A from L3 cache of Node 1 , not from main memory of Node 1.
if it is write-back mode , then while process B in Node 1 like to read the contents of shared memory , the cache line will be in L3 cache of Node 0 ,
get it will cost more since it is in Node 0 cache .
I like to know in Intel(R) Xeon(R) CPU E5-2643 , which mode it will choose ?!
or Xeon will decide which mode it will use on its own and nothing programmer can do ?!
Edit :
dmidecode -t cache
showes Xeon cache operational mode is write back ,look reasonable , refering to
http://www.cs.cornell.edu/courses/cs3410/2013sp/lecture/18-caches3-w.pdf
Cache coherency on Intel (and AMD) x86-64 NUMA architectures does not work like that of a RAID array... Instead of having a single write-through or write-back cache, the two or four processor packages have a snooping & transfer protocol for synchronizing and sharing their L3 caches. OS level support for controlling such things is generally very rough, even though though NUMA has been mainstream for about ten years now.
Speaking specifically about Linux, control over the cache settings really boil down to a handful of process-level settings:
What core(s) your code is allowed to run on.
Whether your process is allowed to allocate non-local node memory.
Whether your process interleaves all of its allocations between Numa nodes.
By default, the Linux kernel will allocate process memory from the NUMA node the process is actively running on, falling back to allocations on the other node if there's memory pressure on the local node.
You can control the pushing of data in and out of the L3 cache of the local node using x86 assembly primitives like LOCK, but in general you really, really, really should not care about anything more than your process running locally with its allocated memory.
For more information on this, I'd encourage you to read some of the Linux documentation on NUMA, and possibly also Intel's (QPI is the name of the cache-sharing technology).
A good start for you would probably be the Linux 'numactl' manpage (https://linux.die.net/man/8/numactl)

Where is memory interleaving and memory split up into ranks happening in Linux kernel?

I am working on a course homework on sysfs virtual file system in Linux Kernel. As part of setting up sysfs virtual file system, Linux kernel organizes the physical memory in to blocks and further into sections in this directoy sys/devices/system/memory. In that directory, memory chunks will be represented as memory0, meomory1, memory2 etc..
After digging the Linux kernel, I have found out that the memory is being split into 128MB blocks and then further into sections of memory and found the code which does this in the C file here: Memory.c. In the above C file, the method memory_dev_init() has the logic for the whole memory block splitting and dividing into sections (or that's what i understood :) ). As per my professor, memory in Linux is split up into ranks and ranks contain interleaved memory addresses as shown below:
rank0: [0-512KB] [2048KB-2560KB] [4096KB-4608KB] ...
rank1: [512KB-1024KB] [2560KB-3072KB] [4608KB-5120KB] ...
rank2: [1024KB-1536KB] [3072KB-3584KB] [5120KB-...
rank3: [1536KB-2048KB] [3584KB-4096KB] ...
As part of my homework, I want to change the rank format into this so that i can get a contiguous memory blocks:
rank0: [0-512KB] [512KB-1024KB] [1024KB-1536KB]...
rank1: [1536KB-2048KB] [2048KB-2560KB] [2560KB-3072KB]...
rank2: [3072KB-3584KB] [3584KB-4096KB] [4096KB-4608KB]...
rank3: [4608KB-5120KB] ...
So I just want to know where exactly this memory interleaving is happening and the existing ranking is happening in the current Linux kernel. Could anyone please point me in the right direction?
I'm not quite sure as I don't see any practical use of the question, it is indeed a sort of academic research... and what you are trying to achieve is achievable by disabling the memory interleaving entirely. I guess after you disable interleaving you will see the proper "picture" in sysfs as well.
In other words -- no coding required, just the change of configuration.
Have a look at the memory interleave settings in BIOS. Here's a post which describe how to do this in a couple of platforms.

4 questions about processor architecture. (Computer engineering)

Our teachers has asked us around 50 true of false questions in preparation for our final exam. I could find an answer for most of them online or by asking relative. How ever, those 4 questions adrive driving me crazy. Most of those question aren't that hard, I just cant get any satisfying answer anywhere. Sorry, the original question are not written in english, i had to translate them myself. If you don't understand something, please tell me.
Thanks!
True or false
The size of the manipulated address by the processor determines the size of the virtual memory. How ever, the size of the memory cache is independent.
For long, DRAM technology stayed imcompatible with CMOS technology used to do the standard logic in processor. This is the reason DRAM memory is (most of the time) used outside of the processor (on a different chip).
Pagination let correspond multiple virtual addressing space to a same space of physical addressing.
An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block.
"Manipulated address" is not a term of the art. You have an m-bit virtual address mapping to an n-bit physical address. Yes, a cache may be of any size up to the physical address size, but typically is much smaller. Note that cache lines are tagged with virtual or more typically physical address bits corresponding to the maximum virtual or physical address range of the machine.
Yes, DRAM processes and logic processes are each tuned for different objectives, and involve different process steps (different materials and thicknesses to lay down DRAM capacitor stacks/trenches, for example) and historically you haven't built processors in DRAM processes (except the Mitsubishi M32RD) nor DRAM in logic processes. Exception is so-called eDRAM that IBM likes to use for their SOI processes, and which is used as last level cache in IBM microprocessors such as the Power 7.
"Pagination" is what we call issuing a form feed so that text output begins at the top of the next page. "Paging" on the other hand is sometimes a synonym for virtual memory management, by which a virtual address is mapped (on a page by page basis) to a physical address. If you set up your page tables just so it allows multiple virtual addresses (indeed, virtual addresses from different processes' virtual address spaces) to map to the same physical address and hence the same location in real RAM.
"An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block."
Hmm, that's a strange question. Let's break it down. 1) You can have a direct mapped cache, in which an address maps to only one cache line. 2) You can have a fully associative cache, in which an address can map to any cache line; there is something like a CAM (content addressible memory) tag structure to find which if any line matches the address. Or 3) you can have an n-way set associative cache, in which you have, essentially, n sets of direct mapped caches, and a given address can map to one of n lines. There are other more esoteric cache organizations, but I doubt you're being taught them.
So let's parse the statement. "An associative cache memory". Well that rules out direct mapped caches. So we're left with "fully associative" and "n-way set associative". It has sets of 1 line. OK, so if it is set associative, then instead of something traditional like 4-ways x 64 lines/way, it is n-ways x 1 lines/way. In other words, it is fully associative. I would say this is a true statement, except the term of the art is "fully associative" not "entirely associative."
Makes sense?
Happy hacking!
True, more or less (it depends on the accuracy of your translation I guess :) ) The number of bits in addresses sets an upper limit on the virtual memory space; you could, of course, choose not to use all the bits. The size of the memory cache depends on how much actual memory is installed, which is independent; but of course if you had more memory than you can address, then it still can't be used.
Almost certainly false. We have RAM on separate chips so that we can install more without building a whole new computer or replacing the CPU.
There is no a-priori upper or lower limit to the cache size, though in a real application certain sizes make more sense than others, of course.
I don't know of any incompatibility. The reason why we use SRAM as on-die cache is because it's faster.
Maybe you can force an MMUs to map different virtual addresses to the same physical location, but usually it's used the other way around.
I don't understand the question.

Resources