Write back or write through to main memory - memory

write-through : data is written to the main memory through the cache immediately
write-back : data is written in a latter time .
I have a shared memory , which is located in NUMA node 1 , suppose that Process A executed in Node 0 which modify the contents of shared memory , then the process B executed in Node 1 which like to read the contents of shared meory .
if it is in write-through mode , then the contents modified by process A will be in main memory in Node 1 , since while Node 0 write data to Node 1 main memory will go through L3 cache of Node 1 , then Process B can get the contents which modified by process A from L3 cache of Node 1 , not from main memory of Node 1.
if it is write-back mode , then while process B in Node 1 like to read the contents of shared memory , the cache line will be in L3 cache of Node 0 ,
get it will cost more since it is in Node 0 cache .
I like to know in Intel(R) Xeon(R) CPU E5-2643 , which mode it will choose ?!
or Xeon will decide which mode it will use on its own and nothing programmer can do ?!
Edit :
dmidecode -t cache
showes Xeon cache operational mode is write back ,look reasonable , refering to
http://www.cs.cornell.edu/courses/cs3410/2013sp/lecture/18-caches3-w.pdf

Cache coherency on Intel (and AMD) x86-64 NUMA architectures does not work like that of a RAID array... Instead of having a single write-through or write-back cache, the two or four processor packages have a snooping & transfer protocol for synchronizing and sharing their L3 caches. OS level support for controlling such things is generally very rough, even though though NUMA has been mainstream for about ten years now.
Speaking specifically about Linux, control over the cache settings really boil down to a handful of process-level settings:
What core(s) your code is allowed to run on.
Whether your process is allowed to allocate non-local node memory.
Whether your process interleaves all of its allocations between Numa nodes.
By default, the Linux kernel will allocate process memory from the NUMA node the process is actively running on, falling back to allocations on the other node if there's memory pressure on the local node.
You can control the pushing of data in and out of the L3 cache of the local node using x86 assembly primitives like LOCK, but in general you really, really, really should not care about anything more than your process running locally with its allocated memory.
For more information on this, I'd encourage you to read some of the Linux documentation on NUMA, and possibly also Intel's (QPI is the name of the cache-sharing technology).
A good start for you would probably be the Linux 'numactl' manpage (https://linux.die.net/man/8/numactl)

Related

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?
It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM.
Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for Intel® Optane™ DC persistent memory.
However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.
As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:
Performance counter stats for 'system wide': 1,035,380 uncore_imc/event=0xe7/
So far this seems to be the the best guess.

How do you register a region of user space memory in a kernel module?

I am working on a Linux module to interface with a third-party device. When this device is ready to give my module information, it writes directly to the RAM memory address 0x900000.
When I check /proc/iomem, I get:
00000000-3fffffff: System Ram
00008000-00700fff: Kernel code
00742000-007a27b3: Kernel datat
From, my understanding, this means that it is writing to an address that is floating out in the middle of user-space.
I know that this is not an optimal situation and it would be better to be able to use memory-mapped addresses/registers, but I don’t have the option of changing the way it works right now.
How do I have my kernel module safely claim the user space memory space from 0x900000 to 0x901000?
I tried mmap and ioremap but those are really for memory-mapped registers, not accessing memory that already ‘exists’ in userspace. I believe that I can read/write from the address by just using the pointer, but that doesn’t prevent corruption if that region is allocated to another process.
You can tell the kernel to restrict the address for kernel space by setting the mem parameter in the bootargs :
mem=1M#0x900000 --> instructs to use 1M starting from 0x900000
you can have multiple mem in boot args
example: mem=1M#0x900000 mem=1M#0xA00000
Following command should tell you the memory region allocated to the kernel:
cat /proc/iomem | grep System

ARM: Safe physical memory position (to reserve) for my ARM hypervisor in relation to a Linux/Android guest

I am developing a basic hypervisor on ARM (using the board Arndale Exynos 5250).
I want to load Linux(ubuntu or smth else)/Android as the guest. Currently I'm using a Linaro distribution.
I'm almost there, most of the big problems have already been dealt with, except for the last one: reserving memory for my hypervisor such that the kernel does not try to OVERWRITE it BEFORE parsing the FDT or the kernel command line.
The problem is that my Linaro distribution's U-Boot passes a FDT in R2 to the linux kernel, BUT the kernel tries to overwrite my hypervisor's memory before seeing that I reserved that memory region in the FDT (by decompiling the DTB, modifying the DTS and recompiling it). I've tried to change the kernel command-line parameters, but they are also parsed AFTER the kernel tries to overwrite my reserved portion of memory.
Thus, what I need is a safe memory location in the physical RAM where to put my hypervisor's code at such that the Linux kernel won't try to access (r/w) it BEFORE parsing the FDT or it's kernel command line.
Context details:
The system RAM layout on Exynos 5250 is: physical RAM starts at 0x4000_0000 (=1GB) and has the length 0x8000_0000 (=2GB).
The linux kernel is loaded (by U-Boot) at 0x4000_7000, it's size (uncompressed uImage) is less than 5MB and it's entry point is set to be at 0x4000_8000;
uInitrd is loaded at 0x4200_0000 and has the size less than 2MB
The FDT (board.dtb) is loaded at 0x41f0_0000 (passed in R2) and has the size less than 35KB
I currently load my hypervisor at 0x40C0_0000 and I want to reserve 200MB (0x0C80_0000) starting from that address, but the kernel tries to write there (a stage 2 HYP trap tells me that) before looking in the FDT or in the command line to see that the region is actually reserved. If instead I load my hypervisor at 0x5000_0000 (without even modifying the original DTB or the command line), it does not try to overwrite me!
The FDT is passed directly, not through ATAGs
Since when loading my hypervisor at 0x5000_0000 the kernel does not try to overwrite it whatsoever, I assume there are memory regions that Linux does not touch before parsing the FDT/command-line. I need to know whether this is true or not, and if true, some details regarding these memory regions.
Thanks!
RELATED QUESTION:
Does anyone happen to know what is the priority between the following: ATAGs / kernel-command line / FDT? For instance, if I reserve memory through the kernel command-line, but not in the FDT (.dtb) should it work or is the command-line overriden by the FDT? Is there somekind of concatenation between these three?
As per
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm/Booting, safe locations start 128MB from start of RAM (assuming the kernel is loaded in that region, which is should be). If a zImage was loaded lower in memory than what is likely to be the end address of the decompressed image, it might relocate itself higher up before it starts decompressing. But in addition to this, the kernel has a .bss region beyond the end of the decompressed image in memory.
(Do also note that your FDT and initrd locations already violate this specification, and that the memory block you are wanting to reserve covers the locations of both of these.)
Effectively, your reserved area should go after the FDT and initrd in memory - which 0x50000000 is. But anything > 0x08000000 from start of RAM should work, portably, so long as that doesn't overwrite the FDT, initrd or U-Boot in memory.
The priority of kernel/FDT/bootloader command line depends on the kernel configuration - do a menuconfig and check under "Boot options". You can combine ATAGS with the built-in command lines, but not FDT - after all, the FDT chosen node is supposed to be generated by the bootloader - U-boot's FDT support is OK so you should let it do this rather than baking it into the .dts if you want an FDT command line.
The kernel is pretty conservative before it's got its memory map since it has to blindly trust the bootloader has laid things out as specified. U-boot on the other hand is copying bits of itself all over the place and is certainly the culprit for the top end of RAM - if you #define DEBUG in (I think) common/board_f.c you'll get a dump of what it hits during relocation (not including the Exynos iRAM SPL/boot code stuff, but that won't make a difference here anyway).

cache coherence protocol AMD Opteron chips (MOESI?)

If I may start with an example.
Say we have a system of 4 sockets, where each socket has 4 cores and each socket has 2GB RAM
ccNUMA (cache coherent non-uniform memory access) type of memory.
Let's say the 4 processes running are on each socket and all have some shared memory region allocated in P2's RAM denoted SHM. This means any load/store to that region will incur a lookup into the P2's directory, correct? If so, then... When that look up happens, is that an equivalent to accessing RAM in terms of latency? Where does this directory reside physically? (See below)
With a more concrete example:
Say P2 does a LOAD on SHM and that data is brought into P2's L3 cache with the tag '(O)wner'. Furthermore, say P4 does a LOAD on the same SHM. This will cause P4 to do a lookup into P2's directory, and since the data is tagged as Owned by P2 my question is:
Does P4 get SHM from P2's RAM or does it ALWAYS get the data from P2's L3 cache?
If it always gets the data from the L3 cache, wouldn't it be faster to get the data directly from P2's RAM? Since it already has to do a look up in P2's directory? And my understanding is that the directory is literally sitting on top of the RAM.
Sorry if I'm grossly misunderstanding what is going on here, but I hope someone can help clarify this.
Also, is there any data on how fast such a directory look up is? In terms of data retrieval is there documentation on the average latencies on such lookups? How many cycles on a L3 read-hit, read-miss, directory lookup? etc.
It depends on whether the Opteron processor implements the HT Assist mechanism.
If it does not, then there is no directory. In your example, when P4 issues a load, a memory request will arrive to P2 memory controller. P2 will answer back with the cache line and will also send a probe message to the other two cores. Finally, these other two cores will answer back to P4 with an ACK saying they do not have a copy of the cache line.
If HT Assist is enabled (typically for 6-core and higher sockets), then each L3 cache contains a snoop filter (directory) used to write down which cores are keeping a line. Thus, in your example, P4 will not send probe messages to the other two cores, as it looks up the HT Assist directory to find out that no one else has a copy of the line (this is a simplification, as the state of the line would be Exclusive instead of Owned and no directory lookup would be needed).

GC and memory limit issues with R

I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?
In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.

Resources