Prometheus container level buffered memory metric is missing - memory

When i checked prometheus custom metrics, i see container_memory_cache but container level memory buffer data is not available.. When i run vmstat -S M command.. i can get buffered memory as below. But in kubernetes architecture, running this command on each pods will be wasting resources.. Is there any alternative way to get these datas for each pods?In addition to that, vmstat metrics also do not have buffered memory data too... Any idea? Thanks
vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 456 2 19594 0 0 6 39 4 3 7 5 87 0 0

You can find a lot of information regarding container memory usage in cgroups, for example
/sys/fs/cgroup/memory/docker/{id}/memory.stat
Such files could be accessed from node OS and will provide you with all available information regarding container memory usage.
However, as far as I understand, buffered memory is not available inside containers to avoid double caching by node OS and container itself.

Related

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?
It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM.
Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for IntelĀ® Optaneā„¢ DC persistent memory.
However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.
As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:
Performance counter stats for 'system wide': 1,035,380 uncore_imc/event=0xe7/
So far this seems to be the the best guess.

Write back or write through to main memory

write-through : data is written to the main memory through the cache immediately
write-back : data is written in a latter time .
I have a shared memory , which is located in NUMA node 1 , suppose that Process A executed in Node 0 which modify the contents of shared memory , then the process B executed in Node 1 which like to read the contents of shared meory .
if it is in write-through mode , then the contents modified by process A will be in main memory in Node 1 , since while Node 0 write data to Node 1 main memory will go through L3 cache of Node 1 , then Process B can get the contents which modified by process A from L3 cache of Node 1 , not from main memory of Node 1.
if it is write-back mode , then while process B in Node 1 like to read the contents of shared memory , the cache line will be in L3 cache of Node 0 ,
get it will cost more since it is in Node 0 cache .
I like to know in Intel(R) Xeon(R) CPU E5-2643 , which mode it will choose ?!
or Xeon will decide which mode it will use on its own and nothing programmer can do ?!
Edit :
dmidecode -t cache
showes Xeon cache operational mode is write back ,look reasonable , refering to
http://www.cs.cornell.edu/courses/cs3410/2013sp/lecture/18-caches3-w.pdf
Cache coherency on Intel (and AMD) x86-64 NUMA architectures does not work like that of a RAID array... Instead of having a single write-through or write-back cache, the two or four processor packages have a snooping & transfer protocol for synchronizing and sharing their L3 caches. OS level support for controlling such things is generally very rough, even though though NUMA has been mainstream for about ten years now.
Speaking specifically about Linux, control over the cache settings really boil down to a handful of process-level settings:
What core(s) your code is allowed to run on.
Whether your process is allowed to allocate non-local node memory.
Whether your process interleaves all of its allocations between Numa nodes.
By default, the Linux kernel will allocate process memory from the NUMA node the process is actively running on, falling back to allocations on the other node if there's memory pressure on the local node.
You can control the pushing of data in and out of the L3 cache of the local node using x86 assembly primitives like LOCK, but in general you really, really, really should not care about anything more than your process running locally with its allocated memory.
For more information on this, I'd encourage you to read some of the Linux documentation on NUMA, and possibly also Intel's (QPI is the name of the cache-sharing technology).
A good start for you would probably be the Linux 'numactl' manpage (https://linux.die.net/man/8/numactl)

Lua and Torch issues with GPu

I am trying to run the Lua based program from the OpenNMT. I have followed the procedure from here : http://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85
I have used the command:
th train.lua -data textsum-train.t7 -save_model textsum1 -gpuid 0 1 2 3 4 5 6 7
I am using 8 GPUs but still the process is damn slow as if the process is working on the CPU. kindly, let me know what might be the solution for the optimizing the GPU usage.
Here is the stats of the GP usage:
Kindly, let me know how I can make the process run faster using the complete GPUs. I am available with 11GBs, but the process only consumes 2 GB or less. Hence the process is damn slow.
As per OpenNMT documentation, you need to remove 0 from right after the gpuid option since 0 stands for the CPU, and you are effectively reduce the training speed to that of a CPU-powered one.
To use data parallelism, assign a list of GPU identifiers to the -gpuid option. For example:
th train.lua -data data/demo-train.t7 -save_model demo -gpuid 1 2 4
will use the first, the second and the fourth GPU of the machine as returned by the CUDA API.

cache coherence protocol AMD Opteron chips (MOESI?)

If I may start with an example.
Say we have a system of 4 sockets, where each socket has 4 cores and each socket has 2GB RAM
ccNUMA (cache coherent non-uniform memory access) type of memory.
Let's say the 4 processes running are on each socket and all have some shared memory region allocated in P2's RAM denoted SHM. This means any load/store to that region will incur a lookup into the P2's directory, correct? If so, then... When that look up happens, is that an equivalent to accessing RAM in terms of latency? Where does this directory reside physically? (See below)
With a more concrete example:
Say P2 does a LOAD on SHM and that data is brought into P2's L3 cache with the tag '(O)wner'. Furthermore, say P4 does a LOAD on the same SHM. This will cause P4 to do a lookup into P2's directory, and since the data is tagged as Owned by P2 my question is:
Does P4 get SHM from P2's RAM or does it ALWAYS get the data from P2's L3 cache?
If it always gets the data from the L3 cache, wouldn't it be faster to get the data directly from P2's RAM? Since it already has to do a look up in P2's directory? And my understanding is that the directory is literally sitting on top of the RAM.
Sorry if I'm grossly misunderstanding what is going on here, but I hope someone can help clarify this.
Also, is there any data on how fast such a directory look up is? In terms of data retrieval is there documentation on the average latencies on such lookups? How many cycles on a L3 read-hit, read-miss, directory lookup? etc.
It depends on whether the Opteron processor implements the HT Assist mechanism.
If it does not, then there is no directory. In your example, when P4 issues a load, a memory request will arrive to P2 memory controller. P2 will answer back with the cache line and will also send a probe message to the other two cores. Finally, these other two cores will answer back to P4 with an ACK saying they do not have a copy of the cache line.
If HT Assist is enabled (typically for 6-core and higher sockets), then each L3 cache contains a snoop filter (directory) used to write down which cores are keeping a line. Thus, in your example, P4 will not send probe messages to the other two cores, as it looks up the HT Assist directory to find out that no one else has a copy of the line (this is a simplification, as the state of the line would be Exclusive instead of Owned and no directory lookup would be needed).

Is there a monitoring tool like xentop that will track historical data?

I'd like to view historical data for guest cpu/memory/IO usage, rather than just current usage.
There is a perl program i have written that does this. See link text
It also supports logging to a URL.
Features:
perl xenstat.pl -- generate cpu stats every 5 secs
perl xenstat.pl 10 -- generate cpu stats every 10 secs
perl xenstat.pl 5 2 -- generate cpu stats every 5 secs, 2 samples
perl xenstat.pl d 3 -- generate disk stats every 3 secs
perl xenstat.pl n 3 -- generate network stats every 3 secs
perl xenstat.pl a 5 -- generate cpu avail (e.g. cpu idle) stats every 5 secs
perl xenstat.pl 3 1 http://server/log.php -- gather 3 secs cpu stats and send to URL
perl xenstat.pl d 4 1 http://server/log.php -- gather 4 secs disk stats and send to URL
perl xenstat.pl n 5 1 http://server/log.php -- gather 5 secs network stats and send to URL
Sample output:
[server~]# xenstat 5
cpus=2
40_falcon 2.67% 2.51 cpu hrs in 1.96 days ( 2 vcpu, 2048 M)
52_python 0.24% 747.57 cpu secs in 1.79 days ( 2 vcpu, 1500 M)
54_garuda_0 0.44% 2252.32 cpu secs in 2.96 days ( 2 vcpu, 750 M)
Dom-0 2.24% 9.24 cpu hrs in 8.59 days ( 2 vcpu, 564 M)
40_falc 52_pyth 54_garu Dom-0 Idle
2009-10-02 19:31:20 0.1 0.1 82.5 17.3 0.0 *****
2009-10-02 19:31:25 0.1 0.1 64.0 9.3 26.5 ****
2009-10-02 19:31:30 0.1 0.0 50.0 49.9 0.0 *****
Try Nagios, or Munin.
Xentop is a tool to monitor the domains (VMs) running under Xen. VMware's ESX has a similar tool (I believe its called esxtop).
The problem is that you'd like to see the historical CPU/Mem usage for domains on your Xen system, correct?
As with all Virtualization layers, there are two views of this information relevant to admins: the burden imposed by the domain on the host and the what the domain thinks is its process load. If the domain thinks it is running low on resources but the host is not, it is easy to allocate more resources to the domain from the host. If the host runs out of resources, you'll need to optimize or turn off some of the domains.
Unfortunately, I don't know of any free tools to do this. XenSource provides a rich XML-RPC API to control and monitor their systems. You could easily build something from that.
If you only care about the domain-view of its own resources, I'm sure there are plenty of monitoring tools already available that fit your need.
As a disclaimer, I should mention that the company I work for, Leostream, builds virtualization management software. Unfortunately, it does not really do utilization monitoring.
Hope this helps.
Both Nagios and Munin seem to have plugins/support for Xen data collection.
A Xen Virtual Machine Monitor Plugin for Nagios
munin plugins

Resources