What is the difference between 'time -f "%M"' and 'valgrind --tool=massif'?

What is the difference between 'time -f "%M"' and 'valgrind --tool=massif'? - memory

I want to see the peak memory usage of a command. I have a parametrized algorithm and I want to know when the program will crash due with an out of memory error on my machine (12GB RAM).
I tried:
/usr/bin/time -f "%M" command
valgrind --tool=massif command
The first one gave me 1414168 (1.4GB; thank you ks1322 for pointing out it is measured in KB!) and valgrind gave me
$ ms_print massif.out
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
75 26,935,731,596 22,420,728 21,956,875 463,853 0
I'm a bit confused which number I should take, but let's assume "total" (22MB).
And the massif-visualizer shows me
Now I have 3 different numbers for the same command:
valgrind --tool=massif command + ms_print: 22MB
valgrind --tool=massif command + massif-visualizer: 206MB (this is what I see in htop and I guess this is what I'm interested in)
time -f "%M" command: 1.4GB
Which is the number I should look at? Why are the numbers different at all?

/usr/bin/time -f "%M" measures the maximum RSS (resident set size), that is the memory used by the process that is in RAM and not swapped out. This memory includes the heap, the stack, the data segment, etc.
This measures the max RSS of the children processes (including grandchildren) taken individually (not the max of the sum of the RSS of the children).
valgrind --tool=massif, as the documentation says:
measures only heap memory, i.e. memory allocated with malloc, calloc, realloc, memalign, new, new[], and a few other, similar functions. This means it does not directly measure memory allocated with lower-level system calls such as mmap, mremap, and brk
This measures only the memory in the child (not grandchildren).
This does not measure the stack nor the text and data segments.
(options likes --pages-as-heap=yes and --stacks=yes enable to measure more)
So in your case the differences are:
time takes into account the grandchildren, while valgrind does not
time does not measure the memory swapped out, while valgrind does
time measures the stack and data segments, while valgrind does not
You should now:
check if some children are responsible of the memory consumption
try profiling with valgrind --tool=massif --stacks=yes to check the stack
try profiling with valgrind --tool=massif --pages-as-heap=yes to check the rest of the memory usage

Related

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?

It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM.
Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for Intel® Optane™ DC persistent memory.
However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.
As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:
Performance counter stats for 'system wide': 1,035,380 uncore_imc/event=0xe7/
So far this seems to be the the best guess.

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!

parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

SBCL used memory reports from top and (room) differs

I am running SBCL 1.0.51 on a Linux (Fedora 15) 32-bit system (kernel 3.6.5) with 1GB Ram and 256MB swap space.
I fire up sbcl --dynamic-space-size 125 and start calling a function that makes ~10000 http-requests (using drakma) to an http (couchDB) server and I just format to the standard-output the results of an operation on the returned data.
After each call I do a (sb-ext:gc :full t) and then (room). The results are not growing. No matter how many times I run the function, (room) reports the same used space (with some ups and downs, but around the same average which does not grow).
BUT: After every time I call the function, top reports that the VIRT and RES amount of the sbcl process keeps growing ,even beyond the 125MB space I told sbcl to ask for itself. So I have the following questions:
Why top -reported memory keeps growing, while (room) says it does not? The only thing I can think of is some leakage through ffi. I am not directly calling out with ffi but maybe some drakma dep does and forgets to free its C garbage. Anyway I dont know if this could even be an explanation. Could it be something else? Any insights?
Why isnt --dynamic-space-size honoured?

Erlang: discrepancy of memory usage figures

When I run my WebSocket test, I found the following interesting memory usage results:
Server stated, no connection
[{total,573263528},
{processes,17375688},
{processes_used,17360240},
{system,555887840},
{atom,472297},
{atom_used,451576},
{binary,28944},
{code,3774097},
{ets,271016}]
44 processes,
System:705M,
Erlang Residence:519M
100K Connections
[{total,762564512},
{processes,130105104},
{processes_used,130089656},
{system,632459408},
{atom,476337},
{atom_used,456484},
{binary,50160},
{code,3925064},
{ets,7589160}]
100044 processes,
System: 1814M,
Erlang Residence: 950M
200K Connections
( restart server and create from 0 connection, not continue from case 2)
[{total,952040232},
{processes,243161192},
{processes_used,243139984},
{system,708879040},
{atom,476337},
{atom_used,456484},
{binary,70856},
{code,3925064},
{ets,14904760}]
200044 processes,
System:3383M,
Erlang: 1837M
The figures with "System:" and "Erlang:" are provided htop, others are output of memory() call from erlang shell. Please look at the total and erlang residence memory. When there is no connection, these two are roughly same, with 100K connections, residence memory is a little larger than total, with 200K connections, residence memory is almost double the total.
Can anybody explain?

The most probable answer for your quersion is memory fragmentation.
Allocating OS memory is expensive, so Erlang tries to manage memory for you.
When Erlang allocates memory, it creates an entity called "carrier", which consists of many "blocks". Erlang memory(total) reports the sum of all the block sizes (memory actually used). OS reports the sum of all carriers sizes (sum of memory used and preallocated). Both sum of blocks sizes and carrier sizes can be read from Erlang VM. If (block sizes)/(carrier sizes) << 1, than VM has hard time with freeing the carriers. There might be many big carriers with only couple of blocks used. You can read it with: erlang:system_info({allocator,Type}). but there is an easier way. You can check it using Recon library:
http://ferd.github.io/recon/recon_alloc.html
Firstly check:
recon_alloc:fragmentation(current).
and next:
recon_alloc:fragmentation(max).
This should explain the difference between total memory reported by Erlang VM and OS. If you are sending many small messages over websockets, you can decrease the fragmentation by running Erlang with 2 options:
erl +MBas aobf +MBlmbcs 512
First option will change the block allocation strategy from best fit to address order best fit, which could help squeeze more blocks into first carriers and second one decreases maximum multiblock carrier size, which makes carriers smaller (this should make freeing them easier).

Measure top memory consumption (linux program)

How can I measure the top (the maximum) memory usage of some programm?
It do a lot of malloc/free, and run rather fast, so I can't see the max memory in top.
I want smth like time utility:
$ time ./program
real xx sec
user xx sec
sys xx sec
and
$ mem_report ./program
max memory used xx mb
shared mem xx mb

The time call is your shell. If you call /usr/bin/time, the program, you will get some knowledge of resident memory usage. Note however that it may not count memory-mapped files, shared memory and other details which you may need.

If you are on linux, you can wrap your program in a script that polls:
# for your current process
/proc/self/statm
# or a process you know the pid of
/proc/{pid}/statm
and writes out the results - you can aggregate them afterwards.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart