perf report - what are the two percentage columns - perf

Does anyone know where the perf report outputs are documented? Or in particular what the two percentage columns on the left are? I have found a number of examples showing a single percentage column but I get two. The commands I used are given below.
Many thanks!
perf record -g -a sleep 1
# Note: -a == all cpus -g enables backtrace recording for call graphs
perf report
+ 78.09% 0.00% node libc-2.19.so [.] __libc_start_main
+ 77.71% 0.00% node node [.] node::Start(int, char**)
...

Using -a does not make a lot of sense if you want to inspect the sleep command.
It is usually percentage of time spend in the function. The first one together with all the children (calls from the function) and the second the function itself without children.

Related

Snakemake limit the memory usage of jobs

I need to run 20 genomes with a snakemake. So I am using basic steps like alignment, markduplicates, realignment, basecall recalibration and so on in snakemake. The machine I am using has up to 40 virtual cores and 70G memory and I run the program like this.
snakemake -s Snakefile -j 40
This works fine, but as soon as It runs markduplicates along other programs, it stops as I think it overloads the 70 available giga and crashes.
Is there a way to set in snakemake the memory limit to 60G in total for all programs running? I would like snakemake runs less jobs in order to stay under 60giga, is some of the steps require a lot of memory. The command line below crashed as well and used more memorya than allocated.
snakemake -s Snakefile -j 40 --resources mem_mb=60000
It's not enough to specify --resources mem_mb=60000 on the command line, you need also to specify mem_mb for the rules you want to keep in check. E.g.:
rule markdups:
input: ...
ouptut: ...
resources:
mem_mb= 20000
shell: ...
rule sort:
input: ...
ouptut: ...
resources:
mem_mb= 1000
shell: ...
This will submit jobs in such way that you don't exceed a total of 60GB at any one time. E.g. this will keep running at most 3 markdups jobs, or 2 markdups jobs and 20 sort jobs, or 60 sort jobs.
Rules without mem_mb will not be counted towards memory usage, which is probably ok for rules that e.g. copy files and do not need much memory.
How much to assign to each rule is mostly up to your guess. top and htop commands help in monitoring jobs and figuring out how much memory they need. More elaborated solutions could be devised but I'm not sure it's worth it... If you use a job scheduler like slurm, the log files should give you the peak memory usage of each job so you can use them for future guidance. Maybe others have better suggestions.

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

What is the difference between 'time -f "%M"' and 'valgrind --tool=massif'?

I want to see the peak memory usage of a command. I have a parametrized algorithm and I want to know when the program will crash due with an out of memory error on my machine (12GB RAM).
I tried:
/usr/bin/time -f "%M" command
valgrind --tool=massif command
The first one gave me 1414168 (1.4GB; thank you ks1322 for pointing out it is measured in KB!) and valgrind gave me
$ ms_print massif.out
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
75 26,935,731,596 22,420,728 21,956,875 463,853 0
I'm a bit confused which number I should take, but let's assume "total" (22MB).
And the massif-visualizer shows me
Now I have 3 different numbers for the same command:
valgrind --tool=massif command + ms_print: 22MB
valgrind --tool=massif command + massif-visualizer: 206MB (this is what I see in htop and I guess this is what I'm interested in)
time -f "%M" command: 1.4GB
Which is the number I should look at? Why are the numbers different at all?
/usr/bin/time -f "%M" measures the maximum RSS (resident set size), that is the memory used by the process that is in RAM and not swapped out. This memory includes the heap, the stack, the data segment, etc.
This measures the max RSS of the children processes (including grandchildren) taken individually (not the max of the sum of the RSS of the children).
valgrind --tool=massif, as the documentation says:
measures only heap memory, i.e. memory allocated with malloc, calloc, realloc, memalign, new, new[], and a few other, similar functions. This means it does not directly measure memory allocated with lower-level system calls such as mmap, mremap, and brk
This measures only the memory in the child (not grandchildren).
This does not measure the stack nor the text and data segments.
(options likes --pages-as-heap=yes and --stacks=yes enable to measure more)
So in your case the differences are:
time takes into account the grandchildren, while valgrind does not
time does not measure the memory swapped out, while valgrind does
time measures the stack and data segments, while valgrind does not
You should now:
check if some children are responsible of the memory consumption
try profiling with valgrind --tool=massif --stacks=yes to check the stack
try profiling with valgrind --tool=massif --pages-as-heap=yes to check the rest of the memory usage

What are the columns in perf-stat when run with -r and -x

I'm trying to interpret the results of perf-stat run on a program. I know that it was run with -r 30 and -x. From https://perf.wiki.kernel.org/index.php/Tutorial is says that if run with -r the stddev will be reported but I'm not sure which of these columns that is and I'm having trouble finding information on the output when run with -x. One example of the output I've recieved is this
19987,,cache-references,0.49%,562360,100.00
256,,cache-misses,10.65%,562360,100.00
541747,,branches,0.07%,562360,100.00
7098,,branch-misses,0.78%,562360,100.00
60,,page-faults,0.43%,560411,100.00
0.560244,,cpu-clock,0.28%,560411,100.00
0.560412,,task-clock,0.28%,560411,100.00
My guess is that the % column is the standard deviation as a percentage of the first column but I'm not sure.
My question in summary is what do the columns represent? Which column is the standard deviation?
You are very close. Here are some blanks filled in.
Arithmetic mean of the measured values.
The unit if known. E.g. on my system it shows 'msec' for 'cpu-clock'.
Event name
Standard deviation scaled to 100% = mean
Time during which counting this event was actually running
Fraction of enabled time during which this event was actually running (in %)
The last two are relevant for multiplexing: If there are more counters selected than can be recorded concurrently, the denoted percentage will drop below 100.
On my system (Linux 5.0.5, not sure since when this is available), there is also a shadow stat for some metrics which compute a derived metric. For example the cpu-clock will compute CPUs utilized or branch-misses computes the fraction of all branches that are missed.
Shadow stat value
Shadow stat description
Note that this format changes with some other options. For example if you display the metrics with a more fine granular grouping (e.g. per cpu), information about these groups will be prepended in additional columns.

How to split/filter events from perf.data?

Question: Is there are a way to extract the samples for a specific event from a perf.data that has multiple events?
Context
I have a recorded sample of two events which I took by running something like
perf record -e cycles,instructions sleep 30
As far as I can see, other perf commands such as perf diff and perf script don't have an events filter flag. So it would be useful to split my perf.data into cycles.perf.data and instructions.perf.data.
Thanks!

Resources