Perf Stat vs Perf Record - perf

I am confused about the difference between perf record and perf stat when it comes to counting events like page-faults, cache-misses and anything else from perf list. I have 2 questions below the answer to "Question 1" might also help answer "Question 2" but I wrote them out explicitly in the case that it doesn't.
Question 1:
It is my understanding that perf stat gets a "summary" of counts but when used with the -I option gets the counts at the specified millisecond interval. With this option does it sum up the counts over the interval or get the average over the interval, or something else entirely? I assume it is summed up. The perf wiki states it is aggregated but I guess that could mean either.
Question 2:
Why doesn't perf stat -e <event1> -I 1000 sleep 5 give about the same counts as if I summed up the counts over each second for the following command perf record -e <event1> -F 1000 sleep 5?
For example if I use "page-faults" as the event for event1 I get the following outputs I have listed below under each command. (I am assuming the period field is the counts for the event in perf record's perf.data file)
PERF STAT
perf stat -e page-faults -I 1000 sleep 5
# time counts unit events
1.000252928 54 page-faults
2.000498389 <not counted> page-faults
3.000569957 <not counted> page-faults
4.000659987 <not counted> page-faults
5.000837864 2 page-faults
PERF RECORD
perf record -e page-faults -F 1000 sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (6 samples) ]
perf script -F period
1
1
1
5
38
164
I expected that if I summed up the counts from perf stat I would get the same as the sum from perf record. If I use the -c option with perf record and give an argument of 1 I do get a close match. Is this just a coincidence because of the relatively low number of page faults?
References I have used so far:
brendangregg's perf blog
The perf record and stat links on this page mentioned above as "perf wiki"
I dug around here to see how and when perf record actually records vs when it writes to perf.data.
Thanks in advance for any and all insight you can provide.

First of all, your test case of using sleep and page-faults is not an ideal test case. There should be no page fault events during the sleep duration, you you can't really expect anything interesting. For the sake of easier reasoning I suggest to use the ref-cycles (hardware) event and a busy workload such as awk 'BEGIN { while(1){} }'.
Question 1: It is my understanding that perf stat gets a "summary" of
counts but when used with the -I option gets the counts at the
specified millisecond interval. With this option does it sum up the
counts over the interval or get the average over the interval, or
something else entirely? I assume it is summed up.
Yes. The values are just summed up. You can confirm that by testing:
$ perf stat -e ref-cycles -I 1000 timeout 10s awk 'BEGIN { while(1){} }'
# time counts unit events
1.000105072 2,563,666,664 ref-cycles
2.000267991 2,577,462,550 ref-cycles
3.000415395 2,577,211,936 ref-cycles
4.000543311 2,577,240,458 ref-cycles
5.000702131 2,577,525,002 ref-cycles
6.000857663 2,577,156,088 ref-cycles
[ ... snip ... ]
[ Note that it may not be as nicely consistent on all systems due dynamic frequency scaling ]
$ perf stat -e ref-cycles -I 3000 timeout 10s awk 'BEGIN { while(1){} }'
# time counts unit events
3.000107921 7,736,108,718 ref-cycles
6.000265186 7,732,065,900 ref-cycles
9.000372029 7,728,302,192 ref-cycles
Question 2: Why doesn't perf stat -e <event1> -I 1000 sleep 5 give
about the same counts as if I summed up the counts over each second
for the following command perf record -e <event1> -F 1000 sleep 5?
perf stat -I is in milliseconds, whereas perf record -F is in HZ (1/s), so the corresponding command to perf stat -I 1000 is perf record -F 1. In fact with our more stable event/workload, this looks better:
$ perf stat -e ref-cycles -I 1000 timeout 10s awk 'BEGIN { while(1){} }'
# time counts unit events
1.000089518 2,578,694,534 ref-cycles
2.000203872 2,579,866,250 ref-cycles
3.000294300 2,579,857,852 ref-cycles
4.000390273 2,579,964,842 ref-cycles
5.000488375 2,577,955,536 ref-cycles
6.000587028 2,577,176,316 ref-cycles
7.000688250 2,577,334,786 ref-cycles
8.000785388 2,577,581,500 ref-cycles
9.000876466 2,577,511,326 ref-cycles
10.000977965 2,577,344,692 ref-cycles
10.001195845 466,674 ref-cycles
$ perf record -e ref-cycles -F 1 timeout 10s awk 'BEGIN { while(1){} }'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (17 samples) ]
$ perf script -F time,period
3369070.273722: 1
3369070.273755: 1
3369070.273911: 3757
3369070.273916: 3015133
3369070.274486: 1
3369070.274556: 1
3369070.274657: 1778
3369070.274662: 2196921
3369070.275523: 47192985748
3369072.663696: 2578692405
3369073.663547: 2579122382
3369074.663609: 2580015300
3369075.664085: 2579873741
3369076.664433: 2578638211
3369077.664379: 2578378119
3369078.664175: 2578166440
3369079.663896: 2579238122
So you see, eventually the results are stable also for perf record -F. Unfortunately the documentation of perf record is very poor. You can learn what the settings -c and -F mean by looking at the documentation of the underlying system call man perf_event_open:
sample_period, sample_freq A "sampling" event is one that
generates an overflow notification every N events, where N is given by
sample_period. A sampling event has sample_period > 0. When
an overflow occurs, requested data is recorded in the mmap buffer.
The sample_type field controls what data is recorded on each
overflow.
sample_freq can be used if you wish to use frequency rather than
period. In this case, you set the freq flag. The kernel will
adjust the sampling period to try and achieve the desired rate. The
rate of adjustment is a timer tick.
So while perf stat uses an internal timer to read the value of the counter every -i milliseconds, perf record sets an event overflow counter to take a sample every -c events. That means it takes a sample every N events (e.g. every N page-fault or cycles). With -F, it it tries to regulate this overflow value to achieve the desired frequency. It tries different values and tunes it up/down accordingly. This eventually works for counters with a stable rate, but will get erratic results for dynamic events.

Related

Extract specific number from command outout

I have the following issue.
In a script, I have to execute the hdparm command on /dev/xvda1 path.
From the command output, I have to extract the MB/sec values calculated.
So, for example, if executing the command I have this output:
/dev/xvda1:
Timing cached reads: 15900 MB in 1.99 seconds = 7986.93 MB/sec
Timing buffered disk reads: 478 MB in 3.00 seconds = 159.09 MB/sec
I have to extract 7986.93 and 159.09.
I tried:
grep -o -E '[0-9]+', but it returns to me all the six number in the output
grep -o -E '[0-9]', but it return to me only the first character of the six values.
grep -o -E '[0-9]+$', but the output is empty, I suppose because the number is not the last character set of outoput.
How can I achieve my purpose?
To get the last number, you can add a .* in front, that will match as much as possible, eating away all the other numbers. However, to exclude that part from the output, you need GNU grep or pcregrep or sed.
grep -Po '.* \K[0-9.]+'
Or
sed -En 's/.* ([0-9.]+).*/\1/p'
Consider using awk to just print the fields you want rather than matching on numbers. This will work using any awk in any shell on every Unix box:
$ hdparm whatever | awk 'NF>1{print $(NF-1)}'
7986.93
159.09

check_cpu + nsclient : set critical threshold only on 5min period

I am using centreon (nagios) to monitor the CPUs of some VMs using NSClient. In my case it makes only sense to set the critical state of the cpu probe if the average cpu load is > 95 over the 5m period. Is this achievable ?
I cannot find documentation on how to specify that in the critical param
Default command
check_cpu
Returns
CPU Load ok
'total 5m load'=0%;80;90 'total 1m load'=0%;80;90 'total 5s load'=7%;80;90
Command with specific threshold (but all time period can match)
check_cpu "critical=load > 90"
It is not exactly what I wanted to do but what I did is the following
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "crit=load > 95" "warn=load > 90" time=5m
Which limits the output to the 5m time period.
Note that to execute this from centreon you have to set the following variables inside the nsclient.ini file (waisted a lot of time on that one)
[/settings/NRPE/server]
allow nasty characters=true
[/settings/external scripts]
allow nasty characters=true
Check this script,
define service{
use generic-service
host_name xxx
service_description CPU Load
check_command check_nrpe!check_load
contact_groups sysadmin
}
---
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
You can try something like that
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "warning=time = '5m' and load > 80" "critical=time = '5m' and load > 90" show-all
You can also check the documentation for more info.

How to monitor resources during slurm job?

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of
#!/bin/bash
#SBATCH <options>
# Running the actual job in background
srun my_program input.in output.out &
# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
#update job status
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
if [ "$JobStatus" == "RUNNING" ]; then
if [ $FIRST -eq 0 ]; then
sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
FIRST=1
else
sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
fi
sleep $STIME
elif [ "$JobStatus" == "PENDING" ]; then
sleep $STIME
else
sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
JobStatus="COMPLETED"
break
fi
done
However, I'm not really convinced of this solution:
sstat unfortunately doesn't show how many cpus are used at the
moment (only average)
MaxRSS is also not helpful if I try to record memory usage over time
there still seems to be some error (script doesn't stop after job finishes)
Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.
Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.
You can activate it with
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
See the documentation here.
To check that this plugin is installed, run
scontrol show config | grep AcctGatherProfileType
It should output AcctGatherProfileType = acct_gather_profile/hdf5.
The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf)
As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:
pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt
This will give you CPU and memory usage per process.
As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.

Sending a nagios alert when graphite does not get data

I am collecting some metrics using graphite, but sometimes there is no data coming into it (probably because the server has gone down, or no network connectivity). I want nagios to send me an alert during such an event. How do i do that?
You could use the check_file_age script from nagios-plugins to check a single known datapoint of interest per system that you are collecting data from.
check_file_age -w 600 -c 1800 /opt/graphite/storage/whisper/servers/$(uname -f)/cpu/idl.wsp
That would alert you if a certain metric was missing within 5 minutes.
Else
You could run a find command over all the points, and report any that have not been updated in n hours.
#!/bin/bash
OLD_GRAPHS=$(find /opt/graphite/storage/whisper -mmin +120 -type f | wc -l)
if [[ OLD_GRAPHS -gt 0 ]];then
echo "Found ${OLD_GRAPHS} graph(s) without an update in 120 minutes"
exit 1
fi
echo "All graphs are up to date"
exit 0

How to display all bandwidth values in iperf

I want to capture all bandwidth value in iperf not only Mbits size but also bits and Kbits as well.
[3] 0.0 - 1.0 sec 128 Kbytes 1.05 Mbits/sec
[3] 1.0 - 2.0 sec 0 Kbytes 0.00 bits/sec
[3] 2.0 - 3.0 sec 90 Kbytes 900.5 Kbits/sec
So far I know about this
iperf -c 10.0.0.1 -i 1 -t 100 | grep -Po '[0-9.]*(?= Mbits/sec)'
but that only captures Mbits value. How to capture bits/sec and Kbits/sec as well at the same time with Mbits/sec?
Thank you
I know this is old, but in case someone stumbles upon it, you could add an optional character class to your grep:
grep -Po '[0-9.]*(?= [KM]*bits/sec)'
This should do it
iperf -c 10.0.0.1 -i 1 -t 100 | awk '{print$5}' FPAT=[.0-9]+
FPAT=[.0-9]+ defines a field as one or more of .0-9
{print$5} prints just the rate
you might want to man iperf to see what's supported. Here's the latest from 2.0.10
-f, --format
[abkmgKMG] format to report: adaptive, bits, Kbits, Mbits, KBytes, MBytes (see NOTES for more)

Resources