Is it possible to raise the sampling frequency of perf stat? - perf

I am using perf for profiling, but the number of monitored PMU events is higher than the number of the hardware counters, so round-robin multiplexing strategy is triggered. However, some of my test cases may run less than a millisecond, which means that if the execution time is less than the multiplicative inverse of the default switch frequency (1000Hz), some events may not be profiled.
How to raise the sampling frequency of perf stat like perf record -F <freqency> to make sure that every events will be recorded even if the measurement overhead may slightly increase?

First off, remember that sampling is different than counting.
perf record will invariably do a sampling of all the events that occured during the time period of profiling. This means that it will not count all of the events that happened (this can be tweaked of course!). You can modify the frequency of sample collection to increase the number of samples that get collected. It will usually be like for every 10 (or whatever number > 0) events that occur, perf record will only record 1 of them.
perf stat will do a counting of all the events that occur. For each event that happens, perf stat will count it and will try not to miss any, unlike sampling. Of course, the number of events counted may not be accurate if there is multiplexing involved (i.e. when the number of events measured is greater than the number of available hardware counters). There is no concept of setting up frequencies in perf stat since all it does is a direct count of all the events that you intend to measure.
This is the proof from the linux kernel source code :-
You can see it sets up sample period (the inverse of sample freq) to be 0 - so you know what sample freq is ;)
Anyway, what you can do is a verbose reading of perf stat using perf stat -v to see and understand what is happening with all of the events that you are measuring.
To understand more about perf stat, you can also read this answer.

Related

How to aggregate PAPI uncore events such as skx_unc_imc0 to measure for all devices?

PAPI counts events per device (which can be iMC, caching and home agent(cha)). These are counted as separate events such as skx_unc_imc0::UNC_M_RPQ_OCCUPANCY for iMC 0. Is there a way to measure this for all the iMCs at the same time?
The Linux perf tool can do this, an equivalent event is unc_m_rpq_occupancy, but this cannot be measured from PAPI.

How to space out influxdb continuous query execution?

I have many influxdb continuous queries(CQ) used to downsample data over a period of time on several occasions. At one point, the load became high and influxdb went to out of memory at the time of executing continuous queries.
Say I have 10 CQ and all the 10 CQ execute in influxdb at a time. That impacts the memory heavily. I am not sure whether there is any way to evenly space out or have some delay in executing each CQ one by one. My speculation is executing all the CQ at the same time makes a influxdb crash. All the CQ are specified in influxdb config. I hope there may be a way to include time delay between the CQ in the influx config. I didn't know exactly how to include the time delay in the config. One sample CQ:
CREATE CONTINUOUS QUERY "cq_volume_reads" ON "metrics"
BEGIN
SELECT sum(reads) as reads INTO rollup1.tire_volume FROM
"metrics".raw.tier_volume GROUP BY time(10m),*
END
And also I don't know whether this is the best way to resolve the problem. Any thoughts on this approach or suggesting any better approach will be much appreciated. It would be great to get suggestions in using debugging tools for influxdb as well. Thanks!
#Rajan - A few comments:
The canonical documentation for CQs is here. Much of what I'm suggesting is from there.
Are you using back-referencing? I see your example CQ uses GROUP BY time(10m),* - the * wildcard is usually used with backreferences. Otherwise, I don't believe you need to include the * to indicate grouping by all tags - it should already be grouped by all tags.
If you are using backreferences, that runs the CQ for each measurement in the metrics database. This is potentially very many CQ executions at the same time, especially if you have many CQ defined this way.
You can set offsets with GROUP BY time(10m, <offset>) but this also impacts the time interval used for your aggregation function (sum in your example) so if your offset is 1 minute then timestamps will be a sum of data between e.g. 13:11->13:21 instead of 13:10 -> 13:20. This will offset execution but may not work for your downsampling use case. From a signal processing standpoint, a 1 minute offset wouldn't change the validity of the downsampled data, but it might produce unwanted graphical display problems depending on what you are doing. I do suggest trying this option.
Otherwise, you can try to reduce the number of downsampling CQs to reduce memory pressure or downsample on a larger timescale (e.g. 20m) or lastly, increase the hardware resources available to InfluxDB.
For managing memory usage, look at this post. There are not many adjustments in 1.8 but there are some.

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?
It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM.
Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for IntelĀ® Optaneā„¢ DC persistent memory.
However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.
As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:
Performance counter stats for 'system wide': 1,035,380 uncore_imc/event=0xe7/
So far this seems to be the the best guess.

Getting Linux prof samples even if my program is in sleep state?

With a program without sleep function, perf collects callgraph samples well.
void main()
{
while(true)
{
printf(...);
}
}
For example, more than 1,000 samples in a second.
I collected perf report with this:
sudo perf report -p <process_id> -g
However, when I do it with a program with sleep function, perf does not collect callgraph samples well: only a few samples in a second.
void main()
{
while(true)
{
sleep(1);
printf(...);
}
}
I want to collect the callgraph samples even if my program is in sleep state aka. device time. In Windows with VSPerf, callgraph with sleep state is also collected well.
Collecting callgraph for sleep state is needed for finding performance bottleneck not only in CPU time but also in device time (e.g. accessing database).
I guess there may be a perf option for collecting samples even if my program is in sleep state, because not only I but also many other programmers may want it.
How can I get the prof samples even if my program is in sleep state?
After posting this question, we found that perf -c 1 captures about 10 samples in a second. Without -c 1, perf captured 0.3 samples per second. 10 samples per second is much better for now, but it is still much less than 1000 samples per second.
Is there any better way?
CPU samples while your process is in the sleep state are mostly useless, but you could emulate this behavior by using an event that records the begin and end of the sleep syscall (capturing the stacks), and then just add the the "sleep stacks" yourself in "post processing" by duplicating the entry stack a number of times consistent with the duration of each sleep.
After all, the stack isn't going to change.
When you specify a profiling target, perf will only account for events that were generated by said target. Quite naturally, a sleep'ing target doesn't generate many performance events.
If you would like to see other processes (like a database?) in your callgraph reports, try system-wide sampling:
-a, --all-cpus
System-wide collection from all CPUs (default if no target is specified).
(from perf man page)
In addition, if you plan to spend a lot of time actually looking at the reports, there is a tool I cannot recommend you enough: FlameGraphs. This visualization may save you a great deal of effort.

Flash Memory Management

I'm collecting data on an ARM Cortex M4 based evaluation kit in a remote location and would like to log the data to persistent memory for access later.
I would be logging roughly 300 bytes once every hour, and would want to come collect all the data with a PC after roughly 1 week of running.
I understand that I should attempt to minimize the number of writes to flash, but I don't have a great understanding of the best way to do this. I'm looking for a resource that would explain memory management techniques for this kind of situation.
I'm using the ADUCM350 which looks like it has 3 separate flash sections (128kB, 256kB, and a 16kB eeprom).
For logging applications the simplest and most effective wear leveling tactic is to treat the entire flash array as a giant ring buffer.
define an entry size to be some integer fraction of the smallest erasable flash unit. Say a sector is 4K(4096 bytes); let the entry size be 256.
This is to make all log entries be sector aligned and will allow you to erase any sector without cuting a log entry in half.
At boot, walk the memory and find the first empty entry. this is the 'write_pointer'
when a log entry is written, simply write it to write_pointer and increment write_pointer.
If write_pointer is on a sector boundary erase the sector at write_pointer to make room for the next write. essentially this guarantees that there is at least one empty log entry for you to find at boot and allows you to restore the write_pointer.
if you dedicate 128KBytes to the log entries and have an endurance of 20000 write/erase cycles. this should give you a total of 10240000 entries written before failure. or 1168 years of continuous logging...

Resources