How to aggregate PAPI uncore events such as skx_unc_imc0 to measure for all devices? - perf

PAPI counts events per device (which can be iMC, caching and home agent(cha)). These are counted as separate events such as skx_unc_imc0::UNC_M_RPQ_OCCUPANCY for iMC 0. Is there a way to measure this for all the iMCs at the same time?
The Linux perf tool can do this, an equivalent event is unc_m_rpq_occupancy, but this cannot be measured from PAPI.

Related

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?
It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM.
Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for Intel® Optane™ DC persistent memory.
However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.
As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:
Performance counter stats for 'system wide': 1,035,380 uncore_imc/event=0xe7/
So far this seems to be the the best guess.

Is it possible to raise the sampling frequency of perf stat?

I am using perf for profiling, but the number of monitored PMU events is higher than the number of the hardware counters, so round-robin multiplexing strategy is triggered. However, some of my test cases may run less than a millisecond, which means that if the execution time is less than the multiplicative inverse of the default switch frequency (1000Hz), some events may not be profiled.
How to raise the sampling frequency of perf stat like perf record -F <freqency> to make sure that every events will be recorded even if the measurement overhead may slightly increase?
First off, remember that sampling is different than counting.
perf record will invariably do a sampling of all the events that occured during the time period of profiling. This means that it will not count all of the events that happened (this can be tweaked of course!). You can modify the frequency of sample collection to increase the number of samples that get collected. It will usually be like for every 10 (or whatever number > 0) events that occur, perf record will only record 1 of them.
perf stat will do a counting of all the events that occur. For each event that happens, perf stat will count it and will try not to miss any, unlike sampling. Of course, the number of events counted may not be accurate if there is multiplexing involved (i.e. when the number of events measured is greater than the number of available hardware counters). There is no concept of setting up frequencies in perf stat since all it does is a direct count of all the events that you intend to measure.
This is the proof from the linux kernel source code :-
You can see it sets up sample period (the inverse of sample freq) to be 0 - so you know what sample freq is ;)
Anyway, what you can do is a verbose reading of perf stat using perf stat -v to see and understand what is happening with all of the events that you are measuring.
To understand more about perf stat, you can also read this answer.

Timing Advance in GSM

I have a bunch of questions concerning Timing Advance in GSM :
When is it defined ?
Is it the phone or the BTS who's in charge of defining it's value ?
is it dynamic, does it depends on certain situations ?
Let's say that I figured out a way to get the exact value of the Timing Advance (GSM Layer 1 Transmission level) from the phone's modem :
In order to verify my solution, I'm supposed to put my phone over and over in a situation where he have to use/change the Timing Advance while I log its value...
How can I do that ?
Thanks
In the GSM cellular mobile phone standard, timing advance value corresponds to the length of time a signal takes to reach the base station from a mobile phone. GSM uses TDMA technology in the radio interface to share a single frequency between several users, assigning sequential timeslots to the individual users sharing a frequency. Each user transmits periodically for less than one-eighth of the time within one of the eight timeslots. Since the users are at various distances from the base station and radio waves travel at the finite speed of light, the precise arrival-time within the slot can be used by the base station to determine the distance to the mobile phone. The time at which the phone is allowed to transmit a burst of traffic within a timeslot must be adjusted accordingly to prevent collisions with adjacent users. Timing Advance (TA) is the variable controlling this adjustment.
Technical Specifications 3GPP TS 05.10[1] and TS 45.010[2] describe the TA value adjustment procedures. The TA value is normally between 0 and 63, with each step representing an advance of one bit period (approximately 3.69 microseconds). With radio waves travelling at about 300,000,000 metres per second (that is 300 metres per microsecond), one TA step then represents a change in round-trip distance (twice the propagation range) of about 1,100 metres. This means that the TA value changes for each 550-metre change in the range between a mobile and the base station. This limit of 63 × 550 metres is the maximum 35 kilometres that a device can be from a base station and is the upper bound on cell placement distance.
A continually adjusted TA value avoids interference to and from other users in adjacent timeslots, thereby minimizing data loss and maintaining Mobile QoS (call quality-of-service).
Timing Advance is significant for privacy and communications security, as its combination with other variables can allow GSM localization to find the device's position and tracking the mobile phone user. TA is also used to adjust transmission power in Space-division multiple access systems.
This limited the original range of a GSM cell site to 35km as mandated by the duration of the standard timeslots defined in the GSM specification. The maximum distance is given by the maximum time that the signal from the mobile/BTS needs to reach the receiver of the mobile/BTS on time to be successfully heard. At the air interface the delay between the transmission of the downlink (BTS) and the uplink (mobile) has an offset of 3 timeslots. Until now the mobile station has used a timing advance to compensate for the propagation delay as the distance to the BTS changes. The timing advance values are coded by 6 bits, which gives the theoretical maximum BTS/mobile separation as 35km.
By implementing the Extended Range feature, the BTS is able to receive the uplink signal in two adjacent timeslots instead of one. When the mobile station reaches its maximum timing advance, i.e. maximum range, the BTS expands its hearing window with an internal timing advance that gives the necessary time for the mobile to be heard by the BTS even from the extended distance. This extra advance is the duration of a single timeslot, a 156 bit period. This gives roughly 120 km range for a cell.[3] and is implemented in sparsely populated areas and to reach islands for example.
Hope this Answer the question:)
It's defined everytime the BTS needs to set the define the phone's transmission power, which happens quite often.
It's the core system (BTS in GSM) who totally in charge of defining it's value.
It's very dynamic, and change a lot. Globally, the GSM core system is constantly trying to find the exact distance between the BTS and the MS, so it constantly make a kind of "ping" to calculate it. The result of such operations is generally not that accurate since there are a lot of obstacles between the mobile and the BTS (it's not a direct link in an open space).
Such operations happens a lot, so use your smartphone. Simply.

Calclulate Power consumption when CPU don't change to LPM Mode in Contiki

I need to calculate power consumption of CPU. According to this formula.
Power(mW) = cpu * 1.8 / time.
Where time is the sum of cpu + lpm.
I need to measure at the start of certain process and at the end, however the time passed it is to short, and cpu don't change to lpm mode as seen in the next values taken with powertrace_print().
all_cpu all_lpm all_transmit all_listen
116443 1514881 148 1531616
17268 1514881 148 1532440
Calculating power consumption of cpu I got 1.8 mW (which is exactly the value of current draw of CPU in active mode).
My question is, how calculate power consumption in this case?
If MCU does not go into a LPM, then it spends all the time in active mode, so the result of 1.8 mW you get looks correct.
Perhaps you want to ask something different? If you want to measure the time required to execute a specific block of code, you can add RTIMER_NOW() calls at the start and end of the block.
The time resolution of RTIMER_NOW may be too coarse for short-time operations. You can use a higher frequency timer for that, depending on your platform, e.g. read the TBR register for timing if you're compiling for a msp430 based sensor node.

Contiki OS CC2538: Reducing current / power consumption

I am trying to drive down the current consumption of the contiki os running on the CC2538 development kit.
I would like to operate the device from a CR2032 with a run life of 2 years. To achieve this I would need an average current less than 100uA.
However when I run the following at 3V, I get the following results:
contiki/examples/hello-world = 0.4mA - 2mA
contiki/examples/er-rest-example/er-example-client = 27mA
contiki/examples/er-rest-example/er-example-server = 27mA
thingsquare websocket example = 4mA
I have also designed my own target platform based on the cc2538 and get similar results.
I have read the guide at https://github.com/contiki-os/contiki/blob/648d3576a081b84edd33da05a3a973e209835723/platform/cc2538dk/README.md
and have ensured that in the contiki-conf.h file:
- LPM_CONF_ENABLE 1
- LPM_CONF_MAX_PM 2
Can anyone give me some pointers as to how I can get the current down. It would be most appreciated.
Regards,
Shane
How did you measure the current?
You have to be aware that using a basic ampere meter to measure the current consumption of contiki-os wouldn't give you relevant results. The system is turning on/off the radio at a relative high rate (8Hz by default) in order to perform the CCA. This might not be very easy to catch for an ampere meter.
To have an idea of the current consumption when the device is in deep sleep (and then make calculation to determine the averaged current consumption), I'd rather put the device in the PM state before the program reach the infinite while loop. I used the following code to do that:
lpm_enter();
REG(SYS_CTRL_PMCTL) = SYS_CTRL_PMCTL_PM2;
do { asm("wfi"::); } while(0);
leds_on(LEDS_RED); // should not reach here
while(1){
...
On the CC2538, the CCA check consumes about 10-15mA and last approximately 2ms. When the radio transmit a packet, it consume 25mA. Have a look at this post: Contiki UDP packet transmission duration with CC2538.
Furthermore, to save a little more current, turn off the serial com:
#define CC2538_CONF_QUIET 1
Are you using the SmartRF board? If you want to make proper current measurement with this board, you have to remove every jumpers: P486, P487, P411 and P408. Keep only the jumpers of BTN_SEL and the RESET signals.

Resources