How to convert task-clock perf-event to seconds or milliseconds? - perf

I am trying to use perf for performance analysis.
When I use perf stat it provides execution time
Performance counter stats for './quicksort_ver1 input.txt 10000':
7.00 msec task-clock:u # 0.918 CPUs utilized
2,679,253 cycles:u # 0.383 GHz (9.58%)
18,034,446 instructions:u # 6.73 insn per cycle (23.56%)
5,764,095 branches:u # 822.955 M/sec (37.62%)
5,030,025 dTLB-loads # 718.150 M/sec (51.69%)
2,948,787 dTLB-stores # 421.006 M/sec (65.75%)
5,525,534 L1-dcache-loads # 788.895 M/sec (48.31%)
2,653,434 L1-dcache-stores # 378.838 M/sec (34.25%)
4,900 L1-dcache-load-misses # 0.09% of all L1-dcache hits (20.16%)
66 LLC-load-misses # 0.00% of all LL-cache hits (6.09%)
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)
<not counted> LLC-stores (0.00%)
0.007631774 seconds time elapsed
0.006655000 seconds user
0.000950000 seconds sys
However when I use perf record, I observe that for task-clock 45 samples and 14999985 events are collected.
Samples: 45 of event 'task-clock:u', Event count (approx.): 14999985
Children Self Command Shared Object Symbol
+ 91.11% 0.00% quicksort_ver1 quicksort_ver1 [.] _start
+ 91.11% 0.00% quicksort_ver1 libc-2.17.so [.] __libc_start_main
+ 91.11% 0.00% quicksort_ver1 quicksort_ver1 [.] main
is there any way to convert task-clock events to seconds to milliseconds?

Got answer with little bit of experimentation. Basic unit of task-cpu event is Nano second
stats collected with perf stat
$ sudo perf stat -e task-clock:u ./bubble_sort input.txt 50000
Performance counter stats for './bubble_sort input.txt 50000':
11,617.33 msec task-clock:u # 1.000 CPUs utilized
11.617480215 seconds time elapsed
11.615856000 seconds user
0.002000000 seconds sys
stats collected with perf record
$ sudo perf report
Samples: 35K of event 'task-clock:u', Event count (approx.): 11715321618
Overhead Command Shared Object Symbol
73.75% bubble_sort bubble_sort [.] bubbleSort
26.15% bubble_sort bubble_sort [.] swap
0.07% bubble_sort libc-2.17.so [.] _IO_vfscanf
observe in both the cases sample has changed but event count is approximately same.
perf stat reports elapsed time as 11.617480215 seconds and perf report reports total task-clock events: 11715321618
11715321618 nanoseconds = 11.715321618 seconds which is approximately equals to 11.615856000 seconds
apparently basic unit of task-cpu event is Nanosecond.

Related

Why the same tasks cost differerent CPU on linux kernel 4.9 and 5.4?

My application is a compute intensive task(I.e. video encoding). When it is running on linux kernel 4.9(Ubuntu 16.04), the cpu usage is 3300%. But when it is running on linux kernel 5.4(Ubuntu 20.04), the cpu Usage is just 2850%. Promise the processes do the same job.
So I wonder if linux kernel had done some cpu scheduling optimization or related work between 4.9 and 5.4? Could you give any advice to investigate the reason?
I am not sure if the version of glic has effect or not, for your information, the version of glic is 2.23 on linux kernel 4.9 while 2.31 on linux kernel 5.4.
CPU Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210 CPU # 2.20GHz
Stepping: 7
CPU MHz: 2200.000
BogoMIPS: 4401.69
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Output of perf stat on Linux Kernel 4.9
Performance counter stats for process id '32504':
3146297.833447 cpu-clock (msec) # 32.906 CPUs utilized
1,718,778 context-switches # 0.546 K/sec
574,717 cpu-migrations # 0.183 K/sec
2,796,706 page-faults # 0.889 K/sec
6,193,409,215,015 cycles # 1.968 GHz (30.76%)
6,948,575,328,419 instructions # 1.12 insn per cycle (38.47%)
540,538,530,660 branches # 171.801 M/sec (38.47%)
33,087,740,169 branch-misses # 6.12% of all branches (38.50%)
1,966,141,393,632 L1-dcache-loads # 624.906 M/sec (38.49%)
184,477,765,497 L1-dcache-load-misses # 9.38% of all L1-dcache hits (38.47%)
8,324,742,443 LLC-loads # 2.646 M/sec (30.78%)
3,835,471,095 LLC-load-misses # 92.15% of all LL-cache hits (30.76%)
<not supported> L1-icache-loads
187,604,831,388 L1-icache-load-misses (30.78%)
1,965,198,121,190 dTLB-loads # 624.607 M/sec (30.81%)
438,496,889 dTLB-load-misses # 0.02% of all dTLB cache hits (30.79%)
7,139,892,384 iTLB-loads # 2.269 M/sec (30.79%)
260,660,265 iTLB-load-misses # 3.65% of all iTLB cache hits (30.77%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
95.615072142 seconds time elapsed
Output of perf stat on Linux Kernel 5.4
Performance counter stats for process id '3355137':
2,718,192.32 msec cpu-clock # 29.184 CPUs utilized
1,719,910 context-switches # 0.633 K/sec
448,685 cpu-migrations # 0.165 K/sec
3,884,586 page-faults # 0.001 M/sec
5,927,930,305,757 cycles # 2.181 GHz (30.77%)
6,848,723,995,972 instructions # 1.16 insn per cycle (38.47%)
536,856,379,853 branches # 197.505 M/sec (38.47%)
32,245,288,271 branch-misses # 6.01% of all branches (38.48%)
1,935,640,517,821 L1-dcache-loads # 712.106 M/sec (38.47%)
177,978,528,204 L1-dcache-load-misses # 9.19% of all L1-dcache hits (38.49%)
8,119,842,688 LLC-loads # 2.987 M/sec (30.77%)
3,625,986,107 LLC-load-misses # 44.66% of all LL-cache hits (30.75%)
<not supported> L1-icache-loads
184,001,558,310 L1-icache-load-misses (30.76%)
1,934,701,161,746 dTLB-loads # 711.760 M/sec (30.74%)
676,618,636 dTLB-load-misses # 0.03% of all dTLB cache hits (30.76%)
6,275,901,454 iTLB-loads # 2.309 M/sec (30.78%)
391,706,425 iTLB-load-misses # 6.24% of all iTLB cache hits (30.78%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
93.139551411 seconds time elapsed
UPDATE:
It is confirmed the performance gain comes from linux kernel 5.4, because the performance on linux kernel 5.3 is the same as linux kernel 4.9.
It is confirmed the performance gain has no relation with libc, because on linux kernel 5.10 whose libc is 2.23 the performance is the same as linux kernel 5.4 whose libc is 2.31
It seems performance gain comes from this fix:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de53fd7aedb100f03e5d2231cfce0e4993282425

Best way for logging CPU & GPU utilization every second in linux

I want to get the CPU and GPU utilisation of my cuda program and plot them like this.
What's the best way?
Here is my script:
### [1] Running my cuda program in background
./my_cuda_program &
PID_MY_CUDA_PROGRAM=$!
### [2] Getting CPU & GPU utilization in background
sar 1 | sed --unbuffered -e 's/^/SYSSTAT:/' &
PID_SYSSTAT=$!
nvidia-smi --format=csv --query-gpu=timestamp,utilization.gpu -l 1 \
| sed --unbuffered -e 's/^/NVIDIA_SMI:/' &
PID_NVIDIA_SMI=$!
### [3] waiting for the [1] process to finish,
### and then kill [2] processes
wait ${PID_MY_CUDA_PROGRAM}
kill ${PID_SYSSTAT}
kill ${PID_NVIDIA_SMI}
exit
That output:
SYSSTAT:Linux 4.15.0-176-generic (ubuntu00) 05/06/22 _x86_64_ (4 CPU)
NVIDIA_SMI:timestamp, utilization.gpu [%]
NVIDIA_SMI:2022/05/06 23:57:00.245, 7 %
SYSSTAT:
SYSSTAT:23:57:00 CPU %user %nice %system %iowait %steal %idle
SYSSTAT:23:57:01 all 8.73 0.00 5.74 7.48 0.00 78.05
NVIDIA_SMI:2022/05/06 23:57:01.246, 1 %
SYSSTAT:23:57:02 all 23.31 0.00 6.02 0.00 0.00 70.68
NVIDIA_SMI:2022/05/06 23:57:02.246, 16 %
SYSSTAT:23:57:03 all 25.56 0.00 3.76 0.00 0.00 70.68
NVIDIA_SMI:2022/05/06 23:57:03.246, 15 %
SYSSTAT:23:57:04 all 22.69 0.00 6.48 0.00 0.00 70.82
NVIDIA_SMI:2022/05/06 23:57:04.246, 21 %
SYSSTAT:23:57:05 all 25.81 0.00 3.26 0.00 0.00 70.93
it's a bit annoying to parse the log above.

How to analyze unsuccessful builds in the analysis phase?

A bazel binary that I am building completes unsuccessfully during the analysis phase. What flags and tools can I use to debug why it fails during analysis.
Currently, clean builds return the following output
ERROR: build interrupted
INFO: Elapsed time: 57.819 s
FAILED: Build did NOT complete successfully (133 packages loaded)
If I retry building after failed completion, I receive the following output
ERROR: build interrupted
INFO: Elapsed time: 55.514 s
FAILED: Build did NOT complete successfully (68 packages loaded)
What flags can I use to identify
what packages are being loaded
what package the build is being interrupted on
whether the interruption is coming from a timeout or an external process.
Essentially, something similar to --verbose_failures but for the analysis phase rather than the execution phrase.
So far I have ran my build through the build profiler, and have not been able to glean any insight. Here is the output of my build:
WARNING: This information is intended for consumption by Blaze developers only, and may change at any time. Script against it at your own risk
INFO: Loading /<>/result
INFO: bazel profile for <> at Mon Jun 04 00:10:11 GMT 2018, build ID: <>, 49405 record(s)
INFO: Aggregating task statistics
=== PHASE SUMMARY INFORMATION ===
Total launch phase time 9.00 ms 0.02%
Total init phase time 91.0 ms 0.16%
Total loading phase time 1.345 s 2.30%
Total analysis phase time 57.063 s 97.53%
Total run time 58.508 s 100.00%
=== INIT PHASE INFORMATION ===
Total init phase time 91.0 ms
Total time (across all threads) spent on:
Type Total Count Average
=== LOADING PHASE INFORMATION ===
Total loading phase time 1.345 s
Total time (across all threads) spent on:
Type Total Count Average
CREATE_PACKAGE 0.67% 9 3.55 ms
VFS_STAT 0.69% 605 0.05 ms
VFS_DIR 0.96% 255 0.18 ms
VFS_OPEN 2.02% 8 12.1 ms
VFS_READ 0.00% 5 0.01 ms
VFS_GLOB 23.74% 1220 0.93 ms
SKYFRAME_EVAL 24.44% 3 389 ms
SKYFUNCTION 36.95% 8443 0.21 ms
SKYLARK_LEXER 0.19% 31 0.29 ms
SKYLARK_PARSER 0.68% 31 1.04 ms
SKYLARK_USER_FN 0.03% 5 0.27 ms
SKYLARK_BUILTIN_FN 5.91% 349 0.81 ms
=== ANALYSIS PHASE INFORMATION ===
Total analysis phase time 57.063 s
Total time (across all threads) spent on:
Type Total Count Average
CREATE_PACKAGE 0.30% 138 3.96 ms
VFS_STAT 0.05% 2381 0.03 ms
VFS_DIR 0.19% 1020 0.35 ms
VFS_OPEN 0.04% 128 0.61 ms
VFS_READ 0.00% 128 0.01 ms
VFS_GLOB 0.92% 3763 0.45 ms
SKYFRAME_EVAL 31.13% 1 57.037 s
SKYFUNCTION 65.21% 32328 3.70 ms
SKYLARK_LEXER 0.01% 147 0.10 ms
SKYLARK_PARSER 0.03% 147 0.39 ms
SKYLARK_USER_FN 0.20% 343 1.08 ms
As far as my command, I am running
bazel build src:MY_TARGET --embed_label MY_LABEL --stamp --show_loading_progress
Use the --host_jvm_debug startup flag to debug Bazel itself during a build.
From https://bazel.build/contributing.html:
Debugging Bazel
Start creating a debug configuration for both C++ and
Java in your .bazelrc with the following:
build:debug -c dbg
build:debug --javacopt="-g"
build:debug --copt="-g"
build:debug --strip="never"
Then you can rebuild Bazel with bazel build --config debug //src:bazel and use your favorite debugger to start debugging.
For debugging the C++ client you can just run it from gdb or lldb as
you normally would. But if you want to debug the Java code, you must
attach to the server using the following:
Run Bazel with debugging option --host_jvm_debug before the command (e.g., bazel --batch --host_jvm_debug build //src:bazel).
Attach a debugger to the port 5005. With jdb for instance, run jdb -attach localhost:5005. From within Eclipse, use the remote
Java application launch configuration.
Our IntelliJ plugin has built-in debugging support

Finding the memory consumption of each redis DB

The problem
One of my Python Redis clients fails with the following exception:
redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
I have checked the redis machine, and it seems to be out of memory:
free
total used free shared buffers cached
Mem: 3952 3656 295 0 1 9
-/+ buffers/cache: 3645 306
Swap: 0 0 0
top
top - 15:35:03 up 14:09, 1 user, load average: 0.06, 0.17, 0.16
Tasks: 114 total, 2 running, 112 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st
KiB Mem: 4046852 total, 3746772 used, 300080 free, 1668 buffers
KiB Swap: 0 total, 0 used, 0 free. 11364 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1102 root 20 0 3678836 3.485g 736 S 1.3 90.3 10:12.53 redis-server
1332 ubuntu 20 0 41196 3096 972 S 0.0 0.1 0:00.12 zsh
676 root 20 0 10216 2292 0 S 0.0 0.1 0:00.03 dhclient
850 syslog 20 0 255836 2288 124 S 0.0 0.1 0:00.39 rsyslogd
I am using a few dozens Redis DBs in a single Redis instance. Each DB is denoted by numeric ids given to redis-cli, e.g.:
$ redis-cli -n 80
127.0.0.1:6379[80]>
How do I know how much memory does each DB consume, and what are the largest keys in each DB?
How do I know how much memory does each DB consume, and what are the largest keys in each DB?
You CANNOT get the used memory for each DB. With INFO command, you can only get the totally used memory for Redis instance. Redis records the newly allocated memory size, each time it dynamically allocates some memory. However, it doesn't do such record for each DB. Also, it doesn't have any record for the largest keys.
Normally, you should config your Redis instance with the maxmemory and maxmemory-policy (i.e. eviction policy when the maxmemory is reached).
You can write some sh-script like to this (show element count in each DB):
#!/bin/bash
max_db=501
i=0
while [ $i -lt $max_db ]
do
echo "db_nubner: $i"
redis-cli -n $i dbsize
i=$((i+1))
done
Example output:
db_nubner: 0
(integer) 71
db_nubner: 1
(integer) 0
db_nubner: 2
(integer) 1
db_nubner: 3
(integer) 1
db_nubner: 4
(integer) 0
db_nubner: 5
(integer) 1
db_nubner: 6
(integer) 28
db_nubner: 7
(integer) 1
I know that we can have a one database with large key, but anyway, in some cases this script can help.

Memcache_connect Connection timed out

I get 10 to 20 of these errors all within 1 second of each other:
Memcache_connect Connection timed out
This happens several times a day, on a server with about 2500 daily active users, 1GB of ram. I don't think the server is swapping. Most of the time, I'm at less then 75% memory utilization, and less then 25% CPU utilization. The load averages are usually less then 9. I followed the debugging instructions here: http://code.google.com/p/memcached/wiki/Timeouts
Here are my memcache stats:
stats
STAT pid 15365
STAT uptime 173776
STAT time 1329157234
STAT version 1.2.8
STAT pointer_size 32
STAT rusage_user 1171.316354
STAT rusage_system 7046.435826
STAT curr_items 28494
STAT total_items 4039745
STAT bytes 3371127
STAT curr_connections 36
STAT total_connections 102206685
STAT connection_structures 328
STAT cmd_flush 0
STAT cmd_get 73532547
STAT cmd_set 4039745
STAT get_hits 40779162
STAT get_misses 32753385
STAT evictions 0
STAT bytes_read 2153565193
STAT bytes_written 38768040520
STAT limit_maxbytes 67108864
STAT threads 2
STAT accepting_conns 1
STAT listen_disabled_num 0
My hypothesis is that I'm running out of TIME_WAIT buckets:
netstat -n | grep TIME_WAIT | wc -l
51892
But I don't know if that's too high or not.
I'm on Solaris (on the Joyent servers) and the tcp_time_wait_interval is set to 60000. Some other readings suggested decreasing this setting to 30000, or 15000. But this doesn't seem like a scalable solution to me.
How do I know that it's running out of buckets? Should i increase the number of TIME_WAIT buckets? if so, how?

Resources