Why the same tasks cost differerent CPU on linux kernel 4.9 and 5.4? - memory

My application is a compute intensive task(I.e. video encoding). When it is running on linux kernel 4.9(Ubuntu 16.04), the cpu usage is 3300%. But when it is running on linux kernel 5.4(Ubuntu 20.04), the cpu Usage is just 2850%. Promise the processes do the same job.
So I wonder if linux kernel had done some cpu scheduling optimization or related work between 4.9 and 5.4? Could you give any advice to investigate the reason?
I am not sure if the version of glic has effect or not, for your information, the version of glic is 2.23 on linux kernel 4.9 while 2.31 on linux kernel 5.4.
CPU Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210 CPU # 2.20GHz
Stepping: 7
CPU MHz: 2200.000
BogoMIPS: 4401.69
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Output of perf stat on Linux Kernel 4.9
Performance counter stats for process id '32504':
3146297.833447 cpu-clock (msec) # 32.906 CPUs utilized
1,718,778 context-switches # 0.546 K/sec
574,717 cpu-migrations # 0.183 K/sec
2,796,706 page-faults # 0.889 K/sec
6,193,409,215,015 cycles # 1.968 GHz (30.76%)
6,948,575,328,419 instructions # 1.12 insn per cycle (38.47%)
540,538,530,660 branches # 171.801 M/sec (38.47%)
33,087,740,169 branch-misses # 6.12% of all branches (38.50%)
1,966,141,393,632 L1-dcache-loads # 624.906 M/sec (38.49%)
184,477,765,497 L1-dcache-load-misses # 9.38% of all L1-dcache hits (38.47%)
8,324,742,443 LLC-loads # 2.646 M/sec (30.78%)
3,835,471,095 LLC-load-misses # 92.15% of all LL-cache hits (30.76%)
<not supported> L1-icache-loads
187,604,831,388 L1-icache-load-misses (30.78%)
1,965,198,121,190 dTLB-loads # 624.607 M/sec (30.81%)
438,496,889 dTLB-load-misses # 0.02% of all dTLB cache hits (30.79%)
7,139,892,384 iTLB-loads # 2.269 M/sec (30.79%)
260,660,265 iTLB-load-misses # 3.65% of all iTLB cache hits (30.77%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
95.615072142 seconds time elapsed
Output of perf stat on Linux Kernel 5.4
Performance counter stats for process id '3355137':
2,718,192.32 msec cpu-clock # 29.184 CPUs utilized
1,719,910 context-switches # 0.633 K/sec
448,685 cpu-migrations # 0.165 K/sec
3,884,586 page-faults # 0.001 M/sec
5,927,930,305,757 cycles # 2.181 GHz (30.77%)
6,848,723,995,972 instructions # 1.16 insn per cycle (38.47%)
536,856,379,853 branches # 197.505 M/sec (38.47%)
32,245,288,271 branch-misses # 6.01% of all branches (38.48%)
1,935,640,517,821 L1-dcache-loads # 712.106 M/sec (38.47%)
177,978,528,204 L1-dcache-load-misses # 9.19% of all L1-dcache hits (38.49%)
8,119,842,688 LLC-loads # 2.987 M/sec (30.77%)
3,625,986,107 LLC-load-misses # 44.66% of all LL-cache hits (30.75%)
<not supported> L1-icache-loads
184,001,558,310 L1-icache-load-misses (30.76%)
1,934,701,161,746 dTLB-loads # 711.760 M/sec (30.74%)
676,618,636 dTLB-load-misses # 0.03% of all dTLB cache hits (30.76%)
6,275,901,454 iTLB-loads # 2.309 M/sec (30.78%)
391,706,425 iTLB-load-misses # 6.24% of all iTLB cache hits (30.78%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
93.139551411 seconds time elapsed
UPDATE:
It is confirmed the performance gain comes from linux kernel 5.4, because the performance on linux kernel 5.3 is the same as linux kernel 4.9.
It is confirmed the performance gain has no relation with libc, because on linux kernel 5.10 whose libc is 2.23 the performance is the same as linux kernel 5.4 whose libc is 2.31

It seems performance gain comes from this fix:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de53fd7aedb100f03e5d2231cfce0e4993282425

Related

Stardog on VM Linux Ubutu - memory capacity

We are experiencing performance problems with Stardog requests (about 500 000ms minimum to get an answer). We followed the Debian Based Systems installation described in the Stardog documentation and have a stardog service installed in our Ubutu VM.
Azure machine: Standard D4s v3 (4 virtual processors, 16 Gb memory)
Total amount of memory of the VM = 16 Gio of memory
We tested several JVM environment variables
Xms4g -Xmx4g -XX:MaxDirectMemorySize=8g
Xms8g -Xmx8g -XX:MaxDirectMemorySize=8g
We also tried to upgrade the VM with a machine but without success:
Azure: Standard D8s v3 - 8 virtual processors, 32 Gb memory
By doing the command: systemctl status stardog in the machine with 32Gio memory
we get :
stardog.service - Stardog Knowledge Graph
Loaded: loaded (/etc/systemd/system/stardog.service; enabled; vendor prese>
Active: active (running) since Tue 2023-01-17 15:41:40 UTC; 1min 35s ago
Docs: https://www.stardog.com/
Process: 797 ExecStart=/opt/stardog/stardog-server.sh start (code=exited, s>
Main PID: 969 (java)
Tasks: 76 (limit: 38516)
Memory: 1.9G
CGroup: /system.slice/stardog.service
└─969 java -Dstardog.home=/var/opt/stardog/ -Xmx8g -Xms8g XX:MaxD
stardog-admin server status :
Access Log Enabled : true
Access Log Type : text
Audit Log Enabled : true
Audit Log Type : text
Backup Storage Directory : .backup
CPU Load : 1.88 %
Connection Timeout : 10m
Export Storage Directory : .exports
Memory Heap : 305M (Max: 8.0G)
Memory Mode : DEFAULT{Starrocks.block_cache=20, Starrocks.dict_block_cache=10, Native.starrocks=70, Heap.dict_value=50, Starrocks.txn_block_cache=5, Heap.dict_index=50, Starrocks.untracked_memory=20, Starrocks.memtable=40, Starrocks.buffer_pool=5, Native.query=30}
Memory Query Blocks : 0B (Max: 5.7G)
Memory RSS : 4.3G
Named Graph Security : false
Platform Arch : amd64
Platform OS : Linux 5.15.0-1031-azure, Java 1.8.0_352
Query All Graphs : false
Query Timeout : 1h
Security Disabled : false
Stardog Home : /var/opt/stardog
Stardog Version : 8.1.1
Strict Parsing : true
Uptime : 2 hours 18 minutes 51 seconds
Knowing that there is only stardog server installed in this VM, 8G JVM Heap Memory & 20G Direct Memory for Java, is it normal to have 1.9G in memory (No process in progress)
and 4.1G (when the query is in progress)
"databases.xxxx.queries.latency": {
"count": 7,
"max": 471.44218324400003,
"mean": 0.049260736982859085,
"min": 0.031328932000000004,
"p50": 0.048930366,
"p75": 0.048930366,
"p95": 0.048930366,
"p98": 0.048930366,
"p99": 0.048930366,
"p999": 0.048930366,
"stddev": 0.3961819852037625,
"m15_rate": 0.0016325388459502614,
"m1_rate": 0.0000015369791915358426,
"m5_rate": 0.0006317127755974434,
"mean_rate": 0.0032760240366080024,
"duration_units": "seconds",
"rate_units": "calls/second"
Of all your queries the slowest took 8 minutes to complete while the others completed very quickly. Best to identify the slow query and profile it.

How to convert task-clock perf-event to seconds or milliseconds?

I am trying to use perf for performance analysis.
When I use perf stat it provides execution time
Performance counter stats for './quicksort_ver1 input.txt 10000':
7.00 msec task-clock:u # 0.918 CPUs utilized
2,679,253 cycles:u # 0.383 GHz (9.58%)
18,034,446 instructions:u # 6.73 insn per cycle (23.56%)
5,764,095 branches:u # 822.955 M/sec (37.62%)
5,030,025 dTLB-loads # 718.150 M/sec (51.69%)
2,948,787 dTLB-stores # 421.006 M/sec (65.75%)
5,525,534 L1-dcache-loads # 788.895 M/sec (48.31%)
2,653,434 L1-dcache-stores # 378.838 M/sec (34.25%)
4,900 L1-dcache-load-misses # 0.09% of all L1-dcache hits (20.16%)
66 LLC-load-misses # 0.00% of all LL-cache hits (6.09%)
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)
<not counted> LLC-stores (0.00%)
0.007631774 seconds time elapsed
0.006655000 seconds user
0.000950000 seconds sys
However when I use perf record, I observe that for task-clock 45 samples and 14999985 events are collected.
Samples: 45 of event 'task-clock:u', Event count (approx.): 14999985
Children Self Command Shared Object Symbol
+ 91.11% 0.00% quicksort_ver1 quicksort_ver1 [.] _start
+ 91.11% 0.00% quicksort_ver1 libc-2.17.so [.] __libc_start_main
+ 91.11% 0.00% quicksort_ver1 quicksort_ver1 [.] main
is there any way to convert task-clock events to seconds to milliseconds?
Got answer with little bit of experimentation. Basic unit of task-cpu event is Nano second
stats collected with perf stat
$ sudo perf stat -e task-clock:u ./bubble_sort input.txt 50000
Performance counter stats for './bubble_sort input.txt 50000':
11,617.33 msec task-clock:u # 1.000 CPUs utilized
11.617480215 seconds time elapsed
11.615856000 seconds user
0.002000000 seconds sys
stats collected with perf record
$ sudo perf report
Samples: 35K of event 'task-clock:u', Event count (approx.): 11715321618
Overhead Command Shared Object Symbol
73.75% bubble_sort bubble_sort [.] bubbleSort
26.15% bubble_sort bubble_sort [.] swap
0.07% bubble_sort libc-2.17.so [.] _IO_vfscanf
observe in both the cases sample has changed but event count is approximately same.
perf stat reports elapsed time as 11.617480215 seconds and perf report reports total task-clock events: 11715321618
11715321618 nanoseconds = 11.715321618 seconds which is approximately equals to 11.615856000 seconds
apparently basic unit of task-cpu event is Nanosecond.

understanding docker container cpu usages

docker stats shows that the cpu usage to be very high. But top command out shows that 88.3% cpu is not being used. Inside the container is a java service httpthrift service.
docker stats :
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
8a0488xxxx5 540.9% 41.99 GiB / 44 GiB 95.43% 0 B / 0 B 0 B / 35.2 MB 286
top output :
top - 07:56:58 up 2 days, 22:29, 0 users, load average: 2.88, 3.01, 3.05
Tasks: 13 total, 1 running, 12 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.2 us, 2.7 sy, 0.0 ni, 88.3 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
KiB Mem: 65959920 total, 47983628 used, 17976292 free, 357632 buffers
KiB Swap: 7999484 total, 0 used, 7999484 free. 2788868 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8823 root 20 0 58.950g 0.041t 21080 S 540.9 66.5 16716:32 java
How to reduce the cpu usage and bring it under 100%?
According to the top man page:
When operating in Solaris mode (`I' toggled Off), a task's cpu usage will be divided by the total number of CPUs. After issuing this command, you'll be told the new state of this toggle.
So by pressing the key I when using top in interactive mode, you will switch to the Solaris mode and the CPU usage will be divided by the total number of CPUs (or cores).
P.S.: This option is not available on all versions of top.

What are memory requirments for OrientDB/Can I run on EC2 micro?

I would like to run OrientDB on an EC2 micro (free tier) instance. I am unable to find official documentation for OrientDB that gives memory requirements, however I found this question that says 512MB should be fine. I am running an EC2 micro instance which has 1GB RAM. However, when I try to run OrientDB I get the JRE error shown below. My initial thought was that I needed to increase the jre memory using -xmx, but I guess it would be the shell script that would do this.. Has anyone successfully run OrientDB in an EC2 micro instance or run into this problem?
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007a04a0000, 1431699456, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (malloc) failed to allocate 1431699456 bytes for committing reserved memory.
An error report file with more information is saved as:
/tmp/jvm-14728/hs_error.log
Here are the contents of the error log:
OS:Linux
uname:Linux 4.14.47-56.37.amzn1.x86_64 #1 SMP Wed Jun 6 18:49:01 UTC 2018 x86_64
libc:glibc 2.17 NPTL 2.17
rlimit: STACK 8192k, CORE 0k, NPROC 3867, NOFILE 4096, AS infinity
load average:0.00 0.00 0.00
/proc/meminfo:
MemTotal: 1011168 kB
MemFree: 322852 kB
MemAvailable: 822144 kB
Buffers: 83188 kB
Cached: 523056 kB
SwapCached: 0 kB
Active: 254680 kB
Inactive: 369952 kB
Active(anon): 18404 kB
Inactive(anon): 48 kB
Active(file): 236276 kB
Inactive(file): 369904 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 36 kB
Writeback: 0 kB
AnonPages: 18376 kB
Mapped: 31660 kB
Shmem: 56 kB
Slab: 51040 kB
SReclaimable: 41600 kB
SUnreclaim: 9440 kB
KernelStack: 1564 kB
PageTables: 2592 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 505584 kB
Committed_AS: 834340 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 49152 kB
DirectMap2M: 999424 kB
CPU:total 1 (initial active 1) (1 cores per cpu, 1 threads per core) family 6 model 63 stepping 2, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, avx, avx2, aes, erms, tsc
/proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2676 v3 # 2.40GHz
stepping : 2
microcode : 0x3c
cpu MHz : 2400.043
cache size : 30720 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4800.05
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Memory: 4k page, physical 1011168k(322728k free), swap 0k(0k free)
vm_info: OpenJDK 64-Bit Server VM (24.181-b00) for linux-amd64 JRE (1.7.0_181-b00), built on Jun 5 2018 20:36:03 by "mockbuild" with gcc 4.8.5 20150623 (Red Hat 4.8.5-28)
time: Mon Aug 20 20:51:08 2018
elapsed time: 0 seconds
Orient can easily run in 512MB though your performance and throughput will not be as high. In OrientDB 3.0.x you can use the environment variable ORIENTDB_OPTS_MEMORY to set it. On the command line I can, for example run:
cd $ORIENTDB_HOME/bin
export ORIENTDB_OPTS_MEMORY="-Xmx512m"
./server.sh
(where $ORIENTDB_HOME is where you have OrientDB installed) and I'm running with 512MB of memory.
As an aside, if you look in $ORIENTDB_HOME/bin/server.sh you'll see that there is even code to check if the server is running on a Raspberry Pi and those range from 256MB to 1GB so the t2.micro will run just fine.

When Cassandra is running almost all RAM is consumed, why?

I have CentOS 6.8, Cassandra 3.9, 32 GB RAM. When I start Cassandra and once it is started, it starts consuming the memory and start adding up 'Cached' memory value when I start querying from CQLSH or Apache Spark and in this process, very less memory remain for other processing like cron execution.
Here are some details from my system
free -m
total used free shared buffers cached
Mem: 32240 32003 237 0 41 24010
-/+ buffers/cache: 7950 24290
Swap: 2047 25 2022
And here is the output of top -M command
top - 08:54:39 up 5 days, 16:24, 4 users, load average: 1.22, 1.20, 1.29
Tasks: 205 total, 2 running, 203 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.5%us, 1.2%sy, 19.8%ni, 75.3%id, 0.1%wa, 0.1%hi, 0.0%si, 0.0%st
Mem: 31.485G total, 31.271G used, 219.410M free, 42.289M buffers
Swap: 2047.996M total, 25.867M used, 2022.129M free, 23.461G cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14313 cassandr 20 0 595g 28g 22g S 144.5 91.3 300:56.34 java
You can see only 220 MB is left and 23.46 is cached.
My question is how to configure Cassandra so that it can use 'cached' memory to certain value and leave more RAM available for other processes.
Thanks in advance.
In linux in general cached memory as your 23g is just really fine. This memory is used as filesystem cache and so on - not by cassandra itself. Linux systems tend to use all available memory.
This helps to speed up your system in many ways to prevent disk reads.
You can still use the cached memory - just start processes and use your ram, the kernel will free it immediatly.
You can set the sizes in cassandra-env.sh under conf folder. This article should help. http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsTuneJVM.html

Resources