Best way for logging CPU & GPU utilization every second in linux - nvidia

I want to get the CPU and GPU utilisation of my cuda program and plot them like this.
What's the best way?
Here is my script:
### [1] Running my cuda program in background
./my_cuda_program &
PID_MY_CUDA_PROGRAM=$!
### [2] Getting CPU & GPU utilization in background
sar 1 | sed --unbuffered -e 's/^/SYSSTAT:/' &
PID_SYSSTAT=$!
nvidia-smi --format=csv --query-gpu=timestamp,utilization.gpu -l 1 \
| sed --unbuffered -e 's/^/NVIDIA_SMI:/' &
PID_NVIDIA_SMI=$!
### [3] waiting for the [1] process to finish,
### and then kill [2] processes
wait ${PID_MY_CUDA_PROGRAM}
kill ${PID_SYSSTAT}
kill ${PID_NVIDIA_SMI}
exit
That output:
SYSSTAT:Linux 4.15.0-176-generic (ubuntu00) 05/06/22 _x86_64_ (4 CPU)
NVIDIA_SMI:timestamp, utilization.gpu [%]
NVIDIA_SMI:2022/05/06 23:57:00.245, 7 %
SYSSTAT:
SYSSTAT:23:57:00 CPU %user %nice %system %iowait %steal %idle
SYSSTAT:23:57:01 all 8.73 0.00 5.74 7.48 0.00 78.05
NVIDIA_SMI:2022/05/06 23:57:01.246, 1 %
SYSSTAT:23:57:02 all 23.31 0.00 6.02 0.00 0.00 70.68
NVIDIA_SMI:2022/05/06 23:57:02.246, 16 %
SYSSTAT:23:57:03 all 25.56 0.00 3.76 0.00 0.00 70.68
NVIDIA_SMI:2022/05/06 23:57:03.246, 15 %
SYSSTAT:23:57:04 all 22.69 0.00 6.48 0.00 0.00 70.82
NVIDIA_SMI:2022/05/06 23:57:04.246, 21 %
SYSSTAT:23:57:05 all 25.81 0.00 3.26 0.00 0.00 70.93
it's a bit annoying to parse the log above.

Related

How to convert task-clock perf-event to seconds or milliseconds?

I am trying to use perf for performance analysis.
When I use perf stat it provides execution time
Performance counter stats for './quicksort_ver1 input.txt 10000':
7.00 msec task-clock:u # 0.918 CPUs utilized
2,679,253 cycles:u # 0.383 GHz (9.58%)
18,034,446 instructions:u # 6.73 insn per cycle (23.56%)
5,764,095 branches:u # 822.955 M/sec (37.62%)
5,030,025 dTLB-loads # 718.150 M/sec (51.69%)
2,948,787 dTLB-stores # 421.006 M/sec (65.75%)
5,525,534 L1-dcache-loads # 788.895 M/sec (48.31%)
2,653,434 L1-dcache-stores # 378.838 M/sec (34.25%)
4,900 L1-dcache-load-misses # 0.09% of all L1-dcache hits (20.16%)
66 LLC-load-misses # 0.00% of all LL-cache hits (6.09%)
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)
<not counted> LLC-stores (0.00%)
0.007631774 seconds time elapsed
0.006655000 seconds user
0.000950000 seconds sys
However when I use perf record, I observe that for task-clock 45 samples and 14999985 events are collected.
Samples: 45 of event 'task-clock:u', Event count (approx.): 14999985
Children Self Command Shared Object Symbol
+ 91.11% 0.00% quicksort_ver1 quicksort_ver1 [.] _start
+ 91.11% 0.00% quicksort_ver1 libc-2.17.so [.] __libc_start_main
+ 91.11% 0.00% quicksort_ver1 quicksort_ver1 [.] main
is there any way to convert task-clock events to seconds to milliseconds?
Got answer with little bit of experimentation. Basic unit of task-cpu event is Nano second
stats collected with perf stat
$ sudo perf stat -e task-clock:u ./bubble_sort input.txt 50000
Performance counter stats for './bubble_sort input.txt 50000':
11,617.33 msec task-clock:u # 1.000 CPUs utilized
11.617480215 seconds time elapsed
11.615856000 seconds user
0.002000000 seconds sys
stats collected with perf record
$ sudo perf report
Samples: 35K of event 'task-clock:u', Event count (approx.): 11715321618
Overhead Command Shared Object Symbol
73.75% bubble_sort bubble_sort [.] bubbleSort
26.15% bubble_sort bubble_sort [.] swap
0.07% bubble_sort libc-2.17.so [.] _IO_vfscanf
observe in both the cases sample has changed but event count is approximately same.
perf stat reports elapsed time as 11.617480215 seconds and perf report reports total task-clock events: 11715321618
11715321618 nanoseconds = 11.715321618 seconds which is approximately equals to 11.615856000 seconds
apparently basic unit of task-cpu event is Nanosecond.

Tensorflow jobs causing high disk load on the Root Filesystem

When running training using Tensorflow, the root filesytem, which is on two SSDs, is seeing extremely high disk utilization, with a wait as high as 39min. Trying to figure out what is causing it.
11:42:43 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
11:42:45 PM sda 2.00 0.00 4096.00 2048.00 143.11 57598.00 500.00 100.00
11:42:45 PM md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:42:45 PM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:42:45 PM sdc 2.00 0.00 4096.00 2048.00 143.10 57594.00 500.00 100.00

Why would an empty rails process have different memory footprint each time it starts up?

We've been trying to figure out how to reduce the boot-up memory footprint of our rails app by identifying memory-hungry gems and finding alternatives or solutions.
But there's one behavior on OS X that I find baffling.
With a brand new generated rails app (rails new memoryusage), with no Gemfile, no models, no data and no transactions, upon starting up rails c the memory OSX displays for the corresponding ruby process vill vary each time it is started up, from as low as 60MB to as high as 65MB, with no discernible pattern as to why the same app might be requiring less or more memory per execution.
I imagine this has to do in some way with how Ruby allocates memory, but I'm not completely clear on why its memory allocation would vary so wildly for the same code and no variable processing.
We have similarly unpredictable behavior when we try to calculate the memory consumed by the process after each gem in the Gemfile is required. We load up a vanilla rails process, then in rails c we run a script that parses the Gemfile and requires each Gem individually, logging the memory pre- and post-require, and what we notice is that not only does the memory footprint not have a consistent starting point, but also the incremental 'steps' in our memory consumption vary wildly.
We booted up our process three times, one after the other, and measured the start up memory and the incremental memory required by each gem. Not only did the startup book memory footprints bounce between 60MB and 92MB, but the points at which we saw memory jumps on loading each gem were inconsistent -- sometimes loading SASS would eat up an additional 5MB, sometimes it wouldn't, sometimes active_merchant would demand 10MB additional, others it wouldn't.
: BOOT UP #1 : BOOT UP #2 : BOOT UP #3
gem : increment | total : increment | total : increment | total
rails : 0.00 | 59.71 : 0.00 | 92.54 : 0.18 | 67.76
unicorn : 0.52 | 60.24 : 0.52 | 93.06 : 3.35 | 71.12
haml : 8.77 | 69.02 : 1.88 | 94.94 : 9.45 | 80.57
haml-rails : 0.00 | 69.02 : 0.00 | 94.94 : 0.00 | 80.57
sass : 4.36 | 73.38 : 6.95 | 101.89 : 0.99 | 81.55
mongoid : 0.00 | 73.38 : 0.00 | 101.89 : 0.00 | 81.55
compass : 11.56 | 84.93 : 3.23 | 105.12 : 8.41 | 89.96
compass-rails : 0.00 | 84.93 : 0.08 | 105.20 : 0.00 | 89.96
compass_twitter_bootstrap: 0.00 | 84.93 : 0.00 | 105.20 : 0.00 | 89.96
profanalyzer : 0.59 | 85.52 : 0.46 | 105.66 : 0.64 | 90.60
simple_form : 0.34 | 85.87 : 0.35 | 106.01 : 0.00 | 90.60
sorcery : 0.00 | 85.87 : 0.25 | 106.26 : 1.07 | 91.67
validates_timeliness: 1.47 | 87.34 : 1.82 | 108.07 : 1.62 | 93.29
mongoid_token : 0.00 | 87.34 : 0.00 | 108.07 : 0.00 | 93.29
nested_form : 0.00 | 87.34 : 0.00 | 108.07 : 0.01 | 93.30
nokogiri : 0.86 | 88.20 : 1.16 | 109.24 : 1.37 | 94.67
carmen : 0.00 | 88.20 : 0.07 | 109.30 : 0.00 | 94.67
carrierwave/mongoid : 2.78 | 90.98 : 0.38 | 109.69 : 0.13 | 94.80
yajl : 0.04 | 91.02 : 0.04 | 109.73 : 0.04 | 94.84
multi_json : 0.00 | 91.02 : 0.00 | 109.73 : 0.00 | 94.84
uuid : 0.00 | 91.03 : 0.00 | 109.73 : 0.41 | 95.25
tilt : 0.00 | 91.03 : 0.00 | 109.73 : 0.00 | 95.25
dynamic_form : 0.00 | 91.04 : 0.00 | 109.73 : 0.00 | 95.25
forem : 0.03 | 91.07 : 0.00 | 109.73 : 0.00 | 95.25
browser : 0.00 | 91.07 : 0.00 | 109.73 : 0.00 | 95.25
activemerchant : 2.17 | 93.24 : 1.18 | 110.92 : 10.58 | 105.83
kaminari : 0.00 | 93.24 : 0.00 | 110.92 : 0.00 | 105.83
merit : 0.00 | 93.24 : 0.00 | 110.92 : 0.00 | 105.83
memcachier : 0.00 | 93.24 : 0.00 | 110.92 : 0.00 | 105.83
dalli : 0.01 | 93.25 : 0.05 | 110.96 : 0.34 | 106.17
bitly : 2.47 | 95.72 : 9.43 | 120.40 : 1.53 | 107.70
em-synchrony : 1.00 | 96.72 : 0.18 | 120.57 : 0.55 | 108.24
em-http-request : 5.56 | 102.28 : 2.15 | 122.72 : 1.40 | 109.64
httparty : 0.00 | 102.28 : 0.00 | 122.72 : 0.00 | 109.64
rack-block : 0.00 | 102.28 : 0.00 | 122.72 : 0.00 | 109.64
resque/server : 1.21 | 103.49 : 1.73 | 124.45 : 1.68 | 111.32
resque_mailer : 0.00 | 103.49 : 0.00 | 124.45 : 0.00 | 111.32
rack-timeout : 0.00 | 103.49 : 0.00 | 124.45 : 0.00 | 111.32
chronic : 1.66 | 105.15 : 0.67 | 125.12 : 0.64 | 111.96
oink : 0.00 | 105.15 : 0.00 | 125.12 : 0.00 | 111.96
dotenv-rails : 0.00 | 105.15 : 0.00 | 125.12 : 0.00 | 111.96
jquery-rails : 0.00 | 105.15 : 0.03 | 125.15 : 0.00 | 111.96
jquery-ui-rails : 0.00 | 105.15 : 0.00 | 125.15 : 0.00 | 111.96
It's clear to me that there's something very basic that I'm missing and don't understand about how memory is allocated to Ruby processes, but I'm having a hard time figuring out why it could be this seemingly stochastic. Anyone have any thoughts?
I'm going to take a wild guess and say this is caused by address space layout randomization and by interactions with shared libraries whose footprints are influenced by running programs that are not in your test case.
OS X has received increasing support for ASLR starting with 10.5 and as of 10.8 even the kernel is randomly relocated.
In some cases, ASLR can cause a program segment to use extra pages depending on whether the offset causes a page boundary to be crossed. Since there are many segments and many libraries, this effect is difficult to predict.
I'm also wondering if (given the huge differences you are seeing) perhaps this is a reporting issue in OS X. I wonder if the overhead for shared objects is being charged unfairly depending on load order.
You can test this by consulting this Stack Overflow question: Disabling ASLR in Mac OS X Snow Leopard.

RVM on ubuntu12.04 vps

~# curl -L get.rvm.io | bash -s stable
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 185 100 185 0 0 11 0 0:00:16 0:00:15 0:00:01 1057
100 9979 100 9979 0 0 317 0 0:00:31 0:00:31 --:--:-- 20235
Downloading RVM from wayneeseguin branch stable
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 125 100 125 0 0 8 0 0:00:15 0:00:15 --:--:-- 384
100 125 100 125 0 0 1 0 0:02:05 0:01:33 0:00:32 0curl: (7) couldn't connect to host
Could not download 'https://github.com/wayneeseguin/rvm/tarball/stable'.
curl returned status '7'.
I think your answer is here:
https://github.com/wayneeseguin/rvm/issues/804

Strange "profiling" output. from Rails command line

I'm suddenly experiencing that rake and rails sort of bail out with some strange output. It most of all looks like a process list but it's clearly Ruby/Rails related. Also, it's several pages long - I actually had to increase the scrollback setting in my terminal to get to see what was going on before this output started.
Here's a short excerpt:
0.00 48.12 0.00 1 0.00 0.00 Rails::Rack::LogTailer#tail!
0.00 48.12 0.00 3 0.00 0.00 WEBrick::HTTPResponse#[]
0.00 48.12 0.00 1 0.00 0.00 Rack::Utils::HeaderHash#each
0.00 48.12 0.00 2 0.00 0.00 Range#begin
0.00 48.12 0.00 1 0.00 0.00 Range#end
0.00 48.12 0.00 1 0.00 10.00 Rack::File#each
0.00 48.12 0.00 1 0.00 0.00 WEBrick::HTTPRequest#fixup
0.00 48.12 0.00 1 0.00 0.00 Kernel.raise
0.00 48.12 0.00 1 0.00 0.00 Exception#to_s
0.00 48.12 0.00 1 0.00 0.00 WEBrick::GenericServer#stop
0.00 48.12 0.00 1 0.00 0.00 WEBrick::BasicLog#debug?
This particular output came after I killed WebRick (ctrl+c). I also experience this when running tests (it seems to show up after each test/file) and when running rake db:migrate (shows up when migration is done).
I'm currently running Rails 3.1.0 (upgraded from 3.0.5 hoping that would solve this) and ruby 1.9.2p180 installed through RVM.
Any ideas why this is happening?
You had a model named Profile, when you removed that Rails attempted to load it, though as that file no longer existed it would have looked elsewhere in the load path and got the profiler out of the standard libraries. Thus what you are seeing is your application being profiled (accidentally).

Resources