Is it better to use #allocated versus "bytes allocated" for measuring memory usage? I'm a bit surprised that bytes allocated changes from invocation to invocation.
julia> #timev map(x->2*x, [1:100])
0.047360 seconds (89.54 k allocations: 4.269 MiB)
elapsed time (ns): 47359831
bytes allocated: 4476884
pool allocs: 89536
non-pool GC allocs:1
1-element Array{StepRange{Int64,Int64},1}:
2:2:200
julia> #timev map(x->2*x, [1:100])
0.047821 seconds (89.56 k allocations: 4.271 MiB)
elapsed time (ns): 47820714
bytes allocated: 4478708
pool allocs: 89554
non-pool GC allocs:1
1-element Array{StepRange{Int64,Int64},1}:
2:2:200
julia> #timev map(x->2*x, [1:100])
0.045273 seconds (89.58 k allocations: 4.274 MiB)
elapsed time (ns): 45272518
bytes allocated: 4481108
pool allocs: 89580
non-pool GC allocs:1
1-element Array{StepRange{Int64,Int64},1}:
2:2:200
Firstly, you should read the performance tips section of the Julia manual: https://docs.julialang.org/en/v1/manual/performance-tips/index.html
You are violating tip number one: don't benchmark in global scope. A big red flag should be that this simple operation takes 4/100 of a second and allocates 4MB.
For benchmarking, always use the BenchmarkTools.jl package. Below is example usage.
(BTW, do you really mean to operate on [1:100]? This is a single-element vector, where the single element is a Range object. Did you perhaps intend to work on 1:100 or maybe collect(1:100)?)
julia> using BenchmarkTools
julia> foo(y) = map(x->2*x, y)
foo (generic function with 2 methods)
julia> v = 1:100
1:100
julia> #btime foo($v)
73.372 ns (1 allocation: 896 bytes)
julia> v = collect(1:100);
julia> #btime foo($v);
73.699 ns (2 allocations: 912 bytes)
julia> #btime foo($v);
73.100 ns (2 allocations: 912 bytes)
julia> #btime foo($v);
74.033 ns (2 allocations: 912 bytes)
julia> v = [1:100];
julia> #btime foo($v);
55.563 ns (2 allocations: 128 bytes)
As you can see, runtimes are almost 6 orders of magnitude faster than what you are seeing, and allocations are stable.
Notice also that the last example, which uses [1:100], is faster than the others, but that's because it's doing something else.
Related
Consider the following code
using Distributions
using BenchmarkTools
u = randn(100, 2)
res = ones(100)
idx = 1
u_vector = u[:, idx]
#btime $res = $1.0 .- $u_vector;
#btime $res = $1.0 .- $u[:,idx];
#btime #views $res = $1.0 .- $u[:,idx];
These are the results that I got from the three lines with #btime
julia> #btime $res = $1.0 .- $u_vector;
37.478 ns (1 allocation: 896 bytes)
julia> #btime $res = $1.0 .- $u[:,idx];
607.383 ns (13 allocations: 1.97 KiB)
julia> #btime #views $res = $1.0 .- $u[:,idx];
397.597 ns (6 allocations: 1.08 KiB)
The second #btime line has the greatest amount of time and allocations but that's in line with my expectation, since I'm slicing. However, I'm not sure why the third line with #views is not the same as the first line? I thought by using #views I'm not longer creating a copy. Is there a way to "fix" the third line? In my real code, the user provides idx so idx is not known in advance. Therefore, I would want to reduce allocations when I do slicing.
What I assume you are looking for is:
julia> #btime $res .= 1.0 .- view($u, :,$idx);
13.126 ns (0 allocations: 0 bytes)
The point is that you want to avoid allocation of the vector on RHS, and that is why you should use .= not =.
I also changed #views to view call. It does not matter here, but, in general using #views is tricky at times, and I avoid it unless there is a reason, see here.
In my machine, the results are different. Consider that you shouldn't use $ for 1.0; on the other hand, you should use $ for idx. Here is the result on my machine:
julia> #btime $res = 1.0 .- $u_vector;
103.455 ns (1 allocation: 896 bytes)
julia> #btime $res = 1.0 .- $u[:,$idx];
241.978 ns (2 allocations: 1.75 KiB)
julia> #btime #views $res = 1.0 .- $u[:,$idx];
105.058 ns (1 allocation: 896 bytes)
Next, I provided a picture that contains several times running the test and the result of versioninfo():
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-4800MQ CPU # 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
Threads: 1 on 8 virtual cores
I'd like to know if my pytorch code is fully utilizing the GPU SMs. According to this question gpu-util in nvidia-smi only shows how time at least one SM was used.
I also saw that typing nvidia-smi dmon gives the following table:
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 132 71 - 58 18 0 0 6800 1830
Where one would think that sm% would be SM utilization, but I couldn't find any documentation on what sm% means. The number given is exactly the same as gpu-util in nvidia-smi.
Is there any way to check the SM utilization?
On a side note, is there any way to check memory bandwidth utilization?
I'm trying to detect a memory leak location for a certain process via Windbg, and have come across a strange problem.
Using windbg, I've created 2 memory dump snapshots - one before and one after the leak, that showed an increase of around 20MBs (detected via Performance Monitor - private bytes). it shows that there is indeed a similar size difference in one of the heaps before and after the leak (Used with the command !heap -s):
Before:
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
03940000 08000002 48740 35312 48740 4372 553 9 5 2d LFH
External fragmentation 12 % (553 free blocks)
03fb0000 08001002 7216 3596 7216 1286 75 4 8 0 LFH
External fragmentation 35 % (75 free blocks)
05850000 08001002 60 16 60 5 2 1 0 0
...
After:
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
03940000 08000002 64928 55120 64928 6232 1051 26 5 51 LFH
External fragmentation 11 % (1051 free blocks)
03fb0000 08001002 7216 3596 7216 1236 73 4 8 0 LFH
External fragmentation 34 % (73 free blocks)
05850000 08001002 60 16 60 5 2 1 0 0
...
See the first Heap (03940000) - there is a difference in committed KBs of around 55120 - 35312 = 19808 KB = 20.2 MB.
However, when I inspected that heap with (!heap -stat -h 03940000), it displays the following for both dump files:
size #blocks total ( %) (percent of total busy bytes)
3b32 1 - 3b32 (30.94)
1d34 1 - 1d34 (15.27)
880 1 - 880 (4.44)
558 1 - 558 (2.79)
220 1 - 220 (1.11)
200 2 - 400 (2.09)
158 1 - 158 (0.70)
140 2 - 280 (1.31)
...(rest of the lines show no difference)
size #blocks total ( %) (percent of total busy bytes)
3b32 1 - 3b32 (30.95)
1d34 1 - 1d34 (15.27)
880 1 - 880 (4.44)
558 1 - 558 (2.79)
220 1 - 220 (1.11)
200 2 - 400 (2.09)
158 1 - 158 (0.70)
140 2 - 280 (1.31)
...(rest of the lines show no difference)
As you can see, there is hardly a difference between the two, despite the abovementioned 20MB size difference.
Is there an explanation for that?
Note: I have also inspected the Unmanaged memory using UMDH - there wasn't a noticeable size difference there.
We are doing some performance measurements including some memory footprint measurements. We've been doing this with GNU time.
But, I cannot tell if they are measuring in kilobytes (1000 bytes) or kibibytes (1024 bytes).
The man page for my system says of the %M format key (which we are using to measure peak memory usage): "Maximum resident set size of the process during its lifetime, in Kbytes."
I assume K here means the SI "Kilo" prefix, and thus kilobytes.
But having looked at a few other memory measurements of various things through various tools, I trust that assumption like I'd trust a starved lion to watch my dogs during a week-long vacation.
I need to know, because for our tests 1000 vs 1024 Kbytes adds up to a difference of nearly 8 gigabytes, and I'd like to think I can cut down the potential error in our measurements by a few billion.
Using the below testing setup, I have determined that GNU time on my system measures in Kibibytes.
The below program (allocator.c) allocates data and touches each of it 1 KiB at a time to ensure that it all gets paged in. Note: This test only works if you can page in the entirety of the allocated data, otherwise time's measurement will only be the largest resident collection of memory.
allocator.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define min(a,b) ( ( (a)>(b) )? (b) : (a) )
volatile char access;
volatile char* data;
const int step = 128;
int main(int argc, char** argv ){
unsigned long k = strtoul( argv[1], NULL, 10 );
if( k >= 0 ){
printf( "Allocating %lu (%s) bytes\n", k, argv[1] );
data = (char*) malloc( k );
for( int i = 0; i < k; i += step ){
data[min(i,k-1)] = (char) i;
}
free( data );
} else {
printf("Bad size: %s => %lu\n", argv[1], k );
}
return 0;
}
compile with: gcc -O3 allocator.c -o allocator
Runner Bash Script:
kibibyte=1024
kilobyte=1000
mebibyte=$(expr 1024 \* ${kibibyte})
megabyte=$(expr 1000 \* ${kilobyte})
gibibyte=$(expr 1024 \* ${mebibyte})
gigabyte=$(expr 1000 \* ${megabyte})
for mult in $(seq 1 3);
do
bytes=$(expr ${gibibyte} \* ${mult} )
echo ${mult} GiB \(${bytes} bytes\)
echo "... in kibibytes: $(expr ${bytes} / ${kibibyte})"
echo "... in kilobytes: $(expr ${bytes} / ${kilobyte})"
/usr/bin/time -v ./allocator ${bytes}
echo "===================================================="
done
For me this produces the following output:
1 GiB (1073741824 bytes)
... in kibibytes: 1048576
... in kilobytes: 1073741
Allocating 1073741824 (1073741824) bytes
Command being timed: "./a.out 1073741824"
User time (seconds): 0.12
System time (seconds): 0.52
Percent of CPU this job got: 75%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.86
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1049068
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 262309
Voluntary context switches: 7
Involuntary context switches: 2
Swaps: 0
File system inputs: 16
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
2 GiB (2147483648 bytes)
... in kibibytes: 2097152
... in kilobytes: 2147483
Allocating 2147483648 (2147483648) bytes
Command being timed: "./a.out 2147483648"
User time (seconds): 0.21
System time (seconds): 1.09
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.31
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2097644
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 524453
Voluntary context switches: 4
Involuntary context switches: 3
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
3 GiB (3221225472 bytes)
... in kibibytes: 3145728
... in kilobytes: 3221225
Allocating 3221225472 (3221225472) bytes
Command being timed: "./a.out 3221225472"
User time (seconds): 0.38
System time (seconds): 1.60
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.98
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3146220
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 786597
Voluntary context switches: 4
Involuntary context switches: 3
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
In the "Maximum resident set size" entry, I see values that are closest to the kibibytes value I expect from that raw byte count. There is some difference because its possible that some memory is being paged out (in cases where it is lower, which none of them are here) and because there is more memory being consumed than what the program allocates (namely, the stack and the actual binary image itself).
Versions on my system:
> gcc --version
gcc (GCC) 6.1.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> /usr/bin/time --version
GNU time 1.7
> lsb_release -a
LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 6.10 (Final)
Release: 6.10
Codename: Final
I am reading chapter 16 from OSTEP on memory segmentation.
In a example of the section, it translate the 15KB virtual address to physical address:
| Segment | Base | Size | Grow Positive |
| Code | 32KB | 2K | 1 |
| Heap | 34KB | 2K | 1 |
| Stack | 28KB | 2K | 0(negative) |
to translate 15KB virtual address to physical (in the text book):
15KB translate to bit => 11 1100 0000 00000
the top 2 bit(11) determined the segment, which is stack.
left with 3KB used to obtain correct offset:
3KB - maximum segment size = 3KB - 4KB = -1KB
physical address = 28KB -1KB = 27KB
My question is, in step 4, why is the maximum segment 4KB--isn't it 2KB?
in step 4, why is the maximum segment 4KB--isn't it 2KB?
For that part of that book; they're assuming that the hardware uses the highest 2 bits of the (14-bit) virtual address to determine which segment is being used. This leaves you with "14-2 = 12 bits" for the offset within a segment, so it's impossible for the hardware to support segments larger than 4 KiB (because the offset is 12 bits and 2**12 is 4 KiB).
Of course just because the maximum possible size of a segment is 4 KiB doesn't mean you can't have a smaller segment (e.g. a 2 KiB segment). For expand down segments I'd assume that the hardware being described in the book does something like "if(max_segment_size - offset >= segment_limit) { segmentation_fault(); }", so if the segment's limit is 2 KiB and "max_segment_size - offset = 4 KiB - 3 KiB = 1 KiB" it'd be fine (no segmentation fault) because 1 KiB less than the segment limit (2 KiB).
Note: Because no modern CPUs and no modern operating systems use segmentation (and because segmentation works differently on other CPUs - e.g. with segment registers and not "highest N bits select segment"); I'd be tempted to quickly skim through chapter 16 without paying much attention. The important part is "paging" (starting in chapter 18 of the book).