How can I measure the top (the maximum) memory usage of some programm?
It do a lot of malloc/free, and run rather fast, so I can't see the max memory in top.
I want smth like time utility:
$ time ./program
real xx sec
user xx sec
sys xx sec
and
$ mem_report ./program
max memory used xx mb
shared mem xx mb
The time call is your shell. If you call /usr/bin/time, the program, you will get some knowledge of resident memory usage. Note however that it may not count memory-mapped files, shared memory and other details which you may need.
If you are on linux, you can wrap your program in a script that polls:
# for your current process
/proc/self/statm
# or a process you know the pid of
/proc/{pid}/statm
and writes out the results - you can aggregate them afterwards.
Related
Due to temp (hopefully) financial problems I have to use old laptop. It's FSB (Front Side Bridge) clock is 333MHz (https://www.techsiting.com/mt-s-vs-mhz/). It has 2 SO-DIMM slots for DDR2 SDRAM. It had only 1 DIMM 2 Gb previously and it was a nightmare.
Each slot can handle maximum 2Gb so maximum amount of memory is 4Gb. Knowing that supported DDR stands for double data ratio, I've bought for funny money (10 euro) 2 DDR2 DIMM SO-DIMM 800MHz hoping to get (assuming memory divider is 1:2 - it's a double data ratio, isn't it?) 2x333MHz->apply divider=667MT/s (no idea how they have avoided 666). As I have Core2Duo I even had a very little hope to get 4x333MHz=1333MT/s.
But it seems that my memory divider is 1:1, so I get either
2x333MHzxDivider=333MT/s
4x333MHzxDivider=?
And utilities like lshw and dmidecode seem to confirm that:
~ >>> sudo lshw -C memory | grep clock
clock: 333MHz (3.0ns) # notice 333MHz here
clock: 333MHz (3.0ns) # notice 333MHz here
~ >>> sudo dmidecode --type memory | grep Speed
Supported Speeds:
Current Speed: Unknown
Current Speed: Unknown
Speed: 333 MT/s # notice 333MT/s here
Speed: 333 MT/s # notice 333MT/s here
~ >>>
So my 333MHz on FSB has been multiplied by 1 (one) and I've got 333MT/s (if I understood correct). I'm still satisfied: OS does not swap that much, boot process is faster, programs starts faster, browser does not hang every hour and I can open much more tabs). I just want to know, since I have Core2Duo what **MT/s8*8 do I have from these two? Or maybe it is even more comlicated?
2x333MHzxDivider=333MT/s
4x333MHzxDivider=667MT/s # 4 because of Duo
and is there any difference for 2 processors system with just 4Gb of RAM with MT\s == MHz?
PS BIOS is old (although latest) and I cannot see real FSB clock there, nor change it nor change the memory divider.
Looks like there's no point in checking I/O bus clock using some Linux command/tool because it is just always half of memory clock.
if what is written in electronics.stackexchange.com/a/424928:
I/O bus clock is always half of bus data rate.
my old machine has these parameters:
It is DDR2-333 (not standardized by JEDEC since they start from DDR-400)
It has memory MHz = 333
It has memory MT/s = 333
It has I/O bus MHz = 166.5 # just because
The thing I still don't get is that I have Core2Duo, so is my memory MT/s = 333 or 666.
How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.
I want to see the peak memory usage of a command. I have a parametrized algorithm and I want to know when the program will crash due with an out of memory error on my machine (12GB RAM).
I tried:
/usr/bin/time -f "%M" command
valgrind --tool=massif command
The first one gave me 1414168 (1.4GB; thank you ks1322 for pointing out it is measured in KB!) and valgrind gave me
$ ms_print massif.out
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
75 26,935,731,596 22,420,728 21,956,875 463,853 0
I'm a bit confused which number I should take, but let's assume "total" (22MB).
And the massif-visualizer shows me
Now I have 3 different numbers for the same command:
valgrind --tool=massif command + ms_print: 22MB
valgrind --tool=massif command + massif-visualizer: 206MB (this is what I see in htop and I guess this is what I'm interested in)
time -f "%M" command: 1.4GB
Which is the number I should look at? Why are the numbers different at all?
/usr/bin/time -f "%M" measures the maximum RSS (resident set size), that is the memory used by the process that is in RAM and not swapped out. This memory includes the heap, the stack, the data segment, etc.
This measures the max RSS of the children processes (including grandchildren) taken individually (not the max of the sum of the RSS of the children).
valgrind --tool=massif, as the documentation says:
measures only heap memory, i.e. memory allocated with malloc, calloc, realloc, memalign, new, new[], and a few other, similar functions. This means it does not directly measure memory allocated with lower-level system calls such as mmap, mremap, and brk
This measures only the memory in the child (not grandchildren).
This does not measure the stack nor the text and data segments.
(options likes --pages-as-heap=yes and --stacks=yes enable to measure more)
So in your case the differences are:
time takes into account the grandchildren, while valgrind does not
time does not measure the memory swapped out, while valgrind does
time measures the stack and data segments, while valgrind does not
You should now:
check if some children are responsible of the memory consumption
try profiling with valgrind --tool=massif --stacks=yes to check the stack
try profiling with valgrind --tool=massif --pages-as-heap=yes to check the rest of the memory usage
I am using MAX10 FPGA and have interfaced DDR3 memory. I have noticed that my DDR3 Memory is working slow as compared to on-chip memory. I came to know about this, as I wrote a blinking LEDs program, and for same delay function with on-chip memory it is working faster as compared to DDR3 memory. What can be done possibly to increase speed? And what might possibly be wrong? My system clock is running at 50MHz.
P.S. There are no Instruction or Data Caches in my system.
First,your function is not pipeline function as your description.Because you do something with memory and then blinking the LED.Every thing run in sequence.
In this case,you should estimate the response time and throughout of your memory.For example,you read a data from memory and then do a add function,and do this 10 times.If you always read memory after add function,your sum time consumption is about 10*response time + 10 add function time.
The difference is memory response time.Inner ram's response time can be 1 cycle at 50MHz.But DDR3 memory should be about 80 ns. That's the difference.
But you can change your module to pipeline pattern.Read/write data and do your other function parallel.And r/w DDR ahead.That's like cache in PC. This can save some time.
By the way,DDR throughout is highly depends on your function pattern.If you read or write data at the sequence order address, then you will get a bigger throughout.
After all,external memory's throughout and response time can never greater then internal memory.
Forgive my English.
I am running SBCL 1.0.51 on a Linux (Fedora 15) 32-bit system (kernel 3.6.5) with 1GB Ram and 256MB swap space.
I fire up sbcl --dynamic-space-size 125 and start calling a function that makes ~10000 http-requests (using drakma) to an http (couchDB) server and I just format to the standard-output the results of an operation on the returned data.
After each call I do a (sb-ext:gc :full t) and then (room). The results are not growing. No matter how many times I run the function, (room) reports the same used space (with some ups and downs, but around the same average which does not grow).
BUT: After every time I call the function, top reports that the VIRT and RES amount of the sbcl process keeps growing ,even beyond the 125MB space I told sbcl to ask for itself. So I have the following questions:
Why top -reported memory keeps growing, while (room) says it does not? The only thing I can think of is some leakage through ffi. I am not directly calling out with ffi but maybe some drakma dep does and forgets to free its C garbage. Anyway I dont know if this could even be an explanation. Could it be something else? Any insights?
Why isnt --dynamic-space-size honoured?