Measuring memory performance in Erlang - memory

Is there a way to measure the complete memory usage when running a program in Erlang? My benchmarks are such that I spawn a process which in turn will spawn more processes, etc. Towards the end, they are all collapsed until only the initial process receives some result.
I am interested in the highest momentary memory usage altogether. Assuming before I spawn my process that memory usage is 0, what is the peak momentary memory usage?
I looked at this thread: GC performance in Erlang, which describes process_info/2. It seems, however, that if I spawn a process the memory reported by process_info(self(), memory) does not increase.
Percept seems to mainly gather statistics of processes and their lifetimes, rather than their resource consumption.
Any help is appreciated.

Related

Writing millions records to mnesia table takes up a lot of memory(RAM) and not reclaim even these records are deleted

I am running an Erlang application that often writes millions of records to the mnesia table for making a scheduler. When the time is due, the records get executed and removed from the table. The table is configured with {type, disk_copies}, {type, ordered_set}. I use transaction operations for writing and dirty operations for deleting records.
I have an experiment that writes 2 million records and then deletes all of them: the RAM memory was not reclaimed after it finished. There is a spike that twice increases the memory when I start to delete those records. For example, the beam memory starts as 75MB, and becomes after the experiment 410MB. I've used erlang:memory() to inspect the memory before and after, found that the memory was eaten by the process_used and binary but actually, I did not have any action with binary. If I use erlang:garbage_collect(Pid) for all running processes, the memory gets reclaimed, leaving 180MB.
Any suggestions for troubleshooting this issue would be highly appreciated. Thank you so much.
Answer from Rickard Green from Elrang OTP:
The above does not indicate a bug.
A process is not garbage collected unless it reaches certain limits, for example, it needs to allocate heap data and there is no free heap available. If a process stops executing, it does not matter how long time passes, it won't automatically garbage collect by itself unless it reaches one of these limits. A garbage collection can be forced by calling erlang:garbage_collect() though.
A process that has had a lot of live data (and by this have grown large) but at the time of the garbage collection has no live data wont shrink down to its original size immediately. It will instead get a relatively large heap. The heap space is free for usage by the process, but it is allocated from the system's point of view. The relatively large heap is selected in order to avoid triggering garbage collections unnecessarily frequent.
Not only your processes are effected when you execute. Also other processes might build up heap in order to serve your processes.
If you look at memory consumption via top or similar, it is also expected that memory usage will have increased after execution even if you are able to garbage collect every process down into its initial size. This due to memory allocators that place memory blocks into larger chunks of memory which cannot be removed until the whole memory chunk is free. More or less every memory allocation system that exist will have this characteristic.

Memory Profiler tool to get an estimation of improvement enabling NUMA

I work on a low latency application that in my opinion would greatly benefit from the enabling of NUMA (or improving the memory locality anyway).
Is there a profiling tool that would give me an estimation on what could be the improvement, maybe in terms of percent/factor of reduction of execution time?
I was considering using cachegrind. I would expect a lot of LL cache miss, but still I wouldn't have an idea of the expected improvement.
Thanks a lot.
Edit:
The goal here is trying to reduce the latency. Currently, there is a thread that works on startup and do all the allocations. A better implementation, I believe, would be to pin the threads to cpu cores, and make every threads to make the allocation it needs. Before to do that i'd like to have, somehow, an estimation of the benefit in terms of latency.

MPI internal buffer memory issue

I am dealing with a Fortran code with MPI parallelization where I'm running out of memory for intensive runs. I'm careful to allocate nearly all of the memory that is required at the beginning of my simulation. Subroutine static memory allocation is typically small, but if I was to run out of memory due to these subroutines, it would occur early in the simulation because the memory allocation should not grow as time progresses. My issue is that far into the simulation, I am running into memory errors such as:
Insufficient memory to allocate Fortran RTL message buffer, message #174 = hex 000000ae.
The only thing that I can think of is that my MPI calls are using memory that I cannot preallocate at the beginning of the simulation. I'm using mostly MPI_Allreduce, MPI_Alltoall, and MPI_Alltoallv while the simulation is running and sometimes I am passing large amounts of data. Could the memory issues be a result of internal buffers created by MPI? How can I prevent a surprise memory issue like this? Can this internal buffer grow during the simulation?
I've looked at Valgrind and besides the annoying MPI warnings, I'm not seeing any other memory issues.
It's a bit hard to tell if MPI is at fault here without knowing more details. You can try massif (one of the valgrind tools) to find out where memory is being allocated.
Be sure you don't introduce any resource leaks: If you create new MPI resources (communicators, groups, requests etc.), make sure to release them properly.
In general, be aware of the buffer sizes required for all-to-all communication, especially at large scale. Use MPI_IN_PLACE if possible, or send data in small chunks rather than as single large blocks.

Do CPU usage numbers/percentages (e.g. Task Manager) include memory I/O times?

This is something I've always wondered but never looked up.
When an OS reports 100% CPU usage, does that necessarily mean that the bottleneck is calculations performed by the CPU, or does that include stall times, loading data from L1, L2, L3 and RAM?
If it does include stall times, is there a tool that allows to break the figure down into its components?
The CPU usage reported by the OS includes time stalled waiting for memory accesses (as well as stalls from data dependencies on high latency computational operations like division).
I suspect that one could use performance counters to get a better handle on what is taking the time, but I am not familiar with the details of using performance monitoring counters.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Resources