Why memory RSS is low but throughput is high - memory

I'm working on a NUMA-related benchmark and got an issue troubling me for a week.
I use numactl to pin a metrics multiplication workload to use node0's CPU and node1's memory like this:
workload
However, I observe tons of memory access still goes to node0 (throughput is high), either using pcm or numatop like this:
enter image description here enter image description here
Then I track the pages mapping of the workload by observing /proc/<pid>/numa_maps and dump the # pages on both NUMA nodes, throughout workload running:
enter image description here
Funny thing is, node0's # active pages remains low all the way.
This doesn't make since to me why RSS is low but memory throughput is high?
FYI, the above workload is in Python that uses Numpy, when I run with Golang using another go library, both RSS and throughput for node0 are low. I couldn't figure out what happens in Python runtime that causes tons memory access goes to node0.
I assume there would be dynamic libs reside on node0 during runtime, so I track those libs (see figure) and evict them from memory before running. However, the result stays the same.

Related

Erlang ETS memory fragmentation

I have an erlang cluster where erlang:memory() 'total' is between 2-2.5GB from idle to busy time, day in day out. ets memory usage is around 440M and stays around there no matter what. The data within ets is heavily transient, completely changes throughout the day. Tomorrows data is guaranteed to have no commonality to today's.
Linux top says beam is using like 10 gigabytes. free -m 'used' agrees with that (the machine really only runs beam). The overall memory usage of the system grows regularly, like 1% per day on 16GB systems. There is some variance across nodes, but not by alot, and OS 'used' memory is always several times more than erlang:memory() total.
erlang:system_info({allocator, ets_alloc}) shows 20 allocators. Most have data that looks something like this (full output of command is here):
{mbcs_pool,[{blocks,2054},
{blocks_size,742672},
{carriers,10},
{carriers_size,17825792}]},
1) Does this mean that 742K bytes (words?) of memory are actually taking 17M of OS memory?
2) As this post suggests, should we add '+MEas bf' to the VM args, in order to reduce overhead?
3) What else can I do to avoid actually running out of memory?
This is R17.5 but we will be migrating to R19.3 in next deployment (this week). We don't have recon in the current deployment but will be adding it in the next deployment. Also, can't imagine this matters, but beam is running inside an alpine container.
In case someone else runs into this later: this was not actually leaked memory.
The default memory allocator strategy of erlang may not be optimal for your use, depending what you do, and depending on how erlang is configured to allocate blocks. Turns out, in some cases, "free" memory from erlang point of view won't necessarily be immediately released to the OS due to allocator fragmentation.
It's somewhat explained here: http://erlang.org/doc/man/erts_alloc.html
The default allocator strategy for the version of erlang we used at the time is aoffcbf (address order first fit carrier best fit). In our case, this resulted in very high memory fragmentation (10+GB overhead worth). When troubleshooting these things, erlang:system_info(allocator) and erlang:system_info({allocator, Alloc}) are your friend. Changing to aobff (address order best fit) resulted in much more efficient memory usage. In truth, as long as the machine didn't run out of physical memory, it wouldn't matter, but for us, we were getting dangerously close to the physical limit. And you do not want to start paging. With aobff, we never passed 4GB, even after the node being up 18 months. With the aoffcbf we would pass 10GB in a few weeks.
As always, YMMV, as it all depends what type, size, etc.. of blocks are allocated, and how long they live.

Castalia Memory Issue

My application layer protocol works fine, but when the number of nodes is large (more than 600) it exits without any error.
I traced the code and didn't find any problem. It seems a memory problem since the number of nodes is large and doing many operations.
Update:
In my application:
Each node broadcasts 2msg/second, during all the simulation time.
The msgs contain much information related to my application.
All the nodes are static.
Using BypassRouting, BypassMAC, Radio cc2420.
Castalia works for nodes larger than 600 and reaches to 2500 from my previous experiments but with low simulation time ... so it depends on the relation between the # of nodes and simulation time and # of sent messages per second.
Single experiment run successfully... but when running for example with 30 seed (i.e. -r 30) ... & num of nodes = 110
its stopped after exp 13 simulation time = 1000s
& its stopped after exp 22 if simulation time = 600s
How I can free memory from unnecessary things during simulation runs.
(note: previously I increased the swap memory and worked for a specific limit)
Thanks,
Without more information on your application and the simulation scenario it's hard to provide very specific suggestions. At the very least, you could provide your ini file and information about any custom modules you are using (your application module for example). Are you using any mobile nodes for example? Which protocols are you using? What does you app module do? In general Castalia should be able to handle 600 nodes. In the past, we have tested Castalia with thousands of (static) nodes.
You could use a memory profiler. An excellent tool (a suite of tools really) is valgrind. You can find memory leaks, and you can also memory profile your program. The heap profiler tool of valgrind is called 'massif':
Massif is a heap profiler. It performs detailed heap profiling by taking regular snapshots of a program's heap. It produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. The graph is supplemented by a text or HTML file that includes more information for determining where the most memory is being allocated. Massif runs programs about 20x slower than normal.
Read the valgrind documentation for more info. This is the way you invoke the tool:
valgrind --tool=massif <executable> <arguments>
The executable in this case is CastaliaBin (not the Castalia python script, which is a higher level execution tool).

Azure Website Memory Usage - what's OK

I have been trying to determine what is the best instance/size combination for 21 sites on Azure Websites due to what I think is memory pressure. CPU is not an issue < 3%.
I have tried 1 medium and 1/2 smalls. Medium improved overall performance by about 15ms response time on the busiest site (per New Relic). Probably because it double the cores (and memory).
Using the new preview portal's memory quote module:
1 or 2 smalls runs about 80%-90% average memory
1 medium runs about 70%
That makes no sense considering the medium is double the memory. Is the larger memory availability not forcing the GC to run as often on the medium instance?
What % memory can an instance run and it not impact performance.
Is 80-90 OK on the small instance?
Will adding instances to smalls help a memory problem since it basically just creates the same setup across all the instances and will eventually use up the same amount?
I have not been able to isolate any issues on performance on any of the 21 sites, but I don't want a surprise if I am running too close.
Run the sites on a smaller instance and set auto-scale to ramp when it gets to high CPU. Monitor how often it needs to scale and if it scales often just switch to a larger instance size permanently.
Coming back to this for some user edification. It seems that the ~80% range is normal and is just using "what's available." No perf issues or failures ever.
So if you come across this and are worries about the high memory usage, you'll need to keep an eye on it to determine if it's just normal or if you have memory leak.
I know, crappy answer :)

Detailed multitasking monitoring

I'm trying to put together a model of a computer and run some simulations on it (part of a school assignment). It's a very simple model - a CPU, a disk and a process generator that generates user processes that take turns in using the CPU and accessing the disk (I've decided to omit the various system processes, because according to Process Explorer they use next to no CPU time - I'm basing this on the Microsoft Process Explorer tool, running on Windows 7). And this is where I've stopped at.
I have no idea how to get relevant data on how often do various processes read/write to disk and how much data at once, and how much time they spend using the CPU. Let's say I want to get some statistics for some typical operations on a PC - playing music/movies, browsing the internet, playing games, working with Office, video editing and so on...is there even a way to gather such data?
I'm simulating preemptive multitasking using RR with a time quantum of 15ms for switching processes, and this is how it looks:
->Process gets to CPU
->Process does its work in 0-15ms, gives up the CPU or is cut off
And now, two options arise:
a)process just sits and waits before it gets the CPU again or before it gets some user input if there is nothing to do
b)process requested data from disk, and does not rejoin the queue until said data is available
And i would like the decision between a) and b) in the model be done based on a probability, for example 90% for a) and 10% for b). But I do not know how to get those percentages to be at least a bit realistic for a certain type of process. Also, how much data can and does a process typically access at once?
Any hints, sources, utilities available for this?
I think I found an answer myself, albeit an unreliable one.
The Process Explorer utility for Windows measures disk I/O - by volume and by occurences. So there's a rough way to get the answer:
say a process performs 3 000 reads in 30 minutes, whilst using 2% of CPU during that time (assuming a single core CPU). So the process has used 36000ms of CPU time, divided into ~5200 blocks (this is the unreliable part - the process in all proabbility does not use the whole of the time slot, so I'll just divide by half the time slot). 3000/5200 gives a 57% chance of reading data after using the CPU.
I hope I did not misunderstand the "reads" statistic in Process Explorer.

What will happen if a application is large enough to be loaded into the available RAM memory?

There is chance were a heavy weight application that needs to be launched in a low configuration system.. (Especially when the system has too less memory)
Also when we have already opened lot of application in the system & we keep on trying opening new new application what would happen?
I have only seen applications taking time to process or hangs up for sometime when I try operating with it in low config. system with low memory and old processors..
How it is able to accomodate many applications when the memory is low..? (like 128 MB or lesser..)
Does it involves any paging or something else..?
Can someone please let me know the theory behind this..!
"Heavyweight" is a very vague term. When the OS loads your program, the EXE is mapped in your address space, but only the code pages that run (or data pages that are referenced) are paged in as necessary.
You will likely get horrible performance if pages need to constantly be swapped as the program runs (aka many hard page faults), but it should work.
Since your commit charge is near the commit limit, and the commit limit will likely have no room to grow, you will also likely recieve many malloc()/VirtualAlloc(..., MEM_COMMIT)/HeapAlloc()/{Local|Global}Alloc() failures so you need to watch the return codes in your program.
Some keywords for search engines are: paging, swapping, virtual memory.
Wikipedia has an article called Paging (Redirected from Swap space).
There is often the use of virtual memory. Virtual memory pages are mapped to physical memory if they are used. If a physical page is needed and no page is available, another is written to disk. This is called swapping and that explains why crowded systems get slow and memory upgrades have positive effects on performance.

Resources