Memory Management in Ruby - ruby-on-rails

I puzzled by some behaviour of ruby and how it manages memory.
I understand the Ruby GC (major or minor) behaviour i.e if the any objects count goes above there threshold value or limit (i.e heap_available_slots,old_objects_limit, remembered_shady_object_limit, malloc_limit). Ruby run/trigger a GC(major or minor).
And after GC if it can't find the enough memory Ruby allocate (basically malloc I assuming) more memory for the running program.
Also, It's a known fact that by does release back memory to the OS immediately.
Now ..
What I fail to understand how come Ruby releases memory (back to the OS) without triggering any GC.
Example
require 'rbtrace'
index = 1
array = []
while(index < 20000000) do
array << index
index += 1
end
sleep 10
print "-"
array=nil
sleep
Here is my example. If run the above code on ruby 2.2.2p95.
htop display the RSS count of the process (test.rb PID 11483) reaching to 161MB.
GC.stat (captured via rbtrace gem) look like (pay close attention to attention to GC count)
rbtrace -p 11843 -e '[Time.now,Process.pid,GC.stat]'
[Time.now,Process.pid,GC.stat]
=> [2016-07-27 13:50:28 +0530, 11843,
{
"count": 7,
"heap_allocated_pages": 74,
"heap_sorted_length": 75,
"heap_allocatable_pages": 0,
"heap_available_slots": 30162,
"heap_live_slots": 11479,
"heap_free_slots": 18594,
"heap_final_slots": 89,
"heap_marked_slots": 120,
"heap_swept_slots": 18847,
"heap_eden_pages": 74,
"heap_tomb_pages": 0,
"total_allocated_pages": 74,
"total_freed_pages": 0,
"total_allocated_objects": 66182,
"total_freed_objects": 54614,
"malloc_increase_bytes": 8368,
"malloc_increase_bytes_limit": 33554432,
"minor_gc_count": 4,
"major_gc_count": 3,
"remembered_wb_unprotected_objects": 0,
"remembered_wb_unprotected_objects_limit": 278,
"old_objects": 14,
"old_objects_limit": 10766,
"oldmalloc_increase_bytes": 198674592,
"oldmalloc_increase_bytes_limit": 20132659
}]
*** detached from process 11843
GC count => 7
Approximately 25 minutes later. Memory has drop down to 6MB but GC count is still 7.
[Time.now,Process.pid,GC.stat]
=> [2016-07-27 14:16:02 +0530, 11843,
{
"count": 7,
"heap_allocated_pages": 74,
"heap_sorted_length": 75,
"heap_allocatable_pages": 0,
"heap_available_slots": 30162,
"heap_live_slots": 11581,
"heap_free_slots": 18581,
"heap_final_slots": 0,
"heap_marked_slots": 120,
"heap_swept_slots": 18936,
"heap_eden_pages": 74,
"heap_tomb_pages": 0,
"total_allocated_pages": 74,
"total_freed_pages": 0,
"total_allocated_objects": 66284,
"total_freed_objects": 54703,
"malloc_increase_bytes": 3248,
"malloc_increase_bytes_limit": 33554432,
"minor_gc_count": 4,
"major_gc_count": 3,
"remembered_wb_unprotected_objects": 0,
"remembered_wb_unprotected_objects_limit": 278,
"old_objects": 14,
"old_objects_limit": 10766,
"oldmalloc_increase_bytes": 198663520,
"oldmalloc_increase_bytes_limit": 20132659
}]
Question: I was under the impression that Ruby Release memory whenever GC is triggered. But clearly that not the case over here.
Anybody can provide a detail on how (as in who triggered the memory releases surely its not GC.) the memory is released back to OS.
OS: OS X version 10.11.12

You are correct, it's not GC that changed the physical memory requirements, it's the OS kernel.
You need to look at the VIRT column, not the RES column. As you can see VIRT stays exactly the same.
RES is physical (resident) memory, VIRT is virtual (allocated, but currently unused) memory.
When the process sleeps it's not using its memory or doing anything, so the OS memory manager decides to swap out part of the physical memory and move it into virtual space.
Why keep an idle process hogging physical memory for no reason? So the OS is smart, and swaps out as much unused physical memory as possible, that's why you see a reduction in RES.
I suspect you would see the same effect even without array = nil, by just sleeping long enough. Once you stop sleeping and access something in the array, then RES will jump back up again.
You can read some more discussion through these:
What is RSS and VSZ in Linux memory management
http://www.darkcoding.net/software/resident-and-virtual-memory-on-linux-a-short-example/
What's the difference between "virtual memory" and "swap space"?
http://www.tldp.org/LDP/tlk/mm/memory.html

Related

What is causing an extremely low memory fragmentation ratio?

We are seeing some odd memory issues with our Redis 4.0.2 instances. The master instance has a ratio of 0.12, whereas the slaves have reasonable ratios that hover just above 1. When we restart the master instance, the memory fragmentation ratio goes back to 1 until we hit our peak load times and the ratio goes back down to less than 0.2. The OS (Ubuntu) is telling us that the redis instance is using 13GB of virtual memory and 1.6GB of RAM. And once this happens, most of the data gets swapped out to disk and the performance grinds almost to a halt.
Our keys tend to last for a day or two before being purged. Most values are hashes and zsets with ~100 or so entries and each entry being less than 1kb or so.
We are not sure what is causing this. We have tried tweaking the OS overcommit_memory. We also tried the new MEMORY PURGE command, but that neither seem to help. We are looking for other things to explore and suggestions to try. Any advice would be appreciated.
What is the likely cause of this and how can we bring the ratio back closer to 1?
Here is a dump of our memory info:
127.0.0.1:8000> info memory
# Memory
used_memory:12955019496
used_memory_human:12.07G
used_memory_rss:1676115968
used_memory_rss_human:1.56G
used_memory_peak:12955019496
used_memory_peak_human:12.07G
used_memory_peak_perc:100.00%
used_memory_overhead:19789422
used_memory_startup:765600
used_memory_dataset:12935230074
used_memory_dataset_perc:99.85%
total_system_memory:33611145216
total_system_memory_human:31.30G
used_memory_lua:945152
used_memory_lua_human:923.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:0.13
mem_allocator:jemalloc-4.0.3
active_defrag_running:0
lazyfree_pending_objects:0
And our memory stats:
127.0.0.1:8000> memory stats
1) "peak.allocated"
2) (integer) 12954706848
3) "total.allocated"
4) (integer) 12954623968
5) "startup.allocated"
6) (integer) 765600
7) "replication.backlog"
8) (integer) 1048576
9) "clients.slaves"
10) (integer) 33716
11) "clients.normal"
12) (integer) 184494
13) "aof.buffer"
14) (integer) 0
15) "db.0"
16) 1) "overhead.hashtable.main"
2) (integer) 17691184
3) "overhead.hashtable.expires"
4) (integer) 32440
17) "overhead.total"
18) (integer) 19756010
19) "keys.count"
20) (integer) 337422
21) "keys.bytes-per-key"
22) (integer) 38390
23) "dataset.bytes"
24) (integer) 12934867958
25) "dataset.percentage"
26) "99.853401184082031"
27) "peak.percentage"
28) "99.999359130859375"
29) "fragmentation"
30) "0.12932859361171722"

malloc kernel panics instead of returning NULL

I'm attempting to do an exercise from "Expert C Programming" where the point is to see how much memory a program can allocate. It hinges on malloc returning NULL when it cannot allocate anymore.
#include <stdio.h>
#include <stdlib.h>
int main() {
int totalMB = 0;
int oneMeg = 1<<20;
while (malloc(oneMeg)) {
++totalMB;
}
printf("Allocated %d Mb total \n", totalMB);
return 0;
}
Rather than printing the total, I get a kernel panic after allocating ~8GB on my 16GB Macbook Pro.
Kernel panic log:
Anonymous UUID: 0B87CC9D-2495-4639-EA18-6F1F8696029F
Tue Dec 13 23:09:12 2016
*** Panic Report ***
panic(cpu 0 caller 0xffffff800c51f5a4): "zalloc: zone map exhausted while allocating from zone VM map entries, likely due to memory leak in zone VM map entries (6178859600 total bytes, 77235745 elements allocated)"#/Library/Caches/com.apple.xbs/Sources/xnu/xnu-3248.50.21/osfmk/kern/zalloc.c:2628
Backtrace (CPU 0), Frame : Return Address
0xffffff91f89bb960 : 0xffffff800c4dab12
0xffffff91f89bb9e0 : 0xffffff800c51f5a4
0xffffff91f89bbb10 : 0xffffff800c5614e0
0xffffff91f89bbb30 : 0xffffff800c5550e2
0xffffff91f89bbba0 : 0xffffff800c554960
0xffffff91f89bbd90 : 0xffffff800c55f493
0xffffff91f89bbea0 : 0xffffff800c4d17cb
0xffffff91f89bbf10 : 0xffffff800c5b8dca
0xffffff91f89bbfb0 : 0xffffff800c5ecc86
BSD process name corresponding to current thread: a.out
Mac OS version:
15F34
I understand that this can easily be fixed by the doctor's cliche of "It hurts when you do that? Then don't do that" but I want to understand why malloc isn't working as expected.
OS X 10.11.5
For the definitive answer to that question, you can look at the source code, which you'll find here:
zalloc.c source in XNU
In that source file find the function zalloc_internal(). This is the function that gives the kernel panic.
In the function you'll find a "for (;;) {" loop, which basically tries to allocate the memory you're requesting in the specified zone. If there isn't enough space, it immediately tries again. If that fails it does a zone_gc() (garbage collect) to try to reclaim memory. If that also fails, it simply kernel panics - effectively halting the computer.
If you want to understand how zalloc.c works, look up zone-based memory allocators.
Your program is making the kernel run out of space in the zone called "VM map entries", which is a predefined zone allocated at boot. You could probably get the result you are expecting from your program, without a kernel panic, if you allocated more than 1 MB at a time.
In essence it is not really a problem for the kernel to allocate you several gigabytes of memory. However, allocating thousands of smaller allocations summing up to those gigabytes is much harder.

Single-threaded program profiles 15% of runtime in semaphore_wait_trap

On Mac OS using mono, if I compile and profile the program below, I get the following results:
% fsharpc --nologo -g foo.fs -o foo.exe
% mono --profile=default:stat foo.exe
...
Statistical samples summary
Sample type: cycles
Unmanaged hits: 336 (49.1%)
Managed hits: 349 (50.9%)
Unresolved hits: 1 ( 0.1%)
Hits % Method name
154 22.48 Microsoft.FSharp.Collections.SetTreeModule:height ...
105 15.33 semaphore_wait_trap
74 10.80 Microsoft.FSharp.Collections.SetTreeModule:add ...
...
Note the second entry, semaphore_wait_trap.
Here is the program:
[<EntryPoint>]
let main args =
let s = seq { 1..1000000 } |> Set.ofSeq
s |> Seq.iter (fun _ -> ())
0
I looked in the source for the Set module, but I didn't find any (obvious) locking.
Is my single-threaded program really spending 15% of its execution time messing with semaphores? If it is, can I make it not do that and get a performance boost?
According to Instruments, it's sgen/gc calling semaphore_wait_trap:
Sgen is documented as stopping all other threads while it collects:
Before doing a collection (minor or major), the collector must stop
all running threads so that it can have a stable view of the current
state of the heap, without the other threads changing it
In other words, when the code is trying to allocate memory and a GC is required, the time it takes shows up under semaphore_wait_trap since that's your application thread. I suspect the mono profiler doesn't profile the gc thread itself so you don't see the time in the collection code.
The germane output then is really the GC summary:
GC summary
GC resizes: 0
Max heap size: 0
Object moves: 1002691
Gen0 collections: 123, max time: 14187us, total time: 354803us, average: 2884us
Gen1 collections: 3, max time: 41336us, total time: 60281us, average: 20093us
If you want your code to run faster, don't collect as often.
Understanding the actual cost of collection can be done through dtrace since sgen has dtrace probes.

erlang crash dump no more index entries

i have a problem about erlang.
One of my Erlang node crashes, and generates erl_crash.dump with reason no more index entries in atom_tab (max=1048576).
i checked the dump file and i found that there are a lot of atoms in the form of 'B\2209\000..., (about 1000000 entries)
=proc:<0.11744.7038>
State: Waiting
Name: 'B\2209\000d\022D.
Spawned as: proc_lib:init_p/5
Spawned by: <0.5032.0>
Started: Sun Feb 23 05:23:27 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Reductions: 1992
Stack+heap: 1597
OldHeap: 1597
Heap unused: 918
OldHeap unused: 376
Program counter: 0x0000000001eb7700 (gen_fsm:loop/7 + 140)
CP: 0x0000000000000000 (invalid)
arity = 0
do you have some experience about what they are?
Atoms
By default, the maximum number of atoms is 1048576. This limit can be raised or lowered using the +t option.
Note: an atom refers into an atom table which also consumes memory. The atom text is stored once for each unique atom in this table. The atom table is not garbage-collected.
I think that you produce a lot of atom in your program, the number of atom reach the number limition for atom.
You can use this +t option to change the number limition of atom in your erlang VM when your start your erlang node.
So it tells you, that you generate atoms. There is somewhere list_to_atom/1 which is called with variable argument. Because you have process with this sort of name, you register/2 processes with this name. It is may be your code or some third party module what you use. It's bad behavior. Don't do that and don't use modules which is doing it.
To be honest, I can imagine design where I would do it intentionally but it is very special case and it is obviously not the case when you ask this question.

retrieval of data from ETS table

I know that lookup time is constant for ETS tables. But I also heard that the table is kept outside of the process and when retrieving data, it needs to be moved to the process heap. So, this is expensive. But then, how to explain this:
18> {Time, [[{ok, Binary}]]} = timer:tc(ets, match, [utilo, {a, '$1'}]).
{0,
[[{ok,<<255,216,255,225,63,254,69,120,105,102,0,0,73,
73,42,0,8,0,0,0,10,0,14,...>>}]]}
19> size(Binary).
1759017
1.7 MB binary takes 0 time to be retrieved from the table!?
EDIT: After I saw Odobenus Rosmarus's answer, I decided to convert the binary to list. Here is the result:
1> {ok, B} = file:read_file("IMG_2171.JPG").
{ok,<<255,216,255,225,63,254,69,120,105,102,0,0,73,73,42,
0,8,0,0,0,10,0,14,1,2,0,32,...>>}
2> size(B).
1986392
3> L = binary_to_list(B).
[255,216,255,225,63,254,69,120,105,102,0,0,73,73,42,0,8,0,0,
0,10,0,14,1,2,0,32,0,0|...]
4> length(L).
1986392
5> ets:insert(utilo, {a, L}).
true
6> timer:tc(ets, match, [utilo, {a, '$1'}]).
{106000,
[[[255,216,255,225,63,254,69,120,105,102,0,0,73,73,42,0,8,0,
0,0,10,0,14,1,2|...]]]}
Now it takes 106000 microseconds to retrieve 1986392 long list from the table which is pretty fast, isn't it? Lists are 2 words per element. Thus the data is 4x1.7MB.
EDIT 2: I started a thread on erlang-question (http://groups.google.com/group/erlang-programming/browse_thread/thread/5581a8b5b27d4fe1) and it turns out that 0.1 second is pretty much the time it takes to do memcpy() (move the data to the process's heap). On the other hand Odobenus Rosmarus's answer explains why retrieving binary takes 0 time.
binaries itself (that longer than 64 bits) are stored in the special heap, outside of process heap.
So, retrieval of binary from the ets table moves to process heap just 'Procbin' part of binary. (roughly it's pointer to start of binary in the binaries memory and size).

Resources