Debugging Erlang heart timeouts

Debugging Erlang heart timeouts - erlang

I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?

By default the heart program issues a SIGKILL to kill off the unresponsive VM so it can quickly start a new one. This makes getting any useful information about the VM pretty much impossible. Something I've tried in the past is to patch the heart program to avoid the hard kill and instead get the VM to create a crash dump and a coredump. I used a patch like this (this one is for Erlang/OTP R14B02):
--- erts/etc/common/heart.c.orig 2011-04-17 12:11:24.000000000 -0400
+++ erts/etc/common/heart.c 2011-04-17 12:12:36.000000000 -0400
## -559,10 +559,11 ##
int res;
if(heart_beat_kill_pid != 0){
pid = (pid_t) heart_beat_kill_pid;
- res = kill(pid,SIGKILL);
+ res = kill(pid,SIGUSR1);
+ sleep(4);
for(i=0; i < 5 && res == 0; ++i){
sleep(1);
- res = kill(pid,SIGKILL);
+ res = kill(pid,i < 2 ? SIGQUIT : SIGKILL);
}
if(errno != ESRCH){
print_error("Unable to kill old process, "
As you can see, with this patch heart will first issue a SIGUSR1 to try to get the VM to create a crash dump. Since this can take awhile, heart then sleeps for 4 seconds. You might have to increase this sleep time if you're not getting full crash dumps. After that, heart then tries twice to issue a SIGQUIT with the hope of getting a coredump, and if that fails, issues a SIGKILL.
Note that this patch will slow down heart's VM restart due to the time required to wait for the crash dumps and coredumps. If you use it in production, be aware of this limitation.

You could try to call erlang:halt/1 from your HEART_COMMAND thus creating a crash dump from the unresponsive node.
You can try using the erl_call tool with e.g. -a erlang halt 123.
If the erlang node can't respond to this is also interesting information.
Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.

If you have any idea of why it is freezing you could try to trace the module using dbg.
http://www.erlang.org/doc/man/dbg.html
In short try
dbg:tracer(), dbg:p(all,c), dbg:tpl(Module, Function, x).
If you want to stop this tracing issue
dbg:ctpl()
See documentation for more info.
Note: Change Module and Function to whatever you want to trace, leave x as it is. You can also skip Function and only give Module, x.
Warning: Running this on a live system can be dangerous as the amount of information that is going to be printed to the shell can be enormous.

Related

How do I debug a memory issue in Rust?

I hope this question isn't too open-ended. I ran into a memory issue with Rust, where I got an "out of memory" from calling next on an Iterator trait object. I'm unsure how to debug it. Prints have only brought me to the point where the failure occurs. I'm not very familiar with other tools such as ltrace, so although I could create a trace (231MiB, pff), I didn't really know what to do with it. Is a trace like that useful? Would I do better to grab gdb/lldb? Or Valgrind?

In general I would try to do the following approach:
Boilerplate reduction: Try to narrow down the problem of the OOM, so that you don't have too much additional code around. In other words: the quicker your program crashes, the better. Sometimes it is also possible to rip out a specific piece of code and put it into an extra binary, just for the investigation.
Problem size reduction: Lower the problem from OOM to a simple "too much memory" so that you can actually tell the some part wastes something but that it does not lead to an OOM. If it is too hard to tell wether you see the issue or not, you can lower the memory limit. On Linux, this can be done using ulimit:
ulimit -Sv 500000 # that's 500MB
./path/to/exe --foo
Information gathering: If you problem is small enough, you are ready to collect information which has a lower noise level. There are multiple ways which you can try. Just remember to compile your program with debug symbols. Also it might be an advantage to turn off optimization since this usually leads to information loss. Both can be archived by NOT using the --release flag during compilation.
Heap profiling: One way is too use gperftools:
LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=/tmp/profile ./path/to/exe --foo
pprof --gv ./path/to/exe /tmp/profile/profile.0100.heap
This shows you a graph which symbolizes which parts of your program eat which amount of memory. See official docs for more details.
rr: Sometimes it's very hard to figure out what is actually happening, especially after you created a profile. Assuming you did a good job in step 2, you can use rr:
rr record ./path/to/exe --foo
rr replay
This will spawn a GDB with superpowers. The difference to a normal debug session is that you can not only continue but also reverse-continue. Basically your program is executed from a recording where you can jump back and forth as you want. This wiki page provides you some additional examples. One thing to point out is that rr only seems to work with GDB.
Good old debugging: Sometimes you get traces and recordings that are still way too large. In that case you can (in combination with the ulimit trick) just use GDB and wait until the program crashes:
gdb --args ./path/to/exe --foo
You now should get a normal debugging session where you can examine what the current state of the program was. GDB can also be launched with coredumps. The general problem with that approach is that you cannot go back in time and you cannot continue with execution. So you only see the current state including all stack frames and variables. Here you could also use LLDB if you want.
(Potential) fix + repeat: After you have a glue what might go wrong you can try to change your code. Then try again. If it's still not working, go back to step 3 and try again.

Valgrind and other tools work fine, and should work out of the box as of Rust 1.32. Earlier versions of Rust require changing the global allocator from jemalloc to the system's allocator so that Valgrind and friends know how to monitor memory allocations.
In this answer, I use the macOS developer tool Instruments, as I'm on macOS, but Valgrind / Massif / Cachegrind work similarly.
Example: An infinite loop
Here's a program that "leaks" memory by pushing 1MiB Strings into a Vec and never freeing it:
use std::{thread, time::Duration};
fn main() {
let mut held_forever = Vec::new();
loop {
held_forever.push("x".repeat(1024 * 1024));
println!("Allocated another");
thread::sleep(Duration::from_secs(3));
}
}
You can see memory growth over time, as well as the exact stack trace that allocated the memory:
Example: Cycles in reference counts
Here's an example of leaking memory by creating an infinite reference cycle:
use std::{cell::RefCell, rc::Rc};
struct Leaked {
data: String,
me: RefCell<Option<Rc<Leaked>>>,
}
fn main() {
let data = "x".repeat(5 * 1024 * 1024);
let leaked = Rc::new(Leaked {
data,
me: RefCell::new(None),
});
let me = leaked.clone();
*leaked.me.borrow_mut() = Some(me);
}
See also:
Why does Valgrind not detect a memory leak in a Rust program using nightly 1.29.0?
Handling memory leak in cyclic graphs using RefCell and Rc
Minimal `Rc` Dependency Cycle

In general, to debug, you can use either a log-based approach (either by inserting the logs yourself, or having a tool such a ltrace, ptrace, ... to generate the logs for you) or you can use a debugger.
Note that ltrace, ptrace or debugger-based approaches require that you be able to reproduce the problem; I tend to favor manual logs because I work in an industry where bug reports are generally too imprecise to allow immediate reproduction (and thus we use logs to create the reproducer scenario).
Rust supports both approaches, and the standard toolset that one uses for C or C++ programs works well for it.
My personal approach is to have some logging in place to quickly narrow down where the issue occurs, and if logging is insufficient to fire up a debugger for a more fine-combed inspection. In this case I would recommend going straight away for the debugger.
A panic is generated, which means that by breaking on the call to the panic hook, you get to see both the call stack and memory state at the moment where things go awry.
Launch your program with the debugger, set a break point on the panic hook, run the program, profit.

lost in debug ... can't stop execution

I am trying to understand why an installation file hangs up using Windbg, but I am at a point where I can't stop the execution.
As background, I had already been able to install this program on the same PC, but for some reason I had then uninstalled, and now I can't re-install it (I tried to clean up everything from the old installation, incl. registry). Now this setup.exe starts and stays idle among the running processes without doing anything.
But let's go to the actual question. I am trying to use Windbg for the first time (I only had some practice with the old 8086 debug at DOS-time :-), so please bear with me if I'm asking something straightforward).
I have tracked the code up to a point where I have a RET code. I am able to stop the debugger at the RET instruction, but as soon as I "step into" the RET, the execution starts and does not stop, while I was expecting it to just go to the instruction following the previous CALL. From how I see things, it seems that after the RET the execution goes somewhere else ... how is it possible? Also, just before the RET there is a SYSCALL that I don't fully understand ... can it have an impact?
This is the portion of the code I am examining at the moment:
ntdll!NtTerminateThread:
00007ff9`fc8b5b20 4c8bd1 mov r10,rcx
00007ff9`fc8b5b23 b853000000 mov eax,53h
00007ff9`fc8b5b28 f604250803fe7f01 test byte ptr [SharedUserData+0x308 (00000000`7ffe0308)],1
00007ff9`fc8b5b30 7503 jne ntdll!NtTerminateThread+0x15 (00007ff9`fc8b5b35)
00007ff9`fc8b5b32 0f05 syscall
00007ff9`fc8b5b34 c3 ret
00007ff9`fc8b5b35 cd2e int 2Eh
00007ff9`fc8b5b37 c3 ret
I am stuck at the first RET instruction, at address 5b34.
At this time, this is the stack call:
00000000`0203fc38 00007ff9`fc86c63e ntdll!NtTerminateThread+0x14
00000000`0203fc40 00007ff9`fc8d903a ntdll!RtlExitUserThread+0x4e
00000000`0203fc80 00007ff9`fc86c5c5 ntdll!DbgUiRemoteBreakin+0x5a
00000000`0203fcb0 00000000`00000000 ntdll!RtlUserThreadStart+0x45
so my understanding is that execution should continue at address 00007ff9`fc86c63e. However, even if I add a BP at this address, or if I just go for a trace, the execution continues and keeps running some idle loop until I hit the "pause" button in windbg, after which it resume at a completely different address.
In case the registers are relevant, here are some of them:
rax: 353000
rbx: 0
rcx: 0
rdx: 0
rsp: 203fc38
rdi: 7ff9c8d8fe0
rip: 7ff9fc8b5b34
So, eventually, where am I wrong? How can I see where the code goes after this RET?
Thanks in advance for any help,
Bob

when a thread exits the execution wont return to the return address
the system is free to schedule another thread that is ready in the process
the stack shows NtTerminateThread on the stack it is a function that does not return
__declspec noreturn foo (...) ;
btw when you say it goes elsewhere do you mean the app keeps running and is not terminated if so hit ctrl+break and check what other threads are doing
ie if usermode ~*kb should show all the threads callstack
answering the comment about where it goes
process is a collection of threads
each thread has a stack and each thread gets a bit of time to execute from the scheduler (thread quantum)
each thread that has a lower priority can be preempted by threads xxxx ,yyy ,zzz with higher priorities by interrupts by apcs , dpcs etc
when a thread has completed its quantum or is preempted by some vip cavalcade happening to travel on the road that this poor thread is walking
a trap is made _KTRAP and this poor threads position EIP is filled into it and put in a waiting threads barricade
when the vip cavalcade's dust has settled police open the barricade and let the poor thread walk from where it stopped
for such gory details you may need a kernel debugging setup and may need to control your process from a kernel debugger
when you hit the return os sees the thread is dead and has no return address
so it checks the !ready threads and selects the highest priority thread and provides it a quantum to enjoy
so before hitting the return address check what all other threads are doing in your app set an appropriate break on threads of interest and hit the return when the other thread executes its quantum your break will get hit

You're looking at the wrong thread!
From the partial output you supplied seems like you're attaching to a running process (rather than start it from the debugger). To break into a running process the debugger injects a thread into the target process that basically contains a hardcoded int 3 instruction and not much more.
It does it by calling ntdll!RtlpCreateUserThreadEx (the internal undocumented native parallel of CreateRemoteThread) supplying ntdll!DbgUiRemoteBreakin as the start address for the new thread.
The sole purpose of this synthetic thread is to generate the breakpoint exception. This exception causes the operating system to stop running the target process and passes control to the debugger. After it does this it's not needed anymore and it commits suicide.
What you're supposed to do at this point is probably switch to your thread of interest using ~s command, set breakpoints and then continue execution.
If try to step through this synthetic thread it will just end, and then the process will continue doing whatever it was doing before you broke into it, which is pretty much the opposite of what you want,
That's what this stack means:
00000000`0203fc38 00007ff9`fc86c63e ntdll!NtTerminateThread+0x14
00000000`0203fc40 00007ff9`fc8d903a ntdll!RtlExitUserThread+0x4e
00000000`0203fc80 00007ff9`fc86c5c5 ntdll!DbgUiRemoteBreakin+0x5a
00000000`0203fcb0 00000000`00000000 ntdll!RtlUserThreadStart+0x45
ntdll!RtlUserThreadStart is the real user-mode entry point of all user-mode threads and you can see that it just called ntdll!DbgUiRemoteBreakin after which you continued a bit until the thread finally ends itself.

Erlang: No Crash Dump

I'm running ejabberd, and every so often it crashes. To figure out why it crashed, I know to look in the erl_crash.dump. The problem is, there doesn't seem to be any erl_crash.dump file. There is a core dump file though. Loading it into gdb and running "bt full," here are the top two frames:
(gdb) bt full
#0 0x000000000054df83 in prepare_crash_dump (secs=<optimized out>) at sys/unix/sys.c:735
max = <optimized out>
env = "\005", '\000' <repeats 15 times>"\200, \373!ڴ"
heart_port = 0x7fb46f31eab0
hp = 0x7fb4d6efb938
heart_fd = {865035, -1}
has_heart = 0
i = <optimized out>
envsz = <optimized out>
heap = {4460060, 140412855877120, 1}
list = 18446744073709551611
#1 erts_sys_prepare_crash_dump (secs=<optimized out>) at sys/unix/sys.c:780
So, it appears that it crashed while it was trying to write the crash dump, but didn't get all the way. I did some research, and it sounds a lot like a problem that had been posted earlier (https://groups.google.com/forum/#!msg/erlang-programming/XH2Uly6hsLY/aeR2Yx2UkZMJ). Heart was not enabled on the command line, which means this shouldn't be the problem, but... in the core dump, heart_port is set to something non-null. This should mean that heart is lurking somewhere, shouldn't it? If so, is there a way to tell heart to really not run?

This is the erlang VM crashing, not an erlang process crashing, so there is no erl_crash.dump generated. From my experience, I suspect it did not core in prepare_crash_dump, but that you have the wrong binaries loading into gdb. If you are not debugging on the system that crashed, you should copy the erlang binaries down and point GDB to them.

In erts 8.0 you have: Make sure to create a crash dump when running out of memory. This was accidentally removed in the erts-7.3 release.
So in case your VM is affected by this bug and it's crashing for this reason it won't generate.

Discovering blocking erlang threads

I have a project which has lots of modules, each one has different running threads. I wrote a little script which goes through each one and safely reloads the code (for hot swaps):
reload_all() ->
?MODULE:reload_all(?MODULE_LIST).
reload_all([]) -> ok;
reload_all([T|C]) ->
io:fwrite("Purging ~w\n",[T]),
try_purge(T),
{module,T} = code:load_file(T),
?MODULE:reload_all(C).
try_purge(T) -> try_purge(T,1).
try_purge(T,Wait) ->
case code:soft_purge(T) of
true -> ok;
false ->
io:fwrite("* Waiting ~w seconds for ~w module\n",[Wait,T]),
timer:sleep(Wait*1000),
try_purge(T,Wait+1)
end.
It uses the soft_purge() function which only purges the code if there are no threads running the "old" code that would be killed by the normal purge command. It will wait in increasing intervals and keep trying. I've designed the project so that the wait should never be more then a minute total, but realistically it should always be more or less instant.
The problem I'm running into is that sometimes a module will have a bug causing it to block indefinitely for one reason or another, and my reload_all() script never completes. This is the desired behavior, it lets me know that something is wrong. The problem is that to track down the bug involves lots and lots of testing and analyzing of the code, which sometimes doesn't even work because the bug only shows up in the production environment and not in the testing one.
My question is: Is there a way to identify which threads are running the "old" code in a module, and see which function they are currently stuck in?

You can check if you are using the old or the new version of the module using erlang:check_old_code/1 and erlang:check_process_code/2. Just see Erlang manual.

trouble reading from __global memory after atom_inc in OpenCL

OpenCL doesn't have a global barrier that will stop all threads, so I'm trying to create a work around with the following code:
void barrier(__global uint* scratch) {
uint nThreads = get_global_size(0);
atom_inc(scratch);
/* this loop never terminates */
while(scratch[0] < nThreads) {
continue;
}
}
The idea is that each thread loops until all of them increment that one piece of memory.
However, the value read from scratch[0] never changes for the threads once it's been read, and it loops forever. I know it's being incremented because it's the correct value when I read it back to the host.
Is the global memory being locally cached? What's going on here?

Found the problem: the order in which work groups are executed is implementation defined. This means that some threads might start only after others have finished.
In the code I gave, the work groups that are started first will loop forever waiting on the the others to hit the 'barrier'. And the work groups that would be started later won't ever start because they're waiting for the first ones to finish.
If the implementation (I'm on a Radeon 5750, using Stream SDK 2.2) executes all work groups concurrently, then it probably wouldn't be an issue. But that's not the case for my setup.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart