Erlang: No Crash Dump - erlang

I'm running ejabberd, and every so often it crashes. To figure out why it crashed, I know to look in the erl_crash.dump. The problem is, there doesn't seem to be any erl_crash.dump file. There is a core dump file though. Loading it into gdb and running "bt full," here are the top two frames:
(gdb) bt full
#0 0x000000000054df83 in prepare_crash_dump (secs=<optimized out>) at sys/unix/sys.c:735
max = <optimized out>
env = "\005", '\000' <repeats 15 times>"\200, \373!ڴ"
heart_port = 0x7fb46f31eab0
hp = 0x7fb4d6efb938
heart_fd = {865035, -1}
has_heart = 0
i = <optimized out>
envsz = <optimized out>
heap = {4460060, 140412855877120, 1}
list = 18446744073709551611
#1 erts_sys_prepare_crash_dump (secs=<optimized out>) at sys/unix/sys.c:780
So, it appears that it crashed while it was trying to write the crash dump, but didn't get all the way. I did some research, and it sounds a lot like a problem that had been posted earlier (https://groups.google.com/forum/#!msg/erlang-programming/XH2Uly6hsLY/aeR2Yx2UkZMJ). Heart was not enabled on the command line, which means this shouldn't be the problem, but... in the core dump, heart_port is set to something non-null. This should mean that heart is lurking somewhere, shouldn't it? If so, is there a way to tell heart to really not run?

This is the erlang VM crashing, not an erlang process crashing, so there is no erl_crash.dump generated. From my experience, I suspect it did not core in prepare_crash_dump, but that you have the wrong binaries loading into gdb. If you are not debugging on the system that crashed, you should copy the erlang binaries down and point GDB to them.

In erts 8.0 you have: Make sure to create a crash dump when running out of memory. This was accidentally removed in the erts-7.3 release.
So in case your VM is affected by this bug and it's crashing for this reason it won't generate.

Related

Drake -- Coredump in Simulation When There is a Contact

I have a Kuka arm and some objects set up in my simulation(very similar to manipulation station example), and I have been running into coredump error below whenever there is a contact between the robot and the objects.
"abort: Failure at multibody/plant/multibody_plant.cc:1640 in CalcImplicitStribeckResults(): condition 'info == ImplicitStribeckSolverResult::kSuccess' failed.
Aborted (core dumped)"
Decreasing the integration step size for the simulator did not help, so I ended up tracing back the error and commented out the condition that is causing the error( "DRAKE_DEMAND(info == ImplicitStribeckSolverResult::kSuccess);" ), which seems to coredump a lot less often.
However, I am guessing that condition is there for a reason, so would commenting the line out cause any other issues in the simulation? What is the proper way to fix the coredump problem?
In Drake PR #12503, the ImplicitStribeck code was refactored to reflect the notation in the TAMSI arXiv paper, and in #12361 it was changed to provide a more helpful exception with troubleshooting tips:
https://github.com/RobotLocomotion/drake/blob/v0.15.0/multibody/plant/multibody_plant.cc#L1866-L1878
Can you try out a later version (e.g. 0.15.0), and then try out the troubleshooting instructions there? (You've already tried changing step size in the simulator, but you may want to check on the stiffness of your overall system, etc.)

How do I debug a memory issue in Rust?

I hope this question isn't too open-ended. I ran into a memory issue with Rust, where I got an "out of memory" from calling next on an Iterator trait object. I'm unsure how to debug it. Prints have only brought me to the point where the failure occurs. I'm not very familiar with other tools such as ltrace, so although I could create a trace (231MiB, pff), I didn't really know what to do with it. Is a trace like that useful? Would I do better to grab gdb/lldb? Or Valgrind?
In general I would try to do the following approach:
Boilerplate reduction: Try to narrow down the problem of the OOM, so that you don't have too much additional code around. In other words: the quicker your program crashes, the better. Sometimes it is also possible to rip out a specific piece of code and put it into an extra binary, just for the investigation.
Problem size reduction: Lower the problem from OOM to a simple "too much memory" so that you can actually tell the some part wastes something but that it does not lead to an OOM. If it is too hard to tell wether you see the issue or not, you can lower the memory limit. On Linux, this can be done using ulimit:
ulimit -Sv 500000 # that's 500MB
./path/to/exe --foo
Information gathering: If you problem is small enough, you are ready to collect information which has a lower noise level. There are multiple ways which you can try. Just remember to compile your program with debug symbols. Also it might be an advantage to turn off optimization since this usually leads to information loss. Both can be archived by NOT using the --release flag during compilation.
Heap profiling: One way is too use gperftools:
LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=/tmp/profile ./path/to/exe --foo
pprof --gv ./path/to/exe /tmp/profile/profile.0100.heap
This shows you a graph which symbolizes which parts of your program eat which amount of memory. See official docs for more details.
rr: Sometimes it's very hard to figure out what is actually happening, especially after you created a profile. Assuming you did a good job in step 2, you can use rr:
rr record ./path/to/exe --foo
rr replay
This will spawn a GDB with superpowers. The difference to a normal debug session is that you can not only continue but also reverse-continue. Basically your program is executed from a recording where you can jump back and forth as you want. This wiki page provides you some additional examples. One thing to point out is that rr only seems to work with GDB.
Good old debugging: Sometimes you get traces and recordings that are still way too large. In that case you can (in combination with the ulimit trick) just use GDB and wait until the program crashes:
gdb --args ./path/to/exe --foo
You now should get a normal debugging session where you can examine what the current state of the program was. GDB can also be launched with coredumps. The general problem with that approach is that you cannot go back in time and you cannot continue with execution. So you only see the current state including all stack frames and variables. Here you could also use LLDB if you want.
(Potential) fix + repeat: After you have a glue what might go wrong you can try to change your code. Then try again. If it's still not working, go back to step 3 and try again.
Valgrind and other tools work fine, and should work out of the box as of Rust 1.32. Earlier versions of Rust require changing the global allocator from jemalloc to the system's allocator so that Valgrind and friends know how to monitor memory allocations.
In this answer, I use the macOS developer tool Instruments, as I'm on macOS, but Valgrind / Massif / Cachegrind work similarly.
Example: An infinite loop
Here's a program that "leaks" memory by pushing 1MiB Strings into a Vec and never freeing it:
use std::{thread, time::Duration};
fn main() {
let mut held_forever = Vec::new();
loop {
held_forever.push("x".repeat(1024 * 1024));
println!("Allocated another");
thread::sleep(Duration::from_secs(3));
}
}
You can see memory growth over time, as well as the exact stack trace that allocated the memory:
Example: Cycles in reference counts
Here's an example of leaking memory by creating an infinite reference cycle:
use std::{cell::RefCell, rc::Rc};
struct Leaked {
data: String,
me: RefCell<Option<Rc<Leaked>>>,
}
fn main() {
let data = "x".repeat(5 * 1024 * 1024);
let leaked = Rc::new(Leaked {
data,
me: RefCell::new(None),
});
let me = leaked.clone();
*leaked.me.borrow_mut() = Some(me);
}
See also:
Why does Valgrind not detect a memory leak in a Rust program using nightly 1.29.0?
Handling memory leak in cyclic graphs using RefCell and Rc
Minimal `Rc` Dependency Cycle
In general, to debug, you can use either a log-based approach (either by inserting the logs yourself, or having a tool such a ltrace, ptrace, ... to generate the logs for you) or you can use a debugger.
Note that ltrace, ptrace or debugger-based approaches require that you be able to reproduce the problem; I tend to favor manual logs because I work in an industry where bug reports are generally too imprecise to allow immediate reproduction (and thus we use logs to create the reproducer scenario).
Rust supports both approaches, and the standard toolset that one uses for C or C++ programs works well for it.
My personal approach is to have some logging in place to quickly narrow down where the issue occurs, and if logging is insufficient to fire up a debugger for a more fine-combed inspection. In this case I would recommend going straight away for the debugger.
A panic is generated, which means that by breaking on the call to the panic hook, you get to see both the call stack and memory state at the moment where things go awry.
Launch your program with the debugger, set a break point on the panic hook, run the program, profit.

Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0xb06b9940)

I'm new to lldb and trying to diagnose an error by using po [$eax class]
The error shown in the UI is:
Thread 1: EXC_BREAKPOINT (code=EXC_i386_BPT, subcode=0x0)
Here is the lldb console including what I entered and what was returned:
(lldb) po [$eax class]
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0xb06b9940).
The process has been returned to the state before expression evaluation.
The global breakpoint state toggle is off.
You app is getting stopped because the code you are running threw an uncaught Mach exception. Mach exceptions are the equivalent of BSD Signals for the Mach kernel - which makes up the lowest levels of the macOS operating system.
In this case, the particular Mach exception is EXC_BREAKPOINT. EXC_BREAKPOINT is a common source of confusion... Because it has the word "breakpoint" in the name people think that it is a debugger breakpoint. That's not entirely wrong, but the exception is used more generally than that.
EXC_BREAKPOINT is in fact the exception that the lower layers of Mach reports when it executes a certain instruction (a trap instruction). That trap instruction is used by lldb to implement breakpoints, but it is also used as an alternative to assert in various bits of system software. For instance, swift uses this error if you access past the end of an array. It is a way to stop your program right at the point of the error. If you are running outside the debugger, this will lead to a crash. But if you are running in the debugger, then control will be returned to the debugger with this EXC_BREAKPOINT stop reason.
To avoid confusion, lldb will never show you EXC_BREAKPOINT as the stop reason if the trap was one that lldb inserted in the program you are debugging to implement a debugger breakpoint. It will always say breakpoint n.n instead.
So if you see a thread stopped with EXC_BREAKPOINT as its stop reason, that means you've hit some kind of fatal error, usually in some system library used by your program. A backtrace at this point will show you what component is raising that error.
Anyway, then having hit that error, you tried to figure out the class of the value in the eax register by calling the class method on it by running po [$eax class]. Calling that method (which will cause code to get run in the program you are debugging) lead to a crash. That's what the "error" message you cite was telling you.
That's almost surely because $eax doesn't point to a valid ObjC object, so you're just calling a method on some random value, and that's crashing.
Note, if you are debugging a 64 bit program, then $eax is actually the lower 32 bits of the real argument passing register - $rax. The bottom 32 bits of a 64 bit pointer is unlikely to be a valid pointer value, so it is not at all surprising that calling class on it led to a crash.
If you were trying to call class on the first passed argument (self in ObjC methods) on 64 bit Intel, you really wanted to do:
(lldb) po [$rax class]
Note, that was also unlikely to work, since $rax only holds self at the start of the function. Then it gets used as a scratch register. So if you are any ways into the function (which the fact that your code fatally failed some test makes seem likely) $rax would be unlikely to still hold self.
Note also, if this is a 32 bit program, then $eax is not in fact used for argument passing - 32 bit Intel code passes arguments on the stack, not in registers.
Anyway, the first thing to do to figure out what went wrong was to print the backtrace when you get this exception, and see what code was getting run at the time this error occurred.
Clean project and restart Xcode worked for me.
I'm adding my solution, as I've struggled with the same problem and I didn't find this solution anywhere.
In my case I had to run Product -> Clean Build Folder (Clean + Option key) and rebuild my project. Breakpoints and lldb commands started to work properly.

"EXC_BAD_ACCESS" vs "Segmentation fault". Are both same practically?

In my first few dummy apps(for practice while learning) I have come across a lot of EXC_BAD_ACCESS, that somehow taught me Bad-Access is : You are touching/Accessing a object that you shouldn't because either it is not allocated yet or deallocated or simply you are not authorized to access it.
Look at this sample code that has bad-access issue because I am trying to modify a const :
-(void)myStartMethod{
NSString *str = #"testing";
const char *charStr = [str UTF8String];
charStr[4] = '\0'; // bad access on this line.
NSLog(#"%s",charStr);
}
While Segmentation fault says : Segmentation fault is a specific kind of error caused by accessing memory that “does not belong to you.” It’s a helper mechanism that keeps you from corrupting the memory and introducing hard-to-debug memory bugs. Whenever you get a segfault you know you are doing something wrong with memory (more description here.
I wanna know two things.
One, Am I right about objective-C's EXC_BAD_ACCESS ? Do I get it right ?
Second, Are EXC_BAD_ACCESS and Segmentation fault same things and Apple has just improvised its name?
No, EXC_BAD_ACCESS is not the same as SIGSEGV.
EXC_BAD_ACCESS is a Mach exception (A combination of Mach and xnu compose the Mac OS X kernel), while SIGSEGV is a POSIX signal. When crashes occur with cause given as EXC_BAD_ACCESS, often the signal is reported in parentheses immediately after: For instance, EXC_BAD_ACCESS(SIGSEGV). However, there is one other POSIX signal that can be seen in conjunction with EXC_BAD_ACCESS: It is SIGBUS, reported as EXC_BAD_ACCESS(SIGBUS).
SIGSEGV is most often seen when reading from/writing to an address that is not at all mapped in the memory map, like the NULL pointer, or attempting to write to a read-only memory location (as in your example above). SIGBUS on the other hand can be seen even for addresses the process has legitimate access to. For instance, SIGBUS can smite a process that dares to load/store from/to an unaligned memory address with instructions that assume an aligned address, or a process that attempts to write to a page for which it has not the privilege level to do so.
Thus EXC_BAD_ACCESS can best be understood as the set of both SIGSEGV and SIGBUS, and refers to all ways of incorrectly accessing memory (whether because said memory does not exist, or does exist but is misaligned, privileged or whatnot), hence its name: exception – bad access.
To feast your eyes, here is the code, within the xnu-1504.15.3 (Mac OS X 10.6.8 build 10K459) kernel source code, file bsd/uxkern/ux_exception.c beginning at line 429, that translates EXC_BAD_ACCESS to either SIGSEGV or SIGBUS.
/*
* ux_exception translates a mach exception, code and subcode to
* a signal and u.u_code. Calls machine_exception (machine dependent)
* to attempt translation first.
*/
static
void ux_exception(
int exception,
mach_exception_code_t code,
mach_exception_subcode_t subcode,
int *ux_signal,
mach_exception_code_t *ux_code)
{
/*
* Try machine-dependent translation first.
*/
if (machine_exception(exception, code, subcode, ux_signal, ux_code))
return;
switch(exception) {
case EXC_BAD_ACCESS:
if (code == KERN_INVALID_ADDRESS)
*ux_signal = SIGSEGV;
else
*ux_signal = SIGBUS;
break;
case EXC_BAD_INSTRUCTION:
*ux_signal = SIGILL;
break;
...
Edit in relation to another of your questions
Please note that exception here does not refer to an exception at the level of the language, of the type one may catch with syntactical sugar like try{} catch{} blocks. Exception here refers to the actions of a CPU on encountering certain types of mistakes in your program (they may or may not be be fatal), like a null-pointer dereference, that require outside intervention.
When this happens, the CPU is said to raise what is commonly called either an exception or an interrupt. This means that the CPU saves what it was doing (the context) and deals with the exceptional situation.
To deal with such an exceptional situation, the CPU does not start executing any "exception-handling" code (catch-blocks or suchlike) in your application. It first gives the OS control, by starting to execute a kernel-provided piece of code called an interrupt service routine. This is a piece of code that figures out what happened to which process, and what to do about it. The OS thus has an opportunity to judge the situation, and take the action it wants.
The action it does for an invalid memory access (such as a null pointer dereference) is to signal the guilty process with EXC_BAD_ACCESS(SIGSEGV). The action it does for a misaligned memory access is to signal the guilty process with EXC_BAD_ACCESS(SIGBUS). There are many other exceptional situations and corresponding actions, not all of which involve signals.
We're now back in the context of your program. If your program receives the SIGSEGV or SIGBUS signals, it will invoke the signal handler that was installed for that signal, or the default one if none was. It is rare for people to install custom handlers for SIGSEGV and SIGBUS and the default handlers shut down your program, so what you usually get is your program being shut down.
This sort of exceptions is therefore completely unlike the sort one throws in try{}-blocks and catch{}es. Those exceptions are handled purely within the application, without involving the OS at all. Here what happens is that a throw statement is simply a glorified jump to the inner-most catch block that handles that exception. As the exception bubbles through the stack, it unwinds the stack behind it, running destructors and suchlike as needed.
Basically yes, indeed an EXC_BAD_ACCESS is usually paired with a SIGSEGV which is a signal that warns about the segmentation failure.
A segmentation failure is risen whenever you are working with a pointer that points to invalid data (maybe not belonging to the process, maybe read-only, maybe an invalid address in general).
Don't think about the segmentation fault in terms of "accessing an object", you are accessing a memory location, so an address. That address must be considered coherent by the OS memory protection system.
Not all errors which are related to accessing invalid data can be tracked by the memory manager, think about a pointer to a stack allocated variable, which is considered valid although its content is not valid anymore upon restoring the stack frame.

Debugging Erlang heart timeouts

I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?
By default the heart program issues a SIGKILL to kill off the unresponsive VM so it can quickly start a new one. This makes getting any useful information about the VM pretty much impossible. Something I've tried in the past is to patch the heart program to avoid the hard kill and instead get the VM to create a crash dump and a coredump. I used a patch like this (this one is for Erlang/OTP R14B02):
--- erts/etc/common/heart.c.orig 2011-04-17 12:11:24.000000000 -0400
+++ erts/etc/common/heart.c 2011-04-17 12:12:36.000000000 -0400
## -559,10 +559,11 ##
int res;
if(heart_beat_kill_pid != 0){
pid = (pid_t) heart_beat_kill_pid;
- res = kill(pid,SIGKILL);
+ res = kill(pid,SIGUSR1);
+ sleep(4);
for(i=0; i < 5 && res == 0; ++i){
sleep(1);
- res = kill(pid,SIGKILL);
+ res = kill(pid,i < 2 ? SIGQUIT : SIGKILL);
}
if(errno != ESRCH){
print_error("Unable to kill old process, "
As you can see, with this patch heart will first issue a SIGUSR1 to try to get the VM to create a crash dump. Since this can take awhile, heart then sleeps for 4 seconds. You might have to increase this sleep time if you're not getting full crash dumps. After that, heart then tries twice to issue a SIGQUIT with the hope of getting a coredump, and if that fails, issues a SIGKILL.
Note that this patch will slow down heart's VM restart due to the time required to wait for the crash dumps and coredumps. If you use it in production, be aware of this limitation.
You could try to call erlang:halt/1 from your HEART_COMMAND thus creating a crash dump from the unresponsive node.
You can try using the erl_call tool with e.g. -a erlang halt 123.
If the erlang node can't respond to this is also interesting information.
Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.
If you have any idea of why it is freezing you could try to trace the module using dbg.
http://www.erlang.org/doc/man/dbg.html
In short try
dbg:tracer(), dbg:p(all,c), dbg:tpl(Module, Function, x).
If you want to stop this tracing issue
dbg:ctpl()
See documentation for more info.
Note: Change Module and Function to whatever you want to trace, leave x as it is. You can also skip Function and only give Module, x.
Warning: Running this on a live system can be dangerous as the amount of information that is going to be printed to the shell can be enormous.

Resources