Discovering blocking erlang threads - erlang

I have a project which has lots of modules, each one has different running threads. I wrote a little script which goes through each one and safely reloads the code (for hot swaps):
reload_all() ->
?MODULE:reload_all(?MODULE_LIST).
reload_all([]) -> ok;
reload_all([T|C]) ->
io:fwrite("Purging ~w\n",[T]),
try_purge(T),
{module,T} = code:load_file(T),
?MODULE:reload_all(C).
try_purge(T) -> try_purge(T,1).
try_purge(T,Wait) ->
case code:soft_purge(T) of
true -> ok;
false ->
io:fwrite("* Waiting ~w seconds for ~w module\n",[Wait,T]),
timer:sleep(Wait*1000),
try_purge(T,Wait+1)
end.
It uses the soft_purge() function which only purges the code if there are no threads running the "old" code that would be killed by the normal purge command. It will wait in increasing intervals and keep trying. I've designed the project so that the wait should never be more then a minute total, but realistically it should always be more or less instant.
The problem I'm running into is that sometimes a module will have a bug causing it to block indefinitely for one reason or another, and my reload_all() script never completes. This is the desired behavior, it lets me know that something is wrong. The problem is that to track down the bug involves lots and lots of testing and analyzing of the code, which sometimes doesn't even work because the bug only shows up in the production environment and not in the testing one.
My question is: Is there a way to identify which threads are running the "old" code in a module, and see which function they are currently stuck in?

You can check if you are using the old or the new version of the module using erlang:check_old_code/1 and erlang:check_process_code/2. Just see Erlang manual.

Related

How do I debug a memory issue in Rust?

I hope this question isn't too open-ended. I ran into a memory issue with Rust, where I got an "out of memory" from calling next on an Iterator trait object. I'm unsure how to debug it. Prints have only brought me to the point where the failure occurs. I'm not very familiar with other tools such as ltrace, so although I could create a trace (231MiB, pff), I didn't really know what to do with it. Is a trace like that useful? Would I do better to grab gdb/lldb? Or Valgrind?
In general I would try to do the following approach:
Boilerplate reduction: Try to narrow down the problem of the OOM, so that you don't have too much additional code around. In other words: the quicker your program crashes, the better. Sometimes it is also possible to rip out a specific piece of code and put it into an extra binary, just for the investigation.
Problem size reduction: Lower the problem from OOM to a simple "too much memory" so that you can actually tell the some part wastes something but that it does not lead to an OOM. If it is too hard to tell wether you see the issue or not, you can lower the memory limit. On Linux, this can be done using ulimit:
ulimit -Sv 500000 # that's 500MB
./path/to/exe --foo
Information gathering: If you problem is small enough, you are ready to collect information which has a lower noise level. There are multiple ways which you can try. Just remember to compile your program with debug symbols. Also it might be an advantage to turn off optimization since this usually leads to information loss. Both can be archived by NOT using the --release flag during compilation.
Heap profiling: One way is too use gperftools:
LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=/tmp/profile ./path/to/exe --foo
pprof --gv ./path/to/exe /tmp/profile/profile.0100.heap
This shows you a graph which symbolizes which parts of your program eat which amount of memory. See official docs for more details.
rr: Sometimes it's very hard to figure out what is actually happening, especially after you created a profile. Assuming you did a good job in step 2, you can use rr:
rr record ./path/to/exe --foo
rr replay
This will spawn a GDB with superpowers. The difference to a normal debug session is that you can not only continue but also reverse-continue. Basically your program is executed from a recording where you can jump back and forth as you want. This wiki page provides you some additional examples. One thing to point out is that rr only seems to work with GDB.
Good old debugging: Sometimes you get traces and recordings that are still way too large. In that case you can (in combination with the ulimit trick) just use GDB and wait until the program crashes:
gdb --args ./path/to/exe --foo
You now should get a normal debugging session where you can examine what the current state of the program was. GDB can also be launched with coredumps. The general problem with that approach is that you cannot go back in time and you cannot continue with execution. So you only see the current state including all stack frames and variables. Here you could also use LLDB if you want.
(Potential) fix + repeat: After you have a glue what might go wrong you can try to change your code. Then try again. If it's still not working, go back to step 3 and try again.
Valgrind and other tools work fine, and should work out of the box as of Rust 1.32. Earlier versions of Rust require changing the global allocator from jemalloc to the system's allocator so that Valgrind and friends know how to monitor memory allocations.
In this answer, I use the macOS developer tool Instruments, as I'm on macOS, but Valgrind / Massif / Cachegrind work similarly.
Example: An infinite loop
Here's a program that "leaks" memory by pushing 1MiB Strings into a Vec and never freeing it:
use std::{thread, time::Duration};
fn main() {
let mut held_forever = Vec::new();
loop {
held_forever.push("x".repeat(1024 * 1024));
println!("Allocated another");
thread::sleep(Duration::from_secs(3));
}
}
You can see memory growth over time, as well as the exact stack trace that allocated the memory:
Example: Cycles in reference counts
Here's an example of leaking memory by creating an infinite reference cycle:
use std::{cell::RefCell, rc::Rc};
struct Leaked {
data: String,
me: RefCell<Option<Rc<Leaked>>>,
}
fn main() {
let data = "x".repeat(5 * 1024 * 1024);
let leaked = Rc::new(Leaked {
data,
me: RefCell::new(None),
});
let me = leaked.clone();
*leaked.me.borrow_mut() = Some(me);
}
See also:
Why does Valgrind not detect a memory leak in a Rust program using nightly 1.29.0?
Handling memory leak in cyclic graphs using RefCell and Rc
Minimal `Rc` Dependency Cycle
In general, to debug, you can use either a log-based approach (either by inserting the logs yourself, or having a tool such a ltrace, ptrace, ... to generate the logs for you) or you can use a debugger.
Note that ltrace, ptrace or debugger-based approaches require that you be able to reproduce the problem; I tend to favor manual logs because I work in an industry where bug reports are generally too imprecise to allow immediate reproduction (and thus we use logs to create the reproducer scenario).
Rust supports both approaches, and the standard toolset that one uses for C or C++ programs works well for it.
My personal approach is to have some logging in place to quickly narrow down where the issue occurs, and if logging is insufficient to fire up a debugger for a more fine-combed inspection. In this case I would recommend going straight away for the debugger.
A panic is generated, which means that by breaking on the call to the panic hook, you get to see both the call stack and memory state at the moment where things go awry.
Launch your program with the debugger, set a break point on the panic hook, run the program, profit.

Random bad_object_header mnesia/dets error

I am having a very weird error with mnesia. I have about 10 tables that mnesia is recording, and usually it works fine. However, in a certain place in my code, whenever I try to read from a particular table (trying to read from other tables is fine) I get a DETS error.
I reduced my code to
{atomic, ok} = mnesia:transaction(fun() ->
[Entry] = mnesia:read(table_name, Key),
ok
end)
I have a try/catch block around the transaction, and the error I get is this:
error:{badmatch,
{aborted,
{{badmatch,
{error,
{bad_object_header,
"/path/to/table_name.DAT"}}},
[{callback,
'-handle/2-fun-0-',
1,
[{file,
"src/src.erl"},
{line,
234}]},
{mnesia_tm,
apply_fun,
3,
[{file,
"mnesia_tm.erl"},
{line,
830}]},
{mnesia_tm,
execute_transaction,
5,
[{file,
"mnesia_tm.erl"},
{line,
810}]},
]}}}
Unfortunately I cannot reproduce the error with a short example. Even if I call the function from the REPL, it doesn't error. It only errors when it happens in my actual code. But it does happen reliably every time.
If I take out the mnesia:read line, everything works fine. I have tried remaking the schema and the tables, and that didn't help. It's really weird because my code goes on later to use the table successfully. It is only if it's used from this one place that it fails.
What could be going wrong?
Update
I experimented some more and it seems that the error only happens when two of these transactions happen (in different processes) nearly simultaneously. Isn't mnesia meant to be used in this way?
Update 2
Turned out that the problem was fixed by downgrading my erlang installation on Arch Linux to R16B-3 from R16B-6. Hopefully this bug will be ironed out soon.
The symptoms means that either that the file operation to read at a specific part of the file runs out of memory or you are trying to read from an non existing position in the file. So if you do not run out of memory (which you should have noticed), it is likely that there is a race condition in the handling of that file by DETS.
I've got the same error from time to time. It occurs since I did an upgrade on my Debian server maybe one or two months ago.
Here is my error:
Error in process <0.84.0> on node 'yaws#overnux' with exit value: {{case_clause,{error,{bad_object_header,"/var/www/d-lan/db/d_lan_downloads_count.dets"}}},[{d_lan_db,loop,0,[]},{string,strip,1,[]}]}
I think it's an Erlang regression because I didn't change the code for a long time and it was working fine before the upgrade.
I'm only using DETS, not Mnesia. I have no concurrent access to the file.
Here is my code, it's very simple: https://github.com/Ummon/D-LAN/blob/website/modules/erl/d_lan_db.erl#L103

Interrupting a process in Erlang

I am new to erlang.
I wonder if it is possible to interrupt a processor in erlang. Assume we have processor x executing a function f1() that takes a long time to execute. I would like to find an efficient way to interrupt the processor x to execute function f2() and after the execution of f2() it goes back to executing f1() from it was interrupted.
One way of doing this (although not exactly what I want) is to let f1() be executed by a processor (name it, f1_proc), while the creator of f1_proc wait for messages such as [interrupt, f1_terminated, etc ..] where if interrupt is received f2() is executed.
However, this is not exactly what I want. What if f2() depends on f1() ? in this case, f1() is paused, f2() is executed and then f1() should start from it stopped. I know we can terminate a process, but can we pause them ?
The answer to your question is no, this can't be done. There is no way to pause a process from the "outside" without any hook (e.g. receive clause) inside the process.
I think your question title (processor) is a bit misleading considering you are trying to work with erlang processes.
You should trying working with erlang hibernate command.
Directly from the above doc link:
Puts the calling process into a wait state where its memory allocation
has been reduced as much as possible, which is useful if the process
does not expect to receive any messages in the near future.
Using timers and message passing between processes you can force your workflow.
i.e. pausing one if it takes too much time, while other continues doing it work.
Though your use case is not so clear in the question, you also can have both (infact more) processes working in parallel without having to wait for one another, and also getting notified once a process has finished it's job.
One way to do it is to simply start both functions in different processes. When f2() is dependent on a result from f1(), it receives a message with the needed data. When f1() is done calculating that data, it sends it to the f2() process.
If f2() reaches the receive clause too early, it will automatically pause and wait until the message arrives (hence letting f1() continue its work). If f1(), however, is done first, it will carry on with it's other tasks until preempted automatically by the Erlang process scheduler.
You can also make f1() pause by letting it wait for a message from f2() as well. In that case, make sure that f1() waits AFTER it has sent its message to avoid deadlocks.
Example:
f1(F2Pid) ->
Data = ...,
F2Pid ! {f1data, Data},
... continue other tasks ....
f2() ->
... do some work ...,
Data = receive
{f1data, F1Data} -> F1Data
end,
... do some work with Data ....
main() ->
F2Pid = spawn_link(?MODULE, f2, []),
f1(F2Pid).
This message passing is fundamental to the Erlang programming model. You donät need to invent synchronisation or locks. Just receive a message and Erlang will make sure you get that message (and that message only).
I don't know how you are learning Erlang, but I recommend the book Erlang Programming by Cesarini & Thompson (O'Reilly). The book covers, in great detail and with good examples, all you need to know about message passing and concurrency.

Erlang Concurrency Model

This could be a very basic question but is Erlang capable of calling a method on another prcoess and wait for it to repond back with some object type without sleeping threads?
Well, if you're waiting for an answer, the calling process will have to sleep eventually… But that's no big deal.
While processes are stuck on the receive loop, other processes can work. In fact, it's not uncommon to have thousands of processes just waiting for messages. And since Erlang processes are not true OS threads, they're very lightweight so the performance loss is minimal.
In fact, the way sleep is implemented looks like:
sleep(Milliseconds) ->
receive
% Intentionally left empty
after Milliseconds -> ok
end.
Yes, it is possible to peek into the mailbox if that is what you mean. Say we have sent a message to another process and now we want to see if the other process has sent something back to us. But we don't want to block on the receive:
receive
Pattern -> Body;
Pattern2 -> Body2
after 0 ->
AfterBody
end
will try to match against the Pattern and Pattern2 in the mailbox. If none matches, it will immediately time out and go to AfterBody. This allows you to implement a non-blocking peek into the mailbox.
If the process is a gen_server the same thing can be had by playing with the internal state and the Timeout setting when a callback returns to the gen_server's control. You can set a Timeout of 0 to achieve this.
What am getting from the question is that we are talking of Synchronous Message Passing. YES ! Erlang can do this perfectly well, its the most basic way of handling concurrency in Erlang. Consider this below:
rpc(Request, To)->
MySelf = self(),
To ! {MySelf,Request},
receive
{To,Reply} -> Reply
after timer:seconds(5) -> erlang:exit({error,timedout})
end.
The code above shows that a processes sends a message to another and immediately goes into waiting (for a reply) without having to sleep. If it does not get a reply within 5 seconds, it will exit.

Erlang: side effect(s) to calling mnesia:create_schema more than once?

Is there a side effect to calling mnesia:create_schema() on each application start?
From what I keep reading, this function should only be called once per database instance. Is it a big issue to call it more than once on an existing database?
I've done this before in development and it spits out warnings on the tables that already exist. However I wouldn't make it a practice to rerun it in Production since it's possible that it may have some side-effects I'm unaware of and even if it doesn't now there is no guarantee that it won't in future releases.
Why do you want to run it multiple times?
It has no side effect, but later calls will result in {error, {Node,{already_exists,Node}}}. You can use something like
ensure_schema() ->
Node = node(),
case mnesia:create_schema([Node]) of
ok -> ok;
{error, {Node, {already_exists, Node}}} -> ok;
Error -> Error
end.
Well it could throw an exception on the second call. Just catch it.

Resources