spawn_monitor() and 'DOWN' messages - erlang

Is it (theoretically) possible that the process that's been spawn_monitor()'ed exits (with the normal exit or on error) without sending 'DOWN' message to the parent process ? I have a very strange process leakage, it seems like some of the processes do not send 'DOWN' message. I am using Erlang package that comes with Ubuntu 9.10. Maybe it is a known bug ?

You'll need to show some code. Monitoring is pretty core to the way erlang works.
It's hard to tell what your actual problem is since you're not describing what you're seeing, so I'll have to guess.
You're either not trying to receive the down message or the process isn't exiting.
If you have processes leaking, it sounds like they're not actually exiting.
You very well may be trying to build your own supervisor module. I'd strongly suggest using OTP's supervisor if you want sane process tree shutdown and/or restart.

Maybe you demonitored the process at some point?
Reading from the doc for erlang:demonitor/1:
Once erlang:demonitor(MonitorRef) has returned it is guaranteed that no
{'DOWN', MonitorRef, _, _, _} message
due to the monitor will be placed in
the callers message queue in the
future. A {'DOWN', MonitorRef, _, _,
_} message might have been placed in the callers message queue prior to the
call, though. Therefore, in most
cases, it is advisable to remove such
a 'DOWN' message from the message
queue after monitoring has been
stopped. erlang:demonitor(MonitorRef,
[flush]) can be used instead of
erlang:demonitor(MonitorRef) if this
cleanup is wanted.

Related

Erlang process termination: Where/When does it happen?

Consider processes all linked in a tree, either a formal supervision tree, or some ad-hoc structure.
Now, consider some child or worker down in this tree, with a parent or supervisor above it. I have two questions.
We would like to "gracefully" exit this process if it needs to be killed or shutdown, because it could be halfway through updating some account balance. Assume we have properly coded up some terminate function and connected this process to others with the proper plumbing. Now assume this process is in its main loop doing work. The signal to terminate comes in. Where exactly (or possibly the question should be WHEN EXACTLY) does this termination happen? In other words, when will terminate be called? Will the thing just preempt itself right in the middle of the loop it is running and call terminate? Will it wait until the end of the loop but before starting the loop again? Will it only do it while in receive mode? Etc.
Same question but without terminate function having been coded. Assume parent process is a supervisor, and this child is following normal OTP conventions. Parent tells child to shutdown, or parent crashes or whatever. The child is in its main loop. When/where/how does shutdown occur? In the middle of the main loop? After it? Etc.
It is quite nicely explained in the docs (sections 12.4, 12.5, 12.6, 12.7).
There are two cases:
Your process terminated due to some bad logic.
It throws an error, so it can be in the middle of work and this could be bad. If you want to prevent that, you can try to define mechanism, that involves two processes. First one begins the transaction, second one does the actual work and after that, first one commits the changes. If something bad happens to second process (it dies, because of errors), the first one simply does not commit the changes.
You are trying to kill the process from outside. For example, when your supervisor restarts or linked process dies.
In this case, you can also be in the middle of something, but Erlang gives you the trap_exit flag. It means, that instead of dying, the process will receive a message, that you can handle. That in turn means, that terminate function will be called after you get to the receive block. So the process will finish one chunk of work and when it will be ready for next, it will call terminate and after that die.
So you can bypass the exiting by using trap_exit. You can also bypass the trap_exit sending exit(Pid, kill), which terminates process even if it traps exits.
There is no way to bypass exit(Pid, kill), so be careful with using it.

What happens in Erlang if return receipt never arrives?

I just happened to read the thesis of Joe Armstrong and don't have much prior knowledge of Erlang. I wonder what happens if a delivery receipt for some message never arrives. What does the sending actor do? It sends the message another time? This could confuse the recipient actor when it receives the same message another time. It has to be able to tell that its receipt was not received and therefore the second message is void.
That kind of problems always kept me away from solutions where message delivery is not transactional. I think I know the answer: the sending actor tells its supervising actor that something must be wrong when it didn't obtain a receipt in reasonable time causing the supervisor to take some action (like restarting the involed actors or something). Is this correct? I see no other solution that doesn't result in theroretically possible infinite message sends.
Thanks for any answer,
Oliver
In Erlang, the sender of a message usually forget it immediately after sending it and continue its job. if an application need an acknowledge of the message reception, you have to build your own protocol (or use an existing one). There are many good reason for that.
One is that most of the time it is not necessary to have this handshake. The higher risk for a message to be ignored is that the receiving process does not exist anymore, or died in the mean time, and the sender in this case has very few chance to do some interesting stuff.
Also, the hand shake is a blocking action, so there is a performance impact, and a risk of deadlock.
The acknowledge should be also a message, but this one should not be acknowledged, otherwise you create a never ending loop of message. Only the application could know what to do (for example using a send with acknowledge or not) and it is really easy to write this kind of function (or use a behavior that implement it). For example:
send_with_ack(To,Mess,TimeOut,Ack) ->
Ref = make_ref(),
To ! {Mess,self(),Ref},
receive
{Ack,Ref} -> Ack
after Timeout ->
{error,timeout}
end.
receiving_process() ->
...
receive
{Pattern_matching_Mess,From,Ref} ->
do_something(),
From ! {Ack,Ref}, %% Ack for this kind of message is known by the receiver
do_somethingelse();
Mess1 -> do_otherthing()
end,
...
with little work, it is even possible to delegate the survey of message delivery to a new process - not blocking check - and using linked process, force a crash of the sender if the timeout is reached.

Race condition between trap_exit EXIT msg and common msg

Hi the question is as following:
Assume we have processes A and B which are linked. Process's A flag trap_exit is set to true. Let B process send a msg to A and then exit:
PidA ! 'msg',
exit(reason).
What I wanna know if we can be shure that the process A will receive 'msg' and only after It {'EXIT', Pid, reason} will come ? Can we predict the ordering of msgs? I can't found any proofs in documentation, but I guess that it will work that way, but I need some proofs. Don't want to have race condition here..
As to not leave this question hanging. This is the discussion in erlang-questions mailing list:
http://thread.gmane.org/gmane.comp.lang.erlang.general/66788
Long story short: all messages are signals (or all signals are messages), exits are seen as messages from the process, guaranteed to arrive in the same order they were sent.
Sounds like a code smell to me. Why do you need to rely on trap_exit? Have you thought of alternatives, e.g. proper monitoring?
I've got the O'Reilly Erlang programming book here, and in Chapter 4, in the section Message Passing, it says:
Messages are stored in the mailbox in the order in which they are delivered. If two messages are sent from one process to another, the messages are guaranteed to be received in the same order in which they are sent. This guarantee is not extended to messages sent from different processes, however, and in this case the ordering is VM-dependent.
However, in your case, I'm not sure the exit message actually comes from process B. It might originate somewhere in the bowels of the VM. If I wanted to be sure, I would actually have process A trigger the exit of process B when it receives your notification message instead.

Handling the cleanup of the gen_server state

I have a gen_server running which it must clean up its state whenever it is stopped normally or it crash unexpectedly. The cleanup basically consists in deleting a few files.
At this moment, when the gen_server crash or it is stopped normally, the cleanup is done in terminate/2.
Is there any reason why terminate/2 would not be called if the gen_server crash?
Should be any other process monitoring the gen_server waiting to do the cleanup if the gen_server dies unexpectedly?
So, the code is like this:
terminate(normal, State) ->
% Invoked when the process stops
% Clean up the mess
terminate(Error, State) ->
% Invoked when the process crashes
% Clean up the mess
EDIT: I found this email in the official mailing list which is talking about the same thing:
http://groups.google.com/group/erlang-programming/browse_thread/thread/9a1ba2d974775ce8
As Adam says below, if we want to avoid to trap the exists in the gen_server, we could use different approaches.
But if we trap the exists, terminate/2 seems to be a safe place to do the cleanup as it always will be called. Furthermore we must handle correctly when 'EXIT' is sent to terminate/2 and to handle_call/3 trying to propagate the errors correctly between workers and supervisors.
terminate/2 is called when a crash occur inside the gen_server even if it doesn't trap exits, it will not be called if it receives an 'EXIT' from some other process linked to it, in case you need to clean up then it should trap exits(using process_flag(trap_exit, true)).
This behavior is a bit unfortunate because it makes it difficult to write a reliable shutdown procedure for a gen_server process. Also, it is not a good habit to trap exits just for the sake of being able to run terminate/2, since you might catch a lot of other errors which makes it harder to debug the system.
I would consider three options:
Handle the left over files when the next instance of the process starts (for example, in init/1)
Trap exits, clean up the files, and then crash again with the same reason
Have a 3rd process which monitors the gen_server whose only purpose is to clean up the files
Option 1 is probably the best option, since at least the code doesn't trap exits and you get persistent state for free. Option 2 is not so nice for the reasons described above, that it can hide and obscure other errors. 3 is messy because the cleanup process might not be done before the gen_server is started again.
Think carefully about why you want to clean up, and if it really has to be done when the process crashes (it is a bug, after all). Be careful that you don't end up doing too much defensive programming.
This is quite fresh and relevant
When does terminate/2 get called in a gen_server?

Problem stopping an Erlang SSH channel

NOTE: I'll use the ssh_sftp channel as an example here, but I've noticed the same behaviour when using different channels.
After starting a channel:
{ok, ChannelPid} = ssh_sftp:start_channel(State#state.cm),
(where cm is my Connection Manager), I'm performing an operation through the channel. Say:
ssh_sftp:write_file(ChannelPid, FilePath, Content),
Then, I'm stopping the channel:
ssh_sftp:stop_channel(ChannelPid),
Since, as far as I know, the channel is implemented as a gen_server, I was expecting the requests to be sequentialized.
Well, after a bit of tracing, I've noticed that the channel is somehow stopped before the file write is completed and the result of the operation is sent through the channel. As a conclusion, the response is not sent through the channel, since the channel doesn't exist anymore.
If I don't stop the channel explicitely, everything works fine and the file write (or any other operation performed through the channel) is completed correctly. But I would prefer to avoid to leave open channels. On the other hand, I would prefer to avoid implementing my own receive handler to wait for the result before the channel can be stopped.
I'm probably missing something trivial here. Do you have any idea why this is happening and/or I could fix it?
I repeat, the ssh_sftp is just an example. I'm using my own channels, implemented using the existing channels in the Erlang SSH application as a template.
As you can see in ssh_sftp.erl it forcefully kills channel after 5 sec timeout with exit(Pid, kill) which interrupts the process regardless of whether it's processing something or not.
Related quote from erlang man:
If Reason is the atom kill, that is if exit(Pid, kill) is called, an untrappable exit signal is sent to Pid which will unconditionally exit with exit reason killed.
I had a similar issue with ssh_connection:exec/4. The problem is that these ssh sibling modules ( ssh_connection, ssh_sftp, etc) all appear to behave asynchronously, therefore a closure of channel of ssh itself will shut down the ongoing action.
The options are:
1) do not close the connection : this may lead to leak of resources. Purpose of my question here
2) After the sftp, introduce a monitoring function that waits by monitoring on the file you are transfering at the remote server ( checksum check ). This can be based on ssh_connection:exec and poll on the file you are transferring. Once the checksum matches what you expect, you can free the main module

Resources