Remote nodes, group leaders and printouts - erlang

Given two Erlang nodes, "foo#host" and "bar#host", the following produces a print-out on "foo":
(foo#host) rpc:call('bar#host', io, format, ["~p", [test]]).
While the following prints out on "bar":
(foo#host) rpc:call('bar#host', erlang, display, [test]).
Even if erlang:display/1 is supposed to be used for debug only, both functions are supposed to send stuff to the standard output. Each process should inherit the group leader from its parent, so I would expect that the two functions would behave in a consistent way.
Is there any rationale for the above behaviour?

The reason for this difference in behaviour is where and by whom the output is done:
erlang:display/1 is a BIF and is handled directly by the BEAM which writes it straight out to its standard output without going anywhere near Erlang's io-system. So doing this on bar results in it printed to bar's standard output.
io:format/1/2 is handled by the Erlang io-system. As no IoDevice has been given it sends an io-request to its group leader. The rpc:call/4 is implemented in such away that the remotely spawned process inherits the group leader of the process doing the RPC call. So the output goes to the standard output of the calling process. So doing an RPC call on foo to the node bar results in the output going to foo's standard output.
Hence the difference. It is interesting to note that no special handling of this is needed in the Erlang io-system, once the group leader has been set it all works transparently.

Related

Dask - Understanding diagnostics - memory:list

I am working on some fairly complex application that is making use of Dask framework, trying to increase the performance. To that end I am looking at the diagnostics dashboard. I have two use-cases. On first I have a 1GB parquet file split in 50 parts, and on second use case I have the first part of the above file, split over 5 parts, which is what used for the following charts:
The red node is called "memory:list" and I do not understand what it is.
When running the bigger input this seems to block the whole operation.
Finally this is what I see when I go inside those nodes:
I am not sure where I should start looking to understand what is generating this memory:list node, especially given how there is no stack button inside the task as it often happens. Any suggestions ?
Red nodes are in memory. So this computation has occurred, and the result is sitting in memory on some machine.
It looks like the type of the piece of data is a Python list object. Also, the name of the task is list-159..., so probably this is the result of calling the list Python function.

How to restart Erlang Supervised function with parameters?

I'm learning Erlang, and have a Supervisor question...
I have a function which requires 3 parameters (string, string, number). I want to Supervise that and make sure that if it fails, it gets restarted with the 3 parameters I passed it.
Is this something a Supervisor can handle, or do I need to look into some other concept?
Thanks.
Update 1/23/2016
One thing I want to mention... I have a list of 1439 entries. I need to create a Supervisor for each entry in that list. Each entry will incur different arguments. For example, here's some psuedo-code (reminiscent of Ruby):
(360..1799).each do |index|
export(output_path, table_name, index) # Supervise this
end
This is triggered by a user interaction at runtime. The output_path and table_name are dynamic too, but won't change for a given batch. Unravelled, a run may look something like this:
export("/output/2016-01-23/", "temp1234", 360)
export("/output/2016-01-23/", "temp1234", 361)
export("/output/2016-01-23/", "temp1234", 362)
.
.
So if 361 fails, I need it restarted with "/output/2016-01-23/", temp1234, and 361.
Is this something I can do with a Supervisor?
Yes, this is what supervisor does, but you mean "arguments", not "parameters".
For ordinary (not simple_one_for_one) supervisors your init/1 implementation returns a list of so called child specifications, each of which specifies a child that will be spawned, and the arguments that will be passed are provided as part of this child specification.
With simple_one_for_one supervisors you still have to provide a child specification, but you provide the arguments when you start each supervised child.
In all cases your supervised children will be restarted with the same arguments.
General Concepts
Erlang systems are divided into modules.
Each module are composed of functions.
Works are done by processes and they use functions to do it.
A supervisor is a process which supervises other processes.
So your function will sure be executed inside a process, and you can specify and supervise that process by a supervisor. This way when your function fails and crashes, the process that are executing it will crash as well. When the process crashes, its supervisor will find out and can act upon it based on a predefined restart strategy. These strategies could be one_for_one, one_for_all and rest_for_one.
You can check its manual for further options.

How can I emit summary data for each window even if a given window was empty?

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Erlang/OTP How to notify parent process that child processes are idle and no messages in their mailbox

I would like to design a process hierarchy where there is a a parent process P which acts like a gatekeeper and delegates the work(messages/events from its client processes) to it's children processes C1,C2..Cn which collaborate with each other and may send the result back to P. The children processes cannot talk to any process outside, only P.
The challenge is that though P may have multiple messages from its clients, it should accept only one message, delegate the work to C1..Cn and ONLY accept the next message from its clients
when all its children processes are done(or idle) and there are no more messages circulating between C1 to Cn.
P finishes accepting messages from C1..Cn so that it can return the result to its clients
Constraints:
Idle for me is when they are waiting with a receive (blocking) or simply exited.
C1 to Cn are finite state machines. Some or all of them may send messages back to C. Or there may be no messages to be sent back to C. Even if no messages are sent back to C, C has to figure out that all of them are done with no messages between them.
If any of C1 to Cn have been pre-empted, then it is considered busy(this may be obvious but I thought I'll put it here for completion) and C will not receive the next message
Is there an OTP pattern or library which will do this for me (before I hack something?). I know that process_info can let me know if the mailbox of a process are empty and I could keep on checking the children's mailboxes from P but it would be unnecessary polling from P.
EDIT GENERAL: I am trying to implement a reactive variant of Flow Based Programming on the Erlang platform. This has the notion of 'hierarchical processes' or composites which themselves may contain composite processes until we reach some boxes of actual code...I am going to research(looking at monitor,process_info,process_flag) but I wanted to respond to your excellent answers
EDIT RECURSIVE PARENTS: Each of C1 and Cn can themselves be parent/composite processes. If I just spawn processes and let them exit immediately, I'll have to create the chain of Composites everytime as C1..Cn may themselves be composites (which spawn composites..and so on). Finally, when we reach a leaf box(which is not a composite node), they are supposed to be finite state machines.. so I'm not sure of spawning and making them exit quickly if the are FSMs.
EDIT TKOWAL: Since I am trying to create a generic parent/composite process, it does not know 'when' the task ends. All it does is relay the messages it receives from its children to it's siblings with the 'constraint' that it will not accept the next message from its client/siblings until its children are 'done'. The children C1..Cn may send not just one but many messages. I understand from your proposal, that wait_for_task_finish will stop blocking the moment it gets the first message. But more messages may be emitted too by P's children. P should wait for all messages. Also, having a task_end symbol will not work for the same reason(i.e. multiple messages possible from the children)
Given how inexpensive it is to start up Erlang processes, your gatekeeper could start new children for each incoming task, and then wait for them all to exit normally once they complete their work.
But in general, it sounds like you're looking for a process pool. There are a few of these already available, such as poolboy and sidejob. Pools can be harder to get right than you think, so I advise using an existing proven pool implementation before attempting to write your own.
After edits, this became entirely different question, so I am posting new answer.
If you are trying to write Flow Based Programming, then you are probably solving wrong problem. FBP is effective, because almost everything is asynchronous and you start processing next request immediately after you finished with previous one.
So, the answer is - don't wait for children to finish:
In FBP, there is no time dependencies between the components. So if I
have a chunk of data, it should be able to flow from one end of the
diagram to the other regardless of how any other pieces of data are
being handled. In order to program an FBP system, you have to minimize
your dependencies.
source
When creating parent and children, you know all the connections between blocks, so just configure children to send processed data directly to next block. For example: P1 has children C1 and C2. You send message to P1, it delegates it to C1, packet flows couple of times between C1 and C2 and after that, C1 or C2 sends it directly to P2.
Blocks should be stateless. They output should not depend on previous requests, so even if C1 and C2 are processing data from two different requests to P1 - it is OK. There could be situations, where P1 gets data packet D1 and then D2, but will output answers in different order R2 and then R1. It is also OK. You can use Erlang reference to tag messages and then check, which response is from which request.
I don't think, there is ready library for that, but it is really easy to hack, unless I missed something. Your P process should look like this:
ready_for_next_task() ->
receive
{task, Task, CallerPid} ->
send_task_to_workers(Task)
end,
wait_for_task_finish(CallerPid).
wait_for_task_finish(CallerPid) ->
receive
{task_end, Response} ->
CallerPid ! Response
end,
ready_for_next_task().
In wait_for_task_finish/1 you have only one clause for receive, so it will not accept next task, until current one is finished. If you are waiting for multiple responses from workers, you can simply add second clause to receive with some partial response and call wait_for_task_finish/1 recursively.
It is always better to have some indicator, that the processing ended, because you don't have guarantees on message delivery time. This means, that you could check, that all processes currently are waiting for message and think, that they ended processing, but actually, they did not started yet or one of them send message to other and you caught them before the second one had it in message box.
If the processes C1..Cn have only parts of actual work and don't know about the progress, than the gatekeeper P should know how many parts there were, receive all of them one by one and then call ready_for_next_task/1.

using AUGraph vs direct connections between Audio Units by means of a GCD queue

So i've gone thru the Audio Unit Hosting Guide for iOS and they hint all along that one might always want to use an AUGraph instead of direct connections between AUs. Among the most notable reasons they mention a high-priority thread, being able to reconfigure the graph while it is running and general thread-safety.
My problem is that I'm trying to get as close to "making my own custom dsp effects" as possible given that iOS does not really let you dynamically load custom code. So, my approach is to create a generic output unit and write the DSP code in its render callback. Now the problem with this approach is if I wanted to chain two of these units with custom callbacks in a graph. Since a graph must have a single output au (for head), trying to add any more output units won't fly. That is, I cant have 2 Generic I/O units and a Remote I/O unit in sequence in a graph.
If I really wanted to use AUGraphs, I can think of one solution along the lines of:
A graph interface that internally keeps an AUGraph with a single Output unit plus, for each connected node in the graph, a list of "custom callback" generic output nodes that are in theory connected to such node. Maybe it could be a class / interface over AUNode instead, but hopefully you get the idea.
If I add non output units to this graph, it essentially connects them in the usual way to the backing AUGraph.
If however, I add a generic output node, this really means adding the node's au to the list and whichever node I am connecting in the graph to, actually gets its input scope / element 0 callback set to something like:
for each unit in your connected list:
call AudioUnitRender(...) and merge in ioData;
So that the node which is "connected" to any number of those "custom" nodes basically pulls the processed output from them and outputs it to whatever next non-custom node.
Sure there might be some loose ends in the idea above, but I think this would work with a bit more thought.
Now my question is, what if instead I do direct connections between AUs without an AUGraph whatsoever? Without an AUGraph in the picture, I can definitely connect many generic output units with callbacks to one final Remote I/O output and this seems to work just fine. Thing is kAudioUnitProperty_MakeConnection is a property. So I'm not sure that once an AU is initialized I can re set properties. I believe if I uninitialize then it's ok. If so, couldn't I just get GCD's high priority queue and dispatch any changes in async blocks that uninitialize, re connect and initialize again?

Resources