Task Parallel Library vs Async Workflows - f#

I have some stuff written in c# that executes concurrent code, making heavy use of the Task Parallel Library (Task and Future continuation chains).
I'm now porting this to F# and am trying to figure out the pros and cons of using F# Async workflows vs. the constructs in the TPL. I'm leaning towards TPL, but I think it could be done either way.
Does anyone have tips and wisdom about writing concurrent programs in F# to share?

The name pretty much sums up the difference: asynchronous programming vs. parallel programming. But in F# you can mix and match.
F# Asynchronous Workflows
F# async workflows are helpful when you want to have code execute asynchronously, that is starting a task and not waiting around for the final result. The most common usage of this is IO operations. Having your thread sit there in an idle loop waiting for your hard disk to finish writing wastes resources.
If you began the write operation asynchronously you can suspend the thread and have it woken up later by a hardware interrupt.
Task Parallel Library
The Task Parallel Library in .NET 4.0 abstracts the notion of a task - such as decoding an MP3, or reading some results from a database. In these situations you actually want the result of the computation and at some point in time later are waiting for the operation's result. (By accessing the .Result property.)
You can easily mix and match these concepts. Such as doing all of your IO operations in a TPL Task object. To the programmer you have abstracted the need to 'deal with' that extra thread, but under the covers you're wasting resources.
Like wise you can create a series of F# async workflows and run them in parallel (Async.Parallel) but then you need to wait for the final result (Async.RunSynchronously). This frees you from needing to explicitly start all the tasks, but really you are just performing the computation in parallel.
In my experience I find that the TPL is more useful because usually I want to execute N operations in parallel. However, F# async workflows are ideal when there is something that is going on 'behind the scenes' such as a Reactive Agent or Mailbox type thing. (You send something a message, it processes it and sends it back.)
Hope that helps.

In 4.0 I would say:
If your function is sequential, use Async workflows. They simply read better.
Use the TPL for everything else.
It's also possible to mix and match. They've added support for running a workflow as a task and creating tasks that follow the async Begin/End pattern using TaskFactory.FromAsync, the TPL equivalent of Async.FromBeginEnd or Async.BuildPrimitive.
let func() =
let file = File.OpenRead("foo")
let buffer = Array.zeroCreate 1024
let task1 = Task.Factory.FromAsync(file.BeginRead(buffer, 0, buffer.Length, null, null), file.EndRead)
task1.Start()
let task2 = Async.StartAsTask(file.AsyncRead(1024))
printfn "%d" task2.Result.Length
It's also worth noting that both the Async Workflows runtime and the TPL are going to create an extra kernel primitive (an Event) and use WaitForMultipleObjects to track I/O completion, rather than using completion ports and callbacks. This is undesirable in some applications.

Related

How to monitor Flux.onBackpressureBuffer() queue size

In my reactive application I have hot Publisher with slow Subscriber. To handle lack of demand I am using onBackpressureBuffer but possible overflow errors are kinda scary.
How can I monitor number of elements present in the queue created by Flux.onBackpressureBuffer(maxSize)? Preferably with built-in reactor metrics() method. I am using Spring Boot + Micrometer if it makes any difference.
Although we didn't we find an easy way to this in Reactor, but we found a bit "hacky" one. Here it is: https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/utils/ReactorUtils.kt#L34
This function measures buffer size of various Flux operators. It is not guaranteed to work on every operator, but it was tested on onBackpressureBuffer with positive results.
It is written in Kotlin, but it should be very easy to port it to Java.
The essence of this code in case of onBackpressureBuffer is to cast Subscription to Scannable, and then use BUFFERED attribute:
flux
.onBackressureBuffer(maxSize)
.doOnSubscribe { subscription ->
// ...
val queueSize = Scannable.from(subscription).scan(Scannable.Attr.BUFFERED)
// ...
}

Asynchronous Xarray writing to Zarr

all. I'm using a Dask Distributed cluster to write Zarr+Dask-backed Xarray Datasets inside of a loop, and the dataset.to_zarr is blocking. This can really slow things down when there are straggler chunks that block the continuation of the loop. Is there a way to do the .to_zarr asynchronously, so that the loop can continue with the next dataset write without being held up by a few straggler chunks?
With the distributed scheduler, you get async behaviour without any special effort. For example, if you are doing arr.to_zarr, then indeed you are going to wait for completion. However, you could do the following:
client = Client(...)
out = arr.to_zarr(..., compute=False)
fut = client.compute(out)
This will return a future, fut, whose status reflects the current state of the whole computation, and you can choose whether to wait on it or to continue submitting new work. You could also display it to a progress bar (in the notebook) which will update asynchronously whenever the kernel is not busy.

Apache Beam: DoFn.Setup equivalent in Python SDK

What is the recommended way to do expensive one-off initialization in a Beam Python DoFn? The Java SDK has DoFn.Setup, but there doesn't appear to be an equivalent in Beam Python.
Is the best way currently to attach objects to threading.local() in the DoFn initializer?
Dataflow Python is not particularly transparent about the optimal method for initializing expensive objects. There are a few mechanisms by which objects can be instantiated infrequently (it is currently not ideal to perform exactly once initialization). Below are outlined some of the experiments I have run and conclusions I have come to. Hopefully someone from the Beam community can help correct me wherever I have strayed.
__init__
Although the __init__ method can be used to initialize an expensive object exactly once, this initialization does not happen on the Worker machines. The object will need to be serialized in order to be sent off to the Worker which, for large objects, as well as Tensorflow models, can be quite unwieldy or not work at all. Furthermore, since this object will be serialized and sent over a wire, it is not secure to perform initializations here, as payloads can be intercepted. The recommendation is against using this method.
start_bundle()
Dataflow processes data in discrete groups that it calls bundles. These are fairly well defined in batch processes, but in streaming they are dependent on the throughput. There are no mechanisms for configuring how Dataflow creates its bundles, and in fact the size of a bundle is entirely dictated by Dataflow. The start_bundle() method will be called on the Worker and can be used to initialize state, however experiments find that in a streaming context, this method is called more frequently than desired, and expensive re-initializations would happen quite often.
Lazy initialization
This methodology was suggested by the Beam docs and is somewhat surprisingly the most performant. Lazy initialization means that you create some stateful parameter that you initialize to None, then execute code such as the following:
if self.expensive_object is None:
self.expensive_object = self.__expensive_initialization()
You can execute this code directly in your process() method. You can also put together some helper functions easily enough that rely on global state so that you can have functions such as (an example of what this might look like is at the bottom of this post):
self.expensive_object = get_or_initialize_global(‘expensive_object’, self.__expensive_initialization)
Experiments
The following experiments were run on a job that was configured using both start_bundle and the lazy initialization method described above, with appropriate logging to indicate invocation. Various throughput was published to the appropriate queue and the results were recorded accordingly.
At a rate of 1 msg/sec over 100s:
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 100
LAZY INITIALIZATION 25
TOTAL MESSAGES 100
At a rate of 10 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 942
LAZY INITIALIZATION 3
TOTAL MESSAGES 1000
At a rate of 100 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2447
LAZY INITIALIZATION 30
TOTAL MESSAGES 10000
At a rate of 1000 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2293
LAZY INITIALIZATION 36
TOTAL MESSAGES 100000
Takeaways
Although start_bundle works well for high throughput, lazy initialization is nonetheless the most performant by a wide margin regardless of throughput. It is the recommended way of performing expensive initializations on Python Beam. This result is perhaps not too surprising given this quote from the official docs:
Setup - called once per DoFn instance before anything else; this has not been implemented in the Python SDK so the user can work around just with lazy initialization
The fact that is is called a "work around" is not particularly encouraging though, and maybe we can expect something more robust in the near future.
Code Samples
Courtesy of Andreas Jansson:
def get_or_initialize_global(object_key, initialize_expensive_object):
if object_key in globals():
expensive_object = globals()[object_key]
else:
expensive_object = initialize_expensive_object()
globals()[object_key] = expensive_object
Setup and teardown have now been added to the Python SDK and are the recommended way to do expensive one-off initialization in a Beam Python DoFn.
This sounds like it could be it https://beam.apache.org/releases/pydoc/2.8.0/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.start_bundle

Erlang: When to use functions vs processes?

My task is to process files inside a zip file. So I write bunch of independent functions and compose them to get the desired result. That's one way of doing things. Now instead of having it all written as functions, I write some of them as processes with selective receives and all, and every things is cool. But then pondering on this a bit further, I'm thinking like, do we need functions at all? Couldn't I replace or convert all those functions into processes that communicates to itself and to other processes? So there lies my doubt. When to use functions and when to use processes? Is there any advantage from performance standpoint in using functions (like caching)? Doesn't code blocks in the processes get cached similarly?
So in our example what's the standard idiom to proceed with? Current pseudo code below.
start() ->
FL = extract("..path"),
FPids = lists:map(open_file, FL), % get file Pids,
lists:foreach(fun(FPid) ->
CPid = spawn_compute_process(),
rpc(CPid,{compute,FPid})
end, FPids).
compute() ->
receive
{Pid,{..}} ->
Line = read_line(..),
TL = tidy_line(Line), % an independent function. But couldn't it be a guard within this process?
..
end.
extract(FilePath) -> FilesList.
read_line(FPid) -> line.
So how do you actually write code? Like, write smaller independent functions first and then wrap them up inside processes?
Thanks.
The short answer is that you use processes to exploit concurrency. Replacing functions with processes where you sequentially run one process, then send its value to another process which then does its work and sends its result to the next process etc each process terminating after its done its bit is the wrong use of processes. Here you are just evaluating something sequentially by sending data from one process to another instead of calling functions.
If, however, you intend this chain of processes to be able to process multiple sequences of "calls" concurrently then it is a different matter. Then you are using the processes for concurrency. The more general way of doing this in erlang is to create a separate process for each sequence and exploit the concurrency in that manner.
Another use of processes is to manage state.

MailboxProcessor usage guidelines?

I've just discovered the MailboxProcessor in F# and it's usage as a "state machine" ... but I can't find much on the recommended usage of them.
For example... say I'm making a simple game with 100 on-screen enemies should I use a MailboxProcessor to change enemy position and health; giving me 200 active MailboxProcessor?
Is there any clever thread management going on under the hood? should I try and limit the amount of active MailboxProcessor I have or can I keep banging them out willy-nilly?
Thanks in advance,
JD.
A MailboxProcessor for enemy simulation might look like this:
MailboxProcessor.Start(fun inbox ->
async {
while true do
let! message = inbox.Receive()
processMessage(message)
})
It does not consume a thread while it waits for a message to arrive (let! message = line). However, once message arrives it will consume a thread (on a thread pool). If you have a 100 mailbox processors that all receive a message simultaneously, they will all attempt to wake up and consume a thread. Since here message processing is CPU bound, 100s of mailbox processors will all wake up and start spawning (thread pool) threads. This is not a great performance.
One situation mailbox processors excel in is the situation where there is a lot of concurrent clients all sending messages to one processor (imagine several parallel web crawlers all downloading pages and sinking results to a queue). On-screen enemies case appears different - it is many entities responding to a single source of messages (player movement/time ticks).
Another example where thousands of MailboxProcessors is a great solution is I/O bound MailboxProcessor:
MailboxProcessor.Start(fun inbox ->
async {
while true do
let! message = inbox.Receive()
match message with
| ->
do! AsyncWrite("something")
let! response = AsyncResponse()
...
})
Here after receiving a message the agent very quickly yields a thread but still needs to maintain state across asynchronous operations. This will scale very very well in practice - you can run thousands and thousands of such agents: this is a great way to write a web server.
As per
http://blogs.msdn.com/b/dsyme/archive/2010/02/15/async-and-parallel-design-patterns-in-f-part-3-agents.aspx
you can bang them out willy-nilly. Try it! They use the ThreadPool. I have not tried this for a real-time GUI game app, but I would not be surprised if this is 'good enough'.
say I'm making a simple game with 100 on-screen enemies should I use a MailboxProcessor to change enemy position and health; giving me 200 active MailboxProcessor?
I don't see any reason to try to use MailboxProcessor for that. A serial loop is likely to be much simpler and faster.
Is there any clever thread management going on under the hood?
Yes, lots. But is it designed for asynchronous concurrent programming (particularly scalable IO) and your program isn't really doing that.
should I try and limit the amount of active MailboxProcessor I have or can I keep banging them out willy-nilly?
You can bang them out willy-nilly but they are far from optimized and performance is much worse than serial code.
Maybe this or this can help?

Resources