In my reactive application I have hot Publisher with slow Subscriber. To handle lack of demand I am using onBackpressureBuffer but possible overflow errors are kinda scary.
How can I monitor number of elements present in the queue created by Flux.onBackpressureBuffer(maxSize)? Preferably with built-in reactor metrics() method. I am using Spring Boot + Micrometer if it makes any difference.
Although we didn't we find an easy way to this in Reactor, but we found a bit "hacky" one. Here it is: https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/utils/ReactorUtils.kt#L34
This function measures buffer size of various Flux operators. It is not guaranteed to work on every operator, but it was tested on onBackpressureBuffer with positive results.
It is written in Kotlin, but it should be very easy to port it to Java.
The essence of this code in case of onBackpressureBuffer is to cast Subscription to Scannable, and then use BUFFERED attribute:
flux
.onBackressureBuffer(maxSize)
.doOnSubscribe { subscription ->
// ...
val queueSize = Scannable.from(subscription).scan(Scannable.Attr.BUFFERED)
// ...
}
Related
I wonder, is there any difference in behavior/guarantees between the MonoJust and FluxJust created with exactly one argument?
From the source code of the Reactor Core 3.3.7 I am able to see that the former one is using the Operators#ScalarSubscription as its subscription object, while the latter one uses its private WeakScalarSubscription.
The only difference between these two is that ScalarSubscription has this volatile int once thing (a counter) defined and checked on each method call and somewhat ensures the onComplete() is called exactly once. At the same time, WeakScalarSubscription uses the boolean terminado thing (a non-volatile flag) for the same purposes, but without the "exactly once" guarantees for onComplete() call.
Using volatile in Java has its price, which is payed off e.g. when one creates a lot of these things (with Mono.just(1) or Flux.just(1)) in the highly-concurrent client code. (As we do in our project inside the flatMap that runs in parallel on a dedicated thread pool.)
There's no class javadoc for MonoJust, so I wonder if my assumptions are correct: that the only difference is that FluxJust may send the completion signal more than once in some circumstances — and that's it? Or are there other subtle differences?
I think that the biggest difference is how you use Flux and Mono. Mono emits one item or error and then completes, whereas Flux can emit more than one element, error, and then completion signal.
just() methods are meant to evaluate one element (or vararg variant for Flux) and return it immediately. I can imagine cases when Flux with only one element is returned.
What is the recommended way to do expensive one-off initialization in a Beam Python DoFn? The Java SDK has DoFn.Setup, but there doesn't appear to be an equivalent in Beam Python.
Is the best way currently to attach objects to threading.local() in the DoFn initializer?
Dataflow Python is not particularly transparent about the optimal method for initializing expensive objects. There are a few mechanisms by which objects can be instantiated infrequently (it is currently not ideal to perform exactly once initialization). Below are outlined some of the experiments I have run and conclusions I have come to. Hopefully someone from the Beam community can help correct me wherever I have strayed.
__init__
Although the __init__ method can be used to initialize an expensive object exactly once, this initialization does not happen on the Worker machines. The object will need to be serialized in order to be sent off to the Worker which, for large objects, as well as Tensorflow models, can be quite unwieldy or not work at all. Furthermore, since this object will be serialized and sent over a wire, it is not secure to perform initializations here, as payloads can be intercepted. The recommendation is against using this method.
start_bundle()
Dataflow processes data in discrete groups that it calls bundles. These are fairly well defined in batch processes, but in streaming they are dependent on the throughput. There are no mechanisms for configuring how Dataflow creates its bundles, and in fact the size of a bundle is entirely dictated by Dataflow. The start_bundle() method will be called on the Worker and can be used to initialize state, however experiments find that in a streaming context, this method is called more frequently than desired, and expensive re-initializations would happen quite often.
Lazy initialization
This methodology was suggested by the Beam docs and is somewhat surprisingly the most performant. Lazy initialization means that you create some stateful parameter that you initialize to None, then execute code such as the following:
if self.expensive_object is None:
self.expensive_object = self.__expensive_initialization()
You can execute this code directly in your process() method. You can also put together some helper functions easily enough that rely on global state so that you can have functions such as (an example of what this might look like is at the bottom of this post):
self.expensive_object = get_or_initialize_global(‘expensive_object’, self.__expensive_initialization)
Experiments
The following experiments were run on a job that was configured using both start_bundle and the lazy initialization method described above, with appropriate logging to indicate invocation. Various throughput was published to the appropriate queue and the results were recorded accordingly.
At a rate of 1 msg/sec over 100s:
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 100
LAZY INITIALIZATION 25
TOTAL MESSAGES 100
At a rate of 10 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 942
LAZY INITIALIZATION 3
TOTAL MESSAGES 1000
At a rate of 100 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2447
LAZY INITIALIZATION 30
TOTAL MESSAGES 10000
At a rate of 1000 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2293
LAZY INITIALIZATION 36
TOTAL MESSAGES 100000
Takeaways
Although start_bundle works well for high throughput, lazy initialization is nonetheless the most performant by a wide margin regardless of throughput. It is the recommended way of performing expensive initializations on Python Beam. This result is perhaps not too surprising given this quote from the official docs:
Setup - called once per DoFn instance before anything else; this has not been implemented in the Python SDK so the user can work around just with lazy initialization
The fact that is is called a "work around" is not particularly encouraging though, and maybe we can expect something more robust in the near future.
Code Samples
Courtesy of Andreas Jansson:
def get_or_initialize_global(object_key, initialize_expensive_object):
if object_key in globals():
expensive_object = globals()[object_key]
else:
expensive_object = initialize_expensive_object()
globals()[object_key] = expensive_object
Setup and teardown have now been added to the Python SDK and are the recommended way to do expensive one-off initialization in a Beam Python DoFn.
This sounds like it could be it https://beam.apache.org/releases/pydoc/2.8.0/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.start_bundle
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().
From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.
I have a bunch of threads that are doing lots of communication with each other.
I would prefer this be lock free.
For each thread, I want to have a mailbox, where other threads can send it messages, (but only the owner can remove messages). This is a multiple-producer single-consumer situation. is it possible for me to do this in a lockfree / high performance matter? (This is in the inner loop of a gigantic simulation.)
Lock-free Multiple Producer Single Consumer (MPSC) Queue is one of the easiest lock-free algorithms to implement.
The most basic implementation requires a simple lock-free singly-linked list (SList) with only push() and flush(). The functions are available in the Windows API as InterlockedFlushSList() and InterlockedPushEntrySList() but these are very easy to roll on your own.
Multiple Producer push() items onto the SList using a CAS (interlocked compare-and-swap).
The Single Consumer does a flush() which swaps the head of the SList with a NULL using an XCHG (interlocked exchange). The Consumer then has a list of items in the reverse-order.
To process the items in order, you must simply reverse the list returned from flush() before processing it. If you do not care about order, you can simply walk the list immediately to process it.
Two notes if you roll your own functions:
1) If you are on a system with weak memory ordering (i.e. PowerPC), you need to put a "release memory barrier" at the beginning of the push() function and an "aquire memory barrier" at the end of the flush() function.
2) You can make the functions considerably simplified and optimized because the ABA-issue with SLists occur during the pop() function. You can not have ABA-issues with a SList if you use only push() and flush(). This means you can implement it as a single pointer very similar to the non-lockfree code and there is no need for an ABA-prevention sequence counter.
Sure, if you have an atomic CompareAndSwap instruction:
for (i = 0; ; i = (i + 1) % MAILBOX_SIZE)
{
if ((mailbox[i].owned == false) &&
(CompareAndSwap(&mailbox[i].owned, true, false) == false))
break;
}
mailbox[i].message = message;
mailbox[i].ready = true;
After reading a message, the consuming thread just sets mailbox[i].ready = false; mailbox[i].owned = false; (in that order).
Here's a paper from the University of Rochester illustrating a non-blocking concurrent queue. The algorithm described in the paper shows one technique for making a lockless queue.
may want to look at Intel thread building blocks, I recall being to lecture by Intel developer that mentioned something along those lines.
I have some stuff written in c# that executes concurrent code, making heavy use of the Task Parallel Library (Task and Future continuation chains).
I'm now porting this to F# and am trying to figure out the pros and cons of using F# Async workflows vs. the constructs in the TPL. I'm leaning towards TPL, but I think it could be done either way.
Does anyone have tips and wisdom about writing concurrent programs in F# to share?
The name pretty much sums up the difference: asynchronous programming vs. parallel programming. But in F# you can mix and match.
F# Asynchronous Workflows
F# async workflows are helpful when you want to have code execute asynchronously, that is starting a task and not waiting around for the final result. The most common usage of this is IO operations. Having your thread sit there in an idle loop waiting for your hard disk to finish writing wastes resources.
If you began the write operation asynchronously you can suspend the thread and have it woken up later by a hardware interrupt.
Task Parallel Library
The Task Parallel Library in .NET 4.0 abstracts the notion of a task - such as decoding an MP3, or reading some results from a database. In these situations you actually want the result of the computation and at some point in time later are waiting for the operation's result. (By accessing the .Result property.)
You can easily mix and match these concepts. Such as doing all of your IO operations in a TPL Task object. To the programmer you have abstracted the need to 'deal with' that extra thread, but under the covers you're wasting resources.
Like wise you can create a series of F# async workflows and run them in parallel (Async.Parallel) but then you need to wait for the final result (Async.RunSynchronously). This frees you from needing to explicitly start all the tasks, but really you are just performing the computation in parallel.
In my experience I find that the TPL is more useful because usually I want to execute N operations in parallel. However, F# async workflows are ideal when there is something that is going on 'behind the scenes' such as a Reactive Agent or Mailbox type thing. (You send something a message, it processes it and sends it back.)
Hope that helps.
In 4.0 I would say:
If your function is sequential, use Async workflows. They simply read better.
Use the TPL for everything else.
It's also possible to mix and match. They've added support for running a workflow as a task and creating tasks that follow the async Begin/End pattern using TaskFactory.FromAsync, the TPL equivalent of Async.FromBeginEnd or Async.BuildPrimitive.
let func() =
let file = File.OpenRead("foo")
let buffer = Array.zeroCreate 1024
let task1 = Task.Factory.FromAsync(file.BeginRead(buffer, 0, buffer.Length, null, null), file.EndRead)
task1.Start()
let task2 = Async.StartAsTask(file.AsyncRead(1024))
printfn "%d" task2.Result.Length
It's also worth noting that both the Async Workflows runtime and the TPL are going to create an extra kernel primitive (an Event) and use WaitForMultipleObjects to track I/O completion, rather than using completion ports and callbacks. This is undesirable in some applications.