In the iOS 13 Combine framework, there are three collect operator methods. The first two are obvious but the third uses types I can't figure out.
collect(_:options:)
https://developer.apple.com/documentation/foundation/timer/timerpublisher/3329497-collect
func collect<S>(_ strategy: Publishers.TimeGroupingStrategy<S>,
options: S.SchedulerOptions? = nil)
-> Publishers.CollectByTime<Timer.TimerPublisher, S>
where S : Scheduler
Can anyone give an example of how one would call this method?
After some struggle, I came up with an example like this:
let t = Timer.publish(every: 0.4, on: .main, in: .default)
t
.scan(0) {prev,_ in prev+1}
.collect(.byTime(DispatchQueue.main, .seconds(1))) // *
.sink(receiveCompletion: {print($0)}) {print($0)}.store(in:&storage)
let cancellable = t.connect()
delay(3) { cancellable.cancel() }
(where storage is the usual Set<AnyCancellable> to keep the subscriber alive).
The output is:
[1, 2]
[3, 4, 5]
[6, 7]
So we are publishing a new number about every 0.4 seconds, but collect only does its thing every 1 second. Thus, the first two values arrive, publishing 1 and 2, and then collect does its thing, accumulates all the values that have arrived so far, and publishes them as an array, [1,2]. And so on. Every second, whatever has come down the pipeline so far is accumulated into an array and published as an array.
The two TimeGroupingStrategy mechanisms are published in that enum. As of iOS 13.3 there are still just two:
byTime
byTimeOrCount
In either case, the first two parameters are a scheduler upon which to run them (Immediate, DispatchQueue, Runloop, or OperationQueue), which is often just inferred by whatever you pass in. Along with the scheduler is a Stride - a time interval you specify - that the operator will buffer values over.
In the byTime, it will collect and buffer as many elements as it receives (using an unbounded amount of memory to do so) in the interval you specify. The byTimeOrCount will limit how many items get buffered to a specific count.
The two means of specifying these are:
let q = DispatchQueue(label: self.debugDescription)
publisher
.collect(.byTime(q, 1.0))
or
let q = DispatchQueue(label: self.debugDescription)
publisher
.collect(.byTimeOrCount(q, 1.0, 10))
These use a DispatchQueue, but you could just as easily use any of the other schedulers.
If you just pass in an Double for the stride, it takes that as a value in seconds.
In both cases, when the time (or count, if that version is specified) is elapsed, the operator will publish an array of the collected values to its subscribers in turn.
Related
I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed.
I created a local Client and use client.compute() to send tasks. Then I use future.result() to get the result of each task.
I'm using threads to ask for results at the same time and measure the time for each result to compute like this:
def get_result(future,i):
t0 = time.time()
print("calculating result", i)
result = future.result()
print("result {} took {}".format(i, time.time() - t0))
client = Client()
df = dd.read_csv(path_to_csv)
future1 = client.compute(df[df.x > 200])
future2 = client.compute(df[df.x > 500])
threading.Thread(target=get_result, args=[future1,1]).start()
threading.Thread(target=get_result, args=[future2,2]).start()
I expect the output of the above code to be something like:
calculating result 1
calculating result 2
result 2 took 10
result 1 took 46
Since the first task is larger.
But instead I got both at the same time
calculating result 1
calculating result 2
result 2 took 46.3046760559082
result 1 took 46.477620363235474
I asume that is because future2 actually computes in the background and finishes before future1, but it waits until future1 is completed to return.
Is there a way I can get the result of future2 at the moment it finishes ?
You do not need to make threads to use futures in an asynchronous fashion - they are already inherently async, and monitor their status in the background. If you want to get results in the order they are ready, you should use as_completed.
However, fo your specific situation, you may want to simply view the dashboard (or use df.visulalize()) to understand the computation which is happening. Both futures depend on reading the CSV, and this one task will be required before either can run - and probably takes the vast majority of the time. Dask does not know, without scanning all of the data, which rows have what value of x.
I'd like to apply filter on my Flux based on a state calculated from previous values. However, it is recommended to avoid using state in operators according to the javadoc
Note that using state in the java.util.function / lambdas used within Flux operators should be avoided, as these may be shared between several Subscribers.
For example, Flux#distinct filters items that appears earlier. How can we implement our own version of distinct?
I have found an answer to my question. Flux#distinct can take a Supplier which provides initial state and a BiPredicate which performs "distinct" check, so we can store arbitrary state in the store and decide whether to keep each element.
Following code shows how to keep the first 3 elements of each mod2 group without changing the order.
// Get first 3 elements per mod 2.
Flux<Integer> first3PerMod2 =
Flux.fromIterable(ImmutableList.of(9, 3, 7, 4, 5, 10, 6, 8, 2, 1))
.distinct(
// Group by mod2
num -> num % 2,
// Counter to store how many elements have been processed for each group.
() -> new HashMap<Integer, Integer>(),
// Increment or set 1 to the counter,
// and return whether 3 elements are published.
(map, num) -> map.merge(num, 1, Integer::sum) <= 3,
// Clean up the state.
map -> map.clear());
StepVerifier.create(first3PerMod2).expectNext(9, 3, 7, 4, 10, 6).verifyComplete();
I'm creating a library for creating data processing workflows using Reactor 3. Each task will have an input flux and an output flux. The input flux is provided by the user. The output flux is created by the library. Tasks can be chained to form a DAG. Something like this: (It's in Kotlin)
val base64 = task<String, String>("base64") {
input { Flux.just("a", "b", "c", "d", "e") }
outputFn { ... get the output values ... }
scriptFn { ... do some stuff ... }
}
val step2 = task<List<String>, String>("step2") {
input { base64.output.buffer(3) }
outputFn { ... }
scriptFn { ... }
}
I have the requirement to limit concurrency for the whole workflow. Only a configured number of inputs can be processed at once. In the example above for a limit of 3 this would mean task base64 would run with inputs "a", "b", and "c" first, then wait for each to complete before processing "d", "e" and the "step2" tasks.
How can I apply such limitations when creating output fluxes from input fluxes? Could a TopicProcessor somehow be applied? Maybe some sort of custom scheduler or processor? How would back-pressure work? Do I need to worry about creating a buffer?
Backpressure propagates from the final susbriber up, across the whole chain. But operators in the chain can ask for data in advance (prefetch) or even "rewrite" the request. For example, in the case of buffer(3) if that operator receives a request(1) it will perform a request(3) upstream ("1 buffer == max 3 elements so I can request my source enough to fill the 1 buffer I was requested").
If the input is always provided by the user, this will be hard to abstract away...
There is no easy way to rate limit sources across multiple pipelines or even multiple subscriptions to a given pipeline (a Flux).
Using a shared Scheduler in multiple publishOn will not work because publishOn selects a Worker thread and sticks to it.
However, if your question is more specifically about the base64 task being limited, maybe the effect can be obtained from flatMap's concurrency parameter?
input.flatMap(someString -> asyncProcess(someString), 3, 1);
This will let at most 3 occurrences of asyncProcess run, and each time one terminates it starts a new one from the next value from input.
I'm trying to make a program that adds multiple nodes at a random intervals from 0 to 3 seconds. Can you please explain why I need runAction or SKAction in the first place? And why I can't put the random function I made inside this block? Also, is there a way to convert the loop into a while loop so that I can break it more easily?
This is what I have right now:
let wait = Double(random(min:0.0, max:3.0))
runAction(SKAction.repeatActionForever(
SKAction.sequence([
SKAction.runBlock(addNode),
SKAction.waitForDuration(wait)
])
I tried this but it doesn't seem to work
var wait = Double(random(min:0.0, max:3.0))
var x = true
while x == true
{
addNode()
SKAction.waitForDuration(wait)
wait = Double(random(min:0.0, max:3.0))
}
waitForDuration takes in a range parameter that will do +- 1/2 value of the range specified, so if you specify 2, it will do a ranged difference -1 <-> 1 to your specified time
E.G. duration 5 seconds range 2
Results
Waits 4
Waits 5.5
Waits 4.47
Waits 4.93
Waits 5.99
To answer the specific question
SKAction.waitForDuration(1.5, withRange 3)
We have a DataFlow job that is subscribed to a PubSub stream of events. We have applied sliding windows of 1 hour with a 10 minute period. In our code, we perform a Count.perElement to get the counts for each element and we then want to run this through a Top.of to get the top N elements.
At a high level:
1) Read from pubSub IO
2) Window.into(SlidingWindows.of(windowSize).every(period)) // windowSize = 1 hour, period = 10 mins
3) Count.perElement
4) Top.of(n, comparisonFunction)
What we're seeing is that the window is being applied twice so data seems to be watermarked 1 hour 40 mins (instead of 50 mins) behind current time. When we dig into the job graph on the Dataflow console, we see that there are two groupByKey operations being performed on the data:
1) As part of Count.perElement. Watermark on the data from this step onwards is 50 minutes behind current time which is expected.
2) As part of the Top.of (in the Combine.PerKey). Watermark on this seems to be another 50 minutes behind the current time. Thus, data in steps below this is watermarked 1:40 mins behind.
This ultimately manifests in some downstream graphs being 50 minutes late.
Thus it seems like every time a GroupByKey is applied, windowing seems to kick in afresh.
Is this expected behavior? Anyway we can make the windowing only be applicable for the Count.perElement and turn it off after that?
Our code is something on the lines of:
final int top = 50;
final Duration windowSize = standardMinutes(60);
final Duration windowPeriod = standardMinutes(10);
final SlidingWindows window = SlidingWindows.of(windowSize).every(windowPeriod);
options.setWorkerMachineType("n1-standard-16");
options.setWorkerDiskType("compute.googleapis.com/projects//zones//diskTypes/pd-ssd");
options.setJobName(applicationName);
options.setStreaming(true);
options.setRunner(DataflowPipelineRunner.class);
final Pipeline pipeline = Pipeline.create(options);
// Get events
final String eventTopic =
"projects/" + options.getProject() + "/topics/eventLog";
final PCollection<String> events = pipeline
.apply(PubsubIO.Read.topic(eventTopic));
// Create toplist
final PCollection<List<KV<String, Long>>> topList = events
.apply(Window.into(window))
.apply(Count.perElement()) //as eventIds are repeated
// get top n to get top events
.apply(Top.of(top, orderByValue()).withoutDefaults());
Windowing is not applied each time there is a GroupByKey. The lag you were seeing was likely the result of two issues, both of which should be resolved.
The first was that data that was buffered for later windows at the first group by key was preventing the watermark from advancing, which meant that the earlier windows were getting held up at the second group by key. This has been fixed in the latest versions of the SDK.
The second was that the sliding windows was causing the amount of data to increase significantly. A new optimization has been added which uses the combine (you mentioned Count and Top) to reduce the amount of data.