GCD concurrent queue not starting tasks in FIFO order [duplicate] - ios

This question already has answers here:
iOS GCD custom concurrent queue execution sequence
(2 answers)
Closed 5 years ago.
I have a class which contains two methods as per the example in Mastering Swift by Jon Hoffman. The class is as below:
class DoCalculation {
func doCalc() {
var x = 100
var y = x * x
_ = y/x
}
func performCalculation(_ iterations: Int, tag: String) {
let start = CFAbsoluteTimeGetCurrent()
for _ in 0..<iterations {
self.doCalc()
}
let end = CFAbsoluteTimeGetCurrent()
print("time for \(tag): \(end - start)")
}
}
Now in the viewDidLoad() of the ViewController from the single view template, I create an instance of the above class and then create a concurrent queue. I then add the blocks executing the performCalculation(: tag:) method to the queue.
cqueue.async {
print("Starting async1")
calculation.performCalculation(10000000, tag: "async1")
}
cqueue.async {
print("Starting async2")
calculation.performCalculation(1000, tag: "async2")
}
cqueue.async {
print("Starting async3")
calculation.performCalculation(100000, tag: "async3")
}
Every time I run the application on simulator, I get random out put for the start statements. Example outputs that I get are below:
Example 1:
Starting async1
Starting async3
Starting async2
time for async2: 4.1961669921875e-05
time for async3: 0.00238299369812012
time for async1: 0.117094993591309
Example 2:
Starting async3
Starting async2
Starting async1
time for async2: 2.80141830444336e-05
time for async3: 0.00216799974441528
time for async1: 0.114436984062195
Example 3:
Starting async1
Starting async3
Starting async2
time for async2: 1.60336494445801e-05
time for async3: 0.00220298767089844
time for async1: 0.129496037960052
I don't understand why the blocks don't start in FIFO order. Can somebody please explain what am I missing here?
I know they will be executed concurrently, but its stated that concurrent queue will respect FIFO for starting the execution of tasks, but won't guarantee which one completes first. So at least the starting task statements should have started with
Starting async1
Starting async3
Starting async2
and this completion statements random:
time for async2: 4.1961669921875e-05
time for async3: 0.00238299369812012
time for async1: 0.117094993591309
and the completion statements random.

A concurrent queue runs the jobs you submit to it concurrentlyThat's what it's for.
If you want a queue the runs jobs in FIFO order, you want a serial queue.
I see what you're saying about the docs claiming that the jobs will be submitted in FIFO order, but your test doesn't really establish the order in which they're run. If the concurrent queue has 2 threads available but only one processor to run those threads on, it might swap out one of the threads before it gets a chance to print, run the other job for a while, and then go back to running the first job. There's no guarantee that a job runs to the end before getting swapped out.
I don't think a print statement gives you reliable information about the order in which the jobs are started.

cqueue is a concurrent queue which is dispatching your block of work to three different threads(it actually depends on the threads availability) at almost the same time but you can not control the time at which each thread completes the work.
If you want to perform a task serially in a background queue, you are much better using serial queue.
let serialQueue = DispatchQueue(label: "serialQueue")
Serial Queue will start the next task in queue only when your previous task is completed.

"I don't understand why the blocks don't start in FIFO order" How do you know they don't? They do start in FIFO order!
The problem is that you have no way to test that. The notion of testing it is, in fact, incoherent. The soonest you can test anything is the first line of each block — and by that time, it is perfectly legal for another line of code from another block to execute, because these blocks are asynchronous. That is what asynchronous means.
So, they start in FIFO order, but there is no guarantee about the order in which, given multiple asynchronous blocks, their first lines will be executed.

With a concurrent queue, you are effectively specifing that they can run at the same time. So while they’re added in FIFO manner, you have a race condition between these various worker threads, and thus you have no assurance which will hit its respective print statement first.
So, this raises the question: Why do you care which order they hit their respective print statements? If order is really important, you shouldn't be using concurrent queue. Or, the other way of saying that, if you want to use a concurrent queue, write code that isn't dependent upon the order with which they run.
You asked:
Would you suggest some way to get the info when a Task is dequeued from the queue so that I can log it to get the FIFO order.
If you're asking how to enjoy FIFO starting of the tasks on concurrent queue in real-world app, the answer is "you don't", because of the aforementioned race condition. When using concurrent queues, never write code that is strictly dependent upon the FIFO behavior.
If you're asking how to verify this empirically for purely theoretical purposes, just do something that ties up the CPUs and frees them up one by one:
// utility function to spin for certain amount of time
func spin(for seconds: TimeInterval, message: String) {
let start = CACurrentMediaTime()
while CACurrentMediaTime() - start < seconds { }
os_log("%#", message)
}
// my concurrent queue
let queue = DispatchQueue(label: label, attributes: .concurrent)
// just something to occupy up the CPUs, with varying
// lengths of time; don’t worry about these re FIFO behavior
for i in 0 ..< 20 {
queue.async {
spin(for: 2 + Double(i) / 2, message: "\(i)")
}
}
// Now, add three tasks on concurrent queue, demonstrating FIFO
queue.async {
os_log(" 1 start")
spin(for: 2, message: " 1 stop")
}
queue.async {
os_log(" 2 start")
spin(for: 2, message: " 2 stop")
}
queue.async {
os_log(" 3 start")
spin(for: 2, message: " 3 stop")
}
You'll be able to see those last three tasks are run in FIFO order.
The other approach, if you want to confirm precisely what GCD is doing, is to refer to the libdispatch source code. It's admittedly pretty dense code, so it's not exactly obvious, but it's something you can dig into if you're feeling ambitious.

Related

Is there an equivalent to Akka Streams' `conflate` and/or `batch` operators in Reactor?

I am looking for an equivalent of the batch and conflate operators from Akka Streams in Project Reactor, or some combination of operators that mimic their behavior.
The idea is to aggregate upstream items when the downstream backpressures in a reduce-like manner.
Note that this is different from this question because the throttleLatest / conflate operator described there is different from the one in Akka Streams.
Some background regarding what I need this for:
I am watching a change stream on a MongoDB and for every change I run an aggregate query on the MongoDB to update some metric. When lots of changes come in, the queries can't keep up and I'm getting errors. As I only need the latest value of the aggregate query, it is fine to aggregate multiple change events and run the aggregate query less often, but I want the metric to be as up-to-date as possible so I want to avoid waiting a fixed amount of time when there is no backpressure.
The closest I could come so far is this:
changeStream
.window(Duration.ofSeconds(1))
.concatMap { it.reduce(setOf<String>(), { applicationNames, event -> applicationNames + event.body.sourceReference.applicationName }) }
.concatMap { Flux.fromIterable(it) }
.concatMap { taskRepository.findTaskCountForApplication(it) }
but this would always wait for 1 second regardless of backpressure.
What I would like is something like this:
changeStream
.conflateWithSeed({setOf(it.body.sourceReference.applicationName)}, {applicationNames, event -> applicationNames + event.body.sourceReference.applicationName})
.concatMap { Flux.fromIterable(it) }
.concatMap { taskRepository.findTaskCountForApplication(it) }
I assume you always run only 1 query at the same time - no parallel execution. My idea is to buffer elements in list(which can be easily aggregated) as long as the query is running. As soon as the query finishes, another list is executed.
I tested it on a following code:
boolean isQueryRunning = false;
Flux.range(0, 1000000)
.delayElements(Duration.ofMillis(10))
.bufferUntil(aLong -> !isQueryRunning)
.doOnNext(integers -> isQueryRunning = true)
.concatMap(integers-> Mono.fromCallable(() -> {
int sleepTime = new Random().nextInt(10000);
System.out.println("processing " + integers.size() + " elements. Sleep time: " + sleepTime);
Thread.sleep(sleepTime);
return "";
})
.subscribeOn(Schedulers.elastic())
).doOnNext(s -> isQueryRunning = false)
.subscribe();
Which prints
processing 1 elements. Sleep time: 4585
processing 402 elements. Sleep time: 2466
processing 223 elements. Sleep time: 2613
processing 236 elements. Sleep time: 5172
processing 465 elements. Sleep time: 8682
processing 787 elements. Sleep time: 6780
Its clearly visible, that size of the next batch is proprortional to previous query execution time(Sleep time).
Note that it is not "real" backpressure solution, just a workaround. Also its not suited for parallel execution. It might also require some tuning in order to prevent running queries for empty batches.

How to get results of tasks when they finish and not after all have finished in Dask?

I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed.
I created a local Client and use client.compute() to send tasks. Then I use future.result() to get the result of each task.
I'm using threads to ask for results at the same time and measure the time for each result to compute like this:
def get_result(future,i):
t0 = time.time()
print("calculating result", i)
result = future.result()
print("result {} took {}".format(i, time.time() - t0))
client = Client()
df = dd.read_csv(path_to_csv)
future1 = client.compute(df[df.x > 200])
future2 = client.compute(df[df.x > 500])
threading.Thread(target=get_result, args=[future1,1]).start()
threading.Thread(target=get_result, args=[future2,2]).start()
I expect the output of the above code to be something like:
calculating result 1
calculating result 2
result 2 took 10
result 1 took 46
Since the first task is larger.
But instead I got both at the same time
calculating result 1
calculating result 2
result 2 took 46.3046760559082
result 1 took 46.477620363235474
I asume that is because future2 actually computes in the background and finishes before future1, but it waits until future1 is completed to return.
Is there a way I can get the result of future2 at the moment it finishes ?
You do not need to make threads to use futures in an asynchronous fashion - they are already inherently async, and monitor their status in the background. If you want to get results in the order they are ready, you should use as_completed.
However, fo your specific situation, you may want to simply view the dashboard (or use df.visulalize()) to understand the computation which is happening. Both futures depend on reading the CSV, and this one task will be required before either can run - and probably takes the vast majority of the time. Dask does not know, without scanning all of the data, which rows have what value of x.

async operation concurrency understanding

These queues manage the tasks you provide to GCD and execute those tasks in FIFO order. This guarantees that first task added to the queue is the first task started in the queue, the second task added will be the second to start, and so on down the line.
below code
let anotherQueue = DispatchQueue(label: "com.gcdTest.Queue", qos: .userInteractive)
anotherQueue.async {
anotherQueue.async{
anotherQueue.async{
anotherQueue.async {
print("task 6")
for _ in 1...300 { }
}
}
print("task 3")
for _ in 301...600 {}
}
anotherQueue.async{
anotherQueue.async{
print("task 5")
for _ in 700...900 {}
}
print("task 4")
for _ in 5000...7000 {}
}
print("task 1")
for _ in 9000...10000 {}
}
anotherQueue.async {
print("task 2")
for _ in 1...1000 {}
}
produces output
task 1
task 2
task 3
task 4
task 5
task 6
But when we run the same code in Concurrent it produces unpredictable output.
ex:- change first line of code to below line
let anotherQueue = DispatchQueue(label: "com.gcdTest.Queue", qos: .userInteractive, attributes: .concurrent)
output
task 3
task 2
task 1
task 4
task 5
task 6
By definition it states
Tasks in concurrent queues are guaranteed to start in the order they were added…and that’s about all you’re guaranteed!
So, expecting a similar output which is produced by serial queue(by default). (task1, task2, task3, task4, task5, task6)
Please any one help me out, where i am going wrong.
Bottom line, GCD will always start the tasks on a queue in the order that they were dispatched to that queue. In the case of a serial queue, that means that they will run sequentially, in that order, and this behavior is easily observable.
In the case of a concurrent queue, however, while it will start the tasks in the queued order, for tasks that are dispatched quickly in succession, they may all start quickly in succession, too, running concurrently with each other. In short, they may start running at nearly the same time, and you therefore have no assurances which will encounter its respective print statement first. Just because the concurrent queue started one task a few milliseconds after another, that provides no assurances regarding the order that those two tasks encounter their respective print statements.
In short, instead of deterministic behavior for the sequence of the print statements, you have a simple race with non-deterministic behavior.
As an aside, while it's clear that your example introduces races when employed on a concurrent queue, it should be noted that because of your nested dispatch statements, you'll have race conditions on your serial queue, too. It looks like the sequence of behavior is entirely predictable on serial queue, but it's not.
Let's consider a simplified version of your example. I'm assuming that we'll start this from the main thread:
queue.async {
queue.async {
print("task 3")
}
print("task 1")
}
queue.async {
print("task 2")
}
Clearly, task 1 will be added to the queue first and if that queue is free, it will start immediately on that background thread, while the main thread proceeds. But as the code on the main thread approaches the dispatching of task 2, task 1 will start and will proceed to dispatch task 3. You have a classic race between the dispatching of task 2 and task 3.
Now, in practice, you'll see task 2 dispatched before task 3, but it doesn't take much of a delay to introduce non-deterministic behavior. For example, on my computer, if, before dispatching task 2, a Thread.sleep(forTimeInterval: 0.00005) manifested the non-deterministic behavior. But even without delays (or for loops of a certain number of iterations), the behavior is technically non-deterministic.
But we can create simple example that eliminates the races implicit in the above examples, but still illustrates the difference between serial and concurrent queue behavior that you were originally asking about:
for i in 0 ..< 10 {
queue.async { [i] in
print(i)
}
}
This is guaranteed to print in order on serial queue, but not necessarily so on a concurrent queue.

Concurrent Queue Issue - iOS/Swift

In my program I need two tasks to run simultaneously in the background. To do that i have used concurrent queues as below,
let concurrentQueue = DispatchQueue(label: "concurrentQueue", qos: .utility, attributes: .concurrent)
concurrentQueue.async {
for i in 0 ..< 10{
print(i)
}
}
concurrentQueue.async {
for i in (0 ..< 10).reversed(){
print(i)
}
}
Here I need the output like this,
0
9
1
8
2
7
3
6
4
5
5
4
6
3
7
2
8
1
9
0
But what I get is,
I referred below tutorial in order to have some basic knowledge about Concurrent Queues in Swift 3
https://www.appcoda.com/grand-central-dispatch/
Can someone tell me what is wrong with my code? or else is it the result I should get? Is there any other ways to get my thing done? Any help would be highly appreciated.
There is nothing wrong with your code sample. That is the correct syntax for submitting two tasks to a concurrent queue.
The problem is the expectation that you'd necessarily see them run concurrently. There are two issues that could affect this:
The first dispatched task can run so quickly that it just happens to finish before the second task gets going. If you slow them down a bit, you'll see your concurrent behavior:
let concurrentQueue = DispatchQueue(label: Bundle.main.bundleIdentifier! + ".concurrentQueue", qos: .utility, attributes: .concurrent)
concurrentQueue.async {
for i in 0 ..< 10 {
print("forward: ", i)
Thread.sleep(forTimeInterval: 0.1)
}
}
concurrentQueue.async {
for i in (0 ..< 10).reversed() {
print("reversed:", i)
Thread.sleep(forTimeInterval: 0.1)
}
}
You'd never sleep in production code, but for pedagogical purposes, it can better illustrate the issue.
You can also omit the sleep calls, and just increase the numbers dramatically (e.g. 1_000 or 10_000), and you might start to see concurrent processing taking place.
Your device could be resource constrained, preventing it from running the code concurrently. Devices have a limited number of CPU cores to run concurrent tasks. Just because you submitted the tasks to concurrent queue, it doesn't mean the device is capable of running the two tasks at the same time. It depends upon the hardware and what else is running on that device.
By the way, note that you might see different behavior on the simulator (which is using your Mac's CPU, which could be running many other tasks) than on a device. You might want to make sure to test this behavior on an actual device, if you're not already.
Also note that you say you "need" the output to alternate print statements between the two queues. While the screen snapshots from your tutorial suggest that this should be expected, you have absolutely no assurances that this will be the case.
If you really need them to alternate back and forth, you have to add some mechanism to coordinate them. You can use semaphores (which I'm reluctant to suggest simply because they're such a common source of problems, especially for new developers) or operation queues with dependencies.
May be you could try using semaphore.
let semaphore1 = DispatchSemaphore(value: 1)
let semaphore2 = DispatchSemaphore(value: 0)
concurrentQueue.async {
for i in 0 ..< 10{
semaphore1.wait()
print(i)
semaphore2.signal()
}
}
concurrentQueue.async {
for i in (0 ..< 10).reversed(){
semaphore2.wait()
print(i)
semaphore1.signal()
}
}

How to find out the PID of the Flinks execution process?

I want to measure flinks performance with performance counters (perf). My code:
var text = env.readTextFile("<filename>")
var counts = text.flatMap { _.toLowerCase.split("\\W+") }.map { (_, 1) }.groupBy(0).sum(1)
counts.writeAsText("<filename_result>", WriteMode.OVERWRITE)
env.execute()
I know the PID of the jobmanager. Also I can see the TID of the Thread (CHAIN DataSource), that runs the execute()-command, during execution. But for each execution the TID changes, so it wont work with the TID. Is there a way to figure out the PID of the jobmanagers child process, that runs the execute()-command? And are there different child processes for every transformation (e.g. flatMap) of the rdd? If so, is it possible to find out their distinct PIDs?
The individual operators are not executed in distinct processes. The JobManager and the TaskManagers are started as Java processes. The TaskManager then runs a set of parallel tasks (corresponding to the operators). Each parallel task is executed in its own thread. When you start Flink, then the system will create files /tmp/your-name-taskmanager.pid and /tmp/your-name-jobmanager.pid which contain the PID of the processes.

Resources