When Dask tasks run multiple times, which result is used? - dask

First, read this question:
Repeated task execution using the distributed Dask scheduler
Now, when Dask decides to rerun a task due to worker stealing or a task failing (as a result of memory limits per process for example), which task result gets passed to the next node of the DAG? We are using nested tasks, e.g.
#dask.delayed
def add(n):
return n+1
t_a = add(1)
t_b = add(t_a)
the_output = add(add(add(t_b)))
So if one of these tasks fails, or gets stolen, and is run twice, which result gets passed to the next node in the DAG?
Further background for those interested:
The reason this has come up is that our task writes to a database. If it runs twice, we get an integrity error because it is trying to insert the same record twice (constrained on id and version in combination). The current plan is to make the task idempotent by catching the integrity error in the task but I still don't understand how Dask "chooses" a result.

If you have a situation like add(add(add(t_b)))
Or more generally
x = add(1)
y = add(x)
z = add(y)
Even though those all use the same function, they are all separate tasks. Dask sees that they have different inputs and so it treats them differently.
So if one of these tasks fails, or gets stolen, and is run twice, which result gets passed to the next node in the DAG?
In all of these cases, there is only one valid result on the cluster at once. A stolen task is only run on the new machine, not the old one. If the result of a task is lost and has to be rerun then only the new value will be present anywhere (the old value was lost, remember).

Related

How is exactly-once processing maintained during worker failures or bundle retries?

I have a pipeline running on Dataflow that ingests files containing several thousand records. These files arrive at a steady frequency, which are processed by a stateful ParDo with timers that attempts to throttle the rate of ingest by batching and holding these files until the timer fires, before being expanded into individual record elements via a file processing ParDo, and finally written to BigQuery destinations.
On occasion, either an intermittent event such as an OOM event or autoscaling events, I have seen Dataflow attempting to emit the files in the stateful ParDo after the event resolves, causing duplicate record elements downstream when the file processing ParDo reprocesses the files. I understand that bundles are retried if there is a failure, but do they account for duplicates?
How/What is exactly-once processing achieving in this context, especially with regard to the State/Timer API, since I am seeing duplicates at my destination?
Dataflow achieves exactly once processing by ensuring that data produced from failing workers is not passed downstream (or, more precisely, if work is retried only one successful result is consumed downstream). For example, if stage A of your pipeline is producing elements and stage B is counting them, and workers in stage A fail and are re-tried, duplicate elements will not be counted by stage B (though of course stage B might itself have to be retried). This also applies to state and timers--a given bundle of work is either committed in its entirety (i.e. the set of inputs are marked as consumed, and the set of outputs committed atomically with the consumption/setting of state and timers) or entirely discarded (state/timers is left unconsumed/untouched and the retry will not not be influenced by what happened before.)
What is not exactly once is interactions with external systems (due to the possibility of retries). These are instead at least once, and so to guarantee correctness all such interactions should be idempotent. Sinks often achieve this by assigning a unique id such that multiple writes can be deduplicated in the downstream system. For files, one can write to temporary files, and then rename the "winning" set of shards to the final destination after a barrier. It's not clear from your question what files you're emitting (or ingesting) but hopefully this should be helpful in understanding how the system works.
More specifically, say the initial state is {state: A, timers: [X, Y], inputs: [i, j, k]}. Suppose further that when processing the bundle (these timers and inputs) the state is updated to B, we emit elements m, and n downstream, and we set a timer W.
If the bundle succeeds, the new state will be {state: B, timers: [W], inputs: []} and the elements [m, n] are guaranteed to be passed downstream. Furthermore, any competing retry of this bundle would always fail.
On the other hand, if the bundle fails (even if it "emitted" some of the elements or tried to update the state) the resulting state of the system will be {state: A, timers: [X, Y], inputs: [i, j, k]} for a fresh retry and nothing that was emitted from this failed bundle will be observed downstream.
Another way to look at it is that the set {inputs consumed, timers consumed, state modifications, timers set, outputs to produce downstream} is written to the backing "database" in a single transaction. Only a single successful attempt is ever committed, failed attempts are discarded.
More details can be found at https://beam.apache.org/documentation/runtime/model/

Asynchronous Xarray writing to Zarr

all. I'm using a Dask Distributed cluster to write Zarr+Dask-backed Xarray Datasets inside of a loop, and the dataset.to_zarr is blocking. This can really slow things down when there are straggler chunks that block the continuation of the loop. Is there a way to do the .to_zarr asynchronously, so that the loop can continue with the next dataset write without being held up by a few straggler chunks?
With the distributed scheduler, you get async behaviour without any special effort. For example, if you are doing arr.to_zarr, then indeed you are going to wait for completion. However, you could do the following:
client = Client(...)
out = arr.to_zarr(..., compute=False)
fut = client.compute(out)
This will return a future, fut, whose status reflects the current state of the whole computation, and you can choose whether to wait on it or to continue submitting new work. You could also display it to a progress bar (in the notebook) which will update asynchronously whenever the kernel is not busy.

Flink: how to control the execution sequence of task

I am trying to write a Junit test which has two resources and operators, the first operator is used to persist the data into the state, then the second operator will retrieve state out.
I found the wired thing is the second operator is always run before the first one which persists the data into the state.
So, my question is how to control the execution sequence of the task within the same flink job program?

DASK - Stopping workers during execution causes completed tasks to be launched twice

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?
I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected

Parallel depth-first search in Erlang is slower than its sequential counterpart

I am trying to implement a modified parallel depth-first search algorithm in Erlang (let's call it *dfs_mod*).
All I want to get is all the 'dead-end paths' which are basically the paths that are returned when *dfs_mod* visits a vertex without neighbours or a vertex with neighbours which were already visited. I save each path to ets_table1 if my custom function fun1(Path) returns true and to ets_table2 if fun1(Path) returns false(I need to filter the resulting 'dead-end' paths with some customer filter).
I have implemented a sequential version of this algorithm and for some strange reason it performs better than the parallel one.
The idea behind the parallel implementation is simple:
visit a Vertex from [Vertex|Other_vertices] = Unvisited_neighbours,
add this Vertex to the current path;
send {self(), wait} to the 'collector' process;
run *dfs_mod* for Unvisited_neighbours of the current Vertex in a new process;
continue running *dfs_mod* with the rest of the provided vertices (Other_vertices);
when there are no more vertices to visit - send {self(), done} to the collector process and terminate;
So, basically each time I visit a vertex with unvisited neighbours I spawn a new depth-first search process and then continue with the other vertices.
Right after spawning a first *dfs_mod* process I start to collect all {Pid, wait} and {Pid, done} messages (wait message is to keep the collector waiting for all the done messages). In N milliseconds after waiting the collector function returns ok.
For some reason, this parallel implementation runs from 8 to 160 seconds while the sequential version runs just 4 seconds (the testing was done on a fully-connected digraph with 5 vertices on a machine with Intel i5 processor).
Here are my thoughts on such a poor performance:
I pass the digraph Graph to each new process which runs *dfs_mod*. Maybe doing digraph:out_neighbours(Graph) against one digraph from many processes causes this slowness?
I accumulate the current path in a list and pass it to each new spawned *dfs_mod* process, maybe passing so many lists is the problem?
I use an ETS table to save a path each time I visit a new vertex and add it to the path. The ETS properties are ([bag, public,{write_concurrency, true}), but maybe I am doing something wrong?
each time I visit a new vertex and add it to the path, I check a path with a custom function fun1() (it basically checks if the path has vertices labeled with letter "n" occurring before vertices with "m" and returns true/false depending on the result). Maybe this fun1() slows things down?
I have tried to run *dfs_mod* without collecting done and wait messages, but htop shows a lot of Erlang activity for quite a long time after *dfs_mod* returns ok in the shell, so I do not think that the active message passing slows things down.
How can I make my parallel dfs_mod run faster than its sequential counterpart?
Edit: when I run the parallel *dfs_mod*, pman shows no processes at all, although htop shows that all 4 CPU threads are busy.
There is no quick way to know without the code, but here's a quick list of why this might fail:
You might be confusing parallelism and concurrency. Erlang's model is shared-nothing and aims for concurrency first (running distinct units of code independently). Parallelism is only an optimization of this (running some of the units of code at the same time). Usually, parallelism will take form at a higher level, say you want to run your sorting function on 50 different structures -- you then decide to run 50 of the sequential sort functions.
You might have synchronization problems or sequential bottlenecks, effectively changing your parallel solution into a sequential one.
The overhead of copying data, context switching and whatnot dwarfs the gains you have in terms of parallelism. This former is especially true of large data sets that you break into sub data sets, then join back into a large one. The latter is especially true of highly sequential code, as seen is the process ring benchmarks.
If I wanted to optimize this, I would try to reduce message passing and data copying to a minimum.
If I were the one working on this, I would keep the sequential version. It does what it says it should do, and when part of a larger system, as soon as you have more processes than core, parallelism will come from the many calls to the sort function rather than branches of the sort function. In the long run, if part of a server or service, using the sequential version N times should have no more negative impact than a parallel one that ends up creating many, many more processes to do the same task, and risk overloading the system more.

Resources