I understand that dask work well in batch mode like this
def load(filename):
...
def clean(data):
...
def analyze(sequence_of_data):
...
def store(result):
with open(..., 'w') as f:
f.write(result)
dsk = {'load-1': (load, 'myfile.a.data'),
'load-2': (load, 'myfile.b.data'),
'load-3': (load, 'myfile.c.data'),
'clean-1': (clean, 'load-1'),
'clean-2': (clean, 'load-2'),
'clean-3': (clean, 'load-3'),
'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
'store': (store, 'analyze')}
from dask.multiprocessing import get
get(dsk, 'store') # executes in parallel
Can we use dask to process streaming channel , where the number of chunks is unknown or even endless?
Can it perform the computation in an incremental way. for example can the 'analyze' step above could process ongoing chunks?
must we call the "get" operation only after all the data chunks are known , could we add new chunks after the "get" was called
Edit: see newer answer below
No
The current task scheduler within dask expects a single computational graph. It does not support dynamically adding to or removing from this graph. The scheduler is designed to evaluate large graphs in a small amount of memory; knowing the entire graph ahead of time is critical for this.
However, this doesn't stop one from creating other schedulers with different properties. One simple solution here is just to use a module like conncurrent.futures on a single machine or distributed on multiple machines.
Actually Yes
The distributed scheduler now operates fully asynchronously and you can submit tasks, wait on a few of them, submit more, cancel tasks, add/remove workers etc. all during computation. There are several ways to do this, but the simplest is probably the new concurrent.futures interface described briefly here:
http://dask.pydata.org/en/latest/futures.html
Related
I am confronted with a problem when submitting many jobs to a cluster, where each job is calculating some data and saving it (with many variables in terms of a .jld file) to some drive, for example like this
function f(savedir, pid, params)
...
save(savedir*"$(pid).jld",result)
end
After the calculation I need to process the data and load each .jld file to access the variables individually. Even though the final reduction is rather small, this takes a lot of time. I though about saving it all to one .jld file, but in that case I run into the problem that the file is potentially accessed at the same time, since the jobs run in parallel. Further I though about collecting the data in a out-of-core fashion using juliaDB, but in the end I do not see why this should be any better. I know that this could be solved with some database server, but this seems to be an overkill for my problem. How do you deal with this kind of problems?
Best,
v.
If the data is small simply use the IOBuffer mechanism and send it from workers to the master:
using Distributed, Serialization
addprocs(4)
#everywhere using Distributed, Serialization
rrs = #distributed (hcat) for i in 1:12
b=IOBuffer()
myres = (rand(), randn(), myid()) # emulates some big computations
# that you are running
serialize(b,myres)
b.data
end
And here is a sample code deserializing back the results:
julia> for i in 1:size(rrs,2)
res = deserialize(IOBuffer(#view rrs[:, i]))
println(res)
end
(0.8656737453513623, 1.0594978554855077, 2)
(0.6637467726391784, 0.35682413048990763, 2)
(0.32579653913039386, 0.2512902466296038, 2)
(0.3033490905926888, 1.7662416364260713, 3)
...
If your data is too big and your cluster is distributed than you need to use some other orchestration mechanism. One possible lightweight solution that I use sometimes is the following bunch of bash codes: https://github.com/pszufe/KissCluster This tool is a set of bash script built around the following bash command very useful for any file-based scenario:
nohup seq $start $end | xargs --max-args=1 --max-procs=$nproc julia run.jl &>> somelogfile.txt &
Nevertheless when possible consider using Julia's Distributed package.
In the context of the Dask distributed scheduler w/ a LocalCluster: Can somebody help me understand the dynamics of having a large (heap) mapping function?
For example, consider the Dask Data Frame ddf and the map_partitions operation:
def mapper():
resource=... #load some large resource eg 50MB
def inner(pdf):
return pdf.apply(lambda x: ..., axis=1)
return inner
mapper_fn = mapper() #50MB on heap
ddf.map_partitions(mapper_fn)
What happens here? Dask will serialize mapper_fn and send to all tasks? Say, I have n partitions so, n tasks.
Empirically, I've observed, that if I have 40 tasks and a 50MB mapper, then it takes about 70s taks to start working, the cluster seems to kinda sit there w/ full CPU, but the dashboard shows nothing. What is happening here? What are the consequences of having large (heap) functions in the dish distributed scheduler?
Dask serializes non-trivial functions with cloudpickle, and includes the serialized version of those functions in every task. This is highly inefficient. We recommend that you not do this, but instead pass data explicitly.
resource = ...
ddf.map_partitions(func, resource=resource)
This will be far more efficient.
I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.
I'm trying to read a 100000 data records of about 100kB each simultaneously from 50 disks, shuffling them, and writing it to 50 output disks at disk speed. What's a good way of doing that with Dask?
I've tried creating 50 queues and submitting 50 reader/writer functions using 100 workers (all on different machines, this is using Kubernetes). I ramp up first the writers, then the readers gradually. The scheduler gets stuck at 100% CPU at around 10 readers, and then gets timeouts when any more readers are added. So this approach isn't working.
Most dask operations have something like 1ms of overhead. As a result Dask is not well suited to be placed within innermost loops. Typically it is used at a coarser level, parallelizing across many Python functions, each of which is expected to take 100ms.
In a situation like yours I would push data onto a shared message system like Kafka, and then use Dask to pull off chunks of data when appropriate.
Data transfer
If your problem is in the bandwidth limitation of moving data through dask queues then you might consider turning your data into dask-reference-able futures before placing things into queues. See this section of the Queue docstring: http://dask.pydata.org/en/latest/futures.html#distributed.Queue
Elements of the Queue must be either Futures or msgpack-encodable data (ints, strings, lists, dicts). All data is sent through the scheduler so it is wise not to send large objects. To share large objects scatter the data and share the future instead.
So you probably want something like the following:
def f(queue):
client = get_client()
for fn in local_filenames():
data = read(fn)
future = client.scatter(data)
queue.put(future)
Shuffle
If you're just looking to shuffle data then you could read it with something like dask.bag or dask.dataframe
df = dd.read_parquet(...)
and then sort your data using the set_index method
df.set_index('my-column')
I am trying to implement a modified parallel depth-first search algorithm in Erlang (let's call it *dfs_mod*).
All I want to get is all the 'dead-end paths' which are basically the paths that are returned when *dfs_mod* visits a vertex without neighbours or a vertex with neighbours which were already visited. I save each path to ets_table1 if my custom function fun1(Path) returns true and to ets_table2 if fun1(Path) returns false(I need to filter the resulting 'dead-end' paths with some customer filter).
I have implemented a sequential version of this algorithm and for some strange reason it performs better than the parallel one.
The idea behind the parallel implementation is simple:
visit a Vertex from [Vertex|Other_vertices] = Unvisited_neighbours,
add this Vertex to the current path;
send {self(), wait} to the 'collector' process;
run *dfs_mod* for Unvisited_neighbours of the current Vertex in a new process;
continue running *dfs_mod* with the rest of the provided vertices (Other_vertices);
when there are no more vertices to visit - send {self(), done} to the collector process and terminate;
So, basically each time I visit a vertex with unvisited neighbours I spawn a new depth-first search process and then continue with the other vertices.
Right after spawning a first *dfs_mod* process I start to collect all {Pid, wait} and {Pid, done} messages (wait message is to keep the collector waiting for all the done messages). In N milliseconds after waiting the collector function returns ok.
For some reason, this parallel implementation runs from 8 to 160 seconds while the sequential version runs just 4 seconds (the testing was done on a fully-connected digraph with 5 vertices on a machine with Intel i5 processor).
Here are my thoughts on such a poor performance:
I pass the digraph Graph to each new process which runs *dfs_mod*. Maybe doing digraph:out_neighbours(Graph) against one digraph from many processes causes this slowness?
I accumulate the current path in a list and pass it to each new spawned *dfs_mod* process, maybe passing so many lists is the problem?
I use an ETS table to save a path each time I visit a new vertex and add it to the path. The ETS properties are ([bag, public,{write_concurrency, true}), but maybe I am doing something wrong?
each time I visit a new vertex and add it to the path, I check a path with a custom function fun1() (it basically checks if the path has vertices labeled with letter "n" occurring before vertices with "m" and returns true/false depending on the result). Maybe this fun1() slows things down?
I have tried to run *dfs_mod* without collecting done and wait messages, but htop shows a lot of Erlang activity for quite a long time after *dfs_mod* returns ok in the shell, so I do not think that the active message passing slows things down.
How can I make my parallel dfs_mod run faster than its sequential counterpart?
Edit: when I run the parallel *dfs_mod*, pman shows no processes at all, although htop shows that all 4 CPU threads are busy.
There is no quick way to know without the code, but here's a quick list of why this might fail:
You might be confusing parallelism and concurrency. Erlang's model is shared-nothing and aims for concurrency first (running distinct units of code independently). Parallelism is only an optimization of this (running some of the units of code at the same time). Usually, parallelism will take form at a higher level, say you want to run your sorting function on 50 different structures -- you then decide to run 50 of the sequential sort functions.
You might have synchronization problems or sequential bottlenecks, effectively changing your parallel solution into a sequential one.
The overhead of copying data, context switching and whatnot dwarfs the gains you have in terms of parallelism. This former is especially true of large data sets that you break into sub data sets, then join back into a large one. The latter is especially true of highly sequential code, as seen is the process ring benchmarks.
If I wanted to optimize this, I would try to reduce message passing and data copying to a minimum.
If I were the one working on this, I would keep the sequential version. It does what it says it should do, and when part of a larger system, as soon as you have more processes than core, parallelism will come from the many calls to the sort function rather than branches of the sort function. In the long run, if part of a server or service, using the sequential version N times should have no more negative impact than a parallel one that ends up creating many, many more processes to do the same task, and risk overloading the system more.