Streamz with Dask Distributed - dask

Based on the streamz documentation, one could leverage a dask distributed cluster in the following way:
from distributed import Client
client = Client('tcp://localhost:8786') # Connect to scheduler that has distributed workers
from streamz import Stream
source = Stream()
(source.scatter() # scatter local elements to cluster, creating a DaskStream
.map(increment) # map a function remotely
.buffer(5) # allow five futures to stay on the cluster at any time
.gather() # bring results back to local process
.sink(write)) # call write locally
for x in range(10):
source.emit(x)
Conceptually, it isn't clear why we don't have to pass the dask distributed client in as a parameter to instantiate Stream(). More specifically, how does Stream() know what scheduler to attach to?
What would you do if you had two schedulers that have workers on unrelated nodes like:
from distributed import Client
client_1 = Client('tcp://1.2.3.4:8786')
client_2 = Client('tcp://10.20.30.40:8786')
How does one create two streams for client_1 and client_2, respectively?

The basic rule in Dask is, if there is a distributed client defined, use it for any Dask computations. If there is more than one distributed client, use the most recently created on that is still alive.
Streamz does not explicitly let you choose which client to use when you .scatter(), it uses dask.distributed.default_client() to pick one. You may wish to raise an issue with them to allow a client= keyword. The workflow doesn't even fit a context-based approach. For now, if you wanted to have simultaneous multiple streamz working with data on different Dask clusters, you would probably have to manipulate the state of dask.distributed.client._global_clients.

Related

Running itself multithreaded functions on a dask cluster

I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?
No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this.
All the dask-distributed examples/docs I see are populating the initial data load from a network resource (hdfs, s3, etc) and does not appear to extend the DAG optimization to the load portion (seems to assume that a network load is a necessary evil and just eats the initial cost.) This is underscored on the answer to another question: Does Dask communicate with HDFS to optimize for data locality?
However, I can see cases where we would want this. For example, if we have a sharded database + dask workers co-located on nodes of this DB, we would want to force records from only the local shard to be populated into the local dask workers. From the documentation/examples, network cris-cross seems like a necessarily assumed cost. Is it possible to force parts of a single dataframe to be obtained from specific workers?
The alternative, which I've tried, is to try and force each worker to run a function (iteratively submitted to each worker) where the function loads only the data local to that machine/shard. This works, and I have a bunch of optimally local dataframes with the same column schema -- however -- now I don't have a single dataframe but n dataframes. Is it possible to merge/fuse dataframes across multiple machines so there is a single dataframe reference, but portions have affinity (within reason, as decided by the task DAG) to specific machines?
You can produce dask "collections" such as a dataframe from futures and delayed objects, which inter-operate nicely with each other.
For each partition, where you know which machine should load it, you can produce a future as follows:
f = c.submit(make_part_function, args, workers={'my.worker.ip'})
where c is the dask client and the address is the machine you'd want to see it happen on. You can also give allow_other_workers=True is this is a preference rather than a requirement.
To make a dataframe, from a list of such futures, you could do
df = dd.from_delayed([dask.delayed(f) for f in futures])
and ideally provide a meta=, giving a description of the expected dataframe. Now, further operations on a given partition will prefer to be scheduled on the same worker which already holds the data.
I am also interested in having the capability to restrict computation to a specific node (and data localized to that node). I have tried to implement the above with a simple script (see below) but looking at the resulting data frame, results the error (from dask/dataframe/utils.py::check_meta()):
ValueError: Metadata mismatch found in `from_delayed`.
Expected partition of type `DataFrame` but got `DataFrame`
Example:
from dask.distributed import Client
import dask.dataframe as dd
import dask
client = Client(address='<scheduler_ip>:8786')
client.restart()
filename_1 = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
filename_2 = 'http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv'
future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
client.has_what()
# Returns: {'tcp://<w1_ip>:41942': ('read_csv-c08b231bb22718946756cf46b2e0f5a1',),
# 'tcp://<w2_ip>:41942': ('read_csv-e27881faa0f641e3550a8d28f8d0e11d',)}
df = dd.from_delayed([dask.delayed(f) for f in [future_1, future_2]])
type(df)
# Returns: dask.dataframe.core.DataFrame
df.head()
# Returns:
# ValueError: Metadata mismatch found in `from_delayed`.
# Expected partition of type `DataFrame` but got `DataFrame`
Note The dask environment has a two worker nodes (aliased to w1 and w2) a scheduler node and the script is running on an external host.
dask==1.2.2, distributed==1.28.1
It is odd to call many dask dataframe functions in parallel. Perhaps you meant to call many Pandas read_csv calls in parallel instead?
# future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
# future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
future_1 = client.submit(pandas.read_csv, filename_1, workers='w1')
future_2 = client.submit(pandas.read_csv, filename_2, workers='w2')
See https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more information

Dask distributed perform computations without returning data

I have a dynamic Dask Kubernetes cluster.
I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud.
During loading files from Gcloud storage, a cluster memory usage is increasing to about 3-4GB. Then workers (each worker has 2GB of RAM) are terminated/restarted and some tasks getting lost,
so cluster starts computing the same things in a circle.
I removed apply() operation and leave only read_parquet() to test
if my custom code causes a trouble, but the problem was the same, even with just single read_parquet() operation. This is a code:
client = Client('<ip>:8786')
client.restart()
def command():
client = get_client()
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute()
x = client.submit(command)
x.result()
Note: I'm submitting a single command function to run all necessary commands to avoid problems with gcsfs authentication inside a cluster
After some investigation, I understood that problem could be in .compute() which returns all data to a process, but this process (my command function) is running on a worker. Because of that, a worker doesn't have enough RAM, crashes and lose all computed task which triggers tasks re-run.
My goal is:
to read from parquet files
perform some computations with apply()
and without even returning data from a cluster write it back to Gcloud storage in parquet format.
So, simply I want to keep data on a cluster and not return it back. Just compute and save data somewhere else.
After reading Dask distributed docs, I have found client.persist()/compute() and .scatter() methods. They look like what I need, but I don't really understand how to use them.
Could you, please, help me with client.persist() and client.compute() methods for my example
or suggest another way to do it? Thank you very much!
Dask version: 0.19.1
Dask distributed version: 1.23.1
Python version: 3.5.1
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute() # this triggers computations, but brings all of the data to one machine and creates a Pandas dataframe
df = df.persist() # this triggers computations, but keeps all of the data in multiple pandas dataframes spread across multiple machines

Parallelizing in-memory tasks in dask using shared memory (no sending to other processes)?

I have a trivially parallelizable in-memory problem, but one that which does not give great speedups with regular Python multiprocessing (only 2xish), due to the need for sending lots of data back and forth between processes. Hoping dask can help.
My code basically looks like this:
delayed_results = []
for key, kdf in natsorted(scdf.groupby(grpby_key)):
d1 = dd.from_pandas(kdf, npartitions=1)
d2 = dd.from_pandas(other_dfs[key], npartitions=1)
result = dask.delayed(function)(d1, d2, key=key, n_jobs=n_jobs, **kwargs)
delayed_results.append(result)
outdfs = dask.compute(*delayed_results)
This is what my old joblib code looked like:
outdfs = Parallel(n_jobs=n_jobs)(delayed(function)(scdf, other_dfs[key], key=key, n_jobs=n_jobs, **kwargs) for key, scdf in natsorted(scdf.groupby(grpby_key)))
However, the dask code is much much slower and more memory-consuming, both for the threaded and multiprocessing schedulers. I was hoping that dask could be used to parallelize tasks without needing to send stuff to other processes. Is there a way to use multiple processes with dask by using shared memory?
Btw. The docs have a reference to http://distributed.readthedocs.io/en/latest/local-cluster.html where they explain that this scheduler
It handles data locality with more sophistication, and so can be more
efficient than the multiprocessing scheduler on workloads that require
multiple processes.
But they have no examples of its usage. What should I replace my dask.compute() call with in the code above to try the local cluster?
So you can just do the following
from distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
<your code>
Distributed will by default register itself as the executor, and you can just use dask.compute as normal

DASK - Stopping workers during execution causes completed tasks to be launched twice

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?
I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected

Resources