Parallelization on cluster dask - dask

I'm looking for the best way to parallelize on a cluster the following problem. I have several files
folder/file001.csv
folder/file002.csv
:
folder/file100.csv
They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of
delayed-groupby way.
Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.

you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.
however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.

Related

How to ensure number of `partitions` is equally distributed across workers with dask and dask-cudf?

I am trying to do a basic ETL workflow on large files across workers using dask-cudf across a large amount of workers .
Problem:
Initially the scheduler schedules equal amounts of partitions to be read across workers but during the pre-processing it tends to distribute/shuffle them across workers.
The minimum number of partitions that a worker gets is 4 and the maximum partitions that it gets is 19 (total partitions = apprx. 300, num_workers = 22) this behavior causes problem downstream as i want equal distribution of partitions across workers.
Is there a way to prevent this behavior ?
I thought below will help with that but it does not .
# limit work-stealing as much as possible
dask.config.set({'distributed.scheduler.work-stealing': False})
dask.config.set({'distributed.scheduler.bandwidth': 1})
Workflow being done:
read
fill-na
down-casting/other logic
df = dask_cudf.read_csv(path = `big_files`,
names = names,
delimiter='\t',
dtype = read_dtype_ls,
chunksize=chunksize)
df = df.map_partitions(lambda df:df.fillna(-1))
def transform_col_int64_to_int32(df, columns):
"""
This function casts int64s columns to int32s
we are using this to transform int64s to int32s and overflows seem to be consitent
"""
for col in columns:
df[col] = df[col].astype(np.int32)
return df
df = df.map_partitions(transform_col_int64_to_int32,cat_col_names)
df = df.persist()
Dask schedules where tasks run based on a number of factors, including data dependencies, runtime, memory use, and so on. Typically the answer to these questions is "just let it do it's thing". The most common cause of poor scheduling is having too few chunks.
However, if you explicitly need a more rebalanced distribution then you can try the Client.rebalance method.
wait(df)
client.rebalance(df)
However beware that rebalance is not as robust as other Dask operations. It's best to do it at a time when there isn't a ton of other work going on (hence the call to dask.distributed.wait above).
Also, I would turn on work stealing. Work stealing is another name for load balancing.

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this.
All the dask-distributed examples/docs I see are populating the initial data load from a network resource (hdfs, s3, etc) and does not appear to extend the DAG optimization to the load portion (seems to assume that a network load is a necessary evil and just eats the initial cost.) This is underscored on the answer to another question: Does Dask communicate with HDFS to optimize for data locality?
However, I can see cases where we would want this. For example, if we have a sharded database + dask workers co-located on nodes of this DB, we would want to force records from only the local shard to be populated into the local dask workers. From the documentation/examples, network cris-cross seems like a necessarily assumed cost. Is it possible to force parts of a single dataframe to be obtained from specific workers?
The alternative, which I've tried, is to try and force each worker to run a function (iteratively submitted to each worker) where the function loads only the data local to that machine/shard. This works, and I have a bunch of optimally local dataframes with the same column schema -- however -- now I don't have a single dataframe but n dataframes. Is it possible to merge/fuse dataframes across multiple machines so there is a single dataframe reference, but portions have affinity (within reason, as decided by the task DAG) to specific machines?
You can produce dask "collections" such as a dataframe from futures and delayed objects, which inter-operate nicely with each other.
For each partition, where you know which machine should load it, you can produce a future as follows:
f = c.submit(make_part_function, args, workers={'my.worker.ip'})
where c is the dask client and the address is the machine you'd want to see it happen on. You can also give allow_other_workers=True is this is a preference rather than a requirement.
To make a dataframe, from a list of such futures, you could do
df = dd.from_delayed([dask.delayed(f) for f in futures])
and ideally provide a meta=, giving a description of the expected dataframe. Now, further operations on a given partition will prefer to be scheduled on the same worker which already holds the data.
I am also interested in having the capability to restrict computation to a specific node (and data localized to that node). I have tried to implement the above with a simple script (see below) but looking at the resulting data frame, results the error (from dask/dataframe/utils.py::check_meta()):
ValueError: Metadata mismatch found in `from_delayed`.
Expected partition of type `DataFrame` but got `DataFrame`
Example:
from dask.distributed import Client
import dask.dataframe as dd
import dask
client = Client(address='<scheduler_ip>:8786')
client.restart()
filename_1 = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
filename_2 = 'http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv'
future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
client.has_what()
# Returns: {'tcp://<w1_ip>:41942': ('read_csv-c08b231bb22718946756cf46b2e0f5a1',),
# 'tcp://<w2_ip>:41942': ('read_csv-e27881faa0f641e3550a8d28f8d0e11d',)}
df = dd.from_delayed([dask.delayed(f) for f in [future_1, future_2]])
type(df)
# Returns: dask.dataframe.core.DataFrame
df.head()
# Returns:
# ValueError: Metadata mismatch found in `from_delayed`.
# Expected partition of type `DataFrame` but got `DataFrame`
Note The dask environment has a two worker nodes (aliased to w1 and w2) a scheduler node and the script is running on an external host.
dask==1.2.2, distributed==1.28.1
It is odd to call many dask dataframe functions in parallel. Perhaps you meant to call many Pandas read_csv calls in parallel instead?
# future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
# future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
future_1 = client.submit(pandas.read_csv, filename_1, workers='w1')
future_2 = client.submit(pandas.read_csv, filename_2, workers='w2')
See https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more information

Dask Distributed with Asynchronous Real-time Parallelism

I'm reading the documentation on dask.distributed and it looks like I could submit functions to the distributed cluster via client.submit().
I have an existing function some_func that is grabbing individual documents (say, a text file) asynchronously and I want to take the raw document and grab all words that don't contain a vowel and shove it back into a different database. This data processing step is blocking.
Assuming that there are several million documents and the distributed cluster only has 10 nodes with 1 process available (i.e., it can only process 10 documents at a time), how will dask.distributed handle the flow of the documents that it needs to process?
Here is some example code:
client = dask.distributed('tcp://1.2.3.4:8786')
def some_func():
doc = retrieve_next_document_asynchronously()
client.submit(get_vowelless_words, doc)
def get_vowelless_words(doc):
vowelless_words = process(doc)
write_to_database(vowelless_words)
if __name__ == '__main__':
for i in range(1000000):
some_func()
Since the processing of a document is blocking and the cluster can only handle 10 documents simultaneously, what happens when 30 other documents are retrieved while the cluster is busy? I understand that client.submit() is asynchronous and it would return a concurrent future but what would happen in this case? Would it hold the document in memory until it 1/10 cores are available and potentially cause the machine to run out of memory after, say, if 1,000 documents are waiting.
What would the scheduler do in this case? FIFO? Should I somehow change the code so that it waits for a core to be available before retrieving the next document? How might that be accomplished?
To use Queues with dask, below is a modified example of using dask Queues with a distributed cluster (based on the documentation):
#!/usr/bin/env python
import distributed
from queue import Queue
from threading import Thread
client = distributed.Client('tcp://1.2.3.4:8786')
nprocs = len(client.ncores())
def increment(x):
return x+1
def double(x):
return 2*x
input_q = Queue(maxsize=nprocs)
remote_q = client.scatter(input_q)
remote_q.maxsize = nprocs
inc_q = client.map(increment, remote_q)
inc_q.maxsize = nprocs
double_q = client.map(double, inc_q)
double_q.maxsize = nprocs
result_q = client.gather(double_q)
def load_data(q):
i = 0
while True:
q.put(i)
i += 1
load_thread = Thread(target=load_data, args=(input_q,))
load_thread.start()
while True:
size = result_q.qsize()
item = result_q.get()
print(item, size)
In this case, we explicitly limit the maximum size of each queue to be equal to the number of distributed processes that are available. Otherwise, the while loop will overload the cluster. Of course, you can adjust the maxsize to be some multiple of the number of available processes as well. For simple functions like increment and double, I found that maxsize = 10*nprocs is still reasonable but this will surely be limited by the amount of time that it takes to run your custom function.
When you call submit all of the arguments are serialized and immediately sent to the scheduler. An alternative would be to both get documents and process them on the cluster (this assumes that documents are globally visible from all workers).
for fn in filenames:
doc = client.submit(retrieve_doc, fn)
process = client.submit(process_doc, doc)
fire_and_forget(process)
If documents are only available on your client machine and you want to restrict flow then you might consider using dask Queues or the as_completed iterator.

Dask distributed perform computations without returning data

I have a dynamic Dask Kubernetes cluster.
I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud.
During loading files from Gcloud storage, a cluster memory usage is increasing to about 3-4GB. Then workers (each worker has 2GB of RAM) are terminated/restarted and some tasks getting lost,
so cluster starts computing the same things in a circle.
I removed apply() operation and leave only read_parquet() to test
if my custom code causes a trouble, but the problem was the same, even with just single read_parquet() operation. This is a code:
client = Client('<ip>:8786')
client.restart()
def command():
client = get_client()
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute()
x = client.submit(command)
x.result()
Note: I'm submitting a single command function to run all necessary commands to avoid problems with gcsfs authentication inside a cluster
After some investigation, I understood that problem could be in .compute() which returns all data to a process, but this process (my command function) is running on a worker. Because of that, a worker doesn't have enough RAM, crashes and lose all computed task which triggers tasks re-run.
My goal is:
to read from parquet files
perform some computations with apply()
and without even returning data from a cluster write it back to Gcloud storage in parquet format.
So, simply I want to keep data on a cluster and not return it back. Just compute and save data somewhere else.
After reading Dask distributed docs, I have found client.persist()/compute() and .scatter() methods. They look like what I need, but I don't really understand how to use them.
Could you, please, help me with client.persist() and client.compute() methods for my example
or suggest another way to do it? Thank you very much!
Dask version: 0.19.1
Dask distributed version: 1.23.1
Python version: 3.5.1
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute() # this triggers computations, but brings all of the data to one machine and creates a Pandas dataframe
df = df.persist() # this triggers computations, but keeps all of the data in multiple pandas dataframes spread across multiple machines

Batch results of intermediate dask computation

I have a large (10s of GB) CSV file that I want to load into dask, and for each row, perform some computation. I also want to write the results of the manipulated CSV into BigQuery, but it'd be better to batch network requests to BigQuery in groups of say, 10,000 rows each, so I don't incur network overhead per row.
I've been looking at dask delayed and see that you can create an arbitrary computation graph, but I'm not sure if this is the right approach: how do I collect and fire off intermediate computations based on some group size (or perhaps time elapsed). Can someone provide a simple example on that? Say for simplicity we have these functions:
def change_row(r):
# Takes 10ms
r = some_computation(r)
return r
def send_to_bigquery(rows):
# Ideally, in large-ish groups, say 10,000 rows at a time
make_network_request(rows)
# And here's how I'd use it
import dask.dataframe as dd
df = dd.read_csv('my_large_dataset.csv') # 20 GB
# run change_row(r) for each r in df
# run send_to_big_query(rows) for each appropriate size group based on change_row(r)
Thanks!
The easiest thing that you can do is provide a block size parameter to read_csv, which will get you approximately the right number of rows per block. You may need to measure some of your data or experiment to get this right.
The rest of your task will work the same way as any other "do this generic thing to blocks of data-frame": the `map_partitions' method (docs).
def alter_and_send(df):
rows = [change_row(r) for r in df.iterrows()]
send_to_big_query(rows)
return df
df.map_partitions(alter_and_send)
Basically, you are running the function on each piece of the logical dask dataframe, which are real pandas dataframes.
You may actually want map, apply or other dataframe methods in the function.
This is one way to do it - you don't really need the "output" of the map, and you could have used to_delayed() instead.

Resources