dask.distributed not utilising the cluster - dask

I'm not able to process this block using the distributed cluster.
import pandas as pd
from dask import dataframe as dd
import dask
df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
for _, outer_row in dask_df.iterrows():
for _, inner_row in dask_df.iterrows():
for base_encod in outer_row['reid_encod']:
for compare_encod in inner_row['reid_encod']:
val = base_encod + compare_encod
save_val.append(val)
return save_val
from dask.distributed import Client
client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()
Also I have few queries
Does dask.delayed use the available clusters to do the computation.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
does dask.distributed work on pandas dataframe.
can we use dask.delayed in dask.distributed.
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.

For the record, some answers, although I wish to note my earlier general points about this question
Does dask.delayed use the available clusters to do the computation.
If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.
does dask.distributed work on pandas dataframe.
Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question
can we use dask.delayed in dask.distributed.
Yes, distributed can execute anything that dask in general can, including delayed functions/objects
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.
In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.

The main problem with your code is sketched in this section of the best practices: don't pass Dask collections to delayed functions. This means, you should use either the delayed API or the dataframe API. While you can convert dataframes<->delayed, simply passing like this is not recommended.
Furthermore,
you only have one row in your dataframe, so you only get one partition and no parallelism whatever. You can only slow things down like this.
this appears to be an everything-to-everything (N^2) operation, so if you had many rows (the normal case for Dask), it would presumably take extremely long, no matter how many cores you used
passing lists in a pandas row is not a great idea, perhaps you wanted to use an array?
the function doesn't return anything useful, so it's not at all clear what you are trying to achieve. Under the description of MVCE, you will see references to "expected outcome" and "what went wrong". To get more help, please be more precise.

Related

Understanding dask cudf object lifecycle

I want to understand the efficient memory management process for Dask objects. I have setup a Dask GPU cluster and I am able to execute tasks that runs across the cluster. However, with the dask objects, especially when I run the compute function, the process that runs on the GPU is quickly growing by using more and more of the memory and soon I am getting "Run out of memory Error".
I want to understand how I can release the memory from dask object once I am done with using them. In this following example, after the compute function how can I release that object. I am running the following code for a few times. The memory keeps growing in the process where it is running
import cupy as cp
import pandas as pd
import cudf
import dask_cudf
nrows = 100000000
df2 = cudf.DataFrame({'a': cp.arange(nrows), 'b': cp.arange(nrows)})
ddf2 = dask_cudf.from_cudf(df2, npartitions=5)
ddf2['c'] = ddf2['a'] + 5
ddf2
ddf2.compute()
Please check this blog post by Nick Becker. you may want to set up a client first.
You read into cudf first, which you shouldn't do as practice. You should read directly into dask_cudf.
When dask_cudf computes, the result returns as a cudf dataframe, which MUST fit into the remaining memory of your GPU. Chances are reading into cudf first may have taken a chunk of your memory.
Then, you can delete a dask object when you are done using client.cancel().

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this.
All the dask-distributed examples/docs I see are populating the initial data load from a network resource (hdfs, s3, etc) and does not appear to extend the DAG optimization to the load portion (seems to assume that a network load is a necessary evil and just eats the initial cost.) This is underscored on the answer to another question: Does Dask communicate with HDFS to optimize for data locality?
However, I can see cases where we would want this. For example, if we have a sharded database + dask workers co-located on nodes of this DB, we would want to force records from only the local shard to be populated into the local dask workers. From the documentation/examples, network cris-cross seems like a necessarily assumed cost. Is it possible to force parts of a single dataframe to be obtained from specific workers?
The alternative, which I've tried, is to try and force each worker to run a function (iteratively submitted to each worker) where the function loads only the data local to that machine/shard. This works, and I have a bunch of optimally local dataframes with the same column schema -- however -- now I don't have a single dataframe but n dataframes. Is it possible to merge/fuse dataframes across multiple machines so there is a single dataframe reference, but portions have affinity (within reason, as decided by the task DAG) to specific machines?
You can produce dask "collections" such as a dataframe from futures and delayed objects, which inter-operate nicely with each other.
For each partition, where you know which machine should load it, you can produce a future as follows:
f = c.submit(make_part_function, args, workers={'my.worker.ip'})
where c is the dask client and the address is the machine you'd want to see it happen on. You can also give allow_other_workers=True is this is a preference rather than a requirement.
To make a dataframe, from a list of such futures, you could do
df = dd.from_delayed([dask.delayed(f) for f in futures])
and ideally provide a meta=, giving a description of the expected dataframe. Now, further operations on a given partition will prefer to be scheduled on the same worker which already holds the data.
I am also interested in having the capability to restrict computation to a specific node (and data localized to that node). I have tried to implement the above with a simple script (see below) but looking at the resulting data frame, results the error (from dask/dataframe/utils.py::check_meta()):
ValueError: Metadata mismatch found in `from_delayed`.
Expected partition of type `DataFrame` but got `DataFrame`
Example:
from dask.distributed import Client
import dask.dataframe as dd
import dask
client = Client(address='<scheduler_ip>:8786')
client.restart()
filename_1 = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
filename_2 = 'http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv'
future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
client.has_what()
# Returns: {'tcp://<w1_ip>:41942': ('read_csv-c08b231bb22718946756cf46b2e0f5a1',),
# 'tcp://<w2_ip>:41942': ('read_csv-e27881faa0f641e3550a8d28f8d0e11d',)}
df = dd.from_delayed([dask.delayed(f) for f in [future_1, future_2]])
type(df)
# Returns: dask.dataframe.core.DataFrame
df.head()
# Returns:
# ValueError: Metadata mismatch found in `from_delayed`.
# Expected partition of type `DataFrame` but got `DataFrame`
Note The dask environment has a two worker nodes (aliased to w1 and w2) a scheduler node and the script is running on an external host.
dask==1.2.2, distributed==1.28.1
It is odd to call many dask dataframe functions in parallel. Perhaps you meant to call many Pandas read_csv calls in parallel instead?
# future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
# future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
future_1 = client.submit(pandas.read_csv, filename_1, workers='w1')
future_2 = client.submit(pandas.read_csv, filename_2, workers='w2')
See https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more information

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Parallelization on cluster dask

I'm looking for the best way to parallelize on a cluster the following problem. I have several files
folder/file001.csv
folder/file002.csv
:
folder/file100.csv
They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of
delayed-groupby way.
Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.
you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.
however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.

Dask performances: workflow doubts

I'm confused about how to get the best from dask.
The problem
I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves
df = list(df.groupby("key")) and then apply my_fun
with multiprocessing. The performances, despite the huge usage of RAM, are pretty good on my machine and terrible on google cloud compute.
On Dask my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
Read data from S3. 14 files -> 14 partitions
`df.groupby("key").apply(my_fun).to_frame.compute(get=get)
As I didn't set the indices df.known_divisions is False
The resulting graph is
and I don't understand if what I see it is a bottleneck or not.
Questions:
Is it better to have df.npartitions as a multiple of ncpu or it doesn't matter?
From this it seems that is better to set the index as key. My guess is that I can do something like
df["key2"] = df["key"]
df = df.set_index("key2")
but, again, I don't know if this is the best way to do it.
For questions like "what is taking time" in Dask, you are generally recommended to use the "distributed" scheduler rather than multiprocessing - you can run with any number of processes/threads you like, but you have much more information available via the diagnostics dashboard.
For your specific questions, if you are grouping over a column that is not nicely split between partitions and applying anything other than the simple aggregations, you will inevitably need a shuffle. Setting the index does this shuffle for you as a explicit step, or you get the implicit shuffle apparent in your task graph. This is a many-to-many operation, each aggregation tasks needs input from every original partition, hence the bottle-neck. There is no getting around that.
As for number of partitions, yes you can have sub-optimal conditions like 9 partitions on 8 cores (you will calculate 8 tasks, and then perhaps block for the final task on one core while the others are idle); but in general you can depend on dask to make reasonable scheduling decisions so long as you are not using a very small number of partitions. In many cases, it will not matter much.

Resources