When using the Dask dataframe where clause I get a "distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS" warning. This happens until the system runs out of memory and swap. Is there a workaround to this or am I doing something wrong. The file I'm reading can be found at https://lcb.app.box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1. You have to read it in with Pandas as save it as a parquet file for Dask to read it.
from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client
from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')
client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client
sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()
#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()
I have a large dataset stored as zipped npy files. How can I stack a given subset of these into a Dask array?
I'm aware of dask.array.from_npy_stack but I don't know how to use it for this.
Here's a crude first attempt that uses up all my memory:
import numpy as np
import dask.array as da
data = np.load('data.npz')
def load(files):
list_ = [da.from_array(data[file]) for file in files]
return da.stack(list_)
x = load(['foo', 'bar'])
Well, you can't load a large npz file into memory, because then you're already out of memory. I would read each one in in a delayed fashion, and then call da.from_array and da.stack as you sort of are in your example.
Here are some docs that may help if you haven't seen them before: https://docs.dask.org/en/latest/array-creation.html#using-dask-delayed
I'm not able to process this block using the distributed cluster.
import pandas as pd
from dask import dataframe as dd
import dask
df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
for _, outer_row in dask_df.iterrows():
for _, inner_row in dask_df.iterrows():
for base_encod in outer_row['reid_encod']:
for compare_encod in inner_row['reid_encod']:
val = base_encod + compare_encod
save_val.append(val)
return save_val
from dask.distributed import Client
client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()
Also I have few queries
Does dask.delayed use the available clusters to do the computation.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
does dask.distributed work on pandas dataframe.
can we use dask.delayed in dask.distributed.
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
For the record, some answers, although I wish to note my earlier general points about this question
Does dask.delayed use the available clusters to do the computation.
If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.
does dask.distributed work on pandas dataframe.
Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question
can we use dask.delayed in dask.distributed.
Yes, distributed can execute anything that dask in general can, including delayed functions/objects
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.
In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.
The main problem with your code is sketched in this section of the best practices: don't pass Dask collections to delayed functions. This means, you should use either the delayed API or the dataframe API. While you can convert dataframes<->delayed, simply passing like this is not recommended.
Furthermore,
you only have one row in your dataframe, so you only get one partition and no parallelism whatever. You can only slow things down like this.
this appears to be an everything-to-everything (N^2) operation, so if you had many rows (the normal case for Dask), it would presumably take extremely long, no matter how many cores you used
passing lists in a pandas row is not a great idea, perhaps you wanted to use an array?
the function doesn't return anything useful, so it's not at all clear what you are trying to achieve. Under the description of MVCE, you will see references to "expected outcome" and "what went wrong". To get more help, please be more precise.
System Info: CentOS, python 3.5.2, 64 cores, 96 GB ram
So I'm trying to load a large array (50GB) from a hdf file into ram (96GB). Each chunk is around 1.5GB less than the worker memory limit. It never seems to complete sometimes crashing or restarting workers also I don't see the memory usage on the web dashboard increasing or tasks being executed.
Should this work or am I missing something obvious here?
import dask.array as da
import h5py
from dask.distributed import LocalCluster, Client
from matplotlib import pyplot as plt
lc = LocalCluster(n_workers=64)
c = Client(lc)
f = h5py.File('50GB.h5', 'r')
data = f['data']
# data.shape = 2000000, 1000
x = da.from_array(data, chunks=(2000000, 100))
x = c.persist(x)
This was a misunderstanding on the way chunks and workers interact. Specifically changing the way the LocalCluster is initialised fixes the issue as described.
lc = LocalCluster(n_workers=1) # This way 1 works has 90GB of mem so can be persisted
I'm looking for the best way to parallelize on a cluster the following problem. I have several files
folder/file001.csv
folder/file002.csv
:
folder/file100.csv
They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of
delayed-groupby way.
Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.
you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.
however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.