Dask Memory Leak Workaround - dask

When using the Dask dataframe where clause I get a "distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS" warning. This happens until the system runs out of memory and swap. Is there a workaround to this or am I doing something wrong. The file I'm reading can be found at https://lcb.app.box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1. You have to read it in with Pandas as save it as a parquet file for Dask to read it.
from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client
from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')
client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client
sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()
#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()

Related

Understanding dask cudf object lifecycle

I want to understand the efficient memory management process for Dask objects. I have setup a Dask GPU cluster and I am able to execute tasks that runs across the cluster. However, with the dask objects, especially when I run the compute function, the process that runs on the GPU is quickly growing by using more and more of the memory and soon I am getting "Run out of memory Error".
I want to understand how I can release the memory from dask object once I am done with using them. In this following example, after the compute function how can I release that object. I am running the following code for a few times. The memory keeps growing in the process where it is running
import cupy as cp
import pandas as pd
import cudf
import dask_cudf
nrows = 100000000
df2 = cudf.DataFrame({'a': cp.arange(nrows), 'b': cp.arange(nrows)})
ddf2 = dask_cudf.from_cudf(df2, npartitions=5)
ddf2['c'] = ddf2['a'] + 5
ddf2
ddf2.compute()
Please check this blog post by Nick Becker. you may want to set up a client first.
You read into cudf first, which you shouldn't do as practice. You should read directly into dask_cudf.
When dask_cudf computes, the result returns as a cudf dataframe, which MUST fit into the remaining memory of your GPU. Chances are reading into cudf first may have taken a chunk of your memory.
Then, you can delete a dask object when you are done using client.cancel().

How can I create a Dask array from zipped .npy files?

I have a large dataset stored as zipped npy files. How can I stack a given subset of these into a Dask array?
I'm aware of dask.array.from_npy_stack but I don't know how to use it for this.
Here's a crude first attempt that uses up all my memory:
import numpy as np
import dask.array as da
data = np.load('data.npz')
def load(files):
list_ = [da.from_array(data[file]) for file in files]
return da.stack(list_)
x = load(['foo', 'bar'])
Well, you can't load a large npz file into memory, because then you're already out of memory. I would read each one in in a delayed fashion, and then call da.from_array and da.stack as you sort of are in your example.
Here are some docs that may help if you haven't seen them before: https://docs.dask.org/en/latest/array-creation.html#using-dask-delayed

Save larger than memory Dask array to hdf5 file

I need to save dask arrays to hdf5 when using dask distributed. My situation is very similar to the one described in this issue:https://github.com/dask/dask/issues/3351. Basically this code will work:
import dask.array as da
from distributed import Client
import h5py
from dask.utils import SerializableLock
def create_and_store_dask_array():
data = da.random.normal(10, 0.1, size=(1000, 1000), chunks=(100, 100))
data.to_hdf5('test.h5', '/test')
# this fails too
# f = h5py.File('test.h5', 'w')
# dset = f.create_dataset('/matrix', shape=data.shape)
# da.store(data, dset) #
# f.close()
create_and_store_dask_array()
But as soon as I try and involve the distributed scheduler I get an TypeError: can't pickle _thread._local objects.
import dask.array as da
from distributed import Client
import h5py
from dask.utils import SerializableLock
from dask.distributed import Client, LocalCluster,progress,performance_report
def create_and_store_dask_array():
data = da.random.normal(10, 0.1, size=(1000, 1000), chunks=(100, 100))
data.to_hdf5('test.h5', '/test')
# this fails too
# f = h5py.File('test.h5', 'w')
# dset = f.create_dataset('/matrix', shape=data.shape)
# da.store(data, dset) #
# f.close()
cluster = LocalCluster(n_workers=35,threads_per_worker=1)
client =Client(cluster)
create_and_store_dask_array()
I am currently working around this by submitting my computations to the scheduler in small pieces, gathering the results in memory and saving the arrays with h5py, but this is very, very slow. Can anyone suggest a good work around to this problem? The issue discussion implies that xarray can take an dask array and write that to and hdf5 file, although this seems very slow.
import xarray as xr
import netCDF4
import dask.array as da
from distributed import Client
import h5py
from dask.utils import SerializableLock
cluster = LocalCluster(n_workers=35,threads_per_worker=1)
client =Client(cluster)
data = da.random.normal(10, 0.1, size=(1000, 1000), chunks=(100, 100))
#data.to_hdf5('test.h5', '/test')
test = xr.DataArray(data,dims=None,coords=None)
#save as hdf5
test.to_netcdf("test.h5",mode='w',format="NETCDF4")
If any one could suggest a way to deal with this I am very interested in finding a solution (particularly one that does not involve adding additional dependencies)
Thanks in advance,
H5Py objects are not serializable, and so are hard to move between different processes in a distributed context. The explicit to_hdf5 method works around this. The more general store method doesn't special-case HDF5 in the same way.

Limitations to using LocalCluster? Crashing persisting 50GB of data to 90GB of memory

System Info: CentOS, python 3.5.2, 64 cores, 96 GB ram
So I'm trying to load a large array (50GB) from a hdf file into ram (96GB). Each chunk is around 1.5GB less than the worker memory limit. It never seems to complete sometimes crashing or restarting workers also I don't see the memory usage on the web dashboard increasing or tasks being executed.
Should this work or am I missing something obvious here?
import dask.array as da
import h5py
from dask.distributed import LocalCluster, Client
from matplotlib import pyplot as plt
lc = LocalCluster(n_workers=64)
c = Client(lc)
f = h5py.File('50GB.h5', 'r')
data = f['data']
# data.shape = 2000000, 1000
x = da.from_array(data, chunks=(2000000, 100))
x = c.persist(x)
This was a misunderstanding on the way chunks and workers interact. Specifically changing the way the LocalCluster is initialised fixes the issue as described.
lc = LocalCluster(n_workers=1) # This way 1 works has 90GB of mem so can be persisted

Value Error on Dask DataFrames

I am using dask to read a csv file. However, i couldn't apply or compute any operation on it because of this error:
Do you have ideas what is this error all about and how to fix it?
On reading csv file in dask, errors comes in upon not recognizing the correct dtype of columns.
For example, we read a csv file using dask as follows:
import dask.dataframe as dd
df = dd.read_csv('\data\file.txt', sep='\t', header='infer')
This prompts the error mentioned above.
To solve this problem, as suggested by #mrocklin on this comment, https://github.com/dask/dask/issues/1166, we need to determine the dtype of the columns. We can do this by reading the csv file in pandas and identify the data type and pass that as argument in reading csv using dask.
df_pd = pd.read_csv('\data\file.txt', sep='\t', header='infer')
dt = df_pd.dtypes.to_dict()
df = dd.read_csv('\data\file.txt', sep='\t', header='infer', dtype=dt)

Resources