dask read_parquet runs out of memory - dask

I'm trying to read a big (will not fit in memory) parquet dataset, amd then sample from it. Each partition of the dataset fits perfectly in memory.
The dataset is about 20Gb of data on disk, divided in 104 partitions of about 200Mb each. I don't want to use more than 40Gb of memory at any point, so i'm setting the n_workers and memory_limit accordingly.
My hypothesis was that Dask would load as many partitions as it could handle, sample from them, scrap them from memory and then continue loading the next ones. Or something like that.
Instead, judging by the execution graph (104 load operations in parallel, after each a sample), it looks like it tries to load all partitions simultaneously, and therefore the workers keep getting killed for running out of memory.
Am I missing something?
This is my code:
from datetime import datetime
from dask.distributed import Client
client = Client(n_workers=4, memory_limit=10e9) #Gb per worker
import dask.dataframe as dd
df = dd.read_parquet('/path/to/dataset/')
df = df.sample(frac=0.01)
df = df.compute()
To reproduce the error you can create a mock dataset 1/10th the size of the one I was trying to load using this code, and try my code with 1GB memory_limit=1e9 to compensate.
from dask.distributed import Client
client = Client() #add restrictions depending on your system here
from dask import datasets
df = datasets.timeseries(end='2002-12-31')
df = df.repartition(npartitions=104)
df.to_parquet('./mock_dataset')

Parquet is an efficient binary format, with encoding and compression. There is a very good chance that in memory, it takes up far more space than you think.
In order to sample the data at 1%, each partition is being loaded and expanded into memory in entirety, before being sub-selected. This comes with considerable memory overhead of buffer copies. Each worker thread will need to accommodate the currently-processed chunk, as well as results that have been accumulated so far on that worker, and then a task will copy all of these for the final concat operation (which also involves copies and overhead).
The general recommendation is that each worker should have access to "several times" the in-memory size of each partition, and in your case, those are ~2GB on-disc and bigger in memory.

Related

DBSCAN running out of memory and getting killed

I am passing data normalized using MinMaxScaler to DBSCAN's fit_predict. My data is very small (12 MB, around 180,000 rows and 9 columns). However while running this, the memory usage quickly climbs up and the kernel gets killed (I presume by OOM killer). I even tried it on a server with 256 GB RAM and it fails fairly quickly.
Here is my repro code:
import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)
and here is my Xml.csv data file.
Any ideas how to get it working?

conversion of xarray to numpy array - oom-kill event

I'm using xarray to read in and modify a data set for my analysis.
That's the data repr.:
To plot the data I have to convert it to numpy array:
Z_diff.values()
When doing so I get the error message:
slurmstepd: error: Detected 1 oom-kill event(s) in step 33179783.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
I'm using the following settings:
#SBATCH --ntasks-per-node=16
#SBATCH --nodes=4
#SBATCH --mem=250G
It looks like you're just running out of memory. When using dask with xarray (link), the data is split into chunks (183 of them in your case). So only a small portion of the dataset is kept in memory at once. Numpy arrays don't function that way so the whole dataset is being read into memory when you use .values() and you're exceeding your memory.
Depending on what you're trying to plot, you might be able to plot each chunk individually or plot data from each chunk one at a time on the same plot. Or just plot a subset of the data to avoid reading all the data into memory at once. Lastly, if available, you could potentially also request more memory until you no longer get this error.

Dask-ml ParallelPostFit not using distributed and causing memory error on local machine

I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:
x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")
However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there.
So how do I save the results of the prediction?
EDIT:
This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.

dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU)

I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?
Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.

tensorflow conv2d memory consumption explain?

output = tf.nn.conv2d(input, weights, strides = [1,3,3,1], padding = 'VALID')
My input has shape 200x225x225x1, weights is 15x15x1x64. Hence, the output has shape 200x71x71x64 since (225-15)/3 + 1 = 71
Tensorboard shows that this operation consumes totally 768MB (see pic below). Assuming it takes into account the size of input (38.6MB), weights (0.06MB) and output (246.2MB) the total memory consumption should not exceed 300MB. So where does the rest of the memory consumption come from?
Although I'm not able to reproduce your graph and values based on information provided, it's possible that you're seeing additional memory usage due to intermediary values materialized during the computation of Conv2D. It's also possible that the instrumentation is incorrect. (e.g. reshape operations that do not result in a copy of Tensor memory end up duplicating the "memory usage" in the TF Node Stats instrumentation.) Without a reproducible test case, it's hard to say more. If you do feel like this is a bug in TensorFlow, please do raise an issue on Github!

Resources