I'm trying to execute some code that doesn't fit in the GPU (this also happens with my CPU memory and our data is usually stored as zarr array), and I'm not sure how I could do that with Dask.
I found this example and I'm following a similar strategy but I received several warnings, distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 12.85 GiB -- Worker memory limit: 7.45 GiB and the data is not being processed on the GPU.
For example:
import cupy as cp
import numpy as np
import dask.array as da
from dask_image import ndfilters as dfilters
from dask.distributed import Client
from functools import partial
if __name__ == '__main__':
client = Client(memory_limit='8GB', processes=False)
arr = da.from_array(np.zeros((50, 256, 512, 512), dtype=np.uint16), chunks=(1, 64, 256, 256))
arr = arr.map_blocks(cp.asarray)
filtering = partial(dfilters.gaussian_filter, sigma=2)
scattered_data = client.scatter(arr)
sent = client.submit(filtering, scattered_data)
filtered = sent.result().compute()
client.close()
The GPU has 24GB of memory.
Thanks in advance.
To answer the specific question: no, there is no way for Dask to know or control how much memory will be used internally to a task. From Dask's point of view, this is arbitrary code and it is simply "called" by python. Monitoring the total process memory in a separate thread is the best tool available.
-previously-
Don't do this:
da.from_array(np.zeros((50, 256, 512, 512), dtype=np.uint16), chunks=(1, 64, 256, 256))
You are materialising a large array, chopping it up and shipping it to workers, where it will need to be deserialised before use. Always make your data in the workers if you can, which in this simplistic case would amount to
da.zeros((50, 256, 512, 512), dtype=np.uint16), chunks=(1, 64, 256, 256)
or in the case of zarr by using da.from_zarr.
Related
I am passing data normalized using MinMaxScaler to DBSCAN's fit_predict. My data is very small (12 MB, around 180,000 rows and 9 columns). However while running this, the memory usage quickly climbs up and the kernel gets killed (I presume by OOM killer). I even tried it on a server with 256 GB RAM and it fails fairly quickly.
Here is my repro code:
import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)
and here is my Xml.csv data file.
Any ideas how to get it working?
I am using GPU to run some very large deep learning models, when I choose a batch size of 8, it can fit into the memory, but if I use a batch size of 16, it will cause CUDA out-of-memory error, and I have to kill the process.
My question is, before actually passing the data into GPU, is there a way that I could know how large the data will occupy in the GPU?
For example, the following code is about how I create a pytorch dataloader and pass each batch of the dataloader to the GPU, could I know how large it is before I call batch.to(device)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
for step, batch in enumerate(train_dataloader):
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
I would recommend using the torchsummary package here.
pip install torchsummary
and in use
from torchsummary import summary
myModel.cuda()
summary(myModel, (shapeOfInput)) # where shapeOfInput is a tuple of the sample's dimensions
This will give you the size of the model, the size of the forward pass, and the size of the backpass in MB for a batch size of 1, and you can then multiple out by your batch size.
I'm trying to read a big (will not fit in memory) parquet dataset, amd then sample from it. Each partition of the dataset fits perfectly in memory.
The dataset is about 20Gb of data on disk, divided in 104 partitions of about 200Mb each. I don't want to use more than 40Gb of memory at any point, so i'm setting the n_workers and memory_limit accordingly.
My hypothesis was that Dask would load as many partitions as it could handle, sample from them, scrap them from memory and then continue loading the next ones. Or something like that.
Instead, judging by the execution graph (104 load operations in parallel, after each a sample), it looks like it tries to load all partitions simultaneously, and therefore the workers keep getting killed for running out of memory.
Am I missing something?
This is my code:
from datetime import datetime
from dask.distributed import Client
client = Client(n_workers=4, memory_limit=10e9) #Gb per worker
import dask.dataframe as dd
df = dd.read_parquet('/path/to/dataset/')
df = df.sample(frac=0.01)
df = df.compute()
To reproduce the error you can create a mock dataset 1/10th the size of the one I was trying to load using this code, and try my code with 1GB memory_limit=1e9 to compensate.
from dask.distributed import Client
client = Client() #add restrictions depending on your system here
from dask import datasets
df = datasets.timeseries(end='2002-12-31')
df = df.repartition(npartitions=104)
df.to_parquet('./mock_dataset')
Parquet is an efficient binary format, with encoding and compression. There is a very good chance that in memory, it takes up far more space than you think.
In order to sample the data at 1%, each partition is being loaded and expanded into memory in entirety, before being sub-selected. This comes with considerable memory overhead of buffer copies. Each worker thread will need to accommodate the currently-processed chunk, as well as results that have been accumulated so far on that worker, and then a task will copy all of these for the final concat operation (which also involves copies and overhead).
The general recommendation is that each worker should have access to "several times" the in-memory size of each partition, and in your case, those are ~2GB on-disc and bigger in memory.
I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?
Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.
I am using Keras, on the backend tensorflow on windows 7 with the NVIDIA Quadro M2000M GPU.
When i initialization my model which contains 5 GRU, 5 Dropout and 1 Dense layers the GPU memory usage jumps to 3800MB of 4096MB and stays there until i restart my spyder session. Clearing the session within spyder with:
K.clear_session()
does not work. The Memory usage stays at that high level.
Is it normal that such a model allocate this much memory of the GPU? What can i change so the memory usage can be used proberly? I want to improve the training speed and i think this high memory usage hinder the GPU to use her full potential.
Update
My model looks like this:
model = Sequential()
layers = [1, 70, 50,100, 50,20, 1]
model.add(GRU(
layers[1],
#batch_size = 32,
input_shape=(sequence_length, anzahl_features),
return_sequences=True))
model.add(Dropout(dropout_1))
model.add(GRU(
layers[2],
#batch_size = 32,
return_sequences=True))
model.add(Dropout(dropout_2))
model.add(GRU(
layers[3],
#batch_size = 32,
return_sequences=True))
model.add(Dropout(dropout_3))
model.add(GRU(
layers[4],
#batch_size = 32,
return_sequences=True))
model.add(Dropout(dropout_4))
model.add(GRU(
layers[5],
#batch_size = 32,
return_sequences=False))
model.add(Dropout(dropout_5))
model.add(Dense(
layers[6]))
model.add(Activation('sigmoid'))
My feature matrix has the size 506x500x35 (506 examples, 500 sequence length and 35 features). The batch size is set to 128. Site note: I am not saying that that is the perfect feature matrix or model configuration.
Here also a screenshot of GPU-Z where i restarted spyder and started the model until the second epoch:
By default TensorFlow allocates the whole GPU memory.
If you want to have a better control on the GPU memory usage you can use these methods:
the per_process_gpu_memory_fraction config option, or
the allow_growth config option.
Let me just give the full piece of code how to control memory usage on your GPU:
from keras.backend.tensorflow_backend
import set_session config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True set_session(tf.Session(config=config))
GPU speed training might depends on your batch size, is it the same on CPU and GPU. Is it cuda installed