fastest way to load 1.5 million images into a dask cluster - dask

I'm trying to persist 1.5 million images to a dask cluster as a dask array, and then get some summary stats. I'm following an image processing tutorial from #mrocklin's blog and have edited my script to be a minimally reproducible example:
import time
import dask
import dask.array as da
import numpy as np
from distributed import Client
client = Client()
def get_imgs(num_imgs):
def get():
arr = np.random.randint(2000, size=(3, 120, 120)).flatten()
return arr
delayed_get = dask.delayed(get)
return [da.from_delayed(delayed_get(), shape=(3 * 120 * 120,), dtype=np.uint16) for num in range(num_imgs)]
imgs = get_imgs(1500000)
imgs = da.stack(imgs, axis=0)
client.persist(imgs)
The persist step causes my jupyter process to crash. Is that because the persist step causes a bunch of operations to be done on each object in the collection, and the collection is too large to fit in memory? So I use scatter instead:
start = time.time()
imgs_future = client.scatter(imgs, broadcast=True)
print(time.time() - start)
But the jupyter process crashes, or the network connection to the scheduler gets lost.
So I tried breaking up the scatter step:
st = time.time()
chunk_size = 50000
chunk_num = 0
chunk_futures = []
start = 0
end = start + chunk_size
is_last_chunk = False
for dataset in client.list_datasets():
client.unpublish_dataset(dataset)
while True:
cst = time.time()
chunk = imgs[start:end]
cst1 = time.time()
if start == 0:
print('loaded chunk in', cst1 - cst)
if len(chunk) == 0:
break
chunk_future = client.scatter(chunk)
chunk_futures.append(chunk_future)
dataset_name = "chunk_{}".format(chunk_num)
client.publish_dataset(**{dataset_name: chunk_future})
if start == 0:
print('submitted chunk in', time.time() - cst1)
start = end
if is_last_chunk:
break
chunk_num += 1
end = start + chunk_size
if end > len(image_paths_to_submit):
is_last_chunk = True
end = len(image_paths_to_submit)
if start == end:
break
if chunk_num % 5 == 0:
print('chunk_num', chunk_num, 'start', start)
print('completed in', time.time() - st)
But this approach results in the connection being lost as well. What's the recommended approach to persisting a large image dataset in a cluster in an asynchronous way?
I've looked at the delayed best practices and what jumps out at me is that I may be using too many tasks? So maybe I need to do more batching in each get() call.

Is that because the persist step causes a bunch of operations to be done on each object in the collection, and the collection is too large to fit in memory?
The best way to find out if this is the case is by using Dask's dashboard.
https://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
I'm following an image processing tutorial from #mrocklin's blog
That post is somewhat old. You may also want to take a look at this more recent post:
https://blog.dask.org/2019/06/20/load-image-data
I've looked at the delayed best practices and what jumps out at me is that I may be using too many tasks? So maybe I need to do more batching in each get() call.
Yes, that might be a problem. If you can keep the number of tasks down that would be nice.

Related

Dask Delayed with xarray - compute() result is still delayed

I tried to perform with Dask and xarray some analysis (e.g. avg) over two datasets, then compute a difference between the two results.
This is my code
cluster = LocalCluster(n_workers=5, threads_per_worker=3, **worker_kwargs)
def calc_avg(path):
mean = xr.open_mfdataset( path,combine='nested', concat_dim="time", parallel=True, decode_times=False, decode_cf=False)['var'].sel(lat=slice(south,north), lon=slice(west,east)).mean(dim='time')
return mean
def diff_(x,y):
return x-y
p1 = "/path/to/first/multi-file/dataset"
p2 = "/path/to/second/multi-file/dataset"
a = dask.delayed(calc_avg)(p1)
b = dask.delayed(calc_avg)(p2)
total = dask.delayed(diff_)(a,b)
result = total.compute()
The executiuon time here is 17s.
However, plotting the result (result.plot()) takes more than 1 min, so it seems that the calculation actually happens when trying to plot the result.
Is this the proper way to use Dask delayed?
You’re wrapping a call to xr.open_mfdataset, which is itself a dask operation, in a delayed function. So when you call result.compute, you’re executing the functions calc_avg and mean. However, calc_avg returns a dask-backed DataArray. So yep, the 17s task converts the scheduled delayed dask graph of calc_avg and mean into a scheduled dask.array dask graph of open_mfdataset and array ops.
To resolve this, drop the delayed wrappers and simply use the dask.array xarray workflow:
a = calc_avg(p1) # this is already a dask array because
# calc_avg calls open_mfdataset
b = calc_avg(p2) # so is this
total = a - b # dask understands array math, so this "just works"
result = total.compute() # execute the scheduled job
See the xarray guide to parallel computing with dask for an introduction.

I'm using Dask to apply LabelingFunction using Snorkel on multiple datasets but it seems to take forever. Is this normal?

My problem is as follow:
I have several datasets (900K, 1M7 and 1M7 entries) in csv format which I load into multiple Dask Dataframe.
Then I concatenate them all in one Dask Dataframe that I can feed to my Snorkel Applier, which applies a bunch of Labeling Function to each row of my Dataframe and return a numpy array with as many rows as there are in the Dataframe and as many columns as there are Labeling Functions.
The call to Snorkel Applier seems to take forever when I do that with 3 datasets (more than 2 days...). However if I just run the code with only the first dataset, the call takes around 2 hours. Of course I don't do the concatenation step.
So I was wondering how can this be ? Should I change the number of partitions in the concatenated Dataframe ? Or maybe I'm using Dask badly in the first place ?
Here is the code I'm using:
from snorkel.labeling.apply.dask import DaskLFApplier
import dask.dataframe as dd
import numpy as np
import os
start = time.time()
applier = DaskLFApplier(lfs) # lfs are the function that are going to be applied, one of them featurize one of the column of my Dataframe and apply a sklearn classifier (I put n_jobs to None when loading the model)
# If I have only one CSV to read
if isinstance(PATH_TO_CSV, str):
training_data = dd.read_csv(PATH_TO_CSV, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'})
slices = None
# If I have several CSV
elif isinstance(PATH_TO_CSV, list):
training_data_list = [dd.read_csv(path, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'}) for path in PATH_TO_CSV]
training_data = dd.concat(training_data_list, axis=0)
# some useful things I do to know where to slice my final result and be sure I can assign each part to each dataset
df_sizes = [len(df) for df in training_data_list]
cut_idx = np.insert(np.cumsum(df_sizes), 0, 0)
slices = list(zip(cut_idx[:-1], cut_idx[1:]))
# The call that lasts forever: I tested all the code above without that line on my 3 datasets and it runs perfectly fine
L_train = applier.apply(training_data)
end = time.time()
print('Time elapsed: {}'.format(timedelta(seconds=end-start)))
If you need more info I will try to get them to you as much as I can.
Thank in you advance for your help :)
It seems that by default applier function is using processes, so does not benefit from additional workers you might have available:
# add this to the beginning of your code
from dask.distributed import Client
client = Client()
# you can see the address of the client by typing `client` and opening the dashboard
# skipping your other code
# you need to pass the client explicitly to the applier
# after launching this open the dashboard and watch the workers work :)
L_train = applier.apply(training_data, scheduler=client)

Pytorch: Calculating running time on GPU and CPU of a for loop

I am really new to pytorch. And I got really confused the whole day while I was trying out to figure out why my nn runs slower on GPU than CPU. I do not understand when I calculated the running time using time.time(), the time of the whole loop is a lot different with the sum of every single running time. Here is part of my code. Could anybody help me? Appreciate it!
time_out = 0
time_in = 0
for epoch in tqdm(range(self.n_epoch)):
running_loss = 0
running_error = 0
running_acc = 0
if self.cuda:
torch.cuda.synchronize() #time_out_start
epst1 = time.time()
for step, (batch_x, batch_y) in enumerate(self.normal_loader):
if self.cuda:
torch.cuda.synchronize() #time_in_start
t1 = time.time()
batch_x, batch_y = batch_x.to(self.device), batch_y.to(self.device)
b_x = Variable(batch_x)
b_y = Variable(batch_y)
pred_y = self.model(b_x)
#print (pred_y)
loss = self.criterion(pred_y, b_y)
error = mae(pred_y.detach().cpu().numpy(),b_y.detach().cpu().numpy())
acc = r2(b_y.detach().cpu().numpy(),pred_y.detach().cpu().numpy())
#print (loss)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
running_acc += acc
running_loss += loss.item()
running_error += error
if self.cuda:
torch.cuda.synchronize() #time_in_end
t6 = time.time()
time_in += t6-t1
if self.cuda:
torch.cuda.synchronize() #time_out_end
eped1 = time.time()
time_out += eped1-epst1
print ('loop time(out)',time_out)
print ('loop time(in)',time_in)
The result is:
CPU:
EPOCH 10: out: 1.283s in: 0.695s
EPOCH 50: out: 6.43s in: 3.288s
EPOCH 100: out:12.646s in:6.386s
GPU:
EPOCH 10: out: 3.92s in: 1.471s
EPOCH 50: out: 9.35s in:3.04s
EPOCH 100: out: 18.418s in:5.655
I understand that transferring data from cpu to gpu cost some time. So as the epochs go up, the calculation time of GPU should become less than CPU time. My question is:
why the time I record outside of the loop is so different from the inside one? Is there any step that I missed to record the running time?
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
The Network is really simple, which is:
class Model(nn.Module):
def __init__(self,n_input,n_nodes1,n_nodes2):
super(Model, self).__init__()
self.n_input = n_input
self.n_nodes1 = n_nodes1
self.n_nodes2 = n_nodes2
self.l1 = nn.Linear(self.n_input, self.n_nodes1)
self.l2 = nn.Linear(self.n_nodes1, self.n_nodes2)
self.l3 = nn.Linear(self.n_nodes2, 1)
def forward(self,x):
h1 = F.relu(self.l1(x))
h2 = F.relu(self.l2(h1))
h = self.l3(h2)
return h
the training data is formed as:(regression problem, input_x are descriptors and y is the target value)
def load_train_normal(self,x,y,batch_size = 100):
if batch_size:
self.batch_size = batch_size
self.x_train_n, self.y_train_n = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
#x, y = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
self.dataset = Data.TensorDataset(self.x_train_n,self.y_train_n)
self.normal_loader = Data.DataLoader(
dataset = self.dataset,
batch_size = self.batch_size,
shuffle = True, num_workers=2,)
why the time I record outside of the loop is so different with the inside one? Is there any step that I missed to record the running time?
self.normal_loader is not just a plain dictionary, vector or something as simple as that. Iterating over it takes a significant amount of time.
And why GPU costs more outside-time even the inside-time has been less than the CPU time?
torch.cuda.synchronize() is a heavy operation. Even when it didn't even do anything useful such as in this case, as pred_y.detach().cpu() had already enforced synchronization.
As to how to get they faster? Drop the synchronize() calls, they don't do you any good.
And then defer the processing of pred_y until later. Much later. You want to have called the model at least 2 or 3 times before you trigger the first download of results. The simpler the model and the smaller the data, the more iterations you have to wait.
Because transfers to and from the GPU don't just "take time", they imply synchronization. Without synchronization, the execution model on the GPU mostly "lags behind", with data uploads to the GPU already being asynchronous behind the scenes, and actual execution only being queued behind them. If you don't synchronize by accident or explicitly, workloads start to overlap, stuff (uploads, execution, CPU work) starts running in parallel. Your effective execution time approaches max(upload, download, GPU execution, CPU execution).
If you synchronize, there are no tasks to overlap, and no batches to form from same-typed tasks. Upload, execution, download, CPU part, it all happens sequentially. Your execution time ends up upload + download + GPU execution + CPU execution. Some additional overhead for breaking batching on the driver level on top. So easily 5-10x slower than it should be.

Efficiently passing additional information per group to Dask dataframe.groupby.apply function

I have a Dask dataframe that is grouped, and then a function is applied to each group. That function uses some pre-calculated metrics from another dataframe as part of its work.
In the actual code, all the data is in parquet datasets loaded from S3 and run on a distributed Dask cluster. Here's a simplified example using csv files.
profiles.csv
company,stat1
1000,10
2000,20
catalog.csv
company,desc
1000,ABC
1000,def
2000,GHI
2000,jkl
code
from dask import dataframe as ddf
profiles_df = ddf.read_csv("profiles.csv").set_index("company")
catalog_df = ddf.read_csv("catalog.csv").set_index("company")
def refine(group_df):
profile = profiles_df.loc[group_df.name].compute()
group_df["desc_"] = group_df["desc"].apply(lambda t: f"{t}-{int(profile.stat1)}")
return group_df
catalog_grouped_df = catalog_df.groupby("company")
refined_catalog_meta = catalog_df._meta.copy()
refined_catalog_meta["desc_"] = None
refined_catalog_df = catalog_grouped_df.apply(refine, meta=refined_catalog_meta)
refined_catalog_df.compute()
This works, except that the source profiles_df csv/parquet is being read over and over again for each invocation of refine(group_df). How do I improve this so that profiles_df is read once, and then only the row is relevant for each group is passed to or accessed by the refine function?
Update
I've managed to stop the repeated reads from the source Parquet datasets by reading the profiles_df and scattering it. Something like this
from dask import dataframe as ddf
from dask.distributed import default_client
profiles_df = ddf.read_csv("profiles.csv").set_index("company")
catalog_df = ddf.read_csv("catalog.csv").set_index("company")
def refine(group_df):
profile = profiles_df.loc[group_df.name].compute()
group_df["desc_"] = group_df["desc"].apply(lambda t: f"{t}-{int(profile.stat1)}")
return group_df
profiles_df = default_client().scatter(profiles_df.compute(), broadcast=True)
catalog_grouped_df = catalog_df.groupby("company")
refined_catalog_meta = catalog_df._meta.copy()
refined_catalog_meta["desc_"] = None
refined_catalog_df = catalog_grouped_df.apply(refine, meta=refined_catalog_meta)
refined_catalog_df.compute()
…
The main downside is that profiles_df is being read to the calling client and then sent to the scheduler. Is there a way I can get the scheduler or a worker to do the read and scatter?

TensorFlow: does tf.train.batch automatically load the next batch when the batch has finished training?

For instance, after I have created my operations, fed the batch data through the operation and run the operation, does tf.train.batch automatically feed in another batch of data to the session?
I ask this because tf.train.batch has an attribute of allow_smaller_final_batch which makes it possible for the final batch to be loaded as a size lesser than the indicated batch size. Does this mean even without a loop, the next batch could be automatically fed? From the tutorial codes I am rather confused. When I load a single batch, I get literally a single batch size of shape [batch_size, height, width, num_channels], but the documentation says it Creates batches of tensors in tensors. Also, when I read the tutorial code in the tf-slim walkthrough tutorial, where there is a function called load_batch, there are only 3 tensors returned: images, images_raw, labels. Where are 'batches' of data as explained in the documentation?
Thank you for your help.
... does tf.train.batch automatically feeds in another batch of data to the session?
No. Nothing happens automatically. You must call sess.run(...) again to load a new batch.
Does this mean even without a loop, the next batch could be automatically fed?
No. tf.train.batch(..) will always load batch_size tensors. If you have for example 100 images and a batch_size=30 then you will have 3*30 batches as in you can call sess.run(batch) three times before the input queue will start from the beginning (or stop if epoch=1). This means that you miss out 100-3*30=10 samples from training. In case you do not want to miss them you can do tf.train.batch(..., allow_smaller_final_batch=True) so now you will have 3x 30-sample-batches and 1x 10-sample-batch before the input queue will restart.
Let me also elaborate with a code sample:
queue = tf.train.string_input_producer(filenames,
num_epochs=1) # only iterate through all samples in dataset once
reader = tf.TFRecordReader() # or any reader you need
_, example = reader.read(queue)
image, label = your_conversion_fn(example)
# batch will now load up to 100 image-label-pairs on sess.run(...)
# most tf ops are tuned to work on batches
# this is faster and also gives better result on e.g. gradient calculation
batch = tf.train.batch([image, label], batch_size=100)
with tf.Session() as sess:
# "boilerplate" code
sess.run([
tf.local_variables_initializer(),
tf.global_variables_initializer(),
])
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
# in most cases coord.should_stop() will return True
# when there are no more samples to read
# if num_epochs=0 then it will run for ever
while not coord.should_stop():
# will start reading, working data from input queue
# and "fetch" the results of the computation graph
# into raw_images and raw_labels
raw_images, raw_labels = sess.run([images, labels])
finally:
coord.request_stop()
coord.join(threads)
You need to call sess.run and pass the batch to it everytime when you want to load the next batch. See the code below.
img = [0,1,2,3,4,5,6,7,8]
lbl = [0,1,2,3,4,5,6,7,8]
images = tf.convert_to_tensor(img)
labels = tf.convert_to_tensor(lbl)
input_queue = tf.train.slice_input_producer([images,labels])
sliced_img = input_queue[0]
sliced_lbl = input_queue[1]
img_batch, lbl_batch = tf.train.batch([sliced_img,sliced_lbl], batch_size=3)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(0,3): #batch size
image_batch,label_batch = sess.run([img_batch,lbl_batch ])
print(image_batch, label_batch)
coord.request_stop()
coord.join(threads)
the answer would be something like this:
[4,1,8] [4,1,8]
[2,3,7] [2,3,7]
[2,6,8] [2,6,8]
I made the modification to the code from https://github.com/tensorflow/models/blob/master/research/slim/slim_walkthrough.ipynb and bodokaiser answer from the above post. Please note that this is from the evaluation scrip on https://github.com/tensorflow/models/tree/master/research/slim, eval_image_classifier.py. The most important modification to the eval_image_classifier.py code is to add num_epochs=1 to the DatasetDataProvider line. That way, all the images would be accessed once for inference.
provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
shuffle=False,
common_queue_capacity=2 * FLAGS.batch_size,
common_queue_min=FLAGS.batch_size, num_epochs=1)
[image, label] = provider.get(['image', 'label'])
images, labels = tf.train.batch(
[image, label],
batch_size=FLAGS.batch_size,
num_threads=FLAGS.num_preprocessing_threads,
capacity=1 * FLAGS.batch_size)
with tf.Session() as sess:
sess.run([tf.local_variables_initializer(),
tf.global_variables_initializer(),])
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
np_image, np_label = sess.run([images, labels])
except:
coord.request_stop()
coord.join(threads)

Resources