If I've got a dask dataframe df. Now I apply some computation on it.
Mathematically,
df1 = f1(df)
df2 = f2(df1)
df3 = f3(df1)
Now if I run, df2.compute(), now after that if I run df1.compute(). How can I stop dask from recomputing the result of df1?
Taking the other case, if I run df3.compute(), then df2.compute(). How can I tell dask to use the already computed value of df1 (which is computed in df3.compute()) in running df2.compute()?
You can use dask.persist to create a dask dataframe with the subgraph computed, or computing.
If you are using the local scheduler then you should take a look at dask.cache.Cache
from dask.cache import Cache
cache = Cache(4e9).register()
Related
I'm generating a large (65k x 65k x 3) 3D signal distributed among several nodes using Dask arrays.
In the next step, I need to extract a few thousands tiles from this array using slices stored in a Dask bag. My code looks like this:
import dask.array as da
import dask.bag as db
from dask.distributed import Client
def pick_tile(window, signal):
return np.array(surface[window])
def computation_on_tile(signal_tile):
# do some rather short computation on a (n x n x 3) signal tile.
dask_client = Client(....)
signal_array = generate_signal(...) # returns a dask array
signal_slices = db.from_sequence(generate_slices(...)) # fixed size slices
signal_tiles = signal_slices.map(pick_tile, signal=signal_array)
result = dask_client.compute(signal_tile.map(computation_on_tile), sync=True)
My issue is that the computation takes a lot of time. I tried to scatter my signal array using:
signal_array = dask_client.scatter(generate_signal(...))
But it doesn't help performance (~12 min. to compute). In comparison, the computation of the full signal and the stdev of the first layer takes approximately 2 minutes.
Is there an efficient way to pick a lot of slices from a distributed Dask array ?
If you have only a few thousand slices then I recommend using a normal Python list rather than Dask Bag. It will likely be much faster and much simpler.
Then you can slice your array many times:
tiles = [dask_array[slc] for slc in slices]
And compute these if you want
tiles = dask.compute(*tiles)
I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?
Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.
Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want
I’m dealing with CIFAR10 and I use torchvision.datasets to create it. I’m in need of GPU to accelerate the calculation but I can’t find a way to put the whole dataset into GPU at one time. My model need to use mini-batches and it is really time-consuming to deal with each batch separately.
I've tried to put each mini-batch into GPU separately but it seems really time-consuming.
TL;DR
You won't save time by moving the entire dataset at once.
I don't think you'd necessarily want to do that even if you have the GPU memory to handle the entire dataset (of course, CIFAR10 is tiny by today's standards).
I tried various batch sizes and timed the transfer to GPU as follows:
num_workers = 1 # Set this as needed
def time_gpu_cast(batch_size=1):
start_time = time()
for x, y in DataLoader(dataset, batch_size, num_workers=num_workers):
x.cuda(); y.cuda()
return time() - start_time
# Try various batch sizes
cast_times = [(2 ** bs, time_gpu_cast(2 ** bs)) for bs in range(15)]
# Try the entire dataset like you want to do
cast_times.append((len(dataset), time_gpu_cast(len(dataset))))
plot(*zip(*cast_times)) # Plot the time taken
For num_workers = 1, this is what I got:
And if we try parallel loading (num_workers = 8), it becomes even clearer:
I've got an answer and I'm gonna try it later. It seems promising.
You can write a dataset class where in the init function, you red the entire dataset and apply all the transformations you need, and convert them to tensor format. Then, send this tensor to GPU (assuming there is enough memory). Then, in the getitem function you can simply use the index to retrieve the elements of that tensor which is already on GPU.
Coming from C++, I am used to libraries using expression templates where matrix operations like:
D = A*(B+C)
do not create temporaries and the element-wise
D(i,j) = A(i,j)*(B(i,j)+C(i,j))
operation is done inside the loop without creating temporary matrices for the operations in the right hand side.
Is this possible with Dask arrays? Does the Dask "lazy evaluation" also do this or this term just refers to the computation on demand of the operation graph.
Thanks.
As of 2018-11-11 the answer is "yes, dask array avoids full temporaries at the large scale but, no, it doesn't avoid allocating temporaries at the Numpy/blockwise level".
Dask arrays are composed of many Numpy arrays. And Dask array operations are achieved by performing those operations on the Numpy array chunks. When you do A * (B + C) that operation happens on every matching set of numpy array chunks as numpy would perform the operation, which includes allocating temporaries.
However, because Dask can operate chunk-wise it doesn't have to allocate all of the (B + C) chunks before moving on.
You're correct that because Dask is lazy is has an opportunity to be more clever than Numpy here. You can track progress on this issue here: https://github.com/dask/dask/issues/4038