dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU) - dask

I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?

Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.

Related

Best way to pick numerous slices from a Dask array

I'm generating a large (65k x 65k x 3) 3D signal distributed among several nodes using Dask arrays.
In the next step, I need to extract a few thousands tiles from this array using slices stored in a Dask bag. My code looks like this:
import dask.array as da
import dask.bag as db
from dask.distributed import Client
def pick_tile(window, signal):
return np.array(surface[window])
def computation_on_tile(signal_tile):
# do some rather short computation on a (n x n x 3) signal tile.
dask_client = Client(....)
signal_array = generate_signal(...) # returns a dask array
signal_slices = db.from_sequence(generate_slices(...)) # fixed size slices
signal_tiles = signal_slices.map(pick_tile, signal=signal_array)
result = dask_client.compute(signal_tile.map(computation_on_tile), sync=True)
My issue is that the computation takes a lot of time. I tried to scatter my signal array using:
signal_array = dask_client.scatter(generate_signal(...))
But it doesn't help performance (~12 min. to compute). In comparison, the computation of the full signal and the stdev of the first layer takes approximately 2 minutes.
Is there an efficient way to pick a lot of slices from a distributed Dask array ?
If you have only a few thousand slices then I recommend using a normal Python list rather than Dask Bag. It will likely be much faster and much simpler.
Then you can slice your array many times:
tiles = [dask_array[slc] for slc in slices]
And compute these if you want
tiles = dask.compute(*tiles)

dask read_parquet runs out of memory

I'm trying to read a big (will not fit in memory) parquet dataset, amd then sample from it. Each partition of the dataset fits perfectly in memory.
The dataset is about 20Gb of data on disk, divided in 104 partitions of about 200Mb each. I don't want to use more than 40Gb of memory at any point, so i'm setting the n_workers and memory_limit accordingly.
My hypothesis was that Dask would load as many partitions as it could handle, sample from them, scrap them from memory and then continue loading the next ones. Or something like that.
Instead, judging by the execution graph (104 load operations in parallel, after each a sample), it looks like it tries to load all partitions simultaneously, and therefore the workers keep getting killed for running out of memory.
Am I missing something?
This is my code:
from datetime import datetime
from dask.distributed import Client
client = Client(n_workers=4, memory_limit=10e9) #Gb per worker
import dask.dataframe as dd
df = dd.read_parquet('/path/to/dataset/')
df = df.sample(frac=0.01)
df = df.compute()
To reproduce the error you can create a mock dataset 1/10th the size of the one I was trying to load using this code, and try my code with 1GB memory_limit=1e9 to compensate.
from dask.distributed import Client
client = Client() #add restrictions depending on your system here
from dask import datasets
df = datasets.timeseries(end='2002-12-31')
df = df.repartition(npartitions=104)
df.to_parquet('./mock_dataset')
Parquet is an efficient binary format, with encoding and compression. There is a very good chance that in memory, it takes up far more space than you think.
In order to sample the data at 1%, each partition is being loaded and expanded into memory in entirety, before being sub-selected. This comes with considerable memory overhead of buffer copies. Each worker thread will need to accommodate the currently-processed chunk, as well as results that have been accumulated so far on that worker, and then a task will copy all of these for the final concat operation (which also involves copies and overhead).
The general recommendation is that each worker should have access to "several times" the in-memory size of each partition, and in your case, those are ~2GB on-disc and bigger in memory.

Lazy evaluation of Dask arrays to avoid temporaries

Coming from C++, I am used to libraries using expression templates where matrix operations like:
D = A*(B+C)
do not create temporaries and the element-wise
D(i,j) = A(i,j)*(B(i,j)+C(i,j))
operation is done inside the loop without creating temporary matrices for the operations in the right hand side.
Is this possible with Dask arrays? Does the Dask "lazy evaluation" also do this or this term just refers to the computation on demand of the operation graph.
Thanks.
As of 2018-11-11 the answer is "yes, dask array avoids full temporaries at the large scale but, no, it doesn't avoid allocating temporaries at the Numpy/blockwise level".
Dask arrays are composed of many Numpy arrays. And Dask array operations are achieved by performing those operations on the Numpy array chunks. When you do A * (B + C) that operation happens on every matching set of numpy array chunks as numpy would perform the operation, which includes allocating temporaries.
However, because Dask can operate chunk-wise it doesn't have to allocate all of the (B + C) chunks before moving on.
You're correct that because Dask is lazy is has an opportunity to be more clever than Numpy here. You can track progress on this issue here: https://github.com/dask/dask/issues/4038

KNN classifier taking too much time even on gpu

I am classifying the MNSIT digit using KNN on kaggle but at last step it is taking to much time to execute and mnsit data is juts 15 mb like i am still waiting can you point any problem that is in my code thanks.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))
#Loading datset
train=pd.read_csv('../input/mnist_test.csv')
test=pd.read_csv('../input/mnist_train.csv')
X_train=train.drop('label',axis=1)
y_train=train['label']
X_test=test.drop('label',axis=1)
y_test=test['label']
from sklearn.neighbors import KNeighborsClassifier
clf=KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)
accuracy
There isn't anything wrong with your code per se. KNN is just a slow algorithm, it's slower for you because computing distances between images is hard at scale, and it's slower for you because the problem is large enough that your cache can't really be used effectively.
Without using a different library or coding your own GPU kernel, you can probably get a speed boost by replacing
clf=KNeighborsClassifier(n_neighbors=3)
with
clf=KNeighborsClassifier(n_neighbors=3, n_jobs=-1)
to at least use all of your cores.
because you are not using gpu on kaggle actually. KNeighborsClassifier do not support gpu
In order to use the GPU for KNN, you need to specify it otherwise it defaults to CPU the documentation is here: https://simbsig.readthedocs.io/en/latest/KNeighborsClassifier.html
knn = KNeighborsClassifier(n_neighbors=3, device = 'gpu')

Use already done computation wisely

If I've got a dask dataframe df. Now I apply some computation on it.
Mathematically,
df1 = f1(df)
df2 = f2(df1)
df3 = f3(df1)
Now if I run, df2.compute(), now after that if I run df1.compute(). How can I stop dask from recomputing the result of df1?
Taking the other case, if I run df3.compute(), then df2.compute(). How can I tell dask to use the already computed value of df1 (which is computed in df3.compute()) in running df2.compute()?
You can use dask.persist to create a dask dataframe with the subgraph computed, or computing.
If you are using the local scheduler then you should take a look at dask.cache.Cache
from dask.cache import Cache
cache = Cache(4e9).register()

Resources