Coming from C++, I am used to libraries using expression templates where matrix operations like:
D = A*(B+C)
do not create temporaries and the element-wise
D(i,j) = A(i,j)*(B(i,j)+C(i,j))
operation is done inside the loop without creating temporary matrices for the operations in the right hand side.
Is this possible with Dask arrays? Does the Dask "lazy evaluation" also do this or this term just refers to the computation on demand of the operation graph.
Thanks.
As of 2018-11-11 the answer is "yes, dask array avoids full temporaries at the large scale but, no, it doesn't avoid allocating temporaries at the Numpy/blockwise level".
Dask arrays are composed of many Numpy arrays. And Dask array operations are achieved by performing those operations on the Numpy array chunks. When you do A * (B + C) that operation happens on every matching set of numpy array chunks as numpy would perform the operation, which includes allocating temporaries.
However, because Dask can operate chunk-wise it doesn't have to allocate all of the (B + C) chunks before moving on.
You're correct that because Dask is lazy is has an opportunity to be more clever than Numpy here. You can track progress on this issue here: https://github.com/dask/dask/issues/4038
Related
I'm generating a large (65k x 65k x 3) 3D signal distributed among several nodes using Dask arrays.
In the next step, I need to extract a few thousands tiles from this array using slices stored in a Dask bag. My code looks like this:
import dask.array as da
import dask.bag as db
from dask.distributed import Client
def pick_tile(window, signal):
return np.array(surface[window])
def computation_on_tile(signal_tile):
# do some rather short computation on a (n x n x 3) signal tile.
dask_client = Client(....)
signal_array = generate_signal(...) # returns a dask array
signal_slices = db.from_sequence(generate_slices(...)) # fixed size slices
signal_tiles = signal_slices.map(pick_tile, signal=signal_array)
result = dask_client.compute(signal_tile.map(computation_on_tile), sync=True)
My issue is that the computation takes a lot of time. I tried to scatter my signal array using:
signal_array = dask_client.scatter(generate_signal(...))
But it doesn't help performance (~12 min. to compute). In comparison, the computation of the full signal and the stdev of the first layer takes approximately 2 minutes.
Is there an efficient way to pick a lot of slices from a distributed Dask array ?
If you have only a few thousand slices then I recommend using a normal Python list rather than Dask Bag. It will likely be much faster and much simpler.
Then you can slice your array many times:
tiles = [dask_array[slc] for slc in slices]
And compute these if you want
tiles = dask.compute(*tiles)
I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?
Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.
If I've got a dask dataframe df. Now I apply some computation on it.
Mathematically,
df1 = f1(df)
df2 = f2(df1)
df3 = f3(df1)
Now if I run, df2.compute(), now after that if I run df1.compute(). How can I stop dask from recomputing the result of df1?
Taking the other case, if I run df3.compute(), then df2.compute(). How can I tell dask to use the already computed value of df1 (which is computed in df3.compute()) in running df2.compute()?
You can use dask.persist to create a dask dataframe with the subgraph computed, or computing.
If you are using the local scheduler then you should take a look at dask.cache.Cache
from dask.cache import Cache
cache = Cache(4e9).register()
Using the example on http://dask.pydata.org/en/latest/array-creation.html
filenames = sorted(glob('2015-*-*.hdf5')
dsets = [h5py.File(fn)['/data'] for fn in filenames]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.concatenate(arrays, axis=0) # Concatenate arrays along first axis
I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the hdf5 files that gets returned.
Is there any increase in performance (thread or memory based) during the file read stage as a result of the da.from_array or is only when you concatenate into the dask array x where you should expect improvements
The objects in the arrays list are all dask arrays, one for each file.
The x object is also a dask array that combines all of the results of the dask arrays in the arrays list. It isn't a dask.array of dask arrays, it's just a single flattened dask array with an a larger first dimension.
There will probably not be an increase in performance for reading data. You're likely to be I/O bound by your disk bandwidth. Most people in this situation are using dask.array because they have more data than can conveniently fit into RAM. If this isn't valuable to you then I would stick with NumPy.
How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325