Using numba functions in map_blocks - dask

I have successfully used map_blocks a few times on dask arrays. I'm now trying to deploy a numba function to act on each block, and to act and change one of the inputs.
The numba function takes in 2 numpy arrays, and updates the second one. this is then returned in the return statement to make it available to map_blocks as a result.
The function works fine on a numpy array, but python just crashes when calling it from map_blocks. numba functions that do not act on an input array behave normally (although it is difficult to get them to do anything useful in this case).
Is this a known limitation? A bug? Am I using it wrong?!
Update
I've finally boiled it down to a reproducible example with a trivial numba function, and I get a clearer idea of the problem. However I'm still unclear on how to resolve the issue. Here's the code:
import numpy as np
from numba import jit, float64, int64
from dask.distributed import Client, LocalCluster
import dask.array as da
cluster=LocalCluster()
c=Client(cluster)
size=int(1e5)
a=np.arange(size,dtype='float64')
b=np.zeros((size,),dtype='float64')
dista=da.from_array(a,chunks=size//4)
distb=da.from_array(b,chunks=size//4)
#jit(float64[:](float64[:],float64[:]))
def crasher(x,y):
for i in range(x.shape[0]):
y[i]=x[i]*2
return y
distc=da.map_blocks(crasher,dista,distb,dtype='float64')
c=distc.compute() #it all crashes at this point
And I now get a more comprehensible error rather than just a straight up crash:
TypeError: No matching definition for argument type(s) readonly array(float64, 1d, C), readonly array(float64, 1d, C)
So if numba is receiving numpy arrays with write=False set, how do you get numba to do any useful work? You can't put an array creation line in the numba function, and you can't feed it writeable arrays.
Any views on how to achieve this?

Here is a version of your code with array creation, which runs fine with numba nopython mode
import numpy as np
from numba import jit, float64, int64
from dask.distributed import Client, LocalCluster
import dask.array as da
cluster=LocalCluster()
c=Client(cluster)
size=int(1e5)
a=np.arange(size,dtype='float64')
dista=da.from_array(a,chunks=size//4)
#jit(nopython=True)
def crasher(x):
y = np.empty_like(x)
for i in range(x.shape[0]):
y[i]=x[i]*2
return y
distc=da.map_blocks(crasher,dista,dtype='float64')
c=distc.compute()
Note the y= line. Note the list of numpy functions supported, according to the documentation.

Related

Understanding dask cudf object lifecycle

I want to understand the efficient memory management process for Dask objects. I have setup a Dask GPU cluster and I am able to execute tasks that runs across the cluster. However, with the dask objects, especially when I run the compute function, the process that runs on the GPU is quickly growing by using more and more of the memory and soon I am getting "Run out of memory Error".
I want to understand how I can release the memory from dask object once I am done with using them. In this following example, after the compute function how can I release that object. I am running the following code for a few times. The memory keeps growing in the process where it is running
import cupy as cp
import pandas as pd
import cudf
import dask_cudf
nrows = 100000000
df2 = cudf.DataFrame({'a': cp.arange(nrows), 'b': cp.arange(nrows)})
ddf2 = dask_cudf.from_cudf(df2, npartitions=5)
ddf2['c'] = ddf2['a'] + 5
ddf2
ddf2.compute()
Please check this blog post by Nick Becker. you may want to set up a client first.
You read into cudf first, which you shouldn't do as practice. You should read directly into dask_cudf.
When dask_cudf computes, the result returns as a cudf dataframe, which MUST fit into the remaining memory of your GPU. Chances are reading into cudf first may have taken a chunk of your memory.
Then, you can delete a dask object when you are done using client.cancel().

How can I create a Dask array from zipped .npy files?

I have a large dataset stored as zipped npy files. How can I stack a given subset of these into a Dask array?
I'm aware of dask.array.from_npy_stack but I don't know how to use it for this.
Here's a crude first attempt that uses up all my memory:
import numpy as np
import dask.array as da
data = np.load('data.npz')
def load(files):
list_ = [da.from_array(data[file]) for file in files]
return da.stack(list_)
x = load(['foo', 'bar'])
Well, you can't load a large npz file into memory, because then you're already out of memory. I would read each one in in a delayed fashion, and then call da.from_array and da.stack as you sort of are in your example.
Here are some docs that may help if you haven't seen them before: https://docs.dask.org/en/latest/array-creation.html#using-dask-delayed

Save larger than memory Dask array to hdf5 file

I need to save dask arrays to hdf5 when using dask distributed. My situation is very similar to the one described in this issue:https://github.com/dask/dask/issues/3351. Basically this code will work:
import dask.array as da
from distributed import Client
import h5py
from dask.utils import SerializableLock
def create_and_store_dask_array():
data = da.random.normal(10, 0.1, size=(1000, 1000), chunks=(100, 100))
data.to_hdf5('test.h5', '/test')
# this fails too
# f = h5py.File('test.h5', 'w')
# dset = f.create_dataset('/matrix', shape=data.shape)
# da.store(data, dset) #
# f.close()
create_and_store_dask_array()
But as soon as I try and involve the distributed scheduler I get an TypeError: can't pickle _thread._local objects.
import dask.array as da
from distributed import Client
import h5py
from dask.utils import SerializableLock
from dask.distributed import Client, LocalCluster,progress,performance_report
def create_and_store_dask_array():
data = da.random.normal(10, 0.1, size=(1000, 1000), chunks=(100, 100))
data.to_hdf5('test.h5', '/test')
# this fails too
# f = h5py.File('test.h5', 'w')
# dset = f.create_dataset('/matrix', shape=data.shape)
# da.store(data, dset) #
# f.close()
cluster = LocalCluster(n_workers=35,threads_per_worker=1)
client =Client(cluster)
create_and_store_dask_array()
I am currently working around this by submitting my computations to the scheduler in small pieces, gathering the results in memory and saving the arrays with h5py, but this is very, very slow. Can anyone suggest a good work around to this problem? The issue discussion implies that xarray can take an dask array and write that to and hdf5 file, although this seems very slow.
import xarray as xr
import netCDF4
import dask.array as da
from distributed import Client
import h5py
from dask.utils import SerializableLock
cluster = LocalCluster(n_workers=35,threads_per_worker=1)
client =Client(cluster)
data = da.random.normal(10, 0.1, size=(1000, 1000), chunks=(100, 100))
#data.to_hdf5('test.h5', '/test')
test = xr.DataArray(data,dims=None,coords=None)
#save as hdf5
test.to_netcdf("test.h5",mode='w',format="NETCDF4")
If any one could suggest a way to deal with this I am very interested in finding a solution (particularly one that does not involve adding additional dependencies)
Thanks in advance,
H5Py objects are not serializable, and so are hard to move between different processes in a distributed context. The explicit to_hdf5 method works around this. The more general store method doesn't special-case HDF5 in the same way.

dask.distributed not utilising the cluster

I'm not able to process this block using the distributed cluster.
import pandas as pd
from dask import dataframe as dd
import dask
df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
for _, outer_row in dask_df.iterrows():
for _, inner_row in dask_df.iterrows():
for base_encod in outer_row['reid_encod']:
for compare_encod in inner_row['reid_encod']:
val = base_encod + compare_encod
save_val.append(val)
return save_val
from dask.distributed import Client
client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()
Also I have few queries
Does dask.delayed use the available clusters to do the computation.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
does dask.distributed work on pandas dataframe.
can we use dask.delayed in dask.distributed.
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
For the record, some answers, although I wish to note my earlier general points about this question
Does dask.delayed use the available clusters to do the computation.
If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.
does dask.distributed work on pandas dataframe.
Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question
can we use dask.delayed in dask.distributed.
Yes, distributed can execute anything that dask in general can, including delayed functions/objects
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.
In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.
The main problem with your code is sketched in this section of the best practices: don't pass Dask collections to delayed functions. This means, you should use either the delayed API or the dataframe API. While you can convert dataframes<->delayed, simply passing like this is not recommended.
Furthermore,
you only have one row in your dataframe, so you only get one partition and no parallelism whatever. You can only slow things down like this.
this appears to be an everything-to-everything (N^2) operation, so if you had many rows (the normal case for Dask), it would presumably take extremely long, no matter how many cores you used
passing lists in a pandas row is not a great idea, perhaps you wanted to use an array?
the function doesn't return anything useful, so it's not at all clear what you are trying to achieve. Under the description of MVCE, you will see references to "expected outcome" and "what went wrong". To get more help, please be more precise.

ParamGridBuilder in PySpark does not work with LinearRegressionSGD

I'm trying to figure out why LinearRegressionWithSGD does not work with Spark's ParamGridBuilder. From the Spark documentation:
lr = LinearRegression(maxIter=10)
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
However, changing LinearRegression to LinearRegressionWithSGD simply does not work. Subsequently SGD parameters are also unable to be passed in (such as iterations or minibatchfraction).
Thanks!!
That is because you are trying to mix functionality from two different libraries: LinearRegressionWithSGD comes from pyspark.mllib (i.e. the old, RDD-based API), while both LinearRegression & ParamGridBuilder come from pyspark.ml (the new, dataframe-based API).
Indeed, a few lines before the code snippet in the documentation you quote (BTW, in the future it would be good to provide a link, too) you'll find the line:
from pyspark.ml.regression import LinearRegression
while for LinearRegressionWithSGD you have used something like:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
These two libraries are not compatible: pyspark.mllib takes RDD's of LabeledPoint as input, which is not compatible with the dataframes used in pyspark.ml; and since ParamGridBuilder is part of the latter, it can only be used with dataframes, and not with algorithms included in pyspark.mllib (check the documentation links provided above).
Moreover, keep in mind that LinearRegressionWithSGD is deprecated in Spark 2:
Note: Deprecated in 2.0.0. Use ml.classification.LogisticRegression or LogisticRegressionWithLBFGS.
UPDATE: Thanks to #rvisio's comment below, we know now that, although undocumented, one can actually use solver='sgd' for LinearRegression in pyspark.ml; here is a short example adapted from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
df = spark.createDataFrame([
(1.0, 2.0, Vectors.dense(1.0)),
(0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
lr = LinearRegression(maxIter=5, regParam=0.0, solver="sgd", weightCol="weight") # solver='sgd'
model = lr.fit(df) # works OK
lr.getSolver()
# 'sgd'

Resources