Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value?
My algorithm won't give me the shape of the array until pretty late in the computation.
Eventually, I will be creating some blocks with shapes specified by the intermediate results of my computation, eventually calling da.concatenate on all the results (well da.block if it were more flexible)
I don't think it is too detrimental if I can't, but it would be cool if could.
Sample code
from dask import delayed
from dask import array as da
import numpy as np
n_shape = (3, 3)
shape = delayed(n_shape, nout=2)
d_shape = (delayed(n_shape[0]), delayed(n_shape[1]))
n = delayed(np.zeros)(n_shape, dtype=np.float)
# this doesn't work
# da.from_delayed(n, shape=shape, dtype=np.float)
# this doesn't work either, but I think goes a little deeper
# into the function call
da.from_delayed(n, shape=d_shape, dtype=np.float)
You can not provide a delayed shape, but you can state that the shape is unknown using np.nan as a value wherever you don't know a dimension
Example
import random
import numpy as np
import dask
import dask.array as da
#dask.delayed
def f():
return np.ones((5, random.randint(10, 20))) # a 5 x ? array
values = [f() for _ in range(5)]
arrays = [da.from_delayed(v, shape=(5, np.nan), dtype=float) for v in values]
x = da.concatenate(arrays, axis=1)
>>> x
dask.array<concatenate, shape=(5, nan), dtype=float64, chunksize=(5, nan)>
>>> x.shape
(5, np.nan)
>>> x.compute().shape
(5, 88)
Docs
See http://dask.pydata.org/en/latest/array-chunks.html#unknown-chunks
Related
I am working on a project where I need to group by several columns depending on the task and I have unknown division issues with dask because of this.
Here is a sample of the problem
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000), "col2": np.random.randint(101, 200, 100000), "col3": np.random.uniform(0, 4, 100000)})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
ddf["col2_sum"] = ddf.groupby("col1")["col3"].transform("sum", meta=('x', 'float64')) # works
print(ddf.compute())
This works because I am grouping by an indexed column. However,
ddf["col2_sum2"] = ddf.groupby("col2")["col3"].transform("sum", meta=('x', 'float64'))
This doesn't work because of ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
I have tried to solve this this way
ddf_new = ddf[["col2", "col3"]].set_index("col2")
ddf_new["col2_sum2"] = ddf_new.groupby("col2")["col3"].transform("sum", meta=('x', 'float64'))
ddf_new = ddf_new.drop(columns=["col3"])
ddf = ddf.merge(ddf_new, on=["col2"], how="outer") # works but expensive round trip
print(ddf.compute())
But this is very expensive dask merges. Is there a better way of solving this problem
The solution you created seems reasonable, I would make one improvement (if this is feasible with actual data): if ddf_new is computed, then it becomes a pandas df, so the merge of ddf and ddf_new becomes a lot faster as there is less data to shuffle around.
Update: also to avoid sending the pandas df from workers to client and back, you could do a ddf_new = client.compute(ddf_new) and pass around just the future (reference to the computed pandas df).
I want to create features(additional columns) from a dataframe and I have the following structure for many functions.
Following this documentation https://docs.dask.org/en/stable/delayed-best-practices.html I have come up with the code below.
However I get the error message: concurrent.futures._base.CancelledError and many times I get the warning: distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
I understand that the object I am appending to delay is very large(it works ok when I use the commented out df) which is why the program crashes but is there a better way of doing it?
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
def main():
#df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000), "col2": np.random.randint(101, 200, 100000), "col3": np.random.uniform(0, 4, 100000)})
df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000000), "col2": np.random.randint(101, 200, 100000000), "col3": np.random.uniform(0, 4, 100000000)})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
delay = []
def create_col_sth():
group = ddf.groupby("col1")["col3"]
#dask.delayed
def small_fun(lag):
return f"col_{lag}", group.transform(lambda x: x.shift(lag), meta=('x', 'float64')).apply(lambda x: np.log(x), meta=('x', 'float64'))
for lag in range(5):
x = small_fun(lag)
delay.append(x)
create_col_sth()
delayed = dask.compute(*delay)
for data in delayed:
ddf[data[0]] = data[1]
ddf.to_parquet("test", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main()
Not sure if this will resolve all of your issues, but generally you don't need to (and shouldn't) mix delayed and dask.datafame operations like this. Additionally, you shouldn't pass large data objects into delayed functions through closures like group in your example. Instead, include them as explicit arguments, or in this case, don't use delayed at all and use dask.dataframe native operations or in-memory operations with dask.dataframe.map_partitions.
Implementing these, I would rewrite your main function as follows:
df = pd.DataFrame({
"col1": np.random.randint(1, 100, 100000000),
"col2": np.random.randint(101, 200, 100000000),
"col3": np.random.uniform(0, 4, 100000000),
})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
group = ddf.groupby("col1")["col3"]
# directly assign the dataframe operations as columns
for lag in range(5):
ddf[f"col_{lag}"] = (
group
.transform(lambda x: x.shift(lag), meta=('x', 'float64'))
.apply(lambda x: np.log(x), meta=('x', 'float64'))
)
# this triggers the operation implicitly - no need to call compute
ddf.to_parquet("test", engine="fastparquet")
After long periods of frustration with Dask, I think I hacked the holy grail of refactoring your pandas transformations wrapped with dask.
Learning points:
Index intelligently. If you are grouping by or merging you should consider indexing the columns you use for those.
Partition and repartition intelligently. If you have a dataframe of 10k rows and another of 1m rows, they should naturally have different partitions.
Don't use dask data frame transformation methods except for example merge. The others should be in pandas code wrapped around map_partitions.
Don't accumulate too large graphs so consider saving after for example indexing or after making a complex transformation.
If possible filter the data frame and work with smaller subset you can always merge this back to the bigger data set.
If you are working in your local machine set the memory limits within the boundaries of system specifications. This point is very important. In the example below I create one million rows of 3 columns one is an int64 and two are float64 which are 8bytes each and 24bytes in total this gives me 24 million bytes.
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
# https://stackoverflow.com/questions/52642966/repartition-dask-dataframe-to-get-even-partitions
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
def main(client):
size = 1000000
df = pd.DataFrame({"col1": np.random.randint(1, 10000, size), "col2": np.random.randint(101, 20000, size), "col3": np.random.uniform(0, 100, size)})
# Select appropriate partitions
ddf = dd.from_pandas(df, npartitions=500)
del df
gc.collect()
# This is correct if you want to group by a certain column it is always best if that column is an indexed one
ddf = ddf.set_index("col1")
ddf = _rebalance_ddf(ddf)
print(ddf.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf.memory_usage(deep=True).sum().compute())
# Always persist your data to prevent big task graphs actually if you omit this step processing will fail
ddf.to_parquet("test", engine="fastparquet")
ddf = dd.read_parquet("test")
# Dummy code to create a dataframe to be merged based on col1
ddf2 = ddf[["col2", "col3"]]
ddf2["col2/col3"] = ddf["col2"] / ddf["col3"]
ddf2 = ddf2.drop(columns=["col2", "col3"])
# Repartition the data
ddf2 = _rebalance_ddf(ddf2)
print(ddf2.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf2.memory_usage(deep=True).sum().compute())
def mapped_fun(data):
for lag in range(5):
data[f"col_{lag}"] = data.groupby("col1")["col3"].transform(lambda x: x.shift(lag)).apply(lambda x: np.log(x))
return data
# Process the group by transformation in pandas but wrapped with Dask if you use the Dask functions to do this you will
# have a variety of issues.
ddf = ddf.map_partitions(mapped_fun)
# Additional... you can merge ddf with ddf2 but on an indexed column otherwise you run into a variety of issues
ddf = ddf.merge(ddf2, on=['col1'], how="left")
ddf.to_parquet("final", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main(client)
I was tasked with the creation of a dataset to test the functionality of the code we're working on.
The dataset must have a group of tensors that will be used later on in a generative model.
I'm trying to save the tensors to a .pt file, but I'm overwriting the tensors thus creating a file with only one. I've read about torch.utils.data.dataset but I'm not able to figure out by my own how to use it.
Here is my code:
import torch
import numpy as np
from torch.utils.data import Dataset
#variables that will be used to create the size of the tensors:
num_jets, num_particles, num_features = 1, 30, 3
for i in range(100):
#tensor from a gaussian dist with mean=5,std=1 and shape=size:
tensor = torch.normal(5,1,size=(num_jets, num_particles, num_features))
#We will need the tensors to be of the cpu type
tensor = tensor.cpu()
#save the tensor to 'tensor_dataset.pt'
torch.save(tensor,'tensor_dataset.pt')
#open the recently created .pt file inside a list
tensor_list = torch.load('tensor_dataset.pt')
#prints the list. Just one tensor inside .pt file
print(tensor_list)
Reason: You overwrote tensor x each time in a loop, therefore you did not get your list, and you only had x at the end.
Solution: you have the size of the tensor, you can initialize a tensor first and iterate through lst_tensors:
import torch
import numpy as np
from torch.utils.data import Dataset
num_jets, num_particles, num_features = 1, 30, 3
lst_tensors = torch.empty(size=(100,num_jets, num_particles, num_features))
for i in range(100):
lst_tensors[i] = torch.normal(5,1,size=(num_jets, num_particles, num_features))
lst_tensors[i] = lst_tensors[i].cpu()
torch.save(lst_tensors,'tensor_dataset.pt')
tensor_list = torch.load('tensor_dataset.pt')
print(tensor_list.shape) # [100,1,30,3]
I'm working with datashader and dask but I'm having problems when trying to plot with a cluster running. To make it more concrete, I have the following example (embedded in a bokeh plot):
import holoviews as hv
import pandas as pd
import dask.dataframe as dd
import numpy as np
from holoviews.operation.datashader import datashade
import datashader.transfer_functions as tf
#initialize the client/cluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
dask_client = Client(cluster)
def datashade_plot():
hv.extension('bokeh')
#create some random data (in the actual code this is a parquet file with millions of rows, this is just an example)
delta = 1/1000
x = np.arange(0, 1, delta)
y = np.cumsum(np.sqrt(delta)*np.random.normal(size=len(x)))
df = pd.DataFrame({'X':x, 'Y':y})
#create dask dataframe
points_dd = dd.from_pandas(df, npartitions=3)
#create plot
points = hv.Curve(points_dd)
return hd.datashade(points)
dask_client.submit(datashade_plot,).result()
This raises a:
TypeError: can't pickle weakref objects
I have the theory that this happens because you can't distribute the datashade operations in the cluster. Sorry if this is a noob question, I'd be very grateful for any advice you could give me.
I think you want to go the other way. That is, pass datashader a dask dataframe instead of a pandas dataframe:
>>> from dask import dataframe as dd
>>> import multiprocessing as mp
>>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())
>>> dask_df.persist()
...
>>> cvs = datashader.Canvas(...)
>>> agg = cvs.points(dask_df, ...)
XREF: https://datashader.org/user_guide/Performance.html
What is the best way to iterate da.linalg.inv over a multi-dimensional dask array?
I have a dask array of shape (4, 4, 8, 8), and need to compute the inverse of the last two dimensions. With numpy, np.linalg.inv loops over all dimensions except the last two, so in the following example, I can just call np.linalg.inv(A).
I have chosen to use a for loop, but I have read about gufuncs in dask (the documentation seems a little outdated). However, I'm not sure how to implement the it, particularly the "signature" bit,
import dask.array as da
import numpy as np
A = da.random.random((4,4,8,8))
A2 = A.reshape((-1,) + A.shape[-2:])
B = [da.linalg.inv(a) for a in A2]
B2 = da.asarray(B)
B3 = B2.reshape(A.shape)
np.testing.assert_array_almost_equal(
np.linalg.inv(A.compute()),
B3
)
My attempt at a gufunc leads to an error:
def foo(x):
return da.linalg.inv(x)
gufoo = da.gufunc(foo, signature="()->()", output_dtypes=float, vectorize=True)
gufoo(A2).compute() # IndexError: tuple index out of range
I think that you want to apply the numpy function np.linalg.inv over your Dask array rather than the dask array function.
If np.linalg.inv is already a gufunc then it might work as expected today
np.linalg.inv(A)