How do I operate on groups returned by Dask's group by? - dask

I have the following table.
value category
0 2 A
1 20 B
2 4 A
3 40 B
I want to add a mean column that contains the mean of the values for each category.
value category mean
0 2 A 3.0
1 20 B 30.0
2 4 A 3.0
3 40 B 30.0
I can do this in pandas like so
p = pd.DataFrame({"value":[2, 20, 4, 40], "category": ["A", "B", "A", "B"]})
groups = []
for _, group in p.groupby("category"):
group.loc[:,"mean"] = group.loc[:,"value"].mean()
groups.append(group)
pd.concat(groups).sort_index()
How do I do the same thing in Dask?
I can't use the pandas functions as-is because you can't enumerate over a groupby object in Dask. This
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
list(d.groupby("category"))
raises KeyError: 'Column not found: 0'.
I can use an apply function to calculate the mean in Dask.
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
q = d.groupby(["category"]).apply(lambda group: group["value"].mean(), meta="object")
q.compute()
returns
category
A 3.0
B 30.0
dtype: float64
But I can't figure how how to fold these back into the rows of the original table.

I would use a merge to achieve this operation:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({
'value': [2, 20, 4, 40],
'category': ['A', 'B', 'A', 'B']
})
ddf = dd.from_pandas(df, npartitions=1)
# Lazy-compute mean per category
mean_by_category = (ddf
.groupby('category')
.agg({'value': 'mean'})
.rename(columns={'value': 'mean'})
).persist()
mean_by_category.head()
# Assign 'mean' value to each corresponding category
ddf = ddf.merge(mean_by_category, left_on='category', right_index=True)
ddf.head()
Which should then output:
category value mean
0 A 2 3.0
2 A 4 3.0
1 B 20 30.0
3 B 40 30.0

Related

Dask - custom aggregation

I've just learned about Dask yesterday and now am upgrading my work from pandas... but am stuck trying to translate simple custom aggregations.
I'm not fully (or probably even at all) understanding how the Series are represented inside those internal lambda functions; normally I would step in with breakpoint() to inspect them, but this isn't an option here. When I try to get an element of x with index, I get an "Meta" error.
Any help/pointers would be appreciated.
import dask.dataframe as dd
import pandas as pd
#toy df
df = dd.from_pandas(pd.DataFrame(dict(a = [1, 1, 2, 2], \
b = [100, 100, 200, 250])), npartitions=2)
df.compute()
a b
0 1 100
1 1 100
2 2 200
3 2 250
# PART 1 - for conceptual understanding - replicating trivial list
# intended result
df.groupby('a').agg(list).compute()
b
a
1 [100, 100]
2 [200, 250]
# replicate manually with custom aggregation
list_manual = dd.Aggregation('list_manual', lambda x: list(x), \
lambda x1: list(x1))
res = df.groupby('a').agg(list_manual).compute()
res
b
0 (0, [(1, 0 100\n1 100\nName: b, dtype: i...
res.b[0]
(0,
0 (1, [100, 100])
0 (2, [200, 250])
Name: list_manual-b-6785578d38a71d6dbe0d1ac6515538f7, dtype: object)
# looks like the grouping tuple wasn't even unpacked (1, 2 groups are there)
# ... instead it all got collapsed into one thing
# PART 2 - custom function
# with pandas - intended behavior
collect_uniq = lambda x: list(set(x))
dfp = df.compute()
dfp.groupby('a').agg(collect_uniq)
b
a
1 [100]
2 [200, 250]
#now trying the same in Dask
collect_uniq_dask = dd.Aggregation('collect_uniq_dask', \
lambda x: list(set(x)), lambda x1: list(x1))
res = df.groupby('a').agg(collect_uniq_dask).compute()
# gives TypeError("unhashable type: 'Series'")

Repartition dask dataframe while maintaining its original sequence

I am trying to repartition multiple .parquet files in order to save a specific number of parquet files. I have a time-series data that is dependent on the number of observations (instead of timestamps) for each client, so I need to ensure that the partitioning will not split a series over two files. In addition, I want to preserve the order, since I have the labels stored elsewhere. Here is an example of what I am trying to do:
import pandas as pd
import dask.dataframe as dd
ids = [9635, 1536, 8477, 1088, 6411, 2251]
df = df = pd.DataFrame({
"partition" : [0]*3 + [1]*3 + [2]*3 + [3]*3 + [4]*3 + [5]*3,
"customer_id" : [ids[0]]*3 + [ids[1]]*3 + [ids[2]]*3 + [ids[3]]*3 + [ids[4]]*3 + [ids[5]]*3,
"x": range(18)})
# indexing on "customer_id" here
df = df.set_index("customer_id")
ddf = dd.from_pandas(df, npartitions=6)
ddf.to_parquet("my_parquets")
read_ddf = dd.read_parquet("my_parquets/*.parquet")
last_idx = [ids[-1]]
my_divisions = ids + last_idx
read_ddf.divisions = my_divisions
# Split into two equal partitions with three customers each
new_divisions = [my_divisions[0], my_divisions[3], my_divisions[5]]
new_ddf = read_df.repartition(divisions=new_divisions)
which raises an error:
ValueError: New division must be sorted
I have tried an alternative approach, which involves setting the "partition" column as the index and modifying the index to "ids" later, but this sorts my entire dataframe, which is undesired because the new sequence no longer matches the labels stored. This is shown here:
import pandas as pd
import dask.dataframe as dd
ids = [9635, 1536, 8477, 1088, 6411, 2251]
df = df = pd.DataFrame({
"partition" : [0]*3 + [1]*3 + [2]*3 + [3]*3 + [4]*3 + [5]*3,
"customer_id" : [ids[0]]*3 + [ids[1]]*3 + [ids[2]]*3 + [ids[3]]*3 + [ids[4]]*3 + [ids[5]]*3,
"x": range(18)})
# indexing on the defined "partition" instead
df = df.set_index("partition")
ddf = dd.from_pandas(df, npartitions=6)
ddf.to_parquet("my_parquets")
read_ddf = dd.read_parquet("my_parquets/*.parquet")
# my_range is equivalent to the list of partitions
my_range = [i for i in range(0,6)]
last_idx = [my_range[-1]]
my_divisions = my_range + last_idx
read_ddf.divisions = my_divisions
new_divisions = [0, 2, 4, 5]
new_ddf = read_ddf.repartition(divisions=new_divisions)
# Need the "customer_id" as index
new_ddf = new_ddf.set_index("customer_id", drop = True)
But this sorts the dataframe by the index and messes up the structure, while I would like to keep the original order.
print("Partition 0")
print(new_ddf.get_partition(0).compute())
print("-------------------")
print("Partition 1")
print(new_ddf.get_partition(1).compute())
print("-------------------")
print("Partition 2")
print(new_ddf.get_partition(2).compute())
Partition 0
Empty DataFrame
Columns: [x]
Index: []
-------------------
Partition 1
x
customer_id
1088 9
1088 10
1088 11
1536 3
1536 4
1536 5
-------------------
Partition 2
x
customer_id
2251 15
2251 16
2251 17
6411 12
6411 13
6411 14
8477 6
8477 7
8477 8
9635 0
9635 1
9635 2
Are there any workarounds for this issue? I am aware that set_index in dask is quite expensive, but none of the approaches are currently working. Also, in my case I already have the .parquet files with the preprocessed data, so I only created the initial dataframe using pandas for demonstration purposes (it would have been much easier to specify the number of partitions in the first step if I had all the data in pandas).

Best way to parallelize computation over dask blocks that do not return np arrays?

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.
Edit:
Thanks to a #AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.
def extract_features_generalized(chunk, offsets, depth, columns):
shape = np.asarray(chunk.shape)
offsets = np.asarray(offsets)
depth = np.asarray(depth)
coordinates = np.stack(np.nonzero(chunk)).T
keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)
data = coordinates + offsets - depth
df = pd.DataFrame(data=data, columns=columns)
return df[keep]
def my_overlap_generalized(data, chunksize, depth, columns, boundary):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr,
chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2),
columns=['r', 'c', 'z', 'y', 'x'],
boundary=tuple(['reflect']*5))
df.compute().reset_index()
-- Remainder of original post, including original bugs --
My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?
def my_overlap(data, chunk_xy, depth_xy):
data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
data_overlapping_chunks = da.overlap.overlap(data,
depth=(0,0,0,depth_xy,depth_xy),
boundary={3: 'reflect', 4: 'reflect'})
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
dfs.append(df_block)
# All computation is delayed, so downstream comptutions need to know the format of the data. If the meta
# information is not specified, a single computation will be done (which could be expensive) at this point
# to infer the metadata.
# This empty dataframe has the index, column, and type information we expect in the computation.
columns = ['r', 'c', 'z', 'y', 'x']
# The dtypes are float64, except for a small number of columns
df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
df_meta.index.name = 'feature'
return dd.from_delayed(dfs, meta=df_meta)
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy,
'x': x+offsets[4]-depth_xy})
df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
return df
data = np.zeros((2,4,8,16,16)) # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()
First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.
When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.
Here is a corrected version:
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
df['y'] = df['y'] + offsets[3] - depth_xy
df['x'] = df['x'] + offsets[4] - depth_xy
return df
The updated code now returns the indices that were set to 1:
index r c z y x
0 0 0 0 4 2 2
1 1 0 1 4 6 2
2 2 0 3 4 2 2
3 1 1 2 4 8 2
For comparison, this is the output of the original version:
index r c z y x
0 1 0 1 4 6 2
1 3 1 2 4 8 2
2 0 0 1 4 6 2
3 1 1 2 4 8 2
It returns lines number 2 and 4, two times each.
The reason why this happens is three mistakes in the extract_features function:
You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap
To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:
def extract_features_generalized(chunk, offsets, depth, columns):
coordinates = np.nonzero(chunk)
df = pd.DataFrame()
rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
for i in range(len(columns)):
df[columns[i]] = coordinates[i]
rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
del coordinates
return df[rows_to_keep > 0]
def my_overlap_generalized(data, chunksize, depth, columns):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth,
boundary=tuple(['reflect']*len(columns)))
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()

Find frequency of elements in a dask array without losing information about the array shape?

I need to find the frequency of every elements in the array while keeping the information about the array shape. This is because I'll need to iterate over it later on.
I tried this solution as well as this one. It works well for numpy however it doesn't seem to work in dask due to the limitation of dask arrays needing to know their size for most operation.
import dask.array as da
arr = da.from_array([1, 1, 1, 2, 3, 4, 4])
unique, counts = da.unique(arr, return_counts=True)
print(unique)
# dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,)>
print(counts)
# dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,)>
I am looking for something similar to this:
import dask.array as da
arr = da.from_array([1, 1, 1, 2, 3, 4, 4])
print(da.frequency(arr))
# {1: 3, 2: 1, 3:1, 4:2}
I found that this solution was the fastest for a large amount (~37.5 Billion elements) of data with many unique values (>50k).
import dask
import dask.array as da
arr = da.from_array(some_large_array)
bincount = da.bincount(arr)
bincount = bincount[bincount != 0] # Remove elements not in the initial array
unique = da.unique(arr)
# Allows to have the shape of the arrays
unique, counts = dask.compute(unique, bincount)
unique = da.from_array(unique)
counts = da.from_array(counts)
frequency = da.transpose(
da.vstack([unique, counts])
)
Perhaps you can call dask.compute directly after creating the frequency counts. Presumably at this point your dataset is small and now would be a good time to transition away from Dask Array and back to NumPy
import dask
import dask.array as da
arr = da.from_array([1, 1, 1, 2, 3, 4, 4])
unique, counts = da.unique(arr, return_counts=True)
unique, counts = dask.compute(unique, counts)
result = dict(zip(unique, counts))
# {1: 3, 2: 1, 3: 1, 4: 2}

How to do groupby filter in Dask

I am attempting to take a dask dataframe, group by column 'A' and remove the groups where there are fewer than MIN_SAMPLE_COUNT rows.
For example, the following code works in pandas:
import pandas as pd
import dask as da
MIN_SAMPLE_COUNT = 1
x = pd.DataFrame([[1,2,3], [1,5,6], [2,8,9], [1,3,5]])
x.columns = ['A', 'B', 'C']
grouped = x.groupby('A')
x = grouped.filter(lambda x: x['A'].count().astype(int) > MIN_SAMPLE_COUNT)
However, in Dask if I try something analogous:
import pandas as pd
import dask
MIN_SAMPLE_COUNT = 1
x = pd.DataFrame([[1,2,3], [1,5,6], [2,8,9], [1,3,5]])
x.columns = ['A', 'B', 'C']
x = dask.dataframe.from_pandas(x, npartitions=2)
grouped = x.groupby('A')
x = grouped.filter(lambda x: x['A'].count().astype(int) > MIN_SAMPLE_COUNT)
I get the following error message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\groupby.py in __getattr__(self, key)
1162 try:
-> 1163 return self[key]
1164 except KeyError as e:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\groupby.py in __getitem__(self, key)
1153 # error is raised from pandas
-> 1154 g._meta = g._meta[key]
1155 return g
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in __getitem__(self, key)
274 if key not in self.obj:
--> 275 raise KeyError("Column not found: {key}".format(key=key))
276 return self._gotitem(key, ndim=1)
KeyError: 'Column not found: filter'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
<ipython-input-55-d8a969cc041b> in <module>()
1 # Remove sixty second blocks that have fewer than MIN_SAMPLE_COUNT samples.
2 grouped = dat.groupby('KPI_60_seconds')
----> 3 dat = grouped.filter(lambda x: x['KPI_60_seconds'].count().astype(int) > MIN_SAMPLE_COUNT)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\groupby.py in __getattr__(self, key)
1163 return self[key]
1164 except KeyError as e:
-> 1165 raise AttributeError(e)
1166
1167 #derived_from(pd.core.groupby.DataFrameGroupBy)
AttributeError: 'Column not found: filter'
The error message suggests that the filter method used in Pandas has not been implemented in Dask (nor did I find it after a search).
Is there a Dask functionality which captures what I am looking to do? I have gone through the Dask API and nothing stood out to me as what I need. I am currently using Dask '1.1.1'
Thank you for your help.
Fairly new to Dask myself. One way to achieve you are trying could be as follows:
Dask version: 0.17.3
import pandas as pd
import dask.dataframe as dd
MIN_SAMPLE_COUNT = 1
x = pd.DataFrame([[1,2,3], [1,5,6], [2,8,9], [1,3,5]])
x.columns = ['A', 'B', 'C']
print("x (before):")
print(x) # still pandas
x = dd.from_pandas(x, npartitions=2)
grouped = x.groupby('A').B.count().reset_index()
grouped = grouped.rename(columns={'B': 'Count'})
y = dd.merge(x, grouped, on=['A'])
y = y[y.Count > MIN_SAMPLE_COUNT]
x = y[['A', 'B', 'C']]
print("x (after):")
print(x.compute()) # needs compute for conversion to pandas df
Output:
x (before):
A B C
0 1 2 3
1 1 5 6
2 2 8 9
3 1 3 5
x (after):
A B C
0 1 2 3
1 1 5 6
1 1 3 5

Resources