dask df.col.unique() vs df.col.drop_duplicates() - dask

In dask what is the difference between
df.col.unique()
and
df.col.drop_duplicates()
Both return a series containing the unique elements of df.col.
There is a difference in the index, unique result is indexed by 1..N while drop_duplicates indexed by an arbitrary looking sequence of numbers.
What is the significance of the index returned by drop_duplicates?
Is there any reason to use one over the other if the index is not important?

Dask.dataframe has both because Pandas has both, and dask.dataframe mostly copies the Pandas API. Unique is a holdover from Pandas' history with Numpy.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.Index(['a', 'b', 'A'], name='I'))
In [3]: df.x.drop_duplicates()
Out[3]:
I
a 1
b 2
Name: x, dtype: int64
In [4]: df.x.unique()
Out[4]: array([1, 2])
In dask.dataframe we deviate slightly and choose to use a dask.dataframe.Series rather than a dask.array.Array because one can't precompute the length of the array and so can't act lazily.
In practice there is little reason to use unique over drop_duplicates

Related

Dask division issue after groupby

I am working on a project where I need to group by several columns depending on the task and I have unknown division issues with dask because of this.
Here is a sample of the problem
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000), "col2": np.random.randint(101, 200, 100000), "col3": np.random.uniform(0, 4, 100000)})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
ddf["col2_sum"] = ddf.groupby("col1")["col3"].transform("sum", meta=('x', 'float64')) # works
print(ddf.compute())
This works because I am grouping by an indexed column. However,
ddf["col2_sum2"] = ddf.groupby("col2")["col3"].transform("sum", meta=('x', 'float64'))
This doesn't work because of ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
I have tried to solve this this way
ddf_new = ddf[["col2", "col3"]].set_index("col2")
ddf_new["col2_sum2"] = ddf_new.groupby("col2")["col3"].transform("sum", meta=('x', 'float64'))
ddf_new = ddf_new.drop(columns=["col3"])
ddf = ddf.merge(ddf_new, on=["col2"], how="outer") # works but expensive round trip
print(ddf.compute())
But this is very expensive dask merges. Is there a better way of solving this problem
The solution you created seems reasonable, I would make one improvement (if this is feasible with actual data): if ddf_new is computed, then it becomes a pandas df, so the merge of ddf and ddf_new becomes a lot faster as there is less data to shuffle around.
Update: also to avoid sending the pandas df from workers to client and back, you could do a ddf_new = client.compute(ddf_new) and pass around just the future (reference to the computed pandas df).

How to iterate da.linalg.inv over a dask array dimension

What is the best way to iterate da.linalg.inv over a multi-dimensional dask array?
I have a dask array of shape (4, 4, 8, 8), and need to compute the inverse of the last two dimensions. With numpy, np.linalg.inv loops over all dimensions except the last two, so in the following example, I can just call np.linalg.inv(A).
I have chosen to use a for loop, but I have read about gufuncs in dask (the documentation seems a little outdated). However, I'm not sure how to implement the it, particularly the "signature" bit,
import dask.array as da
import numpy as np
A = da.random.random((4,4,8,8))
A2 = A.reshape((-1,) + A.shape[-2:])
B = [da.linalg.inv(a) for a in A2]
B2 = da.asarray(B)
B3 = B2.reshape(A.shape)
np.testing.assert_array_almost_equal(
np.linalg.inv(A.compute()),
B3
)
My attempt at a gufunc leads to an error:
def foo(x):
return da.linalg.inv(x)
gufoo = da.gufunc(foo, signature="()->()", output_dtypes=float, vectorize=True)
gufoo(A2).compute() # IndexError: tuple index out of range
I think that you want to apply the numpy function np.linalg.inv over your Dask array rather than the dask array function.
If np.linalg.inv is already a gufunc then it might work as expected today
np.linalg.inv(A)

Dask Dataframe Greater than a Delayed Number

Is there a way to do this but with the threshold as a delayed number?
import dask
import pandas as pd
import dask.dataframe as dd
threshold = 3
df = pd.DataFrame({'something': [1,2,3,4]})
ddf = dd.from_pandas(df, npartitions=2)
ddf[ddf['something'] >= threshold]
What if threshold is:
threshold = dask.delayed(3)
Atm it gives me:
TypeError('Truth of Delayed objects is not supported')
I want to keep the ddf as a dask dataframe, and not turn it into a pandas dataframe. Wondering if there was combinator forms that also took delayed values.
Dask has no way to know that the concrete value in that Delayed object is an integer, so there's no way to know what to do with it in the operation (align, broadcast, etc.)
If you use something like a size-0 array, things seem OK
In [32]: df = dd.from_pandas(pd.DataFrame({"A": [1, 2, 3, 4]}), 2)
In [33]: threshold = da.from_array(np.array([3]))[0]
In [34]: df.A > threshold
Out[34]:
Dask Series Structure:
npartitions=2
0 bool
2 ...
3 ...
Name: A, dtype: bool
Dask Name: gt, 8 tasks
In [35]: df[df.A > threshold].compute()
Out[35]:
A
3 4

Grouping dask.bag items into distinct partitions

I was wondering if somebody could help me understand the way Bag objects handle partitions. Put simply, I am trying to group items currently in a Bag so that each group is in its own partition. What's confusing me is that the Bag.groupby() method asks for a number of partitions. Shouldn't this be implied by the grouping function? E.g., two partitions if the grouping function returns a boolean?
>>> a = dask.bag.from_sequence(range(20), npartitions = 1)
>>> a.npartitions
1
>>> b = a.groupby(lambda x: x % 2 == 0)
>>> b.npartitions
1
I'm obviously missing something here. Is there a way to group Bag items into separate partitions?
Dask bag may put several groups within one partition.
In [1]: import dask.bag as db
In [2]: b = db.range(10, npartitions=3).groupby(lambda x: x % 5)
In [3]: partitions = b.to_delayed()
In [4]: partitions
Out[4]:
[Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 0)),
Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 1)),
Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 2))]
In [5]: for part in partitions:
...: print(part.compute())
...:
[(0, [0, 5]), (3, [3, 8])]
[(1, [1, 6]), (4, [4, 9])]
[(2, [2, 7])]

Can't drop columns or slice dataframe using dask?

I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?
We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3
This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)

Resources