Can't drop columns or slice dataframe using dask? - dask

I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?

We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3

This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)

Related

Dask Dataframe Greater than a Delayed Number

Is there a way to do this but with the threshold as a delayed number?
import dask
import pandas as pd
import dask.dataframe as dd
threshold = 3
df = pd.DataFrame({'something': [1,2,3,4]})
ddf = dd.from_pandas(df, npartitions=2)
ddf[ddf['something'] >= threshold]
What if threshold is:
threshold = dask.delayed(3)
Atm it gives me:
TypeError('Truth of Delayed objects is not supported')
I want to keep the ddf as a dask dataframe, and not turn it into a pandas dataframe. Wondering if there was combinator forms that also took delayed values.
Dask has no way to know that the concrete value in that Delayed object is an integer, so there's no way to know what to do with it in the operation (align, broadcast, etc.)
If you use something like a size-0 array, things seem OK
In [32]: df = dd.from_pandas(pd.DataFrame({"A": [1, 2, 3, 4]}), 2)
In [33]: threshold = da.from_array(np.array([3]))[0]
In [34]: df.A > threshold
Out[34]:
Dask Series Structure:
npartitions=2
0 bool
2 ...
3 ...
Name: A, dtype: bool
Dask Name: gt, 8 tasks
In [35]: df[df.A > threshold].compute()
Out[35]:
A
3 4

Dask `.dt` after conversion

I have a dask dataframe with a timestamp column, and I need to get day of the week and month from it.
Here is the ddf construction
dfs = [delayed(pd.read_csv)(path) for path in glob('../data/20*.zip')]
df = dd.from_delayed(dfs)
meta = ('starttime', pd.Timestamp)
df['start'] = df.starttime.map_partitions(pd.to_datetime, meta=meta)
now, if I use something like
df.head(10).dt.year, it works (returns a year). which means that datacol is converted.
However, when I try to get a new column, it raises an error:
df['dow'] = df['start'].dt.dayofweek (or any other ".dt" option, for that matter):
AttributeError: 'Series' object has no attribute 'dayofweek'
What am I missing here?
I think your meta isn't quite right (it raises an error for me on the latest dask and pandas). Here's a reproducible example that works
In [41]: import numpy as np
In [42]: import pandas as pd
In [43]: import dask.dataframe as dd
In [44]: df = pd.DataFrame({"A": pd.date_range("2017", periods=12)})
In [45]: df['B'] = df.A.astype(str)
In [46]: ddf = dd.from_pandas(df, 2)
In [47]: ddf['C'] = ddf.B.map_partitions(pd.to_datetime, meta=("B", "datetime64[ns]"))
In [48]: ddf.C.dt.dayofweek
Out[48]:
Dask Series Structure:
npartitions=2
0 int64
6 ...
11 ...
Name: C, dtype: int64
Dask Name: dt-dayofweek, 12 tasks
In [49]: ddf.C.dt.dayofweek.compute()
Out[49]:
0 6
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 0
9 1
10 2
11 3
Name: C, dtype: int64
Does that work for you? If not, could you edit your question to include a minimal example?

Grouping dask.bag items into distinct partitions

I was wondering if somebody could help me understand the way Bag objects handle partitions. Put simply, I am trying to group items currently in a Bag so that each group is in its own partition. What's confusing me is that the Bag.groupby() method asks for a number of partitions. Shouldn't this be implied by the grouping function? E.g., two partitions if the grouping function returns a boolean?
>>> a = dask.bag.from_sequence(range(20), npartitions = 1)
>>> a.npartitions
1
>>> b = a.groupby(lambda x: x % 2 == 0)
>>> b.npartitions
1
I'm obviously missing something here. Is there a way to group Bag items into separate partitions?
Dask bag may put several groups within one partition.
In [1]: import dask.bag as db
In [2]: b = db.range(10, npartitions=3).groupby(lambda x: x % 5)
In [3]: partitions = b.to_delayed()
In [4]: partitions
Out[4]:
[Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 0)),
Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 1)),
Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 2))]
In [5]: for part in partitions:
...: print(part.compute())
...:
[(0, [0, 5]), (3, [3, 8])]
[(1, [1, 6]), (4, [4, 9])]
[(2, [2, 7])]

Dask: outer join read from multiple csv files

import dask.dataframe as dd
import numpy as np
from dask import delayed
df1 = pd.DataFrame({'a': np.arange(10), 'b': np.random.rand()})
df1 = df1.astype({'a':np.float64})
df2 = pd.DataFrame({'a': np.random.rand(5), 'c': 1})
df1.to_csv('df1.csv')
df2.to_csv('df2.csv')
dd.read_csv('*.csv').compute()
Gives inner join result:
Unnamed: 0 a b
0 0 0.000000 0.218319
1 1 1.000000 0.218319
2 2 2.000000 0.218319
...
And:
df1_delayed = delayed(lambda: df1)()
df2_delayed = delayed(lambda: df2)()
dd.from_delayed([df1_delayed, df2_delayed]).compute()
Gives outer join result:
a b c
0 0.000000 0.218319 NaN
1 1.000000 0.218319 NaN
2 2.000000 0.218319 NaN
...
How to make read_csv work in the same mode?
EDIT:
Even passing dtype schema down to pandas doesn't work:
dd.read_csv('*.csv', dtype={'a':np.float64, 'b': np.float64, 'c': np.float64}).compute()
Generally dask.dataframe assumes that all Pandas dataframes that form the dask.dataframe have the same columns and dtype. Behavior is ill-defined if this is not the case.
If your CSVs have different columns and dtypes then I recommend using dask.delayed as you've done in your second example and explicitly add the new empty columns before calling dask.dataframe.from_delayed.

dask df.col.unique() vs df.col.drop_duplicates()

In dask what is the difference between
df.col.unique()
and
df.col.drop_duplicates()
Both return a series containing the unique elements of df.col.
There is a difference in the index, unique result is indexed by 1..N while drop_duplicates indexed by an arbitrary looking sequence of numbers.
What is the significance of the index returned by drop_duplicates?
Is there any reason to use one over the other if the index is not important?
Dask.dataframe has both because Pandas has both, and dask.dataframe mostly copies the Pandas API. Unique is a holdover from Pandas' history with Numpy.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.Index(['a', 'b', 'A'], name='I'))
In [3]: df.x.drop_duplicates()
Out[3]:
I
a 1
b 2
Name: x, dtype: int64
In [4]: df.x.unique()
Out[4]: array([1, 2])
In dask.dataframe we deviate slightly and choose to use a dask.dataframe.Series rather than a dask.array.Array because one can't precompute the length of the array and so can't act lazily.
In practice there is little reason to use unique over drop_duplicates

Resources