Dask `.dt` after conversion - dask

I have a dask dataframe with a timestamp column, and I need to get day of the week and month from it.
Here is the ddf construction
dfs = [delayed(pd.read_csv)(path) for path in glob('../data/20*.zip')]
df = dd.from_delayed(dfs)
meta = ('starttime', pd.Timestamp)
df['start'] = df.starttime.map_partitions(pd.to_datetime, meta=meta)
now, if I use something like
df.head(10).dt.year, it works (returns a year). which means that datacol is converted.
However, when I try to get a new column, it raises an error:
df['dow'] = df['start'].dt.dayofweek (or any other ".dt" option, for that matter):
AttributeError: 'Series' object has no attribute 'dayofweek'
What am I missing here?

I think your meta isn't quite right (it raises an error for me on the latest dask and pandas). Here's a reproducible example that works
In [41]: import numpy as np
In [42]: import pandas as pd
In [43]: import dask.dataframe as dd
In [44]: df = pd.DataFrame({"A": pd.date_range("2017", periods=12)})
In [45]: df['B'] = df.A.astype(str)
In [46]: ddf = dd.from_pandas(df, 2)
In [47]: ddf['C'] = ddf.B.map_partitions(pd.to_datetime, meta=("B", "datetime64[ns]"))
In [48]: ddf.C.dt.dayofweek
Out[48]:
Dask Series Structure:
npartitions=2
0 int64
6 ...
11 ...
Name: C, dtype: int64
Dask Name: dt-dayofweek, 12 tasks
In [49]: ddf.C.dt.dayofweek.compute()
Out[49]:
0 6
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 0
9 1
10 2
11 3
Name: C, dtype: int64
Does that work for you? If not, could you edit your question to include a minimal example?

Related

Dask Dataframe Greater than a Delayed Number

Is there a way to do this but with the threshold as a delayed number?
import dask
import pandas as pd
import dask.dataframe as dd
threshold = 3
df = pd.DataFrame({'something': [1,2,3,4]})
ddf = dd.from_pandas(df, npartitions=2)
ddf[ddf['something'] >= threshold]
What if threshold is:
threshold = dask.delayed(3)
Atm it gives me:
TypeError('Truth of Delayed objects is not supported')
I want to keep the ddf as a dask dataframe, and not turn it into a pandas dataframe. Wondering if there was combinator forms that also took delayed values.
Dask has no way to know that the concrete value in that Delayed object is an integer, so there's no way to know what to do with it in the operation (align, broadcast, etc.)
If you use something like a size-0 array, things seem OK
In [32]: df = dd.from_pandas(pd.DataFrame({"A": [1, 2, 3, 4]}), 2)
In [33]: threshold = da.from_array(np.array([3]))[0]
In [34]: df.A > threshold
Out[34]:
Dask Series Structure:
npartitions=2
0 bool
2 ...
3 ...
Name: A, dtype: bool
Dask Name: gt, 8 tasks
In [35]: df[df.A > threshold].compute()
Out[35]:
A
3 4

Dask Equivalent of pd.to_numeric

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to perform some operation stating it cannot convert string to float. Hence I used dtype=str argument to read all the columns as string. Now I want to convert the particular column to numeric with errors='coerce' so that I those records contain string are converted to NaN values and rest get converted to float correctly. Can you please advise how this can be achieved using dask?
Have already tried: astype conversion
import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8',
assume_missing = True,
usecols =col_names.values.tolist(),
dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+-----------------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------------+--------+----------+
| mycol | object | float64 |
+-----------------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- mycol
ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True,
converters={"count":to_n}
)
df.dtypes
search_df = df.query('count >0').compute()
I had a similar problem and I solved it using .where.
p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")
Or perhaps replacing the second line with:
p.where(p.str.isnumeric(), 999).astype("u4")
In my case my DataFrame (or Series) was the result of other operations, so I couldn't apply it directly to read_csv.
As of March 2020, dask.dataframe.to_numeric() has been implemented and is documented here
Here's a minimal example:
import pandas as pd
import dask.dataframe as dd
# create dask dataframe with dummy data incl. number as string
data = {'A': '1', 'B': 2, 'C': 3}
df = pd.DataFrame([data])
ddf = dd.from_pandas(df, npartitions=3)
# inspect dtypes
ddf.dtypes
> A object
> B int64
> C int64
> dtype: object
# apply to_numeric method
ddf.A = dd.to_numeric(ddf.A)
# verify dtypes
ddf.dtypes
> A int64
> B int64
> C int64
> dtype: object

Groupby - rolling alternative in Dask

I am trying to implement a rolling average which resets whenever a '1' is encountered in a column labeled 'A'.
For example, the following functionality works in Pandas.
import pandas as pd
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
If I try an analogous code in Dask, I get the following:
import pandas as pd
import dask
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x = dask.dataframe.from_pandas(x, npartitions=3)
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-189-b6cd808da8b1> in <module>()
7 x = dask.dataframe.from_pandas(x, npartitions=3)
8
----> 9 x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
10 x
AttributeError: 'SeriesGroupBy' object has no attribute 'rolling'
After searching through the Dask API documentation I have not been able to find an implementation of what I am looking for.
Can anyone suggest an implementation of this algorithm in a Dask compatible way?
Thank you :)
Since then I found the following code snippet:
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
at Dask rolling function by group syntax.
Here is an adapted toy example:
import pandas as pd
import dask.dataframe as dd
x = pd.DataFrame([[1,2,3], [2,3,4], [4,5,6], [2,3,4], [4,5,6], [4,5,6], [2,3,4]])
x['bool'] = [0,0,0,1,0,1,0]
x.columns = ['a', 'b', 'x', 'bool']
ddf = dd.from_pandas(x, npartitions=4)
ddf['cumsum'] = ddf['bool'].cumsum()
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
df1
This has the correct functionality, but the order of the indices is now incorrect. Alternatively, if one knows how to preserve the order of the index, that would be a suitable solution.
You might want to construct your own rolling operation using the map_overlap or the _cum_agg methods (cum_agg is unfortunately not well documented).

dask can not read the file that pandas can

I have a csv file that can be accessed using pandas but fails with dask dataframe.
I am using exact same parameters and still getting error with dask.
Pandas use case:
import pandas as pd
mycols = ['id', 'tran_id', 'client_id', 'm_text', 'retry', 'tran_date']
df = pd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str )
Pandas output:
df.retry.value_counts()
1 2792174
2 907081
3 116369
6 6475
4 5598
7 1314
5 1053
8 288
16 3
13 3
Name: retry, dtype: int64
dask code:
import dask.dataframe as dd
from dask.distributed import Client
client = Client('Dask-Scheduler.local-dask:8786')
df = dd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str,
storage_options = {'anon':False, 'key': 'xxx' , 'secret':'xxx'} )
df_persisted = client.persist(df)
df_persisted.retry.value_counts().compute()
Dask Output:
ParserError: unexpected end of data
I have tried opening smaller (and bigger) files in dask and there was no issue with them. It is possible that this file may have unclosed quotations. I can not see any reason why dask is unable to read the file.
Dask splits your files by looking for the line separator character b"\n". It looks for this single byte in parts of the file, so that the whole thing does not need to be read beforehand. When it finds it is not aware of whether the byte is escaped or within a quoted scope.
Thus, the chunking up of a large file by Dask can fail, and it appears that this is happening for you: some block is finishing on a newline which is not really a line ending.

Can't drop columns or slice dataframe using dask?

I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?
We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3
This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)

Resources