Dask Equivalent of pd.to_numeric - dask

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to perform some operation stating it cannot convert string to float. Hence I used dtype=str argument to read all the columns as string. Now I want to convert the particular column to numeric with errors='coerce' so that I those records contain string are converted to NaN values and rest get converted to float correctly. Can you please advise how this can be achieved using dask?
Have already tried: astype conversion
import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8',
assume_missing = True,
usecols =col_names.values.tolist(),
dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+-----------------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------------+--------+----------+
| mycol | object | float64 |
+-----------------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- mycol
ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True,
converters={"count":to_n}
)
df.dtypes
search_df = df.query('count >0').compute()

I had a similar problem and I solved it using .where.
p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")
Or perhaps replacing the second line with:
p.where(p.str.isnumeric(), 999).astype("u4")
In my case my DataFrame (or Series) was the result of other operations, so I couldn't apply it directly to read_csv.

As of March 2020, dask.dataframe.to_numeric() has been implemented and is documented here
Here's a minimal example:
import pandas as pd
import dask.dataframe as dd
# create dask dataframe with dummy data incl. number as string
data = {'A': '1', 'B': 2, 'C': 3}
df = pd.DataFrame([data])
ddf = dd.from_pandas(df, npartitions=3)
# inspect dtypes
ddf.dtypes
> A object
> B int64
> C int64
> dtype: object
# apply to_numeric method
ddf.A = dd.to_numeric(ddf.A)
# verify dtypes
ddf.dtypes
> A int64
> B int64
> C int64
> dtype: object

Related

Dask dataframe parallel task

I want to create features(additional columns) from a dataframe and I have the following structure for many functions.
Following this documentation https://docs.dask.org/en/stable/delayed-best-practices.html I have come up with the code below.
However I get the error message: concurrent.futures._base.CancelledError and many times I get the warning: distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
I understand that the object I am appending to delay is very large(it works ok when I use the commented out df) which is why the program crashes but is there a better way of doing it?
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
def main():
#df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000), "col2": np.random.randint(101, 200, 100000), "col3": np.random.uniform(0, 4, 100000)})
df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000000), "col2": np.random.randint(101, 200, 100000000), "col3": np.random.uniform(0, 4, 100000000)})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
delay = []
def create_col_sth():
group = ddf.groupby("col1")["col3"]
#dask.delayed
def small_fun(lag):
return f"col_{lag}", group.transform(lambda x: x.shift(lag), meta=('x', 'float64')).apply(lambda x: np.log(x), meta=('x', 'float64'))
for lag in range(5):
x = small_fun(lag)
delay.append(x)
create_col_sth()
delayed = dask.compute(*delay)
for data in delayed:
ddf[data[0]] = data[1]
ddf.to_parquet("test", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main()
Not sure if this will resolve all of your issues, but generally you don't need to (and shouldn't) mix delayed and dask.datafame operations like this. Additionally, you shouldn't pass large data objects into delayed functions through closures like group in your example. Instead, include them as explicit arguments, or in this case, don't use delayed at all and use dask.dataframe native operations or in-memory operations with dask.dataframe.map_partitions.
Implementing these, I would rewrite your main function as follows:
df = pd.DataFrame({
"col1": np.random.randint(1, 100, 100000000),
"col2": np.random.randint(101, 200, 100000000),
"col3": np.random.uniform(0, 4, 100000000),
})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
group = ddf.groupby("col1")["col3"]
# directly assign the dataframe operations as columns
for lag in range(5):
ddf[f"col_{lag}"] = (
group
.transform(lambda x: x.shift(lag), meta=('x', 'float64'))
.apply(lambda x: np.log(x), meta=('x', 'float64'))
)
# this triggers the operation implicitly - no need to call compute
ddf.to_parquet("test", engine="fastparquet")
After long periods of frustration with Dask, I think I hacked the holy grail of refactoring your pandas transformations wrapped with dask.
Learning points:
Index intelligently. If you are grouping by or merging you should consider indexing the columns you use for those.
Partition and repartition intelligently. If you have a dataframe of 10k rows and another of 1m rows, they should naturally have different partitions.
Don't use dask data frame transformation methods except for example merge. The others should be in pandas code wrapped around map_partitions.
Don't accumulate too large graphs so consider saving after for example indexing or after making a complex transformation.
 
If possible filter the data frame and work with smaller subset you can always merge this back to the bigger data set.
If you are working in your local machine set the memory limits within the boundaries of system specifications. This point is very important. In the example below I create one million rows of 3 columns one is an int64 and two are float64 which are 8bytes each and 24bytes in total this gives me 24 million bytes.
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
# https://stackoverflow.com/questions/52642966/repartition-dask-dataframe-to-get-even-partitions
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
def main(client):
size = 1000000
df = pd.DataFrame({"col1": np.random.randint(1, 10000, size), "col2": np.random.randint(101, 20000, size), "col3": np.random.uniform(0, 100, size)})
# Select appropriate partitions
ddf = dd.from_pandas(df, npartitions=500)
del df
gc.collect()
# This is correct if you want to group by a certain column it is always best if that column is an indexed one
ddf = ddf.set_index("col1")
ddf = _rebalance_ddf(ddf)
print(ddf.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf.memory_usage(deep=True).sum().compute())
# Always persist your data to prevent big task graphs actually if you omit this step processing will fail
ddf.to_parquet("test", engine="fastparquet")
ddf = dd.read_parquet("test")
# Dummy code to create a dataframe to be merged based on col1
ddf2 = ddf[["col2", "col3"]]
ddf2["col2/col3"] = ddf["col2"] / ddf["col3"]
ddf2 = ddf2.drop(columns=["col2", "col3"])
# Repartition the data
ddf2 = _rebalance_ddf(ddf2)
print(ddf2.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf2.memory_usage(deep=True).sum().compute())
def mapped_fun(data):
for lag in range(5):
data[f"col_{lag}"] = data.groupby("col1")["col3"].transform(lambda x: x.shift(lag)).apply(lambda x: np.log(x))
return data
# Process the group by transformation in pandas but wrapped with Dask if you use the Dask functions to do this you will
# have a variety of issues.
ddf = ddf.map_partitions(mapped_fun)
# Additional... you can merge ddf with ddf2 but on an indexed column otherwise you run into a variety of issues
ddf = ddf.merge(ddf2, on=['col1'], how="left")
ddf.to_parquet("final", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main(client)

Dask Dataframe Greater than a Delayed Number

Is there a way to do this but with the threshold as a delayed number?
import dask
import pandas as pd
import dask.dataframe as dd
threshold = 3
df = pd.DataFrame({'something': [1,2,3,4]})
ddf = dd.from_pandas(df, npartitions=2)
ddf[ddf['something'] >= threshold]
What if threshold is:
threshold = dask.delayed(3)
Atm it gives me:
TypeError('Truth of Delayed objects is not supported')
I want to keep the ddf as a dask dataframe, and not turn it into a pandas dataframe. Wondering if there was combinator forms that also took delayed values.
Dask has no way to know that the concrete value in that Delayed object is an integer, so there's no way to know what to do with it in the operation (align, broadcast, etc.)
If you use something like a size-0 array, things seem OK
In [32]: df = dd.from_pandas(pd.DataFrame({"A": [1, 2, 3, 4]}), 2)
In [33]: threshold = da.from_array(np.array([3]))[0]
In [34]: df.A > threshold
Out[34]:
Dask Series Structure:
npartitions=2
0 bool
2 ...
3 ...
Name: A, dtype: bool
Dask Name: gt, 8 tasks
In [35]: df[df.A > threshold].compute()
Out[35]:
A
3 4

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe
I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df) to turn these dictionaries into a pandas.DataFrame (resulting dataframe is around 100 mb for each file). I would like to append all dataframes into a single dask.dataframe for further analysis.
Up to now I was using dask.delayed objects to load, convert and append all data which works fine (see example below). However for future work I would like to store the loaded dictionaries in a dask.bag using dask.persist().
I managed to load the data into dask.bag, resulting in a list of dicts or list of pandas.DataFrame that I can use locally after calling compute(). When I tried turning the dask.bag into a dask.dataframe using to_delayed() however, I got stuck with an error (see below).
It feels like I am missing something rather simple here or maybe my approach to dask.bag is wrong?
The below example shows my approach using simplified functions and throws the same error. Any advice on how to tackle this is appreciated.
import numpy as np
import pandas as pd
import dask
import dask.dataframe
import dask.bag
print(dask.__version__) # 1.1.4
print(pd.__version__) # 0.24.2
def make_dict(n=1):
return {"name":"dictionary","data":{'A':np.arange(n),'B':np.arange(n)}}
def make_df(d):
return pd.DataFrame(d['data'])
k = [1,2,3]
# using dask.delayed
dfs = []
for n in k:
delayed_1 = dask.delayed(make_dict)(n)
delayed_2 = dask.delayed(make_df)(delayed_1)
dfs.append(delayed_2)
ddf1 = dask.dataframe.from_delayed(dfs).compute() # this works as expected
# using dask.bag and turning bag of dicts into bag of DataFrames
b1 = dask.bag.from_sequence(k).map(make_dict)
b2 = b1.map(make_df)
df = pd.DataFrame().append(b2.compute()) # <- I would like to do this using delayed dask.DataFrames like above
ddf2 = dask.dataframe.from_delayed(b2.to_delayed()).compute() # <- this fails
# error:
# ValueError: Expected iterable of tuples of (name, dtype), got [ A B
# 0 0 0]
what I ultimately would like to do using the distributed scheduler:
b = dask.bag.from_sequence(k).map(make_dict)
b = b.persist()
ddf = dask.dataframe.from_delayed(b.map(make_df).to_delayed())
In the bag case the delayed objects point to lists of elements, so you have a list of lists of pandas dataframes, which is not quite what you want. Two recommendations
Just stick with dask.delayed. It seems to work well for you
Use the Bag.to_dataframe method, which expects a bag of dicts, and does the dataframe conversion itself

Groupby - rolling alternative in Dask

I am trying to implement a rolling average which resets whenever a '1' is encountered in a column labeled 'A'.
For example, the following functionality works in Pandas.
import pandas as pd
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
If I try an analogous code in Dask, I get the following:
import pandas as pd
import dask
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x = dask.dataframe.from_pandas(x, npartitions=3)
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-189-b6cd808da8b1> in <module>()
7 x = dask.dataframe.from_pandas(x, npartitions=3)
8
----> 9 x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
10 x
AttributeError: 'SeriesGroupBy' object has no attribute 'rolling'
After searching through the Dask API documentation I have not been able to find an implementation of what I am looking for.
Can anyone suggest an implementation of this algorithm in a Dask compatible way?
Thank you :)
Since then I found the following code snippet:
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
at Dask rolling function by group syntax.
Here is an adapted toy example:
import pandas as pd
import dask.dataframe as dd
x = pd.DataFrame([[1,2,3], [2,3,4], [4,5,6], [2,3,4], [4,5,6], [4,5,6], [2,3,4]])
x['bool'] = [0,0,0,1,0,1,0]
x.columns = ['a', 'b', 'x', 'bool']
ddf = dd.from_pandas(x, npartitions=4)
ddf['cumsum'] = ddf['bool'].cumsum()
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
df1
This has the correct functionality, but the order of the indices is now incorrect. Alternatively, if one knows how to preserve the order of the index, that would be a suitable solution.
You might want to construct your own rolling operation using the map_overlap or the _cum_agg methods (cum_agg is unfortunately not well documented).

Dask `.dt` after conversion

I have a dask dataframe with a timestamp column, and I need to get day of the week and month from it.
Here is the ddf construction
dfs = [delayed(pd.read_csv)(path) for path in glob('../data/20*.zip')]
df = dd.from_delayed(dfs)
meta = ('starttime', pd.Timestamp)
df['start'] = df.starttime.map_partitions(pd.to_datetime, meta=meta)
now, if I use something like
df.head(10).dt.year, it works (returns a year). which means that datacol is converted.
However, when I try to get a new column, it raises an error:
df['dow'] = df['start'].dt.dayofweek (or any other ".dt" option, for that matter):
AttributeError: 'Series' object has no attribute 'dayofweek'
What am I missing here?
I think your meta isn't quite right (it raises an error for me on the latest dask and pandas). Here's a reproducible example that works
In [41]: import numpy as np
In [42]: import pandas as pd
In [43]: import dask.dataframe as dd
In [44]: df = pd.DataFrame({"A": pd.date_range("2017", periods=12)})
In [45]: df['B'] = df.A.astype(str)
In [46]: ddf = dd.from_pandas(df, 2)
In [47]: ddf['C'] = ddf.B.map_partitions(pd.to_datetime, meta=("B", "datetime64[ns]"))
In [48]: ddf.C.dt.dayofweek
Out[48]:
Dask Series Structure:
npartitions=2
0 int64
6 ...
11 ...
Name: C, dtype: int64
Dask Name: dt-dayofweek, 12 tasks
In [49]: ddf.C.dt.dayofweek.compute()
Out[49]:
0 6
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 0
9 1
10 2
11 3
Name: C, dtype: int64
Does that work for you? If not, could you edit your question to include a minimal example?

Resources