I'm looking into using dask for time-series research with large volumes of data. One common operation that I use is realignment of data to a different index (the reindex operation on pandas dataframe's). I noticed that the reindex function is not currently supported in the dask dataframe API, but is in the DataArray API. Are there plans to add this function?
I believe you could use the Dataframe.set_index() method combined with .resample() for that same purpose.
Related
I like to run an asynchronous dask dataframe computation with dd.persist() and then been able to track an individual partition status. The goal is to get access to partial results in a non-blocking way.
Here the desired pseudo code:
dd = dd.persist()
if dd.partitions[0].__dask_status__ == 'finished':
# Partial non-blocking result access
df = dd.partitions[0].compute()
Using dask futures works well, but submitting many individual partitions is very slow compared to a single dd.persist() and having one future per partition breaks the dashboard "groups" tab by showing too many blocks.
futures = list(map(client.compute, dd.partitions))
Broken dask dashboard "groups" tab
The function you probably want is distributed.futures_of, which lists the running futures of a collection. You can either examine this list yourself, looking at the status of the futures, or use with distributed.as_completed and a for-loop to process the partitions as they become available. The keys of the futures are like (collection-name, partition-index), so you know which partition each belongs to.
The reason dd.partitions[i] (or looping over these with list) doesn't work, is that this creates a new graph for each partition, and so you end up submitting much more to the scheduler than the single call to .persist().
I was wondering if anyone knew the proper way to write out a group of files based on the value of a column in Dask. In other words, if I want to group a bunch of columns based on a value in a column and write those out to CSVs. I've been trying to use the groupby-apply paradigm with Dask, but the problem is that it does not return a dask.dataframe object, so the function I apply it with uses the Pandas API.
Is there a better way to approach what I'm trying to do? A scalable solution would be much appreciated because some of the data that I'm dealing with is very large.
Thanks!
If you were saving to parquet, then partition_on kwarg would be useful. If you are saving to csv, then it's possible to do something similar with (rough pseudocode):
def save_partition(df, partition_info=None):
for group_label, group_df in df.groupby('some_col'):
csv_name = f"{group_label}_partition_{partition_info['number']}.csv"
group_df.to_csv(csv_name)
delayed_save = ddf.map_partitions(save_partition)
The delayed_save can then be computed when convenient.
I have created dataframe with non-sorted index with pandas and save it to parquet. Later, if I load with dask, How do I perform sort index? Do I have to do something like,
pdf.reset_index().set_index(idx)?
As far as I am aware, the answer is yes, your approach is correct. For example, searching for "sort_index" in Dask issues does not really yield any relevant results.
Keep in mind that sorting out-of-core is quite a difficult operation. It's possible you might get more stable results (or even better performance) in Pandas if your dataset fits in your memory.
I’m using a bucket for collecting tick data for multiple symbols in Binance (e.g. ETH/BTC and BNB/BTC) and storing on different measurements (binance_ethbtc and binance_bnbbtc respectively) and that’s working fine. Other than that, I’d like to make aggregations of OHLC data into another bucket, just like this guy here. I’ve already managed to write Flux code for aggregating this data for a single measurement but then it got me wondering: do I need to write a task for EVERY measurement I have? Isn’t there a way of iterating over measurements in a bucket and aggregating the data into another one?
Thanks to FixTestRepeat on the InfluxDB community, I've managed to do it (and iterating over measurements is not necessary). He's showed me that if I remove the filter for the _measurement field, the query will yield as many series as there are measurements. More information here
As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.
Case1.
I want to run:
res = dataframe.column.map(func, ...)
this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:
dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)
Any other recommended way to do it?
Case2
I need to map partitions of the dataframe:
df.map_partitions(mapping_func, meta=df)
Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?
Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?
You can safely modify a dask.dataframe
Operations like the following are supported and safe
df['col'] = df['col'].map(func)
This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).
You can not safely modify a partition
Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.