Why does the amount of dask task increase with an "optimized" chunking compared to a "basic" chunking schema? - dask

I'm trying to understand how different chunking schemas can speed up or slow down my computation using xarray and dask.
I have read dask and xarray guides but I might have missed something to understand this.
Problem
I have 2 storage with the same content but chunked differently.
Both contains a data variable tasmax and the necessary coordinate variables and metadata for it to be opened with xarray.
tasmax shape is <xarray.DataArray 'tasmax' (time: 3660, lat: 256, lon: 512)>
The first storage is a zarr store zarr_init which I made from netCDF files, 1 file per year, 10 .nc files.
When opening it with xarray I get a chunking schema of chunksize=(366, 256, 512), thus 1 year per chunk, same as the initial netCDF storage.
Each chunk is around 191MB.
The second storage, zarr_time_opti is also a zarr store but, there is no chunking on time dimension.
When I open it with xarray and inspect tasmax, it's chunking schema is chunksize=(3660, 114, 115).
Each chunk is around 191MB as well.
Naively, I would expect spatially independent computations to run much faster and to generate much fewer tasks on zarr_time_opti than on zarr_init.
However, I observe the complete opposite:
When computing the same calculus based on groupby("time.month"), I get 2370 tasks with zarr_time_opti and only 570 tasks with zarr_init. As you can see with the MRE below, this has nothing to do with zarr itself as I'm able to reproduce the issue with only xarray and dask.
So my questions are:
What is the mechanism with xarray or dask which create that many tasks ?
Then, what would be the strategy to find the best chunking schema ?
MRE
def simple_climate_index(da):
import time
time_start = time.perf_counter()
# computations
res =( da.groupby("time.month") - da.groupby("time.month").mean("time")).compute()
# summer_days = (da > 25).resample(time="MS").sum().compute()
time_elapsed = time.perf_counter() - time_start
print(f"wall time: {time_elapsed} secs")
def mre_so():
import distributed
import pandas as pd
import numpy as np
client = distributed.Client(memory_limit="16GB", n_workers=1, threads_per_worker=4)
tasmax = xr.DataArray(
data=np.empty((3660, 256, 512), dtype=float),
dims=["time", "lat", "lon"],
coords=dict(
time=pd.date_range("2042-01-01", periods=3660, freq="D"),
lat=np.arange(256),
lon=np.arange(512),
),
name="tasmax",
attrs={"units": "degC"},
)
da_optimized = tasmax.copy(deep=True).chunk(dict(time=-1, lat=114, lon=115))
simple_climate_index(da_optimized)
# wall time: ~47 secs - 2370 tasks (observed on client)
da_init = tasmax.copy(deep=True).chunk(dict(time=366, lat=-1, lon=-1))
simple_climate_index(da_init)
# wall time: ~37 secs - 570 tasks (observed on client)
if __name__ == "__main__":
mre_so()
Notes
zarr_time_opti is obtained by rechunking zarr_init with rechunker, a library to efficiently rewrite to different chunking schemas.
In reality, I'm doing time series analyses by computing (for example) the 90th daily percentile over 30 years on each pixel and then computing the exceedance rate of tasmax compare to this percentile on each pixel again.
And in this case, using ~100 years I get around 2000 tasks when time is chunked and around 85000 when time is not chunked.

What is the mechanism with xarray or dask which create that many tasks ?
In the case of da_optimized, you seem to be chunking along both lat and lon dimensions, and in da_init, you're chunking along only the time dimension.
da_optimized:
da_init:
When you do a compute, in the beginning, each task will correspond to one chunk.
Sidenotes about your specific example:
da_optimized starts with 15 chunks and da_init with 10, this is adding to fewer overall tasks in da_init. So, to balance them, I've modified it to be:
da_optimized = tasmax.copy(deep=True).chunk(dict(time=-1, lat=128, lon=103))
While executing, xarray shows this warning: PerformanceWarning: Slicing with an out-of-order index is generating 11 times more chunks. So, I've simplifies the comupation in simple_climate_index to be:
res = da.groupby("time.month").mean("time").compute()
The best chunking technique would depend on what operation you're doing.
For a groupby operation, commonly seen in pandas, I can see why da_init has fewer tasks, and is faster. All the lat+lon data is conserved within a chunk for any given timestamp. (Moreover, Dask can optimize the number of chunks based on groups in this case. For example, you're grouping-by month, so even if you start with 100 chunks, you'll end up with 12 groups, which can potentially be stored as one-group-per-chunk, so, 12 chunks in total. I'm not sure if xarray actually does this optimization, I'm just saying it's possible.)
In da_optimized, a groupby will require communication between chunks because the lat+lon data are spread across different chunks, which will result in more tasks, and therefore a performance penalty.
Here are the (task) graph visualizations for both operations:
da_optimized:
da_init:
Then, what would be the strategy to find the best chunking schema ?
Since you're doing the groupby() on "time", the task graph would be most efficient if you chunk along the same (time) dimension.

Related

Does GPU accelerate data preprocessing in ML tasks?

I am doing a machine learning (value prediction) task. While I am preprocessing data, it takes a very long time. I have a csv file with around 640000 rows, and I am trying to subtract the dates of consecutive rows and calculate the time duration. The csv file looks as attached. For example, 2011-08-17 to 2011-08-19 takes 2 days, and I would like to write 2 to the "time duration" column. I've used the python datetime function to do this. And it costs a lot of time.
data = pd.read_csv(f'{proj_dir}/raw data/measures.csv', encoding="cp1252")
file = data[['ID', 'date', 'value1', 'value2', 'duration']]
def time_subtraction(date, prev_date):
diff = datetime.strptime(date, '%Y-%m-%d') - datetime.strptime(prev_date, '%Y-%m-%d')
diff_days = diff.days
return diff_days
def calculate_time_duration(dataframe, set_0_indices):
for i in range(dataframe.shape[0]):
# For each patient, sets "Time Duration" at the first measurement to be 0
if i in set_time_0_indices.values:
dataframe.iloc[i, 4] = 0 # set time duration to 0 (beginning of this patient)
else: # time subtraction
dataframe.iloc[i, 4] = time_subtraction(date=dataframe.iloc[i, 1], prev_date=dataframe.iloc[i-1, 1])
return dataframe
# I am running on Google Colab. This line takes very long.
result = calculate_time_duration(dataframe = file, set_0_indices = set_time_0_indices)
I wonder if there are any ways to accelerate this process. Does using a GPU help? I have access to a remote GPU, but I don't know if using a GPU helps with data preprocessing. By the way, under what scenario can GPUs really make things faster? Thanks in advance!
what my data looks like
Regarding updating your data in a faster fashion please see this post.
Regarding speed improvements using the GPU: You can only use the GPU if there are optimized operations which can actually be run on the CPU. Preprocessing like you do it are normally not in the scope. You must also consider that you would need to transfer data to the GPU first, before computing anything and then transferring the results back. In your case, this would take much longer than the actual speedup, especially since your operation on the data is quite simple. I'm sure using the correct pandas syntax will lead to your desired speed up in preprocessing.

Dask utilising all ram when saving to parquet

I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading'))
I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet") it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h goes up from about 40megabytes to 40+gig.
I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.
Dask computations are executed lazily. The underlying operations aren't actually executed until the last possible moment. Here's what I can gather from your question / comment:
you read a 40GB dataset
you run grouping / sorting
you join with other datasets
you try to write to Parquet
The computation bottleneck isn't necessarily related to the Parquet writing part. Your bottleneck may be with the grouping, sorting, or joining.
You may need to perform a broadcast join, strategically persist, or repartition, it's hard to say given the information provided.

Dask Distributed Scheduler and Large Functions

In the context of the Dask distributed scheduler w/ a LocalCluster: Can somebody help me understand the dynamics of having a large (heap) mapping function?
For example, consider the Dask Data Frame ddf and the map_partitions operation:
def mapper():
resource=... #load some large resource eg 50MB
def inner(pdf):
return pdf.apply(lambda x: ..., axis=1)
return inner
mapper_fn = mapper() #50MB on heap
ddf.map_partitions(mapper_fn)
What happens here? Dask will serialize mapper_fn and send to all tasks? Say, I have n partitions so, n tasks.
Empirically, I've observed, that if I have 40 tasks and a 50MB mapper, then it takes about 70s taks to start working, the cluster seems to kinda sit there w/ full CPU, but the dashboard shows nothing. What is happening here? What are the consequences of having large (heap) functions in the dish distributed scheduler?
Dask serializes non-trivial functions with cloudpickle, and includes the serialized version of those functions in every task. This is highly inefficient. We recommend that you not do this, but instead pass data explicitly.
resource = ...
ddf.map_partitions(func, resource=resource)
This will be far more efficient.

Batch results of intermediate dask computation

I have a large (10s of GB) CSV file that I want to load into dask, and for each row, perform some computation. I also want to write the results of the manipulated CSV into BigQuery, but it'd be better to batch network requests to BigQuery in groups of say, 10,000 rows each, so I don't incur network overhead per row.
I've been looking at dask delayed and see that you can create an arbitrary computation graph, but I'm not sure if this is the right approach: how do I collect and fire off intermediate computations based on some group size (or perhaps time elapsed). Can someone provide a simple example on that? Say for simplicity we have these functions:
def change_row(r):
# Takes 10ms
r = some_computation(r)
return r
def send_to_bigquery(rows):
# Ideally, in large-ish groups, say 10,000 rows at a time
make_network_request(rows)
# And here's how I'd use it
import dask.dataframe as dd
df = dd.read_csv('my_large_dataset.csv') # 20 GB
# run change_row(r) for each r in df
# run send_to_big_query(rows) for each appropriate size group based on change_row(r)
Thanks!
The easiest thing that you can do is provide a block size parameter to read_csv, which will get you approximately the right number of rows per block. You may need to measure some of your data or experiment to get this right.
The rest of your task will work the same way as any other "do this generic thing to blocks of data-frame": the `map_partitions' method (docs).
def alter_and_send(df):
rows = [change_row(r) for r in df.iterrows()]
send_to_big_query(rows)
return df
df.map_partitions(alter_and_send)
Basically, you are running the function on each piece of the logical dask dataframe, which are real pandas dataframes.
You may actually want map, apply or other dataframe methods in the function.
This is one way to do it - you don't really need the "output" of the map, and you could have used to_delayed() instead.

large data shuffle causes timeouts

I'm trying to read a 100000 data records of about 100kB each simultaneously from 50 disks, shuffling them, and writing it to 50 output disks at disk speed. What's a good way of doing that with Dask?
I've tried creating 50 queues and submitting 50 reader/writer functions using 100 workers (all on different machines, this is using Kubernetes). I ramp up first the writers, then the readers gradually. The scheduler gets stuck at 100% CPU at around 10 readers, and then gets timeouts when any more readers are added. So this approach isn't working.
Most dask operations have something like 1ms of overhead. As a result Dask is not well suited to be placed within innermost loops. Typically it is used at a coarser level, parallelizing across many Python functions, each of which is expected to take 100ms.
In a situation like yours I would push data onto a shared message system like Kafka, and then use Dask to pull off chunks of data when appropriate.
Data transfer
If your problem is in the bandwidth limitation of moving data through dask queues then you might consider turning your data into dask-reference-able futures before placing things into queues. See this section of the Queue docstring: http://dask.pydata.org/en/latest/futures.html#distributed.Queue
Elements of the Queue must be either Futures or msgpack-encodable data (ints, strings, lists, dicts). All data is sent through the scheduler so it is wise not to send large objects. To share large objects scatter the data and share the future instead.
So you probably want something like the following:
def f(queue):
client = get_client()
for fn in local_filenames():
data = read(fn)
future = client.scatter(data)
queue.put(future)
Shuffle
If you're just looking to shuffle data then you could read it with something like dask.bag or dask.dataframe
df = dd.read_parquet(...)
and then sort your data using the set_index method
df.set_index('my-column')

Resources