xarray load() running slowly on open_mfdataset() data - dask

It is taking 21 seconds to run xarray.DataArray.values for a dataset that I have opened with open_mfdataset().
Getting the values from a larger array that I opened with open_dataset() is over 1000 times quicker. EDIT: looping over the multiple files using a for loop is also much quicker than using open_mfdataset(). See edit at the bottom.
Could you help me to understand why this happens, or what to look for, and if there is a faster way for me to open 40 netCDFs, do some selecting, and export the selected data to numpy?
My code is along these lines:
ds = xr.open_mfdataset(myfiles_list, concat_dim='new_dim')
ds = ds.sel(time=selected_date)
ds = ds.sel(latitude=slice([ymin, ymax]), longitude=slice([xmin, xmax]))
vals = ds['temperature'].values # this line takes 18.9 secs
# total time: 21 secs
# vals.shape = (40, 1, 26, 17)
vs
onefile = xr.open_dataset('/path/to/data/single_file.nc')
vals = onefile['temperature'].values # this line takes 0.005 secs
# total time: 0.018 secs
# vals.shape = (93, 40, 26, 17)
Thanks.
EDIT - Extra Info:
I should clarify that it seems to be the loading that is slow. When values is called an array that was previously lazy gets loaded. If I insert an explicit load() command then the loading is slow but the values command is then quick:
ds = xr.open_mfdataset(myfiles_list, concat_dim='new_dim')
ds = ds.sel(time=selected_date)
ds = ds.sel(latitude=slice([ymin, ymax]), longitude=slice([xmin, xmax]))
ds = ds.load() # this line takes 19 secs
vals = ds['temperature'].values # this line takes <10 ms
# total time: 21 secs
# vals.shape = (40, 1, 26, 17)
If, instead of using open_mfdataset(), I do a for loop over my list of files, extract a numpy array from each one, and do the concatenation in numpy then it only takes 1 second. In this MWE this solves my whole problem, but in my complete code I do need to use open_mfdataset():
list_of_arrays = []
for file in myfiles_list:
ds = xr.open_dataset(file)
ds = ds.sel(time=selected_date)
ds = ds.sel(latitude=slice([ymin, ymax]), longitude=slice([xmin, xmax]))
list_of_arrays.append(ds['temperature'].values)
vals = np.concatenate(list_of_arrays, axis=0)
# total time: 1.0 secs
# vals.shape = (40, 26, 17)

xarray.open_mfdataset will make a python list of xarray.Datasets and will concatenate them after all files parsed to the list.
So multiple times data has to be opened and stored into a list. If you profile the code, you will recognize that the file parsing takes the most time but it does not correlate to the size of the file. So a 2 times bigger file will not take 2 times more time to parse. In the end concatenation itselfs takes time.

Related

Dask dataframe parallel task

I want to create features(additional columns) from a dataframe and I have the following structure for many functions.
Following this documentation https://docs.dask.org/en/stable/delayed-best-practices.html I have come up with the code below.
However I get the error message: concurrent.futures._base.CancelledError and many times I get the warning: distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
I understand that the object I am appending to delay is very large(it works ok when I use the commented out df) which is why the program crashes but is there a better way of doing it?
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
def main():
#df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000), "col2": np.random.randint(101, 200, 100000), "col3": np.random.uniform(0, 4, 100000)})
df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000000), "col2": np.random.randint(101, 200, 100000000), "col3": np.random.uniform(0, 4, 100000000)})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
delay = []
def create_col_sth():
group = ddf.groupby("col1")["col3"]
#dask.delayed
def small_fun(lag):
return f"col_{lag}", group.transform(lambda x: x.shift(lag), meta=('x', 'float64')).apply(lambda x: np.log(x), meta=('x', 'float64'))
for lag in range(5):
x = small_fun(lag)
delay.append(x)
create_col_sth()
delayed = dask.compute(*delay)
for data in delayed:
ddf[data[0]] = data[1]
ddf.to_parquet("test", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main()
Not sure if this will resolve all of your issues, but generally you don't need to (and shouldn't) mix delayed and dask.datafame operations like this. Additionally, you shouldn't pass large data objects into delayed functions through closures like group in your example. Instead, include them as explicit arguments, or in this case, don't use delayed at all and use dask.dataframe native operations or in-memory operations with dask.dataframe.map_partitions.
Implementing these, I would rewrite your main function as follows:
df = pd.DataFrame({
"col1": np.random.randint(1, 100, 100000000),
"col2": np.random.randint(101, 200, 100000000),
"col3": np.random.uniform(0, 4, 100000000),
})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
group = ddf.groupby("col1")["col3"]
# directly assign the dataframe operations as columns
for lag in range(5):
ddf[f"col_{lag}"] = (
group
.transform(lambda x: x.shift(lag), meta=('x', 'float64'))
.apply(lambda x: np.log(x), meta=('x', 'float64'))
)
# this triggers the operation implicitly - no need to call compute
ddf.to_parquet("test", engine="fastparquet")
After long periods of frustration with Dask, I think I hacked the holy grail of refactoring your pandas transformations wrapped with dask.
Learning points:
Index intelligently. If you are grouping by or merging you should consider indexing the columns you use for those.
Partition and repartition intelligently. If you have a dataframe of 10k rows and another of 1m rows, they should naturally have different partitions.
Don't use dask data frame transformation methods except for example merge. The others should be in pandas code wrapped around map_partitions.
Don't accumulate too large graphs so consider saving after for example indexing or after making a complex transformation.
 
If possible filter the data frame and work with smaller subset you can always merge this back to the bigger data set.
If you are working in your local machine set the memory limits within the boundaries of system specifications. This point is very important. In the example below I create one million rows of 3 columns one is an int64 and two are float64 which are 8bytes each and 24bytes in total this gives me 24 million bytes.
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
# https://stackoverflow.com/questions/52642966/repartition-dask-dataframe-to-get-even-partitions
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
def main(client):
size = 1000000
df = pd.DataFrame({"col1": np.random.randint(1, 10000, size), "col2": np.random.randint(101, 20000, size), "col3": np.random.uniform(0, 100, size)})
# Select appropriate partitions
ddf = dd.from_pandas(df, npartitions=500)
del df
gc.collect()
# This is correct if you want to group by a certain column it is always best if that column is an indexed one
ddf = ddf.set_index("col1")
ddf = _rebalance_ddf(ddf)
print(ddf.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf.memory_usage(deep=True).sum().compute())
# Always persist your data to prevent big task graphs actually if you omit this step processing will fail
ddf.to_parquet("test", engine="fastparquet")
ddf = dd.read_parquet("test")
# Dummy code to create a dataframe to be merged based on col1
ddf2 = ddf[["col2", "col3"]]
ddf2["col2/col3"] = ddf["col2"] / ddf["col3"]
ddf2 = ddf2.drop(columns=["col2", "col3"])
# Repartition the data
ddf2 = _rebalance_ddf(ddf2)
print(ddf2.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf2.memory_usage(deep=True).sum().compute())
def mapped_fun(data):
for lag in range(5):
data[f"col_{lag}"] = data.groupby("col1")["col3"].transform(lambda x: x.shift(lag)).apply(lambda x: np.log(x))
return data
# Process the group by transformation in pandas but wrapped with Dask if you use the Dask functions to do this you will
# have a variety of issues.
ddf = ddf.map_partitions(mapped_fun)
# Additional... you can merge ddf with ddf2 but on an indexed column otherwise you run into a variety of issues
ddf = ddf.merge(ddf2, on=['col1'], how="left")
ddf.to_parquet("final", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main(client)

Efficiently loading large time series data into InfluxDB

I am trying to load 100 billion (thousands of columns, millions of rows) multi-dimensional time series datapoints into InfluxDB from a CSV file.
I am currently doing it through line protocol as follows (my codebase is in Python):
f = open(args.file, "r")
l = []
bucket_size = 100
if rows > 10000:
bucket_size = 10
for x in tqdm(range(rows)):
s = f.readline()[:-1].split(" ")
v = {}
for y in range(columns):
v["dim" + str(y)] = float(s[y + 1])
time = (get_datetime(s[0])[0] - datetime(1970, 1, 1)).total_seconds() * 1000000000
time = int(time)
body = {"measurement": "puncte", "time": time, "fields": v }
l.append(body)
if len(l) == bucket_size:
while True:
try:
client.write_points(l)
except influxdb.exceptions.InfluxDBServerError:
continue
break
l = []
client.write_points(l)
final_time = datetime.now()
final_size = get_size()
seconds = (final_time - initial_time).total_seconds()
As the code above shows, my code is reading the dataset CSV file and preparing batches of 10000 data points, then sending the datapoints using client.write_points(l).
However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!
I am certain there is a better way to load the dataset into InfluxDB. Any suggestions for faster data loading?
Try loading the data in multiple concurrent threads. This should give a speedup on multi-CPU machines.
Another option is to feed the CSV file directly to time series database without additional transformations. See this example.

Xarray Distributed Failed to serialize

I need to upsample through a linear interpolation some satellite images organized in a DataArray.
Until I run the code locally I've no issue but, if I try to replicate the interpolation over a distributed system, I get back this error:
`Could not serialize object of type tuple`
to replicate the problem what's needed is to switch between a distributed or local env.
here the distributed version of the code.
n_time = 365
px = 2000
lat = np.linspace(19., 4., px)
lon = np.linspace(34., 53., px)
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
data = xr.DataArray(np.random.random((n_time, px, px)), dims=('time', 'lat',
'lon'),coords={'time': time, 'lat': lat, 'lon': lon})
data = data.chunk({'time':1})
#upsampling
nlat = np.linspace(19., 4., px*2)
nlon = np.linspace(34., 53., px*2)
interp = data.interp(lat=nlat, lon=nlon)
computed = interp.compute()
Does any have and idea on how to work around the problem?
EDIT 1:
As seems that I haven't been enough clear in my first MRE so I decided to rewrite with all the inputs received up to now.
I need to upsample a satellite dataset from 500 meters to 250m. The final goal is, as chunking along the dimension to be interpolated is not yet supported **, figure out how I can create a workaround and upsampling each image to the 500 datasets.
px = 2000
n_time = 365
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
# dataset to be upsampled
lat_500 = np.linspace(19., 4., px)
lon_500 = np.linspace(34., 53., px)
da_500 = xr.DataArray(dsa.random.random((n_time, px, px),
chunks=(1, 1000, 1000)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat_500, 'lon': lon_500})
# reference dataset
lat_250 = np.linspace(19., 4., px * 2)
lon_250 = np.linspace(34., 53., px * 2)
da_250 = xr.DataArray(dsa.random.random((n_time, px * 2, px * 2),
chunks=(1, 1000, 1000)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat_250, 'lon': lon_250})
# upsampling
da_250i = da_500.interp(lat=lat_250, lon=lon_250)
#fake index
fNDVI = (da_250i-da_250)/(da_250i+da_250)
fNDVI.to_netcdf(r'c:\temp\output.nc').compute()
This should recreate the problem, and avoid to impact on the memory as suggested by Rayan. In any case, the two datasets can be dumped to the disk and then reloaded.
**note seems that something is moving to implement an interpolation along with chunked dataset but isn't still fully available. Here the details https://github.com/pydata/xarray/pull/4155
I believe that there are two things that cause this example to crash, both likely related to memory usage
You populate your original dataset with a large numpy array (np.random.random((n_time, px, px)) and then call .chunk after the fact. This forces Dask to pass a large object around in its graphs. Solution: use a lazy loading method.
Your object interp requires 47 GB of memory. This is too much for most computers to handle. Solution: add a reduction step before calling compute. This allows you to check whether your interpolation is working properly without simultaneously loading all the results into RAM.
With these modifications, the code looks like this
import numpy as np
import dask.array as dsa
import pandas as pd
import xarray as xr
n_time = 365
px = 2000
lat = np.linspace(19., 4., px)
lon = np.linspace(34., 53., px)
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
# use dask to lazily create the random data, not numpy
# this avoids populating the dask graph with large objects
data = xr.DataArray(dsa.random.random((n_time, px, px),
chunks=(1, px, px)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat, 'lon': lon})
# upsampling
nlat = np.linspace(19., 4., px*2)
nlon = np.linspace(34., 53., px*2)
# this object requires 47 GB of memory
# computing it directly is not an option on most computers
interp = data.interp(lat=nlat, lon=nlon)
# instead, we reduce in the time dimension before computing
interp.mean(dim='time').compute()
This ran in a few minutes on my laptop.
In response to your edited question, I have a new solution.
In order to interpolate across the lat / lon dimensions, you need to rechunk the data. I added this line before the interpolation step
da_500 = da_500.chunk({'lat': -1, 'lon': -1})
After doing that, the computation executed without errors for me in distributed mode.
from dask.distributed import Client
client = Client()
fNDVI.to_netcdf(r'~/tmp/test.nc').compute()
I did notice that the computation was rather memory intensive. I recommend monitoring the dask dashboard to see if you are running out of memory.

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way?
Here's a simplified snippet of my code:
temp_dd = dd.read_parquet(read_str, gather_statistics=False)
temp_dd = dask_client.scatter(temp_dd, broadcast=True)
dask_wait([temp_dd])
temp_dd = dask_client.gather(temp_dd)
while row_batch <= max_row:
row_batch_dd = temp_dd.get_partition(row_batch)
row_batch_dd = row_batch_dd.dropna()
row_batch_dd_len = row_batch_dd.index.size # <-- this is the current way I'm determining the length
row_batch = row_batch + 1
I note that, while I am reading a parquet, I can't simply use the parquet info (which is very fast) because, after reading, I do some partition-by-partition processing and then drop the NaNs. It's the post-processed length per partition that I'd like.
df = dd.read_parquet(fn, gather_statistics=False)
df = df.dropna()
df.map_partitions(len).compute()

Apply function along time dimension of XArray

I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y.
I have tried: apply_ufunc but the function fails stating that I need to first load the data into RAM (i.e. cannot use a Dask Array). Ideally, I'd like to keep the DataArray as Dask Arrays internally as it isn't possible to load the entire stack into RAM. The exact error message is:
ValueError: apply_ufunc encountered a dask array on an argument, but handling for dask arrays has not been enabled. Either set the dask argument or load your data into memory first with .load() or .compute()
My code currently looks like this:
import numpy as np
import xarray as xr
import pandas as pd
def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True})
If I do load the data into RAM using .compute then I still end up with an error which states:
ValueError: applied function returned data with unexpected number of dimensions: 0 vs 2, for dimensions ('y', 'x')
I'm not sure entirely what I am missing/doing wrong.
def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True}, dask = 'allowed', vectorize = True)
The code above using the vectorize argument should work.
My aim was also to implement apply_ufunc from Xarray such that it can compute the special mean across x and y.
I enjoyed Ales example; of course by omitting the line related to the chunk. Otherwise:
ValueError: applied function returned data with unexpected number of dimensions. Received 0 dimension(s) but expected 2 dimensions with names: ('y', 'x')
Interestingly, I realized that, in a situation, to have the output of apply_ufunc 3D instead of 2D, we need to add "out_core_dims=[["time"]]" to the apply_ufunc.

Resources