Apply function along time dimension of XArray - dask

I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y.
I have tried: apply_ufunc but the function fails stating that I need to first load the data into RAM (i.e. cannot use a Dask Array). Ideally, I'd like to keep the DataArray as Dask Arrays internally as it isn't possible to load the entire stack into RAM. The exact error message is:
ValueError: apply_ufunc encountered a dask array on an argument, but handling for dask arrays has not been enabled. Either set the dask argument or load your data into memory first with .load() or .compute()
My code currently looks like this:
import numpy as np
import xarray as xr
import pandas as pd
def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True})
If I do load the data into RAM using .compute then I still end up with an error which states:
ValueError: applied function returned data with unexpected number of dimensions: 0 vs 2, for dimensions ('y', 'x')
I'm not sure entirely what I am missing/doing wrong.

def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True}, dask = 'allowed', vectorize = True)
The code above using the vectorize argument should work.

My aim was also to implement apply_ufunc from Xarray such that it can compute the special mean across x and y.
I enjoyed Ales example; of course by omitting the line related to the chunk. Otherwise:
ValueError: applied function returned data with unexpected number of dimensions. Received 0 dimension(s) but expected 2 dimensions with names: ('y', 'x')

Interestingly, I realized that, in a situation, to have the output of apply_ufunc 3D instead of 2D, we need to add "out_core_dims=[["time"]]" to the apply_ufunc.

Related

Xarray Distributed Failed to serialize

I need to upsample through a linear interpolation some satellite images organized in a DataArray.
Until I run the code locally I've no issue but, if I try to replicate the interpolation over a distributed system, I get back this error:
`Could not serialize object of type tuple`
to replicate the problem what's needed is to switch between a distributed or local env.
here the distributed version of the code.
n_time = 365
px = 2000
lat = np.linspace(19., 4., px)
lon = np.linspace(34., 53., px)
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
data = xr.DataArray(np.random.random((n_time, px, px)), dims=('time', 'lat',
'lon'),coords={'time': time, 'lat': lat, 'lon': lon})
data = data.chunk({'time':1})
#upsampling
nlat = np.linspace(19., 4., px*2)
nlon = np.linspace(34., 53., px*2)
interp = data.interp(lat=nlat, lon=nlon)
computed = interp.compute()
Does any have and idea on how to work around the problem?
EDIT 1:
As seems that I haven't been enough clear in my first MRE so I decided to rewrite with all the inputs received up to now.
I need to upsample a satellite dataset from 500 meters to 250m. The final goal is, as chunking along the dimension to be interpolated is not yet supported **, figure out how I can create a workaround and upsampling each image to the 500 datasets.
px = 2000
n_time = 365
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
# dataset to be upsampled
lat_500 = np.linspace(19., 4., px)
lon_500 = np.linspace(34., 53., px)
da_500 = xr.DataArray(dsa.random.random((n_time, px, px),
chunks=(1, 1000, 1000)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat_500, 'lon': lon_500})
# reference dataset
lat_250 = np.linspace(19., 4., px * 2)
lon_250 = np.linspace(34., 53., px * 2)
da_250 = xr.DataArray(dsa.random.random((n_time, px * 2, px * 2),
chunks=(1, 1000, 1000)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat_250, 'lon': lon_250})
# upsampling
da_250i = da_500.interp(lat=lat_250, lon=lon_250)
#fake index
fNDVI = (da_250i-da_250)/(da_250i+da_250)
fNDVI.to_netcdf(r'c:\temp\output.nc').compute()
This should recreate the problem, and avoid to impact on the memory as suggested by Rayan. In any case, the two datasets can be dumped to the disk and then reloaded.
**note seems that something is moving to implement an interpolation along with chunked dataset but isn't still fully available. Here the details https://github.com/pydata/xarray/pull/4155
I believe that there are two things that cause this example to crash, both likely related to memory usage
You populate your original dataset with a large numpy array (np.random.random((n_time, px, px)) and then call .chunk after the fact. This forces Dask to pass a large object around in its graphs. Solution: use a lazy loading method.
Your object interp requires 47 GB of memory. This is too much for most computers to handle. Solution: add a reduction step before calling compute. This allows you to check whether your interpolation is working properly without simultaneously loading all the results into RAM.
With these modifications, the code looks like this
import numpy as np
import dask.array as dsa
import pandas as pd
import xarray as xr
n_time = 365
px = 2000
lat = np.linspace(19., 4., px)
lon = np.linspace(34., 53., px)
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
# use dask to lazily create the random data, not numpy
# this avoids populating the dask graph with large objects
data = xr.DataArray(dsa.random.random((n_time, px, px),
chunks=(1, px, px)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat, 'lon': lon})
# upsampling
nlat = np.linspace(19., 4., px*2)
nlon = np.linspace(34., 53., px*2)
# this object requires 47 GB of memory
# computing it directly is not an option on most computers
interp = data.interp(lat=nlat, lon=nlon)
# instead, we reduce in the time dimension before computing
interp.mean(dim='time').compute()
This ran in a few minutes on my laptop.
In response to your edited question, I have a new solution.
In order to interpolate across the lat / lon dimensions, you need to rechunk the data. I added this line before the interpolation step
da_500 = da_500.chunk({'lat': -1, 'lon': -1})
After doing that, the computation executed without errors for me in distributed mode.
from dask.distributed import Client
client = Client()
fNDVI.to_netcdf(r'~/tmp/test.nc').compute()
I did notice that the computation was rather memory intensive. I recommend monitoring the dask dashboard to see if you are running out of memory.

How to iterate da.linalg.inv over a dask array dimension

What is the best way to iterate da.linalg.inv over a multi-dimensional dask array?
I have a dask array of shape (4, 4, 8, 8), and need to compute the inverse of the last two dimensions. With numpy, np.linalg.inv loops over all dimensions except the last two, so in the following example, I can just call np.linalg.inv(A).
I have chosen to use a for loop, but I have read about gufuncs in dask (the documentation seems a little outdated). However, I'm not sure how to implement the it, particularly the "signature" bit,
import dask.array as da
import numpy as np
A = da.random.random((4,4,8,8))
A2 = A.reshape((-1,) + A.shape[-2:])
B = [da.linalg.inv(a) for a in A2]
B2 = da.asarray(B)
B3 = B2.reshape(A.shape)
np.testing.assert_array_almost_equal(
np.linalg.inv(A.compute()),
B3
)
My attempt at a gufunc leads to an error:
def foo(x):
return da.linalg.inv(x)
gufoo = da.gufunc(foo, signature="()->()", output_dtypes=float, vectorize=True)
gufoo(A2).compute() # IndexError: tuple index out of range
I think that you want to apply the numpy function np.linalg.inv over your Dask array rather than the dask array function.
If np.linalg.inv is already a gufunc then it might work as expected today
np.linalg.inv(A)

How to solve "not all divisions are known" error?

I'm trying to filter a Dask dataframe with groupby.
df = df.set_index('ngram');
sizes = df.groupby('ngram').size();
df = df[sizes > 15];
However, df.head(15) throws the error ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.. The divisions on sizes are not known:
>>> df.known_divisions
True
>>> sizes.known_divisions
False
A workaround is to do sizes.compute() or .to_csv(...) and then read it back to Dask with dd.from_pandas or dd.read_csv. Then sizes.known_divisions would return True. That's a notable inconvenience.
How else can this be solved? Am I using Dask wrong?
Note: there's an unanswered dublicate here.
In the common case you are using, it appears to be that your indexing series is in fact much smaller than the source dataframe you want to apply it to. In this case, it makes sense to materialise it and use simple indexing like this:
df = pd.DataFrame({'ngram': np.random.choice([1, 2, 3], size=1000),
'other': np.random.randn(1000)}) # fake data
d = dd.from_pandas(df, npartitions=3)
sizes = d.groupby('ngram').size().compute()
d = d.set_index('ngram') # also sorts the divisions
ngrams = sizes[sizes > 300].index.tolist() # a list of good ngrams
d.loc[ngrams].compute()

How to normalise all training samples at once using MinMaxScaler

I have 1320 training samples (sea surface temperature) and each sample is a 2d array(160,320) so the final array is in the shape (1320,160,320). I would like to normalize them to values between 0 and 1 using MinMaxScaler(). I get the error "Found array with dim 3. MinMaxScaler expected <= 2.". My code is as follows. I could loop through all the 1320 samples, normalising them one by one but I would like to know if there is a way to normalize all of them because Max and Mix for each sample is not the same.
scaler = prep.MinMaxScaler()
sst = scaler.fit_transform(sst)
As far as I know, you can't really do it only using MinMaxScaler(). np.apply_along_axis won't be useful either since you want to apply a min-max scaler over 2D slices. One solution could be something like this:
import numpy as np
a = np.random.random((2, 3, 3))
def customMinMaxScaler(X):
return (X - X.min()) / (X.max() - X.min())
np.array([customMinMaxScaler(x) for x in a])
But I guess it wouldn't be much faster than iterating over the samples.

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

Resources