How to iterate da.linalg.inv over a dask array dimension - dask

What is the best way to iterate da.linalg.inv over a multi-dimensional dask array?
I have a dask array of shape (4, 4, 8, 8), and need to compute the inverse of the last two dimensions. With numpy, np.linalg.inv loops over all dimensions except the last two, so in the following example, I can just call np.linalg.inv(A).
I have chosen to use a for loop, but I have read about gufuncs in dask (the documentation seems a little outdated). However, I'm not sure how to implement the it, particularly the "signature" bit,
import dask.array as da
import numpy as np
A = da.random.random((4,4,8,8))
A2 = A.reshape((-1,) + A.shape[-2:])
B = [da.linalg.inv(a) for a in A2]
B2 = da.asarray(B)
B3 = B2.reshape(A.shape)
np.testing.assert_array_almost_equal(
np.linalg.inv(A.compute()),
B3
)
My attempt at a gufunc leads to an error:
def foo(x):
return da.linalg.inv(x)
gufoo = da.gufunc(foo, signature="()->()", output_dtypes=float, vectorize=True)
gufoo(A2).compute() # IndexError: tuple index out of range

I think that you want to apply the numpy function np.linalg.inv over your Dask array rather than the dask array function.
If np.linalg.inv is already a gufunc then it might work as expected today
np.linalg.inv(A)

Related

ValueError: setting an array element with a sequence when using Onehotencoder on multiple columns

I'm fairly new to machine learning and I am using the following code to encode my categorical data for preprocessing:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(handle_unknown = 'ignore'), [0])],remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)
which works when I only have one categorical column of data in X.
However when I have multiple columns of categorical data I change my code to :
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(handle_unknown = 'ignore'), [0,1,2,3,4,5,10,14,15])],remainder='passthrough')
but I get the following error when calling the np.array function:
Value Error: setting an array element with a sequence
on the np.array function call...
From what I understand all I need to do is specify which columns I'm hot encoding as in the above line of code...so why does one work and the other give an error? What should I do to fix it?
Also: if I remove the
dtype=np.float
from the np.array function I don't get an error - but I also don't get anything returned in X
Never mind I was able to answer my own question.
For anyone interested what I did was change the line
X = np.array(ct.fit_transform(X), dtype=np.float)
to:
X = ct.fit_transform(X).toarray()
The code works perfectly now.

Apply function along time dimension of XArray

I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y.
I have tried: apply_ufunc but the function fails stating that I need to first load the data into RAM (i.e. cannot use a Dask Array). Ideally, I'd like to keep the DataArray as Dask Arrays internally as it isn't possible to load the entire stack into RAM. The exact error message is:
ValueError: apply_ufunc encountered a dask array on an argument, but handling for dask arrays has not been enabled. Either set the dask argument or load your data into memory first with .load() or .compute()
My code currently looks like this:
import numpy as np
import xarray as xr
import pandas as pd
def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True})
If I do load the data into RAM using .compute then I still end up with an error which states:
ValueError: applied function returned data with unexpected number of dimensions: 0 vs 2, for dimensions ('y', 'x')
I'm not sure entirely what I am missing/doing wrong.
def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True}, dask = 'allowed', vectorize = True)
The code above using the vectorize argument should work.
My aim was also to implement apply_ufunc from Xarray such that it can compute the special mean across x and y.
I enjoyed Ales example; of course by omitting the line related to the chunk. Otherwise:
ValueError: applied function returned data with unexpected number of dimensions. Received 0 dimension(s) but expected 2 dimensions with names: ('y', 'x')
Interestingly, I realized that, in a situation, to have the output of apply_ufunc 3D instead of 2D, we need to add "out_core_dims=[["time"]]" to the apply_ufunc.

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe
I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df) to turn these dictionaries into a pandas.DataFrame (resulting dataframe is around 100 mb for each file). I would like to append all dataframes into a single dask.dataframe for further analysis.
Up to now I was using dask.delayed objects to load, convert and append all data which works fine (see example below). However for future work I would like to store the loaded dictionaries in a dask.bag using dask.persist().
I managed to load the data into dask.bag, resulting in a list of dicts or list of pandas.DataFrame that I can use locally after calling compute(). When I tried turning the dask.bag into a dask.dataframe using to_delayed() however, I got stuck with an error (see below).
It feels like I am missing something rather simple here or maybe my approach to dask.bag is wrong?
The below example shows my approach using simplified functions and throws the same error. Any advice on how to tackle this is appreciated.
import numpy as np
import pandas as pd
import dask
import dask.dataframe
import dask.bag
print(dask.__version__) # 1.1.4
print(pd.__version__) # 0.24.2
def make_dict(n=1):
return {"name":"dictionary","data":{'A':np.arange(n),'B':np.arange(n)}}
def make_df(d):
return pd.DataFrame(d['data'])
k = [1,2,3]
# using dask.delayed
dfs = []
for n in k:
delayed_1 = dask.delayed(make_dict)(n)
delayed_2 = dask.delayed(make_df)(delayed_1)
dfs.append(delayed_2)
ddf1 = dask.dataframe.from_delayed(dfs).compute() # this works as expected
# using dask.bag and turning bag of dicts into bag of DataFrames
b1 = dask.bag.from_sequence(k).map(make_dict)
b2 = b1.map(make_df)
df = pd.DataFrame().append(b2.compute()) # <- I would like to do this using delayed dask.DataFrames like above
ddf2 = dask.dataframe.from_delayed(b2.to_delayed()).compute() # <- this fails
# error:
# ValueError: Expected iterable of tuples of (name, dtype), got [ A B
# 0 0 0]
what I ultimately would like to do using the distributed scheduler:
b = dask.bag.from_sequence(k).map(make_dict)
b = b.persist()
ddf = dask.dataframe.from_delayed(b.map(make_df).to_delayed())
In the bag case the delayed objects point to lists of elements, so you have a list of lists of pandas dataframes, which is not quite what you want. Two recommendations
Just stick with dask.delayed. It seems to work well for you
Use the Bag.to_dataframe method, which expects a bag of dicts, and does the dataframe conversion itself

Can I create a dask array with a delayed shape

Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value?
My algorithm won't give me the shape of the array until pretty late in the computation.
Eventually, I will be creating some blocks with shapes specified by the intermediate results of my computation, eventually calling da.concatenate on all the results (well da.block if it were more flexible)
I don't think it is too detrimental if I can't, but it would be cool if could.
Sample code
from dask import delayed
from dask import array as da
import numpy as np
n_shape = (3, 3)
shape = delayed(n_shape, nout=2)
d_shape = (delayed(n_shape[0]), delayed(n_shape[1]))
n = delayed(np.zeros)(n_shape, dtype=np.float)
# this doesn't work
# da.from_delayed(n, shape=shape, dtype=np.float)
# this doesn't work either, but I think goes a little deeper
# into the function call
da.from_delayed(n, shape=d_shape, dtype=np.float)
You can not provide a delayed shape, but you can state that the shape is unknown using np.nan as a value wherever you don't know a dimension
Example
import random
import numpy as np
import dask
import dask.array as da
#dask.delayed
def f():
return np.ones((5, random.randint(10, 20))) # a 5 x ? array
values = [f() for _ in range(5)]
arrays = [da.from_delayed(v, shape=(5, np.nan), dtype=float) for v in values]
x = da.concatenate(arrays, axis=1)
>>> x
dask.array<concatenate, shape=(5, nan), dtype=float64, chunksize=(5, nan)>
>>> x.shape
(5, np.nan)
>>> x.compute().shape
(5, 88)
Docs
See http://dask.pydata.org/en/latest/array-chunks.html#unknown-chunks

Horizontal Histogram in OpenCV

I am newbie to OpenCV,now I am making a senior project related Image processing. I have a question: Can I make a horizontal or vertical histogram with some functions of OpenCV?
Thanks,
Truong
The most efficient way to do this is by using the cvReduce function. There's a parameter to allow to select if you want an horizontal or vertical projection.
You can also do it by hand with the functions cvGetCol and cvGetRow combined with cvSum.
Based on the link you provided in a comment, this is what I believe you're trying to do.
You want to create an array with n elements, where n is the number of columns in the input image. The value of the nth element of the array is the sum of all the pixels in the nth column.
You can calculate this array by looping over the columns of the input image, using cvGetSubRect to access the pixels in that column, and cvSum to sum those pixels.
Here is some Python code that does that, assuming a grayscale image:
import cv
def verticalProjection(img):
"Return a list containing the sum of the pixels in each column"
(w,h) = cv.GetSize(img)
sumCols = []
for j in range(w):
col = cv.GetSubRect(img, (j,0,1,h))
sumCols.append(cv.Sum(col)[0])
return sumCols
Updating carnieri answer (some cv functions are not working today)
import numpy as np
import cv2
def verticalProjection(img):
"Return a list containing the sum of the pixels in each column"
(h, w) = img.shape[:2]
sumCols = []
for j in range(w):
col = img[0:h, j:j+1] # y1:y2, x1:x2
sumCols.append(np.sum(col))
return sumCols
Regards.
An example of using cv2.reduce with OpenCV 3 in Python :
import numpy as np
import cv2
img = cv2.imread("test_1.png")
x_sum = cv2.reduce(img, 0, cv2.REDUCE_SUM, dtype=cv2.CV_32S)
y_sum = cv2.reduce(img, 1, cv2.REDUCE_SUM, dtype=cv2.CV_32S)

Resources