I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.
Edit:
Thanks to a #AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.
def extract_features_generalized(chunk, offsets, depth, columns):
shape = np.asarray(chunk.shape)
offsets = np.asarray(offsets)
depth = np.asarray(depth)
coordinates = np.stack(np.nonzero(chunk)).T
keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)
data = coordinates + offsets - depth
df = pd.DataFrame(data=data, columns=columns)
return df[keep]
def my_overlap_generalized(data, chunksize, depth, columns, boundary):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr,
chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2),
columns=['r', 'c', 'z', 'y', 'x'],
boundary=tuple(['reflect']*5))
df.compute().reset_index()
-- Remainder of original post, including original bugs --
My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?
def my_overlap(data, chunk_xy, depth_xy):
data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
data_overlapping_chunks = da.overlap.overlap(data,
depth=(0,0,0,depth_xy,depth_xy),
boundary={3: 'reflect', 4: 'reflect'})
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
dfs.append(df_block)
# All computation is delayed, so downstream comptutions need to know the format of the data. If the meta
# information is not specified, a single computation will be done (which could be expensive) at this point
# to infer the metadata.
# This empty dataframe has the index, column, and type information we expect in the computation.
columns = ['r', 'c', 'z', 'y', 'x']
# The dtypes are float64, except for a small number of columns
df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
df_meta.index.name = 'feature'
return dd.from_delayed(dfs, meta=df_meta)
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy,
'x': x+offsets[4]-depth_xy})
df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
return df
data = np.zeros((2,4,8,16,16)) # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()
First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.
When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.
Here is a corrected version:
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
df['y'] = df['y'] + offsets[3] - depth_xy
df['x'] = df['x'] + offsets[4] - depth_xy
return df
The updated code now returns the indices that were set to 1:
index r c z y x
0 0 0 0 4 2 2
1 1 0 1 4 6 2
2 2 0 3 4 2 2
3 1 1 2 4 8 2
For comparison, this is the output of the original version:
index r c z y x
0 1 0 1 4 6 2
1 3 1 2 4 8 2
2 0 0 1 4 6 2
3 1 1 2 4 8 2
It returns lines number 2 and 4, two times each.
The reason why this happens is three mistakes in the extract_features function:
You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap
To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:
def extract_features_generalized(chunk, offsets, depth, columns):
coordinates = np.nonzero(chunk)
df = pd.DataFrame()
rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
for i in range(len(columns)):
df[columns[i]] = coordinates[i]
rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
del coordinates
return df[rows_to_keep > 0]
def my_overlap_generalized(data, chunksize, depth, columns):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth,
boundary=tuple(['reflect']*len(columns)))
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()
I'm trying to adapt the recommendation in Section 12.7.6.5 of the manual of using a dedicated environment (rather than the global environment) to interactive usage with r_make().
What I did is to modify the _drake.R configuration script as follows:
envir <- new.env(parent = globalenv())
source("R/packages.R", local = envir) # Load your packages, e.g. library(drake).
source("R/functions.R", local = envir) # Define your custom code as a bunch of functions.
source("R/plan.R", local = envir) # Create your drake plan.
drake_config(plan, envir = envir)
for any packages, functions and plan.
When I run
library(drake)
r_make()
I get:
Error in if (nrow(plan) < 1L) { : argument is of length zero
Error: <callr_status_error: callr subprocess failed: argument is of length zero>
-->
<callr_remote_error in if (nrow(plan) < 1L) { ...:
argument is of length zero>
in process 1598
See `.Last.error.trace` for a stack trace.
Am I missing something?
If you are already using r_make(), you most likely do not need to bother with envir. Because r_make() begins and ends in its own isolated callr::r() process, the global environment of the master session is already protected. In fact, r_make() is much better than envir when it comes to environment reproducibility, so you are already on the right track.
But if you do still want to use envir, please make sure the plan is defined in the environment that calls drake_config(): that is, the global environment of the session that runs _drake.R. So you can either call drake_config(envir$plan, envir = envir) or write source("plan.R") instead of source("plan.R", local = envir).
Examples:
writeLines(
c(
"library(drake)",
"plan <- drake_plan(x = 1)"
),
"plan.R"
)
writeLines(
c(
"envir <- new.env(parent = globalenv())",
"source(\"plan.R\", local = envir)",
"ls() # does not contain the plan",
"ls(envir) # contains the plan",
"drake_config(envir$plan, envir = envir)"
),
"_drake.R"
)
cat(readLines("plan.R"), sep = "\n")
#> library(drake)
#> plan <- drake_plan(x = 1)
cat(readLines("_drake.R"), sep = "\n")
#> envir <- new.env(parent = globalenv())
#> source("plan.R", local = envir)
#> ls() # does not contain the plan
#> ls(envir) # contains the plan
#> drake_config(envir$plan, envir = envir)
library(drake)
r_make()
#> [32mtarget[39m x
Created on 2020-01-13 by the reprex package (v0.3.0)
writeLines(
c(
"library(drake)",
"plan <- drake_plan(x = 1)"
),
"plan.R"
)
writeLines(
c(
"envir <- new.env(parent = globalenv())",
"source(\"plan.R\") # source into global envir",
"ls()",
"ls(envir)",
"drake_config(plan, envir = envir)"
),
"_drake.R"
)
cat(readLines("plan.R"), sep = "\n")
#> library(drake)
#> plan <- drake_plan(x = 1)
cat(readLines("_drake.R"), sep = "\n")
#> envir <- new.env(parent = globalenv())
#> source("plan.R") # source into global envir
#> ls()
#> ls(envir)
#> drake_config(plan, envir = envir)
library(drake)
r_make()
#> [32mtarget[39m x
Created on 2020-01-13 by the reprex package (v0.3.0)
Is it possible to use a map transformation with a grouping variable that is described in an external plan?
In other words, this works for me:
plan_a = drake_plan(
foo = target(x + 1, transform = map(x = c(4, 5, 6))),
bar = target(y + 5, transform = map(foo))
)
but this doesn't:
plan_a = drake_plan(
foo = target(x + 1, transform = map(x = c(4, 5, 6))),
)
plan_b = drake_plan(
bar = target(y + 5, transform = map(foo))
)
bind_plans(plan_a, plan_b)
Thanks!
I found a solution using the transform_plan function.
I would like to use the wildcard to generate a bunch of targets, and then have another set of targets that refers to those original targets. I think this example represents my idea:
plan <- drake_plan(
sub_task = runif(1000, min = mean__, max = 50),
full_task = sub_task * 2
)
step <- 1:4
full_plan <- evaluate_plan(
plan,
rules = list(
mean__ = step
)
)
So what I get now is 5 targets, 4 sub_tasks and a single final_task. What I'm looking for is to get 8 targets. The 4 sub_tasks (that are good), and 4 more that are based on those 4 good sub_tasks.
This question comes up regularly, and I like how you phrased it.
More about the problem
For onlookers, I will print out the plan and the graph of the current (problematic) workflow.
library(drake)
plan <- drake_plan(
sub_task = runif(1000, min = mean__, max = 50),
full_task = sub_task * 2
)
step <- 1:4
full_plan <- evaluate_plan(
plan,
rules = list(
mean__ = step
)
)
full_plan
#> # A tibble: 5 x 2
#> target command
#> <chr> <chr>
#> 1 sub_task_1 runif(1000, min = 1, max = 50)
#> 2 sub_task_2 runif(1000, min = 2, max = 50)
#> 3 sub_task_3 runif(1000, min = 3, max = 50)
#> 4 sub_task_4 runif(1000, min = 4, max = 50)
#> 5 full_task sub_task * 2
config <- drake_config(full_plan)
vis_drake_graph(config)
Created on 2018-12-18 by the reprex package (v0.2.1)
Solution
As you say, we want full_task_* targets that depend on their corresponding single_task_* targets. to accomplish this, we need to use the mean__ wildcard in the full_task_* commands as well. Wildcards are an early-days interface based on text replacement, so they do not need to be independent variable names in their own right.
library(drake)
plan <- drake_plan(
sub_task = runif(1000, min = mean__, max = 50),
full_task = sub_task_mean__ * 2
)
step <- 1:4
full_plan <- evaluate_plan(
plan,
rules = list(
mean__ = step
)
)
full_plan
#> # A tibble: 8 x 2
#> target command
#> <chr> <chr>
#> 1 sub_task_1 runif(1000, min = 1, max = 50)
#> 2 sub_task_2 runif(1000, min = 2, max = 50)
#> 3 sub_task_3 runif(1000, min = 3, max = 50)
#> 4 sub_task_4 runif(1000, min = 4, max = 50)
#> 5 full_task_1 sub_task_1 * 2
#> 6 full_task_2 sub_task_2 * 2
#> 7 full_task_3 sub_task_3 * 2
#> 8 full_task_4 sub_task_4 * 2
config <- drake_config(full_plan)
vis_drake_graph(config)
Created on 2018-12-18 by the reprex package (v0.2.1)
In the code snipped below, functions f and g are returning different values. From reading the code, you would expect them to behave the same. I am guessing it is to do with closure of v -> innerprodfn(m, v). How do I do it to get the desired behaviour where f and g return the same values.
type Mat{T<:Number}
data::Matrix{T}
end
innerprodfn{T}(m::Mat{T}, v::Array{T}) = i -> (m.data*v)[i]
innerprodfn{T}(m::Mat{T}, vv::Matrix{T}) = mapslices(v->innerprodfn(m, v), vv, 1)
m = Mat(collect(reshape(0:5, 2, 3)))
v = collect(reshape(0:11, 3, 4))
f = innerprodfn(m, v[:,1])
g = innerprodfn(m, v)[1]
m.data * v
# 10 28 46 64
# 13 40 67 94
[f(1) g(1); f(2) g(2)]
# 10 64
# 13 94
I don't have an explanation for the observed behavior, but on a recent nightly version of Julia one gets the expected result.
On 0.5, a workaround is to use a comprehension:
innerprodfn{T}(m::Mat{T}, vv::Matrix{T}) = [innerprodfn(m, vv[:,i]) for i in indices(vv, 2)]
Of course, this works on 0.6 as well.