Operation over two arrays with different numblocks - dask

I am trying to implement take_along_axis for dask array.
What is the standard way to map an operation that takes in a block from dask array A, and the corresponding block from dask array B?
Should I use a rechunk when A.numblocks != B.numblocks?

Rechunking is fine. Dask.array algorithms align chunks internally fairly frequently. You might also consider using functions like map_blocks or atop if they work for you. Dask developers optimize these.
http://dask.pydata.org/en/latest/array-api.html#dask.array.core.atop
http://dask.pydata.org/en/latest/array-api.html#dask.array.core.map_blocks
These are only important though if you're constructing your own algorithms. Normal use of dask.arrays does not really require thinking about rechunking explicitly.

Related

Use map_blocks to compute a heap based on the contents of each block

I've created a Dask cluster on my laptop, and have loaded a NetCDF dataset on it using xarray.open_dataset('some_data.nc',chunks={'lat':'auto', 'lon':'auto', 'time':-1})
I've converted this to a distributed array of time series, ts, one per (lat,lon) pair. For this array, ts.chunks is:
((1555200, 1555200, 1555200, 1555200, 1555200, 1555200), (12,))
Now what I'd like to do is create one heapq per chunk with entries computed one per row of each chunk. I was hoping I could use map_blocks for this, but I don't see how. Also, I want to do some reduction based on those heaps.
Is there a straightforward way to accomplish this? Thanks.
One simple way to accomplish this would be to switch to Dask delayed. See https://docs.dask.org/en/latest/delayed-collections.html

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Change order of operation applied to a dask bag

I am using a dask bag to handle parallelization of data processing on traces collected from a set of experiments. The paths to the data files for each experiment are turned into custom objects and common operations I perform on this type of data are object methods.
Each object has an identification number associated with the particular experiment. And at some point in the program I want to use this ID number to remove some of the experiments. As in this task graph, where a object is created from a sequence, detrending and deconvolution functions are then applied followed by a remove operation.
Because experiment identification number is static the remove operation can be performed at any step in the task graph and the end result will be the same. However if the remove operation is performed following other computationally costly methods the result will be mush slower, due to these computations being performed unnecessarily on objects that will end up being removed.
Is there a way to insert an operation at an earlier point in the task graph for the bag so that if someone adds a remove operation at any point, it will be the first operation performed?
Instead of using dask bag you might want to look at dask delayed, which might give you a bit more flexibility:
http://dask.pydata.org/en/latest/delayed.html
If you really want to muck about with the task graph directly then you should read about the graph specification
http://dask.pydata.org/en/latest/spec.html

Computing in-place with dask

Short version
I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the original backup numpy arrays, making the whole thing an in-place operation?
If you're thinking "you're using dask wrong" then see the long version below for why I feel the need to do this.
Long version
I'm using dask for an application where the original data is sourced from in-memory numpy arrays that contain data collected from a scientific instrument. The goal is to fill most of the RAM (say 75%+) with the original data, which means that there isn't enough to make an in-memory copy. That makes it semantically a bit like an out-of-core problem, in that any derived value can only be realised in memory in chunks rather than all at once.
Dask is well-suited to this, except for one wrinkle. I'm simplifying a lot, but on most of the data (call it X), we need to apply an element-wise operation f, compute some summary statistics s(f(X)), and use that to compute another result over the data, say t(s(f(X)), f(X)). While all the functions are dask-friendly (can be done on a per-chunk basis), trying to simply run this dask graph would cause f(X) to all be held in memory at once because the chunks are all needed for the second pass. An alternative is to explicitly compute s before asking for t (as suggested by https://github.com/dask/dask/issues/874), and thus pay to compute f(X) twice, but it's a somewhat expensive operation so I'd like to avoid that.
However, once f has been applied, the original data are no longer needed. So I'd like to run da.store(f(X)) and have it store the results in the original backing numpy arrays. Technically I think I know how to set that up, and as long as I can be sure that each piece of data is fully consumed before it is overwritten then there are no race conditions, but I'm worried that I may be breaking an API contract by changing back data underneath dask and that it might go wrong in some way. Is there any way to guarantee that it is safe?
One way I can immediately see this going wrong is if several of the input arrays have the same contents and hence get given the same name in dask, causing them to the unified in the graph. I'm using name=False in da.from_array though, so that shouldn't be an issue.

What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().
From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.

Resources