Defining dask worker resources for a dataframe operation - dask

I am applying multiple operations to a dask dataframe. Can I define distributed worker resource requirements for particular operation?
e.g. I call something like:
df.fillna(value="").map_partitions(...).map(...)
I want to specify resource requirement for map_partitions() (potentially different than the ones for map()), but seems like the method does not accept resources parameter.
PS. Alternatively, I figured out that I can call client.persist() after map_partitions() and specify resources in this call, but this immediately triggers the computation.

You can specify resource constraints on particular parts of your computation when you call compute or persist by providing the intermediate collection.
x = dd.read_csv(...)
y = x.map_partitions(func)
z = y.map(func2)
z.compute(resources={tuple(y._keys()): {'GPU': 1}})
Thank you for the question, I went to include a link to the documentation about this feature and found that it was undocumented. I'll fix shortly.
It looks like today there is a bug where the intermediate keys may be optimized out in some situations (though this is less likely for dataframe operation), so you may also want to pass the optimize_graph=False keyword.
z.compute(resources={tuple(y._keys()): {'GPU': 1}}, optimize_graph=False)
See https://github.com/dask/distributed/pull/1362

Related

Do eager pipes create multiple lists in F#?

When using a list an pipes does it create mutliple intermediate lists? If so is this not very bad for garbage collection?
let mylist = [...]
let filterByPipes ls =
somefilter ls //is a list created here
|> otherFiler // and then again here
|> filtersForDays // and finally returned here
Yes, F# uses applicative evaluation order, which means that every value has to be fully evaluated before it can be passed as parameter to a function, which means that yes, at every step a new list is created.
As to whether it's bad for garbage collection... "First measure then optimize" - the first rule of optimization.
In most circumstances, the amount of data is so miniscule, it doesn't really make any difference, definitely not enough to make up for the massive gain in code readability.
But if your measurements do determine a bottleneck in this particular place, and it turns out to be significant for the end goal, - then yes, you'll have to manually fuse the processing chain. Or perhaps employ a more efficient data structure. It all would depend on what the optimization goal is.

Can I assume the running order of LeafSystem's CalcOutput function?

I am working on a LeafSystem like this:
class exampleLeafSystem(LeafSystem):
def __init__(self, plant):
self._plant = plant
self._plant_context = plant.CreateDefaultContext()
self.DeclareVectorIutputPort("q_v", BasicVector(6))
self.DeclareVectorOutputPort("tau", BasicVector(3), self.TauCalcOutput)
self.DeclareVectorOutputPort("xe", BasicVector(3), self.xeCalcOutput)
def TauCalcOutput(self, context, output):
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
# Do some calculation with self._plant and self._plant_context to get the output
output.SetFromVector(tau)
def xeCalcOutput(self, context, output):
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
# Do some calculation with self._plant and self._plant_context to get the output
output.SetFromVector(xe)
In the two methods TauCalcOutput and xeCalcOutput here, I need frist update the MultibodyPlant's state information and then do the calculation to compute the output. However, since I do not know the order of the two CalcOutput method being called, in order for this two method to use the newest state information, I have to write
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
in both methods, which seems a bit unnecessary. If I can assume the order of the CalcOutput methods being runned, for example, say that TauCalcOutput always runs first, then I can only have
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
in TauCalcOutput method, without worrying that XeCalcOutput method will use MultibodyPlant's state information that is one step lagged.
So my question is that is there a specific order for the CalcOutput methods being called?
No. The contract is that output ports can be called in any order at any time, and are only called when they are evaluated (e.g. by a downstream system). The order that they will be called will depend on the other systems in the Diagram; they might not all get called during a single simulation step (e.g. if one system is consumed by a discrete time system with time step 0.1, and another by time step 0.2), or may not be called at all if they are not connected. Users can even call get_output_port().Eval() manually.
For a general approach to avoiding duplicate computation in output ports, you should store the result of the shared computation in the Context, either as state or as a "cache entry". See DeclareCacheEntry for more details.
For this workflow, specifically, perhaps the simplest solution is to check whether the positions and velocities are already set to the same value, just to avoid repeating the kinematics evaluation, for instance as you see here:
https://github.com/RobotLocomotion/drake/blob/6e6e37ffa677362245773f13c0628f0042b47414/multibody/inverse_kinematics/kinematic_constraint_utilities.cc#L47-L54

Can Dask computational graphs keep intermediate data so re-compute is not necessary?

I am very impressed with Dask and I am trying to determine if it is the right tool for my problem. I am building a project for interactive data exploration where users can interactively change parameters of a figure. Sometimes these changes requires re-computing the entire pipeline to make the graph (e.g. "show data from a different time interval"), but sometimes not. For instance, "change the smoothing parameter" should not require the system to reload the raw unsmoothed data, because the underlying data is the same, only the processing changes. The system should instead use the existing raw data that has already been loaded. I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed. It looks like the caching system in Dask is close to what I need, but was designed with a bit of a different use-case in mind. I see there is a persist method, but I'm not sure if that would work either. Is there an easy way to accomplish this in Dask, or is there another project that would be more appropriate?
"change the smoothing parameter" should not require the system to reload the raw unsmoothed data
Two options:
The builtin functools.lru_cache will cache every unique input. The check on memory is with the maxsize parameter, which controls how many input/output pairs are stored.
Using persist in the right places will compute that object as mentioned at https://distributed.dask.org/en/latest/manage-computation.html#client-persist. It will not require re-running computation to get the object in later computation; functionally, it's the same as lru_cache.
For example, this code will read from disk twice:
>>> import dask.dataframe as dd
>>> df = dd.read_csv(...)
>>> # df = df.persist() # uncommenting this line → only read from disk once
>>> df[df.x > 0].mean().compute()
24.9
>>> df[df.y > 0].mean().compute()
0.1
Uncommented the line will mean this code only reads from disk once because the task graph for the CSV is computed and the value is stored in memory. For your application is sounds like I would use persist intelligently: https://docs.dask.org/en/latest/best-practices.html#persist-when-you-can
What if two smoothing parameters want to be visualized? In that case, I'd avoid calling compute repeatedly: https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly
lower, upper = client.compute(df.x.min(), df.x.max())
This will share the task graph for min and max so unnecessary computation is not performed.
I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed.
Dask Distributed has a smart caching ability: https://docs.dask.org/en/latest/caching.html#automatic-opportunistic-caching. Part of the documentation says
Another approach is to watch all intermediate computations, and guess which ones might be valuable to keep for the future. Dask has an opportunistic caching mechanism that stores intermediate tasks that show the following characteristics:
Expensive to compute
Cheap to store
Frequently used
I think this is what you're looking for; it'll store values depending on those attributes.

Semaphores in dask.distributed?

I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a task in parallel.
I have seen that distributed supports locks (http://distributed.readthedocs.io/en/latest/api.html#distributed.Lock). But with that, I could do only one query in parallel, not m.
Also I have seen that I could define resources per worker (https://distributed.readthedocs.io/en/latest/resources.html). But that does not fit also, as the database is independent from the workers. I would either have to define 1 database resource per worker (which leads to too much parallel queries). Or I would have to distribute m database resources to n workers, which is difficult on setting up the cluster and suboptimal in execution.
Is it possible to define something like semaphores in dask to solve that?
You could probably hack something together with Locks and Variables.
A cleaner solution would be to just implement Semaphores much like how Locks are implemented. Depending on your experience this may not be that hard, (the lock implementation is 150 lines) and would be a welcome pull request.
https://github.com/dask/distributed/blob/master/distributed/lock.py
You can use a dask.distributed.Queue
class DDSemaphore(object):
"""Dask Distributed Semaphore"""
def __init__(self, value=1):
self._q = dask.distributed.Queue()
for _ in range(value):
self._q.put(42)
def acquire():
self._q.get()
def release():
self._q.put(42)
dask.distributed now contains a Semaphore class that can be used via the distributed Futures API:
https://docs.dask.org/en/stable/futures.html#id1
If you're using Dask collections such as Bag, DataFrame, or Array, then you may need to obtain their enclosed Future objects to use them with the Semaphore. Do that with futures_of()
https://docs.dask.org/en/stable/user-interfaces.html?highlight=futures_of#combining-interfaces

Lazy repartitioning of dask dataframe

After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition() method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of the data after processing, which is yet unknown.
I think I can do lazy calculation of size by df.memory_usage().sum() but repartition() does not seem to accept it (scalar) as an argument.
Is there a way to do this kind of adaptative (data-size-based) lazy repartitioning?
PS. Since this is the (almost) last step in my pipeline, I can probably work around this by converting to delayed and repartitioning "manually" (I don't need to go back to dataframe), but I'm looking for a simpler way to do this.
PS. Repartitioning by partition size would also be a very useful feature
Unfortunately Dask's task-graph construction happens immediately and there is no way to partition (or do any operation) in a way where the number of partitions is not immediately known or is lazily computed.
You could, as you suggest, switch to lower-level systems like delayed. In this case I would switch to using futures and track the size of results as they came in, triggering appropriate merging of partitions on the fly. This is probably far more complex than is desired though.

Resources