Losing the local client when calling dark inside of a distributed-spawned function - dask

I am trying to perform some task operations inside of a function which is sent to a worker through distributed. A simplified version of the code is
client = Client(...)
X_ = dask.array.from_array(...)
X = dask.persist(X_)
def func(X, b):
with distributed.local_client() as c:
with dask.set_options(get=c.get):
return dask.lu_solve(X, b)
client.persist(dask.do(func)(X, b))
The problem is that in doing for several X, b instances, sometimes it works and sometimes I get the Exception Exception: Client not running. Status: closed
any idea on how to address this?

When you pass the inputs dask.array X and b to a dask.delayed function they arrive as numpy arrays. I recommend just using NumPy functions instead.
Alternatively, maybe you're trying to accomplish something else?
If you want to call a dask.array function on dask.arrays you can do it from your normal python session. There is no reason to use a local_client.

Related

Mismanagement of dask future results slow down the performances

I'm looking for any suggestion on how to solve the bottleneck below described.
Within a dask distributed infrastructure I map some futures and gain results whenever they are ready. Once retrieved I've to invoke a time consuming, blocking "pandas" function and, unfortunately, this function can't be avoided.
The optimum would be to have something that let me create another process, detached from the for loop, that's able to ingest the flow of results. For other constraints, not present in the example, the output can't be serialized and sent to workers and must be processed on the master.
here a small mockup. Just grab the idea and not focus too much on the details of the code.
class pxldrl(object):
def __init__(self, df):
self.table = df
def simulation(list_param):
time.sleep(random.random())
val = sum(list_param)/4
if val < 0.5:
result = {'param_e': val}
else:
result = {'param_f': val}
return pxldrl(result)
def costly_function(result, output):
time.sleep(1)
# blocking pandas function
output = output.append(result.table, sort=False, ignore_index=True)
return output
def main():
client = Client(n_workers=4, threads_per_worker=1)
output = pd.DataFrame(columns=['param_e', 'param_f'])
input = pd.DataFrame(np.random.random(size=(100, 4)),
columns=['param_a', 'param_b', 'param_c', 'param_d'])
for i in range(2):
futures = client.map(simulation, input.values)
for future, result in as_completed(futures, with_results=True):
output = costly_function(result, output)
It sounds like you want to run costly_function in a separate thread. Perhaps you could using the threading or concurrent.futures module to run your entire routine on a separate thread?
If you wanted to get fancy, you could even use Dask again and create a second client that ran within this process:
local_client = Client(processes=False)
and use that. (although you'll have to be careful about mixing futures between clients, which won't work)

Dask: what function variable is best to choose for visualize()

I am trying to understand Dask delayed more deeply so I decided to work through the examples here. I modified some of the code to reflect how I want to use Dask (see below). But the results are different than what I expected ie. a tuple vs list. When I try to apply '.visualize()' to see what the execution graph looks like I get nothing.
I worked through all the examples in 'delayed.ipynb' and they all work properly including all the visualizations. I then modified the 'for' loop for one example:
for i in range(256):
x = inc(i)
y = dec(x)
z = add(x, y)
zs.append(z)
to a function call the uses a list comprehension. The result is a variation on the original working example.
%%time
import time
import random
from dask import delayed, compute, visualize
zs = []
#delayed
def inc(x):
time.sleep(random.random())
return x + 1
#delayed
def dec(x):
time.sleep(random.random())
return x - 1
#delayed
def add(x, y):
time.sleep(random.random())
return x + y
def myloop(x):
x.append([add(inc(i), dec(inc(i))) for i in range(8)])
return x
result = myloop(zs)
final = compute(*result)
print(final)
I have tried printing out 'result' (function call) which provides the expected list of delay calls but when I print the results of 'compute' I unexpectedly get the desired list as part of a tuple. Why don't I get a just a list?
When I try to 'visualize' the execution graph I get nothing at all. I was expecting to see as many nodes as are in the generated list.
I did not think I made any significant modifications to the example so what am I not understanding?
The visualize function has the same call signature as compute. So if your compute(*result) call works then try visualize(*result)

Funny behavior with numba - guvectorized functions using argmax()

Consider the following script:
from numba import guvectorize, u1, i8
import numpy as np
#guvectorize([(u1[:],i8)], '(n)->()')
def f(x, res):
res = x.argmax()
x = np.array([1,2,3],dtype=np.uint8)
print(f(x))
print(x.argmax())
print(f(x))
When running it, I get the following:
4382569440205035030
2
2
Why is this happening? Is there a way to get it right?
Python doesn't have references, so res = ... is not actually assigning to the output parameter, but instead rebinding the name res. I believe res is pointing to uninitialized memory, which is why your first run gives a seemingly random value.
Numba works around this using the slice syntax ([:]) which does mutate res- you also need to declare the type as an array. A working function is:
#guvectorize([(u1[:], i8[:])], '(n)->()')
def f(x, res):
res[:] = x.argmax()

Dask compute fails when using client, works when no client setup

I am trying to use the dask client to parallelize my compute. When I run df.compute() I get the correct output (though it is very slow), but when I run the same thing after setting up a client, I get the following error:
distributed.protocol.pickle - INFO - Failed to serialize <function part at 0x7fd5186ed730>. Exception: can't pickle _thread.RLock objects
Here is my code, in the first df.compute(), I get the expected result, in the second I do not.
#dask.delayed
def part(x):
lower, upper = x
q = "SELECT id,tfidf_vec,emb_vec FROM document_table"
lines=man.session.execute(q)
counter = lower
df = []
for line in lines:
df.append(line)
counter += 1
if counter == upper:
break
return pd.DataFrame(df)
parts = [part(x) for x in [[0,100000],[100000,200000]]]
df = dd.from_delayed(parts)
df.compute()
from dask.distributed import Client
client = Client('127.0.0.1:8786')
df.compute()
Your function contains a reference to man.session, which is part of the function closure. When you use the default scheduler, threads, the object can be shared between the threads that execute your code. When you use the distributed scheduler, the function must be serialised and sent to workers in difference process(es).
You should make a function which creates the session object on each invocation, as was suggested as an answer to your very similar question.

Dask Delayed ignores name for dependent variables

When creating a graph of calculations using delayed I'm trying to assign names so that if I visualize the graph it's readable. However, for delayed variables that are dependent on functions the name parameter doesn't seem to affect the key. Here's a toy example:
def calc_avg(a, b):
return pd.concat([a, b], axis=1).mean(axis=1)
def calc_ratio(a, b):
return a / b
a = delayed(pd.Series(np.random.rand(10)), name='a')
b = delayed(pd.Series(np.random.rand(10)), name='b')
c = delayed(pd.Series(np.random.rand(10)), name='c')
x = delayed(calc_avg, name='avg_result')(a,b)
y = delayed(calc_ratio, name='ratio_result')(x,c)
y.visualize()
You can see the visualization here (I can't embed images), but rather than seeing 'avg_result' I see 'calc_avg-#0' and rather than see 'ratio_result' I see 'calc_ratio-#1'. If I look at x.key or y.key they do not match the names that I provided. Is this the expected behavior?
The key of a dask result needs to be unique for every combination of the function that was delayed, and the inputs you give it. What you see above is the expected behaviour: you are naming the function, but a call with different inputs would expect a different output, so the key must be different.
You can specify the key you'd like associated not when you define the delayed function, but when you call it:
x = delayed(calc_avg)(a, b, dask_key_name='avg_result')
y = delayed(calc_ratio)(x, c, dask_key_name='ratio_result')

Resources