Dask compute fails when using client, works when no client setup - dask

I am trying to use the dask client to parallelize my compute. When I run df.compute() I get the correct output (though it is very slow), but when I run the same thing after setting up a client, I get the following error:
distributed.protocol.pickle - INFO - Failed to serialize <function part at 0x7fd5186ed730>. Exception: can't pickle _thread.RLock objects
Here is my code, in the first df.compute(), I get the expected result, in the second I do not.
#dask.delayed
def part(x):
lower, upper = x
q = "SELECT id,tfidf_vec,emb_vec FROM document_table"
lines=man.session.execute(q)
counter = lower
df = []
for line in lines:
df.append(line)
counter += 1
if counter == upper:
break
return pd.DataFrame(df)
parts = [part(x) for x in [[0,100000],[100000,200000]]]
df = dd.from_delayed(parts)
df.compute()
from dask.distributed import Client
client = Client('127.0.0.1:8786')
df.compute()

Your function contains a reference to man.session, which is part of the function closure. When you use the default scheduler, threads, the object can be shared between the threads that execute your code. When you use the distributed scheduler, the function must be serialised and sent to workers in difference process(es).
You should make a function which creates the session object on each invocation, as was suggested as an answer to your very similar question.

Related

Mismanagement of dask future results slow down the performances

I'm looking for any suggestion on how to solve the bottleneck below described.
Within a dask distributed infrastructure I map some futures and gain results whenever they are ready. Once retrieved I've to invoke a time consuming, blocking "pandas" function and, unfortunately, this function can't be avoided.
The optimum would be to have something that let me create another process, detached from the for loop, that's able to ingest the flow of results. For other constraints, not present in the example, the output can't be serialized and sent to workers and must be processed on the master.
here a small mockup. Just grab the idea and not focus too much on the details of the code.
class pxldrl(object):
def __init__(self, df):
self.table = df
def simulation(list_param):
time.sleep(random.random())
val = sum(list_param)/4
if val < 0.5:
result = {'param_e': val}
else:
result = {'param_f': val}
return pxldrl(result)
def costly_function(result, output):
time.sleep(1)
# blocking pandas function
output = output.append(result.table, sort=False, ignore_index=True)
return output
def main():
client = Client(n_workers=4, threads_per_worker=1)
output = pd.DataFrame(columns=['param_e', 'param_f'])
input = pd.DataFrame(np.random.random(size=(100, 4)),
columns=['param_a', 'param_b', 'param_c', 'param_d'])
for i in range(2):
futures = client.map(simulation, input.values)
for future, result in as_completed(futures, with_results=True):
output = costly_function(result, output)
It sounds like you want to run costly_function in a separate thread. Perhaps you could using the threading or concurrent.futures module to run your entire routine on a separate thread?
If you wanted to get fancy, you could even use Dask again and create a second client that ran within this process:
local_client = Client(processes=False)
and use that. (although you'll have to be careful about mixing futures between clients, which won't work)

Dask: what function variable is best to choose for visualize()

I am trying to understand Dask delayed more deeply so I decided to work through the examples here. I modified some of the code to reflect how I want to use Dask (see below). But the results are different than what I expected ie. a tuple vs list. When I try to apply '.visualize()' to see what the execution graph looks like I get nothing.
I worked through all the examples in 'delayed.ipynb' and they all work properly including all the visualizations. I then modified the 'for' loop for one example:
for i in range(256):
x = inc(i)
y = dec(x)
z = add(x, y)
zs.append(z)
to a function call the uses a list comprehension. The result is a variation on the original working example.
%%time
import time
import random
from dask import delayed, compute, visualize
zs = []
#delayed
def inc(x):
time.sleep(random.random())
return x + 1
#delayed
def dec(x):
time.sleep(random.random())
return x - 1
#delayed
def add(x, y):
time.sleep(random.random())
return x + y
def myloop(x):
x.append([add(inc(i), dec(inc(i))) for i in range(8)])
return x
result = myloop(zs)
final = compute(*result)
print(final)
I have tried printing out 'result' (function call) which provides the expected list of delay calls but when I print the results of 'compute' I unexpectedly get the desired list as part of a tuple. Why don't I get a just a list?
When I try to 'visualize' the execution graph I get nothing at all. I was expecting to see as many nodes as are in the generated list.
I did not think I made any significant modifications to the example so what am I not understanding?
The visualize function has the same call signature as compute. So if your compute(*result) call works then try visualize(*result)

I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them?

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.
df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)
df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df = client.persist(df)
def create_sep_futures(symbol,df):
symbol_df = copy.deepcopy(df[df['symbol' == symbol]])
return symbol_df
lazy_values = [delayed(create_sep_futures)(symbol, df) for symbol in st]
future = client.compute(lazy_values)
result = client.gather(future)
st list contains 1000 elements
when I do this, I get this error:
distributed.worker - WARNING - Compute Failed
Function: create_sep_futures
args: ('PHG', symbol col_3 col_2 \
0 A 1.451261e+09 23.512857
1 A 1.451866e+09 23.886857
2 A 1.452470e+09 25.080429
kwargs: {}
Exception: KeyError(False,)
My assumption is that workers should get full dataframe and query on it. But I think it just gets the block and tries to do it.
What is the workaround for it? Since dataframe chunks are already in workers memory. I don't want to move the dataframe to each worker.
Operations on dataframes, using the dataframe syntax and API, are lazy (delayed) by default, you need do nothing more.
First problem: your syntax is wrong df[df['symbol' == symbol]] => df[df['symbol'] == symbol]. That is the origin of the False key.
So the solution you are probably looking for:
future = client.compute(df[df['symbol'] == symbol])
If you do want to work on the chunks separately, you can look into df.map_partitions, which you use with a normal function and takes care of passing data or delayed/futures or df.to_delayed, which will give you a set of delayed objects which you can use with a delayed function.

wrk executing Lua script

My question is that when I run
wrk -d10s -t20 -c20 -s /mnt/c/xxxx/post.lua http://localhost:xxxx/post
the Lua script that is only executed once? It will only put one item into the database at the URL.
-- example HTTP POST script which demonstrates setting the
-- HTTP method, body, and adding a header
math.randomseed(os.time())
number = math.random()
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.body = '{"name": "' .. tostring(number) .. '", "title":"test","enabled":true,"defaultValue":false}'
Is there a way to make it create the 'number' variable dynamically and keep adding new items into the database until the 'wrk' command has finished its test? Or that it will keep executing the script for the duration of the test creating and inserting new 'number' variables into 'wrk.body' ?
Apologies I have literally only being looking at Lua for a few hours.
Thanks
When you do
number = math.random
you're not setting number to a random number, you're setting it equal to the function math.random. To set the variable to the value returned by the function, that line should read
number = math.random()
You may also need to set a random seed (with the math.randomseed() function and your choice of an appropriately variable argument - system time is common) to avoid math.random() giving the same result each time the script is run. This should be done before the first call to math.random.
As the script is short, system time probably isn't a good choice of seed here (the script runs far quicker than the value from os.time() changes, so running it several times immediately after one another gives the same results each time). Reading a few bytes from /dev/urandom should give better results.
You could also just use /dev/urandom to generate a number directly, rather than feeding it to math.random as a seed. Like in the code below, as taken from this answer. This isn't a secure random number generator, but for your purposes it would be fine.
urand = assert (io.open ('/dev/urandom', 'rb'))
rand = assert (io.open ('/dev/random', 'rb'))
function RNG (b, m, r)
b = b or 4
m = m or 256
r = r or urand
local n, s = 0, r:read (b)
for i = 1, s:len () do
n = m * n + s:byte (i)
end
return n
end

Losing the local client when calling dark inside of a distributed-spawned function

I am trying to perform some task operations inside of a function which is sent to a worker through distributed. A simplified version of the code is
client = Client(...)
X_ = dask.array.from_array(...)
X = dask.persist(X_)
def func(X, b):
with distributed.local_client() as c:
with dask.set_options(get=c.get):
return dask.lu_solve(X, b)
client.persist(dask.do(func)(X, b))
The problem is that in doing for several X, b instances, sometimes it works and sometimes I get the Exception Exception: Client not running. Status: closed
any idea on how to address this?
When you pass the inputs dask.array X and b to a dask.delayed function they arrive as numpy arrays. I recommend just using NumPy functions instead.
Alternatively, maybe you're trying to accomplish something else?
If you want to call a dask.array function on dask.arrays you can do it from your normal python session. There is no reason to use a local_client.

Resources