Dask: set multiprocessing method from Python - dask

Is there a way to set the multiprocessing method from Python? I do not see a method in the Client() API docs of Dask.distributed that indicates how to set this property.
Update:
For example, is there:
client = Client(multiprocessing='fork')
or
client = Client(multiprocessing='spawn')
?

Unfortunately the multiprocessing context method is set at import time of dask.distributed. If you wanted to set this from Python you could set the config value after you import dask, but before you import dask.distributed.
import dask
dask.config.set({'distributed.worker.multiprocessing-method': 'spawn'})
from dask.distributed import Client
However it's probably more robust to just set this in your config file. See configuration documentation for the various ways to set configuration values.
Note: this is using the configuration as of dask.__version__ == '0.18.0'

Related

Where is MobyLCPSolver?

Where is MobyLCPSolver?
ImportError: cannot import name 'MobyLCPSolver' from 'pydrake.all' (/home/docker/drake/drake-build/install/lib/python3.8/site-packages/pydrake/all.py)
I have the latest Drake and cannot import it.
Can anyone help?
As of pydrake v1.12.0, the MobyLcp C++ API is not bound in Python.
However, if you feed an LCP into Solve() then Drake can choose Moby to solve it. You can take advantage of this to create an instance of MobyLCP:
import numpy as np
from pydrake.all import (
ChooseBestSolver,
MakeSolver,
MathematicalProgram,
)
prog = MathematicalProgram()
x = prog.NewContinuousVariables(2)
prog.AddLinearComplementarityConstraint(np.eye(2), np.array([1, 2]), x)
moby_id = ChooseBestSolver(prog)
moby = MakeSolver(moby_id)
print(moby.SolverName())
# The output is: "Moby LCP".
# The C++ type of the `moby` object is drake::solvers::MobyLCP.
That only allows for calling Moby via the MathematicalProgram interface, however. To call any MobyLCP-specific C++ functions like SolveLcpFastRegularized, those would need to be added to the bindings code specifically before they could be used.
You can file a feature request on the Drake GitHub page when you need access to C++ classes or functions that aren't bound into Python yet, or even better you can open a pull request with the bindings that you need.

How can I keep a PBSCluster running?

I have access to a cluster running PBS Pro and would like to keep a PBSCluster instance running on the headnode. My current (obviously broken) script is:
import dask_jobqueue
from paths import get_temp_dir
def main():
temp_dir = get_temp_dir()
scheduler_options = {'scheduler_file': temp_dir / 'scheduler.json'}
cluster = dask_jobqueue.PBSCluster(cores=24, memory='100GB', processes=1, scheduler_options=scheduler_options)
if __name__ == '__main__':
main()
This script is obviously broken because after the cluster is created the main() function exits and the cluster is destroyed.
I imagine I must call some sort of execute_io_loop function, but I can't find anything in the API.
So, how can I keep my PBSCluster alive?
I'm thinking that the section of the Python API (advanced) in the docs might be a good way to try to solve this issue.
Mind you this is an example of how to create Schedulers and Workers, but I'm assuming that the logic could be used in a similar way for your case.
import asyncio
async def create_cluster():
temp_dir = get_temp_dir()
scheduler_options = {'scheduler_file': temp_dir / 'scheduler.json'}
cluster = dask_jobqueue.PBSCluster(cores=24, memory='100GB', processes=1, scheduler_options=scheduler_options)
if __name__ == "__main__":
asyncio.get_event_loop().run_until_complete(create_cluster())
You might have to change the code a bit, but it should keep your create_cluster running until it finished.
Let me know if this works for you.

IBM Cloud Functions - Python Actions

I'm learning how to use Serverless Functions, I'm working trying to connect a Watson assistant through webhooks using a python action that is processing a small dataset, I'm still struggling to succeed on it.
I’ve done my coding on Jupyter environment calling raw csv dataset from Github and using pandas to handle it. The issue is when I’m invoking the action into IBM Functions works 10% of the times. I did debug on Jupyter and Visual Studio environments and the code seems to be ok, but once I move the code to the IBM Functions environment it doesn't perform.
import sys
import csv
import json
import pandas as pd
location = ('Germany') #Passing country parameter for testing purpose
data = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-24-2020.csv')
def main(args):
location = args.get("location")
for index, row in data.iterrows():
currentLoc = row['Country/Region']
if currentLoc == location:
covid_statistics = {
"Province/State": row['Province/State'],
"Country/Region": row['Country/Region'],
"Confirmed":row['Confirmed'],
"Deaths":row['Deaths'],
"Recovered":row['Recovered']
}
return {"message": covid_statistics}
else:
return {"message": "Data not available"}

Dask Client detect local default cluster already running

from dask.distributed import Client
Client()
Client(do_not_spawn_new_if_default_address_in_use=True) # should not spawn a new default cluster
Is this possible to do somehow?
The non-public function distributed.client._get_global_client() will return the current client if it exists, or None
client = _get_global_client() or Client()
Since it is internal, the API may change without notice.
You shouldn't really be creating multiple clients within the same Python session. It may be worth digging deeper into why you are calling Client() more than once.
If you already have a Dask cluster running on the default address you could set the DASK_SCHEDULER_ADDRESS environment variable which will instruct the client to look there instead of creating a local cluster.
>>> import os
>>> os.environ['DASK_SCHEDULER_ADDRESS'] = 'tcp://localhost:8786'
>>> from dask.distributed import Client
>>> Client() # Does not create a cluster
<Client: 'tcp://127.0.0.1:8786' processes=0 threads=0, memory=0 B>

Seeing logs of dask workers

I'm having trouble changing the temporary directory in Dask. When I change the temporary-directory in dask.yaml for some reason Dask is still writing out in /tmp (which is full). I now want to try and debug this, but when I use client.get_worker_logs() I only get INFO output.
I start my cluster with
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=1, threads_per_worker=4, memory_limit='10gb')
client = Client(cluster)
I already tried adding distributed.worker: debug to the distributed.yaml, but this doesn't change the output. I also check I am actually changing the configuration by calling dask.config.get('distributed.logging')
What am I doing wrong?
By default LocalCluster silences most logging. Try passing the silence_logs=False keyword
cluster = LocalCluster(..., silence_logs=False)

Resources