Seeing logs of dask workers - dask

I'm having trouble changing the temporary directory in Dask. When I change the temporary-directory in dask.yaml for some reason Dask is still writing out in /tmp (which is full). I now want to try and debug this, but when I use client.get_worker_logs() I only get INFO output.
I start my cluster with
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=1, threads_per_worker=4, memory_limit='10gb')
client = Client(cluster)
I already tried adding distributed.worker: debug to the distributed.yaml, but this doesn't change the output. I also check I am actually changing the configuration by calling dask.config.get('distributed.logging')
What am I doing wrong?

By default LocalCluster silences most logging. Try passing the silence_logs=False keyword
cluster = LocalCluster(..., silence_logs=False)

Related

launching dask-cuda LocalCUDACluster within SLURMCluster

I want to launch a cluster on Slurm. Whereas, on each node, a LocalCUDACluster should be launched to use the available GPUs on each node. My sample code looks as follows:
import dask
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
from dask_cuda import LocalCUDACluster
import os
def test():
#return(cuda.get_current_device().id)
return([i.id for i in cuda.list_devices()])
def test_numba_cuda():
cluster = LocalCUDACluster()
client = Client(cluster)
return(cluster.cuda_visible_devices)
queue = "gpus" # batch, gpus, develgpus, etc.
project = "deepacf" # your project: zam, training19xx, etc.
port = 56755
cluster = SLURMCluster(
n_workers=2,
cores=1,
processes=2,
memory="5GB",
shebang="#!/usr/bin/env bash",
queue=queue,
scheduler_options={"dashboard_address": ":"+str(port)},
walltime="00:30:00",
local_directory="/tmp",
death_timeout="30m",
log_directory=f'{os.environ["HOME"]}/dask_jobqueue_logs',
interface="ib0",
project=project,
python="/p/home/jusers/elshambakey1/juwels/jupyter/kernels/dg_rr_analytics/bin/python",
nanny=False,
job_extra=['--gres gpu:4']
)
client=Client(cluster)
ts=[dask.delayed(test_numba_cuda)()]
res=client.compute(ts)
res[0].result()
I had to set nanny=False because, otherwise, I receive an error about daemonized tasks that cannot have children. Thus, I found a similar problem at https://github.com/dask/distributed/issues/2142. So, I set nanny=False. It worked fine when n_workers=1 and processes=1. But when I tried to set both n_workers=2 and processes=2, it fails with the following error:
distributed.dask_worker - ERROR - Failed to launch worker. You cannot use the --no-nanny argument when n_workers > 1
I wonder how to solve this problem.
You can get over the error
distributed.dask_worker - ERROR - Failed to launch worker. You cannot use the --no-nanny argument when n_workers > 1
just by using nanny and asking it not to use daemons for workers. Just add the following to the SLURMCluster builder call:
job_script_prologue=["export DASK_DISTRIBUTED__WORKER__DAEMON=False"],
nanny=True

how to set up proper printout destination for dask multiprocessing in jupyter notebook on linux

I am using dask in jupyter notebook on a linux server to run python functions on multiple CPUs. The python functions have standard print statement. I would like the output of the print to be shown in the jupyter notebook right below the cell. However, the print out were all shown in the console. Can anyone explain why this happens and how to make dask.function.print output to the notebook, or both the console and the notebook.
The following is a simplified version of the problem:
import dask
import functools
from dask import compute, delayed
iter_list=[0,1]
def iFunc(item):
print('Meme',item)
# call this function itself will print normally to the
# notebook below the cell, desired.
with dask.config.set(scheduler='processes',num_workers=2):
func1=functools.partial(iFunc)
ret=compute([delayed(func1)(item) for item in iter_list])
# surprisingly, Meme 0, Meme 1 only print out to the console,
# not the notebook, Not desired, hard to debug. Any clue?
The whole point of dask is leveraging multiple threads, processes, or nodes/machines to distribute work. The workers you create are therefore not on the same thread as your client, and may not be on the same process, or even the same machine (or like, in the same country) as your client, depending on how you set up your cluster.
If you start a LocalCluster from your jupyter notebook, whether you're using threads or processes, you should see printed output appearing as output in the cells which execute jobs on the workers:
In [1]: import dask.distributed as dd
In [2]: client = dd.Client(processes=4)
In [3]: def job():
...: print("hello from a worker!")
In [4]: client.submit(job).result()
hello from a worker!
However, if a different process is spinning up your workers, it is up to that process to decide how to handle stdout. So if you're spinning up workers using the jupyterlab terminal, stdout will appear there. If you're spinning up workers in a kubernetes pod, stdout will appear in the worker logs. Dask doesn't actively manage standard out, so it's up to you to handle this. Note that this also applies to logging - neither stdout nor logs are captured by dask. This is actually a really important design feature - many distributed systems have their own systems for managing the standard out & logging of nodes, and dask does not want to impose its own parallel/conflicting system for handling output. The main focus of dask is executing the tasks, not managing a distributed logging system.
That said, dask does have the infrastructure for passing around messages, and this is something the package could support. There is an open issue and pull request attempting to add this ability as a feature, but it looks like there are a lot of open design questions that would need to be resolved before this could be added. Many of them revolve around the issues I raised above - how to add a clean distributed logging feature without overburdening the scheduler, complicating the already complex set of configuration options, or overriding the important, existing logging systems users rely on. The dask core team seems to agree that this is a good idea, if the tough design questions can be resolved.
You certainly always have the option of returning messages. For example, the following would work:
In [10]: def job():
...: return_blob = {"diagnostics": {}, "messages": [], "return_val": None}
...: start = time.time()
...: return_blob["diagnostics"]["start"] = start
...:
...: try:
...: return_blob["messages"].append("raising error")
...: # this causes a DivideByZeroError
...: return_blob["return_val"] = 1 / 0
...: except Exception as e:
...: return_blob["diagnostics"]["error"] = e
...:
...: return_blob["diagnostics"]["end"] = time.time()
...: return return_blob
...:
In [11]: client.submit(job).result()
Out[11]:
{'diagnostics': {'start': 1644091274.738912,
'error': ZeroDivisionError('division by zero'),
'end': 1644091274.7389162},
'messages': ['raising error'],
'return_val': None}

DisabledBackend: Erratic Behavior with Celery, Redis & Flask

I've been using Celery for a while a now, in production I use RabbitMQ as the broker and Redis for the backend in a K8s cluster with no problems so far. Locally, I run a docker compose with a few services (Flask API, 2 different Workers, Beat, Redis, Flower, Hasura), using Redis as both the Broker and the Backend.
I haven't experienced problems with this setup for the past months, but yesterday I started getting erratic behavior while accessing task results.
Tasks are sent to queue, the worker recognizes it and performs the task, but while querying for the task state I sometimes get DisabledBackend. Normally on the first request, and then it works. Couldn't find a pattern of when it works and when it doesn't, it's erratic.
I've read somewhere that Celery didn't work very well with flask's builtin server so I switched to uWSGI with pretty much the same setup I have in production:
[uwsgi]
wsgi-file = app/uwsgi.py
callable = application
http = :8080
processes = 4
threads = 2
master = true
chmod-socket = 660
vacuum = true
die-on-term = true
buffer-size = 32768
enable-threads = true
req-logger = python:uwsgi
I've seen a similar question in Django in which the problem seemed to be on WSGI Mod with Apache, which is not my case, but the behavior seems similar. Every other question I've seen was related to misconfiguration of the backend, which is not my case.
Any ideas on what might be causing this?
Thanks.
So it seems that I need to access AsyncResult only via my Celery app instance, instead of through Celery, or pass the Celery app instance as an argument.
So, this doesn't work:
from celery.result import AsyncResult
#app.route('/status/<task_id>')
def get_status(task_id):
task = AsyncResult(task_id)
return task.state
This works:
from app import my_celery # Your own Celery Application Instance
#app.route('/status/<task_id>')
def get_status(task_id):
task = my_celery.AsyncResult(task_id)
return task.state
This also works:
from app import my_celery
from celery.result import AsyncResult
#app.route('/status/<task_id>')
def get_status(task_id):
task = AsyncResult(task_id, app=my_celery)
return task.state
I'm guessing what happens is that by calling AsyncResult directly from Celery, it doesn't access Celery's configurations, hence it thinks that there's no backend configured to query results to.
But that would only explain complete failure of the function, and not the erratic behavior. I'm guessing this is because of different threads, and situations in which the app instance is being importante, so Celery finds it, not too sure though.
I've ran a couple of tests and seems to be working fine again after changing the imported AsyncResult, but I'll keep digging.

Beam/Dataflow: No session file found: /var/opt/google/dataflow/pickled_main_session

When using Apache Beam (GCP Dataflow) I see the following warning in worker logs:
No session file found: /var/opt/google/dataflow/pickled_main_session.
Functions defined in __main__ (interactive session) may fail.
My Dataflow job seems to be fine regardless, but I'm wondering what this warning is all about.
I have seen the following in some sample code (which I am NOT currently doing):
pipeline_options.view_as(SetupOptions).save_main_session = True
where pipeline_options is the main way of specifying options for the Beam/Dataflow pipeline, as in the following later in the code:
with beam.Pipeline(options=pipeline_options) as p:
# actual pipeline code here
I am curious if the two are related. Does the presence of the warning mean I should always be saving the main session? Are these two things related? Unrelated?
You should be able to safely ignore this warning. No need to set save_main_session if it's not required for your pipeline.

Dask with tls connection can not end the program with to_parquet method

I am using dask to process 10 files which the size of each file is about 142MB. I build a method with delayed tag, following is an example:
#dask.delayed
def process_one_file(input_file_path, save_path):
res = []
for line in open(input_file_path):
res.append(line)
df = pd.DataFrame(line)
df.to_parquet(save_path+os.path.basename(input_file_path))
if __name__ == '__main__':
client = ClusterClient()
input_dir = ""
save_dir = ""
print("start to process")
cvss = [process_one_file(input_dir+filename, save_dir) for filename in os.listdir(input_dir)]
dask.compute(csvs)
However, dask does not always run successfully. After processing all files, the program often hangs.
I used the command line to run the program. The program often huangs after printing start to process. I know the program runs correctly, since I can see all output files after a while.
But the program never stops. If I disabled tls, the program can run successfully.
It was so strange that dask can not stop the program if I enable tls connection. How can I solve it?
I found that if I add to_parquet method, then the program cannot stop, while if I remove the method, it runs successfully.
I have found the problem. I set 10GB for each process. That means I set memory-limit=10GB. I totally set 2 workers and each has 2 processes. Each process has 2 threads.
Thus, each machine will have 4 processes which occupy 40GB. However, my machine only have 32GB. If I lower the memory limit, then the program will run successfully!

Resources