I am trying to use the dask.distributed.SLURMCluster to submit batch jobs to a SLURM job scheduler on a supercomputing cluster. The jobs all submit as expect, but throw an error after 1 minute of running: asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds. How do I get the nanny to connect?
Full Trace:
distributed.nanny - INFO - Start Nanny at: 'tcp://206.76.203.125:38324'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.worker - INFO - Start worker at: tcp://206.76.203.125:37609
distributed.worker - INFO - Listening to: tcp://206.76.203.125:37609
distributed.worker - INFO - dashboard at: 206.76.203.125:35505
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 2.00 GB
distributed.worker - INFO - Local Directory: /home1/06729/tg860286/tests/dask-rsmas-presentation/dask-worker-space/worker-pu937jui
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.worker - INFO - Waiting to connect to: tcp://129.114.63.43:35489
distributed.nanny - INFO - Closing Nanny at 'tcp://206.76.203.125:38324'
distributed.worker - INFO - Stopping worker at tcp://206.76.203.125:37609
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 173, in wait_for
await asyncio.wait_for(future, timeout=timeout)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 440, in <module>
go()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 436, in go
main()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 422, in main
loop.run_sync(run)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 416, in run
await asyncio.gather(*nannies)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 176, in wait_for
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds```
It looks like your workers weren't able to connect to the scheduler. My guess is that you need to specify a network interface. You should ask your system administrator which network interface you should use, and then specify that with the interface= keyword.
You might also want to read through https://blog.dask.org/2019/08/28/dask-on-summit , which gives a case study of common problems that arise.
Related
The problem:
i use appium+robotframework to test my app.when i use key words:Open Application,it always gets failed result:No application is open.but actually the app was already open.i started appium server with code:appium -p 4723 --session-override --no-reset.
Environment:
info AppiumDoctor ### Diagnostic for necessary dependencies starting ###
info AppiumDoctor ✔ The Node.js binary was found at: C:\Program Files\nodejs\node.EXE
info AppiumDoctor ✔ Node version is 16.15.1
info AppiumDoctor ✔ ANDROID_HOME is set to: D:\Android_Sdk
info AppiumDoctor ✔ JAVA_HOME is set to: C:\Program Files\Java\jdk1.8.0_60
info AppiumDoctor Checking adb, android, emulator
info AppiumDoctor 'adb' is in D:\Android_Sdk\platform-tools\adb.exe
info AppiumDoctor 'android' is in D:\Android_Sdk\tools\android.bat
info AppiumDoctor 'emulator' is in D:\Android_Sdk\emulator\emulator.exe
info AppiumDoctor ✔ adb, android, emulator exist: D:\Android_Sdk
info AppiumDoctor ✔ 'bin' subfolder exists under 'C:\Program Files\Java\jdk1.8.0_60'
info AppiumDoctor ### Diagnostic for necessary dependencies completed, no fix needed. ###
Log:
in robotframework,i runed the test in debug,there some info:
20220802 18:05:05.399 : DEBUG : Starting new HTTP connection (1): 127.0.0.1:4723
20220802 18:05:14.770 : DEBUG : http://127.0.0.1:4723 "POST /wd/hub/session HTTP/1.1" 200 884
20220802 18:05:14.771 : DEBUG : Remote response: status=200 | data={"value":{"capabilities":{"platform":"LINUX","webStorageEnabled":false,"takesScreenshot":true,"javascriptEnabled":true,"databaseEnabled":false,"networkConnectionEnabled":true,"locationContextEnabled":false,"warnings":{},"desired":{"platformName":"Android","appPackage":"com.cmcc.myhouse.demo","appActivity":"com.cmcc.myhouse.MainActivity","appWaitDuration":60000,"noSign":true},"platformName":"Android","appPackage":"com.cmcc.myhouse.demo","appActivity":"com.cmcc.myhouse.MainActivity","appWaitDuration":60000,"noSign":true,"deviceName":"ed192f0","deviceUDID":"ed192f0","deviceApiLevel":29,"platformVersion":"10","deviceScreenSize":"1080x2160","deviceScreenDensity":380,"deviceModel":"ONEPLUS A5010","deviceManufacturer":"OnePlus","pixelRatio":2.375,"statBarHeight":57,"viewportRect":{"left":0,"top":57,"width":1080,"height":2103}},"sessionId":"312366fe-1008-47f4-9063-1cf0e4a27e0c"}} | headers=HTTPHeaderDict({'X-Powered-By': 'Express', 'Vary': 'X-HTTP-Method-Override', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '884', 'ETag': 'W/"374-cX9IxtSKVtVV/oMPHrqcO0PP2Yg"', 'Date': 'Tue, 02 Aug 2022 10:05:14 GMT', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=600'})
20220802 18:05:14.771 : DEBUG : Finished Request
20220802 18:05:14.774 : FAIL : No application is open
20220802 18:05:14.776 : DEBUG :
Traceback (most recent call last):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 16, in _run_on_failure_decorator
return method(*args, **kwargs)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_applicationmanagement.py", line 52, in open_application
application = webdriver.Remote(str(remote_url), desired_caps)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 268, in __init__
AppiumConnection(command_executor, keep_alive=keep_alive), desired_capabilities, browser_profile, proxy
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 275, in __init__
self.start_session(capabilities, browser_profile)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 361, in start_session
self.capabilities = response.get('value')
AttributeError: can't set attribute
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 21, in _run_on_failure_decorator
raise err
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 16, in _run_on_failure_decorator
return method(*args, **kwargs)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_screenshot.py", line 31, in capture_page_screenshot
if hasattr(self._current_application(), 'get_screenshot_as_file'):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_applicationmanagement.py", line 367, in _current_application
raise RuntimeError('No application is open')
RuntimeError: No application is open
20220802 18:05:14.779 : WARN : Keyword 'Capture Page Screenshot' could not be run on failure: No application is open
20220802 18:05:14.780 : FAIL : AttributeError: can't set attribute
20220802 18:05:14.780 : DEBUG :
Traceback (most recent call last):
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 21, in _run_on_failure_decorator
raise err
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\keywordgroup.py", line 16, in _run_on_failure_decorator
return method(*args, **kwargs)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\AppiumLibrary\keywords\_applicationmanagement.py", line 52, in open_application
application = webdriver.Remote(str(remote_url), desired_caps)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 268, in __init__
AppiumConnection(command_executor, keep_alive=keep_alive), desired_capabilities, browser_profile, proxy
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 275, in __init__
self.start_session(capabilities, browser_profile)
File "c:\users\xiangfang\appdata\local\programs\python\python37\lib\site-packages\appium\webdriver\webdriver.py", line 361, in start_session
self.capabilities = response.get('value')
AttributeError: can't set attribute
Ending test: XiriTest.XiriBusinessTest.26MainBusinessTest.2.6CommonCMD
The app is usually opened automatically by Appium and then makes the necessary connections. Tests can fail if the app is already open.
Are you sure the appPackage and appActivity are correct? It would be worth double checking these.
I'm following this guide on how to set up docker with timescale/wale for continuous archiving.
https://docs.timescale.com/timescaledb/latest/how-to-guides/backup-and-restore/docker-and-wale/#run-the-timescaledb-container-in-docker
Everything runs as expected, but when I get to the final step, I'm seeing:
written to stdout
2022-04-13 07:43:33.349 UTC [27] LOG: redo done at 0/50000F8
Connecting to wale (172.18.0.3:80)
writing to stdout
- 100% |********************************| 36 0:00:00 ETA
written to stdout
Connecting to wale (172.18.0.3:80)
wget: server returned error: HTTP/1.0 500 INTERNAL SERVER ERROR
2022-04-13 07:43:34.264 UTC [27] LOG: selected new timeline ID: 2
2022-04-13 07:43:34.282 UTC [27] LOG: archive recovery complete
Connecting to wale (172.18.0.3:80)
wget: server returned error: HTTP/1.0 500 INTERNAL SERVER ERROR
2022-04-13 07:43:34.838 UTC [27] LOG: could not open file "pg_wal/000000010000000000000006": Permission denied
2022-04-13 07:43:34.844 UTC [1] LOG: database system is ready to accept connections
It looks like the wget to wale is failing? It's connected to the same network as timescaledb_recovered so shouldn't it work? is there some additional config that the docs are missing? Or am I misreading these logs somehow?
Some additional error output from wale log:
['wal-e', '--terse', 'wal-push', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
Pushing wal file /var/lib/postgresql/data/pg_wal/000000010000000000000012: ['wal-e', '--terse', 'wal-push', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
172.18.0.2 - - [13/Apr/2022 14:09:17] "GET /wal-push/000000010000000000000012 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
Fetching wal 000000010000000000000011: ['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
172.18.0.4 - - [13/Apr/2022 14:09:53] "GET /wal-fetch/000000010000000000000011 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000012', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
Fetching wal 000000010000000000000012: ['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000012', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
172.18.0.4 - - [13/Apr/2022 14:09:54] "GET /wal-fetch/000000010000000000000012 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
Fetching wal 000000010000000000000011: ['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
172.18.0.4 - - [13/Apr/2022 14:09:54] "GET /wal-fetch/000000010000000000000011 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '00000002.history', '/var/lib/postgresql/data/pg_wal/00000002.history']
Fetching wal 00000002.history: ['wal-e', '--terse', 'wal-fetch', '-p=0', '00000002.history', '/var/lib/postgresql/data/pg_wal/00000002.history']
lzop: short read
wal_e.main CRITICAL MSG: An unprocessed exception has avoided all error handling
DETAIL: Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/wal_e/cmd.py", line 657, in main
args.prefetch)
File "/usr/lib/python3.5/site-packages/wal_e/operator/backup.py", line 353, in wal_restore
self.gpg_key_id is not None)
File "/usr/lib/python3.5/site-packages/wal_e/worker/worker_util.py", line 58, in do_lzop_get
return blobstore.do_lzop_get(creds, url, path, decrypt, do_retry=do_retry)
File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 52, in do_lzop_get
raise exc
File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 64, in write_and_return_error
key.get_contents_to_file(stream)
File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/calling_format.py", line 53, in get_contents_to_file
with open(self.path, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/backups/wal_005/00000002.history.lzo'
STRUCTURED: time=2022-04-13T14:09:55.216294-00 pid=32
Failed to fetch wal 00000002.history: None
Exception on /wal-fetch/00000002.history [GET]
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/usr/lib/python3.5/site-packages/flask/app.py", line 1816, in full_dispatch_request
return self.finalize_request(rv)
File "/usr/lib/python3.5/site-packages/flask/app.py", line 1831, in finalize_request
response = self.make_response(rv)
File "/usr/lib/python3.5/site-packages/flask/app.py", line 1957, in make_response
'The view function did not return a valid response. The'
TypeError: The view function did not return a valid response. The function either returned None or ended without a return statement.
172.18.0.4 - - [13/Apr/2022 14:09:55] "GET /wal-fetch/00000002.history HTTP/1.1" 500 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '00000001.history', '/var/lib/postgresql/data/pg_wal/00000001.history']
Fetching wal 00000001.history: ['wal-e', '--terse', 'wal-fetch', '-p=0', '00000001.history', '/var/lib/postgresql/data/pg_wal/00000001.history']
lzop: short read
wal_e.main CRITICAL MSG: An unprocessed exception has avoided all error handling
DETAIL: Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/wal_e/cmd.py", line 657, in main
args.prefetch)
File "/usr/lib/python3.5/site-packages/wal_e/operator/backup.py", line 353, in wal_restore
self.gpg_key_id is not None)
File "/usr/lib/python3.5/site-packages/wal_e/worker/worker_util.py", line 58, in do_lzop_get
return blobstore.do_lzop_get(creds, url, path, decrypt, do_retry=do_retry)
File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 52, in do_lzop_get
raise exc
File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 64, in write_and_return_error
key.get_contents_to_file(stream)
File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/calling_format.py", line 53, in get_contents_to_file
with open(self.path, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/backups/wal_005/00000001.history.lzo'
STRUCTURED: time=2022-04-13T14:09:55.689548-00 pid=38
Failed to fetch wal 00000001.history: None
Exception on /wal-fetch/00000001.history [GET]
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/usr/lib/python3.5/site-packages/flask/app.py", line 1816, in full_dispatch_request
return self.finalize_request(rv)
File "/usr/lib/python3.5/site-packages/flask/app.py", line 1831, in finalize_request
response = self.make_response(rv)
File "/usr/lib/python3.5/site-packages/flask/app.py", line 1957, in make_response
'The view function did not return a valid response. The'
TypeError: The view function did not return a valid response. The function either returned None or ended without a return statement.
172.18.0.4 - - [13/Apr/2022 14:09:55] "GET /wal-fetch/00000001.history HTTP/1.1" 500 -
I've added some additional logs from the wale container that create the error message on running timescaledb-recovered. I'm guessing that there is some issue with the requests timescaledb-recovered is sending because wget works until that continer is started.
This is bizarre, but apparently the critical failure and 500 error are intended to lets postgres know no further segments need to be recovered. Incredibly frustrating.
when I run
python -m torch.distributed.run --nproc_per_node=8 --master_addr="127.0.0.1" --master_port=$RANDOM ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py --manual_loads_name l2l_resnet12rfs_cifarfs_adam_cl_80k
I get the error:
====> about to start train loop
Starting training!
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/learn2learn/algorithms/maml.py", line 159, in adapt
gradients = grad(loss,
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/autograd/__init__.py", line 226, in grad
return Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 6; 39.59 GiB total capacity; 265.23 MiB already allocated; 10.19 MiB free; 282.00 MiB reserved in total by PyTorch)
learn2learn: Maybe try with allow_nograd=True and/or allow_unused=True ?
Traceback (most recent call last):
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 216, in <module>
main()
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 161, in main
train(args=args)
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 196, in train
meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 149, in meta_train_iterations_ala_l2l
train_loss, train_loss_std, train_acc, train_acc_std = meta_learner(task_dataset, call_backward=True)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/meta_learners/maml_meta_learner.py", line 371, in forward
meta_loss, meta_loss_std, meta_acc, meta_acc_std = forward(meta_learner=self,
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/meta_learners/maml_meta_learner.py", line 312, in forward
loss, acc = fast_adapt(
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/meta_learners/maml_meta_learner.py", line 266, in fast_adapt
learner.adapt(adaptation_error)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/learn2learn/algorithms/maml.py", line 169, in adapt
self.module = maml_update(self.module, self.lr, gradients)
UnboundLocalError: local variable 'gradients' referenced before assignment
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f5036ae4a22 in /home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x132 (0x7f50dbec70e2 in /home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f50dbec8d40 in /home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11c (0x7f50dbec975c in /home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0xc71f (0x7f50da8aa71f in /home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x7ea5 (0x7f50e43e6ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f50e410fb0d in /lib64/libc.so.6)
/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 5 (pid: 158275) of binary: /home/miranda9/miniconda3/envs/meta_learning_a100/bin/python
/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 158275 (local_rank 5) FAILED (exitcode -6)
Error msg: Signal 6 (SIGABRT) received by PID 158275
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
#record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/run.py", line 702, in <module>
main()
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
return f(*args, **kwargs)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
****************************************************************************************************************************************
/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py FAILED
========================================================================================================================================
Root Cause:
[0]:
time: 2022-02-04_20:26:43
rank: 5 (local_rank: 5)
exitcode: -6 (pid: 158275)
error_file: <N/A>
msg: "Signal 6 (SIGABRT) received by PID 158275"
========================================================================================================================================
Other Failures:
[1]:
time: 2022-02-04_20:26:43
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 158280)
error_file: <N/A>
msg: "Process failed with exitcode 1"
****************************************************************************************************************************************
but different variations of the way to set up the script doesn't work.
In summary, how is one suppose to use torch.distributed.run such that TWO or MORE scripts can run at once and they are all distributed?
Related:
main discussion for a solution here: https://github.com/learnables/learn2learn/issues/263
https://discuss.pytorch.org/t/multiprocessing-failed-with-torch-distributed-launch-module/33056
base code: https://github.com/learnables/learn2learn/blob/master/examples/vision/distributed_maml.py
I have thousands of parquet files that I need to process. Before processing the files, I'm trying to get various information about the files using the parquet metadata, such as number of rows in each partition, mins, maxs, etc.
I tried reading the metadata using dask.delayed hoping to distribute the metadata gathering tasks across my cluster, but this seems to lead to instability in Dask. See an example code snippet and an error of a node time out below.
Is there a way to read the parquet metadata from Dask? I know Dask's "read_parquet" function has a "gather_statistics" option, which you can set to false to speed up the file reads. But, I don't see a way to access all of the parquet metadata / statistics if it's set to true.
Example code:
#dask.delayed
def get_pf(item_to_read):
pf = fastparquet.ParquetFile(item_to_read)
row_groups = pf.row_groups.copy()
all_stats = pf.statistics.copy()
col = pf.info['columns'].copy()
return [row_groups, all_stats, col]
stats_arr = get_pf(item_to_read)
Example error:
2019-10-03 01:43:51,202 - INFO - 192.168.0.167 - distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.0.223:34623
2019-10-03 01:43:51,203 - INFO - 192.168.0.167 - Traceback (most recent call last):
2019-10-03 01:43:51,204 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 218, in connect
2019-10-03 01:43:51,206 - INFO - 192.168.0.167 - quiet_exceptions=EnvironmentError,
2019-10-03 01:43:51,207 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,210 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,211 - INFO - 192.168.0.167 - tornado.util.TimeoutError: Timeout
2019-10-03 01:43:51,212 - INFO - 192.168.0.167 -
2019-10-03 01:43:51,213 - INFO - 192.168.0.167 - During handling of the above exception, another exception occurred:
2019-10-03 01:43:51,214 - INFO - 192.168.0.167 -
2019-10-03 01:43:51,215 - INFO - 192.168.0.167 - Traceback (most recent call last):
2019-10-03 01:43:51,217 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 1841, in gather_dep
2019-10-03 01:43:51,218 - INFO - 192.168.0.167 - self.rpc, deps, worker, who=self.address
2019-10-03 01:43:51,219 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,220 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,222 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,223 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,224 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 3029, in get_data_from_worker
2019-10-03 01:43:51,225 - INFO - 192.168.0.167 - comm = yield rpc.connect(worker)
2019-10-03 01:43:51,640 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,641 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,643 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,644 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,645 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/core.py", line 866, in connect
2019-10-03 01:43:51,646 - INFO - 192.168.0.167 - connection_args=self.connection_args,
2019-10-03 01:43:51,647 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,649 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,650 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,651 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,652 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 230, in connect
2019-10-03 01:43:51,653 - INFO - 192.168.0.167 - _raise(error)
2019-10-03 01:43:51,654 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 207, in _raise
2019-10-03 01:43:51,656 - INFO - 192.168.0.167 - raise IOError(msg)
2019-10-03 01:43:51,657 - INFO - 192.168.0.167 - OSError: Timed out trying to connect to 'tcp://192.168.0.223:34623' after 10 s: connect() didn't finish in time
Does dd.read_parquet take a long time? If not, then you can follow whatever strategy is in there to do the reading in the client.
If the data has a single _metadata file in the root directory, then you can simply open this with fastparquet, which is exactly what Dask would do. It contains all the details of all of the data pieces.
There is no particular reason distributing the metadata reads should be a problem, but you should be aware that in some cases the total metadata items can add up to a substantial size.
I'm creating a Hourly task in Airflow that schedules a Dataflow Job, however the hook provided by Airflow Library most of the times crashes while the dataflow job actually succeed.
[2018-05-25 07:05:03,523] {base_task_runner.py:98} INFO - Subtask: [2018-05-25 07:05:03,439] {gcp_dataflow_hook.py:109} WARNING - super(GcsIO, cls).__new__(cls, storage_client))
[2018-05-25 07:05:03,721] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-05-25 07:05:03,725] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/bin/airflow", line 27, in <module>
[2018-05-25 07:05:03,726] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-05-25 07:05:03,729] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-05-25 07:05:03,729] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-05-25 07:05:03,731] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-05-25 07:05:03,732] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-05-25 07:05:03,734] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1492, in _run_raw_task
[2018-05-25 07:05:03,738] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-05-25 07:05:03,740] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/operators/dataflow_operator.py", line 313, in execute
[2018-05-25 07:05:03,746] {base_task_runner.py:98} INFO - Subtask: self.py_file, self.py_options)
[2018-05-25 07:05:03,748] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 188, in start_python_dataflow
[2018-05-25 07:05:03,751] {base_task_runner.py:98} INFO - Subtask: label_formatter)
[2018-05-25 07:05:03,753] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 158, in _start_dataflow
[2018-05-25 07:05:03,756] {base_task_runner.py:98} INFO - Subtask: _Dataflow(cmd).wait_for_done()
[2018-05-25 07:05:03,757] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 129, in wait_for_done
[2018-05-25 07:05:03,759] {base_task_runner.py:98} INFO - Subtask: line = self._line(fd)
[2018-05-25 07:05:03,761] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 110, in _line
[2018-05-25 07:05:03,763] {base_task_runner.py:98} INFO - Subtask: line = lines[-1][:-1]
[2018-05-25 07:05:03,766] {base_task_runner.py:98} INFO - Subtask: IndexError: list index out of range
I look that file up in Airflow github repo and the line error does not match which makes me think that the actual Airflow instance from Cloud Composer is outdated. Is there any way to update it?
This would be resolved in 1.10 or 2.0.
Have a look to this PR
https://github.com/apache/incubator-airflow/pull/3165
This has been merged to master. You may use this PR code and create your own plugin.