What is the root cause of distributed.scheduler.KilledWorker exception? - dask

I'm trying to run a Dask job on a YARN cluster. This jobs reads and writes to HDFS using the hdfs3 library.
When I run it on a cluster without a Kerberos security layer, it runs fine.
But, on a cluster with a Kerberos security layer, I had to implement the solution here to avoid Kerberos related errors. Running the same job, led to the following error:
File "/fsstreamdevl/f6_development/acoustics/acoustics_analysis_dask/acoustics_analytics/task_runner/task_runner.py", line 123, in run
dask.compute(tasks)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 2568, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 1822, in gather
asynchronous=asynchronous,
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 753, in sync
return sync(self.loop, func, *args, **kwargs)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
six.reraise(*error[0])
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
result[0] = yield future
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 1653, in _gather
six.reraise(type(exception), exception, traceback)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
distributed.scheduler.KilledWorker: ('__call__-6af7aa29-2a09-45f3-a5e2-207c06562672', <Worker 'tcp://10.194.211.132:11927', memory: 0, processing: 1>)
Strangely enough, running the same solution on the former cluster without a Kerberos security layer, I get the same error.
Looking at the YARN application logs, I see the following traceback, but cannot tell what it means.
distributed.nanny - INFO - Closing Nanny at 'tcp://10.194.211.133:17659'
Traceback (most recent call last):
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
End of LogType:dask.worker.log
I do not see any explicit messages in the logs about low memory. Would anyone know how to diagnose this issue?

hdfs3 is not actively maintained any more. You have two main choices for interacting with HDFS:
pyarrow's hdfs driver (via libhdfs jni library), which requires you to have java and hadoop requirements correctly set up and available to the session calling it
webhdfs such as in fsspec, which does not need java libraries, and can interact with kerberos if HTTP authentication is allowed on your system.

Related

Why does my Python Dataflow job gets stuck at the Write phase?

I wrote a Python Dataflow job which managed to process 300 files, unfortunately, when I try to run it on 400 files it gets stuck in the Write phase forever.
The logs aren't really helpful, but I think that the issue comes from the writing logic of the code, initially, I only wanted 1 output file, so I wrote:
| 'Write' >> beam.io.WriteToText(
known_args.output,
file_name_suffix=".json",
num_shards=1,
shard_name_template=""
))
Then, I removed, num_shards=1 and shard_name_template="" and I was able to process more files but it'd still get stuck.
Extra Information
the files to process are small, less than a 1MB
also, when removing the num_shards and shard_name_template fields, I noticed that the data got output a temporary folder in the target path, but the job never finishes
I have the following DEADLINE_EXCEEDED exception and I tried solving it by increasing --num_workers to 6 and --disk_size_gb to 30 but it doesn't work.
Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 638, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 63, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 64, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 79, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 80, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 82, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 441, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 282, in __next__ return next(self.iterator) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 240, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 133, in shuffle_client.PyShuffleReader.Read OSError: Shuffle read failed: b'DEADLINE_EXCEEDED: (g)RPC timed out when extract-fields-three-mont-10090801-dlaj-harness-fj4v talking to extract-fields-three-mont-10090801-dlaj-harness-1f7r:12346. Server unresponsive (ping error: Deadline Exceeded, {"created":"#1602260204.931126454","description":"Deadline Exceeded","file":"third_party/grpc/src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}). Typically one can self manage this issue, please read: https://cloud.google.com/dataflow/docs/guides/common-errors#tsg-rpc-timeout'
Can you please recommend ways to troubleshoot this type of issues?
After trying to pump resources, I managed to get it working by enabling the Dataflow shuffle service fixed the situation. Please see resource
Just add --experiments=shuffle_mode=service to your PipelineOptions.

Celery with redis: instance state changed (master -> replica?)

Am using celery for scheduled tasks and redis server for data backup within docker containers. My jobs are running correctly sometimes. But I am get following error randomly and celery beat task can no longer progress.
[2020-09-16 21:01:07,863: CRITICAL/MainProcess] Unrecoverable error: ResponseError('UNBLOCKED force unblock from blocking operation, instance sta
te changed (master -> replica?)',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/worker/worker.py", line 205, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 599, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.6/site-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/usr/local/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 364, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 1088, in on_readable
self.cycle.on_readable(fileno)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 359, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 739, in _brpop_read
**options)
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 892, in parse_response
response = connection.read_response()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 752, in read_response
raise response
redis.exceptions.ResponseError: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)
Any help is will be appreciated. Let me know in case you need more details
As I stated above the issue is happening randomly and perturb our app in production. So I decided to spend time on a solution. I came across many propositions such as hardware issues (Memory or CPU). But this one definitively solve the issue. I was not using authentication on redis server Those interesting on setting redis password easily in docker can refer to this Docker Tip. After setting a password to redis the url looks like REDIS_URL=redis://user:myPass#localhost:6379
You can try this answer: https://stackoverflow.com/a/74141982/1635525
TLDR Adding restart: unless-stopped to your docker-compose helps to recover from celery crashes including the ones caused by redis downtime/maintenance.

Dataflow jobs fail with: Shuffle close failed: FAILED_PRECONDITION: Precondition check failed

My Dataflow jobs fail with the following error:
INFO:root:2018-10-15T18:55:37.417Z: JOB_MESSAGE_ERROR: Workflow failed.
Causes: S17:fold2/Write/WriteImpl/WindowInto(WindowIntoFn)+write instances fold2/Write/WriteImpl/GroupByKey/Reify+write instances fold2/Write/WriteImpl/GroupByKey/Write failed.,
A work item was attempted 4 times without success.
Each time the worker eventually lost contact with the service. The work item was attempted on:
yuri-nine-gag-recommender-10151140-3kmq-harness-mdgd,
yuri-nine-gag-recommender-10151140-3kmq-harness-mdgd,
yuri-nine-gag-recommender-10151140-3kmq-harness-41dd,
yuri-nine-gag-recommender-10151140-3kmq-harness-mdgd
Digging into the logs shows only one error:
An exception was raised when trying to execute the workitem 6479210647275353150 :
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 158, in execute op.finish()
File "dataflow_worker/shuffle_operations.py", line 144, in dataflow_worker.shuffle_operations.ShuffleWriteOperation.finish def finish(self):
File "dataflow_worker/shuffle_operations.py", line 145, in dataflow_worker.shuffle_operations.ShuffleWriteOperation.finish with self.scoped_finish_state:
File "dataflow_worker/shuffle_operations.py", line 147, in dataflow_worker.shuffle_operations.ShuffleWriteOperation.finish self.writer.__exit__(None, None, None)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 599, in __exit__ self.writer.Close()
File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 202, in shuffle_client.PyShuffleWriter.Close IOError: Shuffle close failed: FAILED_PRECONDITION: Precondition check failed.
Any ideas?
I finally figured out the problem by removing various pieces for code, printing tons of logs and running the job again. It turned out that I had a regular expression that blew up for one particular entry. Unfortunately, Dataflow logs were not helpful at all.

Cannot start dask cluster over SSH

I'm trying to start a dask cluster over SSH, but I am encountering a strange errors like these:
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/localuser/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/localuser/miniconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/localuser/miniconda3/lib/python3.6/site-packages/distributed/deploy/ssh.py", line 57, in async_ssh
banner_timeout=20) # Helps prevent timeouts when many concurrent ssh connections are opened.
File "/home/localuser/miniconda3/lib/python3.6/site-packages/paramiko/client.py", line 329, in connect
to_try = list(self._families_and_addresses(hostname, port))
File "/home/localuser/miniconda3/lib/python3.6/site-packages/paramiko/client.py", line 200, in _families_and_addresses
hostname, port, socket.AF_UNSPEC, socket.SOCK_STREAM)
File "/home/localuser/miniconda3/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
I'm starting the cluster like this:
$ dask-ssh --ssh-private-key ~/.ssh/cluster_id_rsa \
--hostfile ~/dask-hosts.txt \
--remote-python "~/miniconda3/bin/python3.6"
My dask-hosts.txt looks like this:
localuser#127.0.0.1
remoteuser#10.10.4.200
...
remoteuser#10.10.4.207
I get the same error with/without the localhost line.
I have checked the ssh setup, I can login to all the nodes using a public key setup (the key is unencrypted, to avoid decryption prompts). What am I missing?
The error indicates that name resolution is the culprit. Most likely this is happening because of the inclusion of usernames in your dask-hosts.txt. According to its documentation, the host file should contain only hostnames/IP addresses:
–hostfile PATH Textfile with hostnames/IP addresses
You can use --ssh-username to set a username (although only a single one).

Broken Pipe - Cannot connect to openERP 6.0.4 server using port 8070

I have an issue whereby all clients cannot connect to openERP server 6.0.4 using port 8070.
It happened sometimes in a while (4-6 months). I wonder whats the problem, I checked the network traffic, processor, memory of the server, nothing wrong at all But it just happened few times.
When I checked on server logs, the error are same each time I met this issue, as below :
[2013-04-23 12:33:53,258][Server] ERROR:web-services:netrpc: cannot
deliver exception message to client Traceback (most recent call last):
File "/opt/openerp/server/bin/service/netrpc_server.py", line 89, in
run
ts.mysend(e, exception=True, traceback=tb_s) File "/opt/openerp/server/bin/tiny_socket.py", line 64, in mysend
self.sock.sendall('%8d%s%s' % (len(msg), exception and "1" or "0", msg)) File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args) error: [Errno 32] Broken pipe
[2013-04-23 13:45:56,273][Server] ERROR:http:Could not run do_POST
Traceback (most recent call last): File
"/opt/openerp/server/bin/service/websrv_lib.py", line 299, in
_handle_one_foreign
method() File "/usr/lib/python2.7/SimpleXMLRPCServer.py", line 519, in do_POST
self.send_response(200) File "/usr/lib/python2.7/BaseHTTPServer.py", line 396, in send_response
(self.protocol_version, code, message)) File "/usr/lib/python2.7/socket.py", line 324, in write
self.flush() File "/usr/lib/python2.7/socket.py", line 303, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size]) error: [Errno 104] Connection reset by peer [2013-04-23
13:45:56,647][Server] ERROR:http:code 500, message Internal error
[2013-04-23 13:45:56,650][Server] ERROR:init:Server error in request
from ('192.168.0.132', 1880): Traceback (most recent call last):
File "/opt/openerp/server/bin/service/websrv_lib.py", line 528, in
_handle_request2
self.process_request(request, client_address) File "/usr/lib/python2.7/SocketServer.py", line 310, in process_request
self.finish_request(request, client_address) File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
self.RequestHandlerClass(request, client_address, self) File "/opt/openerp/server/bin/service/websrv_lib.py", line 246, in init
SocketServer.StreamRequestHandler.init(self,request,client_address,server)
File "/usr/lib/python2.7/SocketServer.py", line 641, in init
self.finish() File "/usr/lib/python2.7/SocketServer.py", line 694, in finish
self.wfile.flush() File "/usr/lib/python2.7/socket.py", line 303, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size]) error: [Errno 32] Broken pipe
can anyone help me on this?
Broken-pipe error is a typical socket related error. It maybe if connect to slow from internet to server.
I suggest to use apache proxy to make available local server to internet. Mapping local server LOCALHOST:8069 to www.wxample.net:9000 using VirtualHost setting in apache. It may work for you.
For more information, Have a look at this link:
https://bugs.launchpad.net/openerp-web/+bug/927793
It may be helpful for you.

Resources