In a distributed GKE dask cluster, I have one graph that stalls with the traceback below. The worker dashboard continues reporting the same constant high value for cpu, while the GKE dashboard shows near-zero CPU for the pod. The "last seen" value of the worker dashboard becomes many minutes. After 15 minutes I kill the GKE pod, yet the dask scheduler still indicates the worker exists and keeps the task assigned to it. The scheduler seems to be wedged regarding this task - no progress is made, nor is any cleanup or restarting of the failing work unit.
I am using dask/distributed 2020.12.0, dask-gateway 0.9.0, and xarray 0.16.2.
What can cause a key to go missing?
How does one debug or workaround the underlying issue here?
Edit:
For each run of the same graph, a different key is in the traceback. Wedged workers for a single run show the same key in their tracebacks.
With enough patience/retries, the graph can succeed.
I am using an auto-scaling cluster with pre-emptible nodes, though the problem persists even if I remove auto-scaling and set a fixed number of nodes.
I've seen this on different workloads. The current workload I'm struggling with creates a dask dataframe from two arrays like this:
image_chunks = image.to_delayed().ravel()
labels_chunks = labels.to_delayed().ravel()
results = []
for image_chunk, labels_chunk in zip(image_chunks, labels_chunks):
offsets = np.array(image_chunk.key[1:]) * np.array(image.chunksize)
result = dask.delayed(stats)(image_chunk, labels_chunk, offsets, ...)
results.append(result)
...
dask_df = dd.from_delayed(results, meta=df_meta)
dask_df = dask_df.groupby(['label', 'kind']).sum()
Example traceback #1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2627, in execute
self.ensure_communicating()
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1880, in ensure_communicating
to_gather, total_nbytes = self.select_keys_for_gather(
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1985, in select_keys_for_gather
total_bytes = self.tasks[dep].get_nbytes()
KeyError: 'xarray-image-bc4e1224600f3930ab9b691d1009ed0c'
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fac3d04f4f0>>, <Task finished name='Task-143' coro=<Worker.execute() done, defined at /opt/conda/lib/python3.8/site-packages/distributed/worker.py:2524> exception=KeyError('xarray-image-bc4e1224600f3930ab9b691d1009ed0c')>)
Example traceback #2
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2146, in gather_dep
await self.query_who_has(dep.key)
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2218, in query_who_has
self.update_who_has(response)
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2227, in update_who_has
self.tasks[dep].who_has.update(workers)
KeyError: "('rechunk-merge-607c9ba97d3abca4de3981b3de246bf3', 0, 0, 4, 4)"
Related
When I ssh into a particular remote machine and start an IPython session, it crashed whenever I hold a key for about half a second (e.g. backspace key).
The error output is pasted below:
File "/home/zach/local/anaconda3/bin/ipython", line 11, in <module>
sys.exit(start_ipython())
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/IPython/__init__.py", line 125, in start_ipython
return launch_new_instance(argv=argv, **kwargs)
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/IPython/terminal/ipapp.py", line 356, in start
self.shell.mainloop()
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/IPython/terminal/interactiveshell.py", line 498, in mainloop
self.interact()
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/IPython/terminal/interactiveshell.py", line 481, in interact
code = self.prompt_for_code()
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/IPython/terminal/interactiveshell.py", line 410, in prompt_for_code
**self._extra_prompt_options())
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/shortcuts/prompt.py", line 738, in prompt
return run_sync()
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/shortcuts/prompt.py", line 727, in run_sync
return self.app.run(inputhook=self.inputhook, pre_run=pre_run2)
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/application/application.py", line 709, in run
return run()
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/application/application.py", line 682, in run
run_until_complete(f, inputhook=inputhook)
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/eventloop/defaults.py", line 123, in run_until_complete
return get_event_loop().run_until_complete(future, inputhook=inputhook)
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/eventloop/posix.py", line 66, in run_until_complete
self._run_once(inputhook)
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/eventloop/posix.py", line 85, in _run_once
self._inputhook_context.call_inputhook(ready, inputhook)
File "/home/zach/local/anaconda3/lib/python3.7/site-packages/prompt_toolkit/eventloop/inputhook.py", line 78, in call_inputhook
threading.Thread(target=thread).start()
File "/home/zach/local/anaconda3/lib/python3.7/threading.py", line 847, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
If you suspect this is an IPython bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev#python.org
You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.
Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True
It drops me from here into a broken bash session where my keystrokes do not appear on screen, although I can execute commands such as ls, man, pwd, ipython, etc. I can only kill the bash session by pressing Control D followed by Control C. In particular, the message's suggestion that I press %tb and so forth is not possible.
Other programs are not competing for threads. Looking through the error, it looks like the an event loop is possibly trying to create a thread to handle every key press, and this eventually causes failure to allocate more threads. It seems a little far-fetched that this would be the issue though since holding a key down is surely expected behavior.
This seems potentially similar to the issue https://ipython.org/faq.html#ipython-crashes-under-os-x-when-using-the-arrow-keys.
It appears not to be a Python issue per se, since if I use Python rather than IPython the issue disappears. I initially used Anaconda ipython but also switched to the system ipython in /usr/bin/ipython with the same results. Also tried a clean install of Anaconda, with the same issue. Also tried a fresh install of Anaconda on a different machine with the same OS, and the issue did not occur.
I am looking for ideas to make progress on this issue. Any ideas are appreciated, and I will post follow-up data if needed.
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
IPython 7.5.0
Ubuntu 18.04.2 LTS
It is fixed now, but still somewhat mysterious to me. I followed the stack trace all the way down through CPython to the pthreads library calls. The pthreads documentation indicated that the error can essentially only arise if one is out of memory on the heap or if the max number of threads has been allocated. I used ulimit to set the virtual memory per process to unlimited (it had been ~3 GB). This resolved the issue.
So apparently the virtual memory limit interfered with the ability to allocate a thread. The obvious solution is that more memory was needed, although it is hard to believe that more than 3 GB is needed to respond to a key press. Another possibility is that the amount allocated per thread is a function of the virtual memory limit--I remember something like that in the pthreads documentation although it was a bit above my head.
When running Celery on a Docker container which receives restAPI from other containers I get a RuntimeError: concurrent poll() invocation.
Did anyone face a similar error?
I attach the traceback.
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/opt/www/api/api/training_call.py", line 187, in start_process
result_state.get(on_message=self._on_raw_message, propagate=False)
File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 226, in get
on_message=on_message,
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 188, in wait_for_pending
for _ in self._wait_for_pending(result, **kwargs):
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 255, in _wait_for_pending
on_interval=on_interval):
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 56, in drain_events_until
yield self.wait_for(p, wait, timeout=1)
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 65, in wait_for
wait(timeout=timeout)
File "/usr/local/lib/python3.5/dist-packages/celery/backends/redis.py", line 127, in drain_events
message = self._pubsub.get_message(timeout=timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/client.py", line 3135, in get_message
response = self.parse_response(block=False, timeout=timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/client.py", line 3034, in parse_response
if not block and not connection.can_read(timeout=timeout):
File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 628, in can_read
return self._parser.can_read() or self._selector.can_read(timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/selector.py", line 28, in can_read
return self.check_can_read(timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/selector.py", line 156, in check_can_read
events = self.read_poller.poll(timeout)
RuntimeError: concurrent poll() invocation
The broker connection is not thread-safe, so you need to handle thread-safety in your application code.
#Laizer mentioned the ticket where this error was introduced in the python core library
One way to do it is to wrap all the calls that block until task completion in shared Lock:
import celery
import threading
#celery.shared_task
def debug_task(self):
print('Hello, world')
def boom(nb_tasks):
""" not thread safe - raises RuntimeError during concurrent executions """
tasks = celery.group([debug_task.s() for _ in range(nb_tasks)])
pool = tasks.apply_async()
pool.join() # raised from here
CELERY_POLL_LOCK = threading.Lock()
def safe(nb_tasks):
tasks = celery.group([debug_task.s() for _ in range(nb_tasks)])
pool = tasks.apply_async()
with CELERY_POLL_LOCK: # prevents concurrent calls to poll()
pool.join()
def main(nb_threads, nb_tasks_per_thread):
for func in (safe, boom):
threads = [threading.Thread(target=func, args=(nb_tasks_per_thread, )) for _ in range(nb_threads)]
for a_thread in threads:
a_thread.start()
for a_thread in threads:
a_thread.join()
main(10, 100)
This is a naive approach, that's suitable for me because I don't expect much concurrency and all the tasks are relatively fast (~10s).
If you have a different "profile", you may need something more convoluted (e.g. a single polling task that periodically polls for all pending groups / tasks).
I had the same error come up with an application that was using Redis pub/sub directly. Firing off many calls to redis.client.PubSub.getMessage in quick succession led to this race condition. My solution was to slow down the rate of polling for new messages.
I was faced the same problem, and solved it by
pip install -U "celery[redis]"
hope it's helpful to you
https://docs.celeryproject.org/en/latest/getting-started/brokers/redis.html
I am trying to aggregate various columns on a 450 million row data set. When I use Dask's built in aggregations like 'min', 'max', 'std', 'mean' keep crashing a worker in the process.
The file that I am using can be found here: https://www.kaggle.com/c/PLAsTiCC-2018/data look for test_set.csv
I have a google kubernetes cluster which consists of 3 8core machines with a total of 22GB of RAM.
Since these are just the built in aggregation functions, I haven't tried too much else.
It's not using that much RAM either, it stays steady around 6GB total and I haven't seen any errors that would indicate an out of memory error.
Below is my basic code and the error log on the evicted worker:
from dask.distributed import Client, progress
client = Client('google kubernetes cluster address')
test_df = dd.read_csv('gs://filepath/test_set.csv', blocksize=10000000)
def process_flux(df):
flux_ratio_sq = df.flux / df.flux_err
flux_by_flux_ratio_sq = (df.flux * flux_ratio_sq)
df_flux = dd.concat([df, flux_ratio_sq, flux_by_flux_ratio_sq], axis=1)
df_flux.columns = ['object_id', 'mjd', 'passband', 'flux', 'flux_err', 'detected', 'flux_ratio_sq', 'flux_by_flux_ratio_sq']
return df_flux
aggs = {
'flux': ['min', 'max', 'mean', 'std'],
'detected': ['mean'],
'flux_ratio_sq': ['sum'],
'flux_by_flux_ratio_sq': ['sum'],
'mjd' : ['max', 'min'],
}
def featurize(df):
start_df = process_flux(df)
agg_df = start_df.groupby(['object_id']).agg(aggs)
return agg_df
overall_start = timer()
final_df = featurize(test_df).compute()
overall_end = timer()
Error logs:
distributed.core - INFO - Event loop was unresponsive in Worker for 74.42s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 3.30s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 3.75s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
A number of these occur, then:
distributed.core - INFO - Event loop was unresponsive in Worker for 65.16s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - ERROR - Worker stream died during communication: tcp://hidden address
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 180, in read
n_frames = yield stream.read_bytes(8)
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 441, in read_bytes
self._try_inline_read()
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 911, in _try_inline_read
self._check_closed()
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 1112, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
response = yield comm.read(deserializers=deserializers)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
yielded = next(result)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 201, in read
convert_stream_closed_error(self, e)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 127, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: TimeoutError: [Errno 110] Connection timed out
It runs fairly quickly and I'm just looking to get consistent performance without my workers crashing.
Thanks!
Recently a script which uses py2neo package and connects via the Bolt protocol began failing unexpectedly with a ProtocolError: Server closed connection, and I'm struggling to understand why. Note the script works when using HTTP as the protocol.
The script pulls the Neo4j graph, augments it in Python and then attempts to push the relevant changes back to the database. The whole process takes about an hour, and was working reliably for quite some time, however the push stage recently starting failing.
If I create a toy example, bypasses the pull and augmentation, then the push works and thus I'm thinking the server closed the connection due to a timeout, however I can't find any timeout related parameter for Bolt in py2neo. Note I have set the HTTP socket timeout to be 9999 seconds (~ 2.75 hours),
from py2neo.packages.httpstream import http
http.socket_timeout = 9999
though my understanding of this is i) it's unrelated to Bolt and ii) the timeout is at the connection level which greatly exceeds the time the script ran.
For reference I'm using Neo4j v3.0.3 and py2neo v3.1.2. The stack-trace was:
File "/usr/local/lib/python2.7/dist-packages/py2neo/database/__init__.py", line 1017, in __exit__
self.commit()
File "/usr/local/lib/python2.7/dist-packages/py2neo/database/__init__.py", line 1059, in commit
self._post(commit=True)
File "/usr/local/lib/python2.7/dist-packages/py2neo/database/__init__.py", line 1291, in _post
self.finish()
File "/usr/local/lib/python2.7/dist-packages/py2neo/database/__init__.py", line 1296, in finish
self._sync()
File "/usr/local/lib/python2.7/dist-packages/py2neo/database/__init__.py", line 1286, in _sync
connection.fetch()
File "/usr/local/lib/python2.7/dist-packages/py2neo/packages/neo4j/v1/bolt.py", line 323, in fetch
raw.writelines(self.channel.chunk_reader())
File "/usr/local/lib/python2.7/dist-packages/py2neo/packages/neo4j/v1/bolt.py", line 174, in chunk_reader
chunk_header = self._recv(2)
File "/usr/local/lib/python2.7/dist-packages/py2neo/packages/neo4j/v1/bolt.py", line 157, in _recv
raise ProtocolError("Server closed connection")
ProtocolError: Server closed connection
and the stripped down Python code is of the form,
import py2neo
from py2neo.packages.httpstream import http
http.socket_timeout = 3600
graph = py2neo.Graph(
host='localhost',
bolt=True,
bolt_port=4096,
http_port=4095,
)
# Pull the graph from the Neo4j database via graph.run(...) statements,
# augments the graph etc.
# ...
# Exception is thrown when the following push transaction is executed.
with graph.begin() as tx:
statement = """
UNWIND {rows} AS row
WITH row.source AS source, row.target AS target
MATCH (s:Entity)
USING INDEX s:Entity(uuid)
WHERE s.uuid = source
MATCH (t:Entity)
USING INDEX t:Entity(uuid)
WHERE t.uuid = target
MATCH (s)-[r:FAVORITED]->(t)
DELETE r
"""
rows = [{
'source': '8267d7d0-a837-11e6-b841-22000bcec6a9',
'target': 'c6296c97-a837-11e6-b841-22000bcec6a9',
}]
tx.run(statement, rows=rows)
Does anyone have any suggestions as to how I can further debug this or what causes the connection to close? I've looked through the _recv function but it's not apparent why no data was received by the socket.
Looking through the Neo4j debug.log file the only possible related error was
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
I've also checked the uptime of the service, and confirmed that it was running throughout the entire duration of the script (HH:MM:SS),
> ps -p "14765" -o etime=
16:55:31
I believe the issue is with:
tx.run(statement, rows=rows)
The second argument to CypherResoure.run() is a dictionary of parameters. You're passing them in as additional python params. See py2neo code.
Try:
tx.run(statement, {"rows": rows});
First let me briefly describe our set up before I ask the question proper:
We have a web application server (virtual machine) running a django application. nginx at the front, uwsgi running under that, then a newrelic application wrapper followed by django et al., database is a separate postgresql server located via smartstack (synapse/nerve)
The issue we face is that occasionally (happened once 2 weeks ago, and twice in the last 2 days), one or two of the uwsgi worker processes will trip up and start producing "django.db.utils.InterfaceError: connection already closed" on most of their requests.
slightly redacted stack trace (user and application_name):
Traceback (most recent call last):
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/newrelic-2.8.0.7/newrelic/api/web_transaction.py", line 863, in __call__
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/newrelic-2.8.0.7/newrelic/api/function_trace.py", line 90, in literal_wrapper
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/newrelic-2.8.0.7/newrelic/api/web_transaction.py", line 752, in __call__
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 194, in __call__
signals.request_started.send(sender=self.__class__)
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 185, in send
response = receiver(signal=self, sender=sender, **named)
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/db/__init__.py", line 91, in close_old_connections
conn.abort()
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 374, in abort
self.rollback()
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 177, in rollback
self._rollback()
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 141, in _rollback
return self.connection.rollback()
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/db/utils.py", line 99, in __exit__
six.reraise(dj_exc_type, dj_exc_value, traceback)
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 141, in _rollback
return self.connection.rollback()
File "/home/user/webapps/application_name/local/lib/python2.7/site-packages/newrelic-2.8.0.7/newrelic/hooks/database_dbapi2.py", line 82, in rollback
django.db.utils.InterfaceError: connection already closed
The stack trace never gets in to our application, it only touches new relic and django. Once a worker trips, it doesn't recover and all further requests result in 500's in the uwsgi logs and 502's on the front side. I assume database connectivity is fine because the sibling workers continue to function normally, and restarting uwsgi instantly fixes the problem.
My question is how one would go about diagnosing this issue to pinpoint the root cause, I have checked everything I know how to check (memory, cpu, logs, database connectivity) and some things I don't fully understand but am trying to read up on (file descriptors mainly).
For now I updated new relic (stack trace is older version) as it's the only thing I felt I could do.
I would appreciate any feedback, many google searches have proved fruitless.
replies may be slightly delayed, my timezone says it's time to sleep. Also, apologies if this should be on serverfault or something, I just figured it's closer to an application debug issue than a server config issue.