Worker crashes during simple aggregation - dask

I am trying to aggregate various columns on a 450 million row data set. When I use Dask's built in aggregations like 'min', 'max', 'std', 'mean' keep crashing a worker in the process.
The file that I am using can be found here: https://www.kaggle.com/c/PLAsTiCC-2018/data look for test_set.csv
I have a google kubernetes cluster which consists of 3 8core machines with a total of 22GB of RAM.
Since these are just the built in aggregation functions, I haven't tried too much else.
It's not using that much RAM either, it stays steady around 6GB total and I haven't seen any errors that would indicate an out of memory error.
Below is my basic code and the error log on the evicted worker:
from dask.distributed import Client, progress
client = Client('google kubernetes cluster address')
test_df = dd.read_csv('gs://filepath/test_set.csv', blocksize=10000000)
def process_flux(df):
flux_ratio_sq = df.flux / df.flux_err
flux_by_flux_ratio_sq = (df.flux * flux_ratio_sq)
df_flux = dd.concat([df, flux_ratio_sq, flux_by_flux_ratio_sq], axis=1)
df_flux.columns = ['object_id', 'mjd', 'passband', 'flux', 'flux_err', 'detected', 'flux_ratio_sq', 'flux_by_flux_ratio_sq']
return df_flux
aggs = {
'flux': ['min', 'max', 'mean', 'std'],
'detected': ['mean'],
'flux_ratio_sq': ['sum'],
'flux_by_flux_ratio_sq': ['sum'],
'mjd' : ['max', 'min'],
}
def featurize(df):
start_df = process_flux(df)
agg_df = start_df.groupby(['object_id']).agg(aggs)
return agg_df
overall_start = timer()
final_df = featurize(test_df).compute()
overall_end = timer()
Error logs:
distributed.core - INFO - Event loop was unresponsive in Worker for 74.42s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 3.30s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 3.75s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
A number of these occur, then:
distributed.core - INFO - Event loop was unresponsive in Worker for 65.16s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - ERROR - Worker stream died during communication: tcp://hidden address
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 180, in read
n_frames = yield stream.read_bytes(8)
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 441, in read_bytes
self._try_inline_read()
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 911, in _try_inline_read
self._check_closed()
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 1112, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
response = yield comm.read(deserializers=deserializers)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
yielded = next(result)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 201, in read
convert_stream_closed_error(self, e)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 127, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: TimeoutError: [Errno 110] Connection timed out
It runs fairly quickly and I'm just looking to get consistent performance without my workers crashing.
Thanks!

Related

Dask worker hangs after missing dep key

In a distributed GKE dask cluster, I have one graph that stalls with the traceback below. The worker dashboard continues reporting the same constant high value for cpu, while the GKE dashboard shows near-zero CPU for the pod. The "last seen" value of the worker dashboard becomes many minutes. After 15 minutes I kill the GKE pod, yet the dask scheduler still indicates the worker exists and keeps the task assigned to it. The scheduler seems to be wedged regarding this task - no progress is made, nor is any cleanup or restarting of the failing work unit.
I am using dask/distributed 2020.12.0, dask-gateway 0.9.0, and xarray 0.16.2.
What can cause a key to go missing?
How does one debug or workaround the underlying issue here?
Edit:
For each run of the same graph, a different key is in the traceback. Wedged workers for a single run show the same key in their tracebacks.
With enough patience/retries, the graph can succeed.
I am using an auto-scaling cluster with pre-emptible nodes, though the problem persists even if I remove auto-scaling and set a fixed number of nodes.
I've seen this on different workloads. The current workload I'm struggling with creates a dask dataframe from two arrays like this:
image_chunks = image.to_delayed().ravel()
labels_chunks = labels.to_delayed().ravel()
results = []
for image_chunk, labels_chunk in zip(image_chunks, labels_chunks):
offsets = np.array(image_chunk.key[1:]) * np.array(image.chunksize)
result = dask.delayed(stats)(image_chunk, labels_chunk, offsets, ...)
results.append(result)
...
dask_df = dd.from_delayed(results, meta=df_meta)
dask_df = dask_df.groupby(['label', 'kind']).sum()
Example traceback #1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2627, in execute
self.ensure_communicating()
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1880, in ensure_communicating
to_gather, total_nbytes = self.select_keys_for_gather(
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1985, in select_keys_for_gather
total_bytes = self.tasks[dep].get_nbytes()
KeyError: 'xarray-image-bc4e1224600f3930ab9b691d1009ed0c'
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fac3d04f4f0>>, <Task finished name='Task-143' coro=<Worker.execute() done, defined at /opt/conda/lib/python3.8/site-packages/distributed/worker.py:2524> exception=KeyError('xarray-image-bc4e1224600f3930ab9b691d1009ed0c')>)
Example traceback #2
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2146, in gather_dep
await self.query_who_has(dep.key)
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2218, in query_who_has
self.update_who_has(response)
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2227, in update_who_has
self.tasks[dep].who_has.update(workers)
KeyError: "('rechunk-merge-607c9ba97d3abca4de3981b3de246bf3', 0, 0, 4, 4)"

Dask with tls connection can not end the program with to_parquet method

I am using dask to process 10 files which the size of each file is about 142MB. I build a method with delayed tag, following is an example:
#dask.delayed
def process_one_file(input_file_path, save_path):
res = []
for line in open(input_file_path):
res.append(line)
df = pd.DataFrame(line)
df.to_parquet(save_path+os.path.basename(input_file_path))
if __name__ == '__main__':
client = ClusterClient()
input_dir = ""
save_dir = ""
print("start to process")
cvss = [process_one_file(input_dir+filename, save_dir) for filename in os.listdir(input_dir)]
dask.compute(csvs)
However, dask does not always run successfully. After processing all files, the program often hangs.
I used the command line to run the program. The program often huangs after printing start to process. I know the program runs correctly, since I can see all output files after a while.
But the program never stops. If I disabled tls, the program can run successfully.
It was so strange that dask can not stop the program if I enable tls connection. How can I solve it?
I found that if I add to_parquet method, then the program cannot stop, while if I remove the method, it runs successfully.
I have found the problem. I set 10GB for each process. That means I set memory-limit=10GB. I totally set 2 workers and each has 2 processes. Each process has 2 threads.
Thus, each machine will have 4 processes which occupy 40GB. However, my machine only have 32GB. If I lower the memory limit, then the program will run successfully!

RuntimeError: concurrent poll() invocation using celery

When running Celery on a Docker container which receives restAPI from other containers I get a RuntimeError: concurrent poll() invocation.
Did anyone face a similar error?
I attach the traceback.
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/opt/www/api/api/training_call.py", line 187, in start_process
result_state.get(on_message=self._on_raw_message, propagate=False)
File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 226, in get
on_message=on_message,
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 188, in wait_for_pending
for _ in self._wait_for_pending(result, **kwargs):
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 255, in _wait_for_pending
on_interval=on_interval):
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 56, in drain_events_until
yield self.wait_for(p, wait, timeout=1)
File "/usr/local/lib/python3.5/dist-packages/celery/backends/asynchronous.py", line 65, in wait_for
wait(timeout=timeout)
File "/usr/local/lib/python3.5/dist-packages/celery/backends/redis.py", line 127, in drain_events
message = self._pubsub.get_message(timeout=timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/client.py", line 3135, in get_message
response = self.parse_response(block=False, timeout=timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/client.py", line 3034, in parse_response
if not block and not connection.can_read(timeout=timeout):
File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 628, in can_read
return self._parser.can_read() or self._selector.can_read(timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/selector.py", line 28, in can_read
return self.check_can_read(timeout)
File "/usr/local/lib/python3.5/dist-packages/redis/selector.py", line 156, in check_can_read
events = self.read_poller.poll(timeout)
RuntimeError: concurrent poll() invocation
The broker connection is not thread-safe, so you need to handle thread-safety in your application code.
#Laizer mentioned the ticket where this error was introduced in the python core library
One way to do it is to wrap all the calls that block until task completion in shared Lock:
import celery
import threading
#celery.shared_task
def debug_task(self):
print('Hello, world')
def boom(nb_tasks):
""" not thread safe - raises RuntimeError during concurrent executions """
tasks = celery.group([debug_task.s() for _ in range(nb_tasks)])
pool = tasks.apply_async()
pool.join() # raised from here
CELERY_POLL_LOCK = threading.Lock()
def safe(nb_tasks):
tasks = celery.group([debug_task.s() for _ in range(nb_tasks)])
pool = tasks.apply_async()
with CELERY_POLL_LOCK: # prevents concurrent calls to poll()
pool.join()
def main(nb_threads, nb_tasks_per_thread):
for func in (safe, boom):
threads = [threading.Thread(target=func, args=(nb_tasks_per_thread, )) for _ in range(nb_threads)]
for a_thread in threads:
a_thread.start()
for a_thread in threads:
a_thread.join()
main(10, 100)
This is a naive approach, that's suitable for me because I don't expect much concurrency and all the tasks are relatively fast (~10s).
If you have a different "profile", you may need something more convoluted (e.g. a single polling task that periodically polls for all pending groups / tasks).
I had the same error come up with an application that was using Redis pub/sub directly. Firing off many calls to redis.client.PubSub.getMessage in quick succession led to this race condition. My solution was to slow down the rate of polling for new messages.
I was faced the same problem, and solved it by
pip install -U "celery[redis]"
hope it's helpful to you
https://docs.celeryproject.org/en/latest/getting-started/brokers/redis.html

SSDT: timeout in create database script

We have recently moved to SSDT for our database management and deployment tool. We are using SqlPackage.exe to deploy the package. We are occasionally getting timeout errors when deploying the package. After looking at the errors, I added /p:CommandTimeout=900 in the commandline parameters of the sqlpackage. But it is still failing on some occasions and when it fails, it fails within few seconds. So I'm guessing it is not hitting that p:CommandTimeout.
I couldn't find documentation on any other timeout.
Here is the detailed error message -
Error SQL72014: .Net SqlClient Data Provider: Msg -2, Level 11, State
0, Line 0 Execution Timeout Expired. The timeout period elapsed prior
to completion of the operation or the server is not responding. Error
SQL72045: Script execution error. The executed script:
CREATE DATABASE [$(DatabaseName)]
ON
PRIMARY(NAME = [PhoenixDB], FILENAME = '$(DefaultDataPath)PhoenixDB_Data.mdf', SIZE = 8000 MB, FILEGROWTH = 10 %)
LOG ON (NAME = [PhoenixDB_log], FILENAME = '$(DefaultLogPath)PhoenixDB_Log.ldf', SIZE = 2000 MB, FILEGROWTH = 10 %) COLLATE SQL_Latin1_General_CP1_CI_AS;
Error SQL72014: .Net SqlClient Data Provider: Msg 1802, Level 16,
State 4, Line 1 CREATE DATABASE failed. Some file names listed could
not be created. Check related errors. Error SQL72045: Script execution
error. The executed script:
CREATE DATABASE [$(DatabaseName)]
ON
PRIMARY(NAME = [PhoenixDB], FILENAME = '$(DefaultDataPath)PhoenixDB_Data.mdf', SIZE = 8000 MB, FILEGROWTH = 10 %)
LOG ON (NAME = [PhoenixDB_log], FILENAME = '$(DefaultLogPath)PhoenixDB_Log.ldf', SIZE = 2000 MB, FILEGROWTH = 10 %) COLLATE SQL_Latin1_General_CP1_CI_AS;

IPython.parallel client is hanging while waiting for result of map_async

I am running 7 worker processes on a single machine with 4 cores. I may have made a poor choice with this loop while waiting for the result of map_async:
while not result.ready():
time.sleep(10)
for out in result.stdout:
print out
rec_file_list = result.get()
result.stdout keeps growing with all the printed output from the 7 processes running, and it caused the console that initiated the map to hang. The activity monitor on my MacBook Pro shows the 7 processes are still running, and the terminal running the Controller is still active. What are my options here? Is there any way to acquire the result once the processes have completed?
I found an answer:
Remote introspection of ASyncResult objects is possible from another client as long as a 'database backend' has been enabled by the controller with:
ipcontroller --dictb # or --mongodb or --sqlitedb
Then, it is possible to create a new client instance and retrieve the results with:
client.get_result(task_id)
where the task_ids can be retrieved with:
client.hub_history()
Also, a simple way to avoid the buffer overflow I encountered is to periodically print just the last few lines from each engine's stdout history, and to flush the buffer like:
from IPython.display import clear_output
import sys
while not result.ready():
clear_output()
for stdout in result.stdout:
if stdout:
lines = stdout.split('\n')
for line in lines[-4:-1]:
if line:
print line
sys.stdout.flush()
time.sleep(30)

Resources