Why decouple applications where there is a fast producer and slow consumer? What happens? - amazon-sqs

I'm reading about SQS and message queues and I'm wondering what happens when there is a fast producer and slow consumer. What happens? Where is the message buildup stored? Is it on the memory stack and eventually a stack overflow happens and the server crashes? Is that the problem that memory is used as queue which is finite?

If you continuously consume messages more slowly than they are produced, typically messages in SQS (or whichever other message system you may choose to use) will expire and be lost. Slow consumer can cause memory problems if you have a producer and consumer in the same in-memory streams, and in that case you may run out of memory and crash, but that is less typical in a message based architecture. More commonly messages will be stored on disk, but there is usually still a finite retention period.
[Edit]
If you use an in memory unbounded queue, you can run into issues, take the following example:
from threading import Thread
import time
from multiprocessing import Queue
class SlowConsumer(Thread):
def __init__(self, queue: Queue):
self.queue = queue
Thread.__init__(self)
def run(self):
while True:
v = self.queue.get()
print(f"Processing #{v}")
time.sleep(1)
q = Queue()
c = SlowConsumer(q)
c.start()
i=0
while True:
q.put(i)
i+=1
time.sleep(.1)
The consumer will always be falling further and further behind the producer. In the end, the process will exhaust available memory. This is a pathalogical case which should be avoided -- generally by using a bounded queue. Note: effectively the same problem can exist with external queues that support limitless retention, such as Kafka, but we won't run out of memory, we will run out of disk. For on disk queues, it is also best to set a retention period, if a sensible one exists.

Related

Asynchronous Xarray writing to Zarr

all. I'm using a Dask Distributed cluster to write Zarr+Dask-backed Xarray Datasets inside of a loop, and the dataset.to_zarr is blocking. This can really slow things down when there are straggler chunks that block the continuation of the loop. Is there a way to do the .to_zarr asynchronously, so that the loop can continue with the next dataset write without being held up by a few straggler chunks?
With the distributed scheduler, you get async behaviour without any special effort. For example, if you are doing arr.to_zarr, then indeed you are going to wait for completion. However, you could do the following:
client = Client(...)
out = arr.to_zarr(..., compute=False)
fut = client.compute(out)
This will return a future, fut, whose status reflects the current state of the whole computation, and you can choose whether to wait on it or to continue submitting new work. You could also display it to a progress bar (in the notebook) which will update asynchronously whenever the kernel is not busy.

Trigger Dask workers to release memory

I'm distributing the computation of some functions using Dask. My general layout looks like this:
from dask.distributed import Client, LocalCluster, as_completed
cluster = LocalCluster(processes=config.use_dask_local_processes,
n_workers=1,
threads_per_worker=1,
)
client = Client(cluster)
cluster.scale(config.dask_local_worker_instances)
work_futures = []
# For each group do work
for group in groups:
fcast_futures.append(client.submit(_work, group))
# Wait till the work is done
for done_work in as_completed(fcast_futures, with_results=False):
try:
result = done_work.result()
except Exception as error:
log.exception(error)
My issue is that for a large number of jobs I tend to hit memory limits. I see a lot of:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 1.15 GB -- Worker memory limit: 1.43 GB
It seems that each future isn't releasing its memory. How can I trigger that? I'm using dask==1.2.0 on Python 2.7.
Results are help by the scheduler so long as there is a future on a client pointing to it. Memory is released when (or shortly after) the last future is garbage-collected by python. In your case you are keeping all of your futures in a list throughout the computation. You could try modifying your loop:
for done_work in as_completed(fcast_futures, with_results=False):
try:
result = done_work.result()
except Exception as error:
log.exception(error)
done_work.release()
or replacing the as_completed loop with something that explicitly removes futures from the list once they have been processed.

Dask scheduler behavior while reading/retrieving large datasets

This is a follow-up to this question.
I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a backing Lustre filesystem.
Problem 1:
ds = DataSlicer(dataset) # ~600 GB dataset
dask_array = dask.array.from_array(ds, chunks=(13507, -1, -1), name=False) # ~22 GB chunks
dask_array = client.persist(dask_array)
When inspecting the Dask status dashboard, I see all 28 tasks being assigned to and processed by one worker while the other workers do nothing. Additionally, when every task has finished processing and the tasks are all in the "In memory" state, only 22 GB of RAM (i.e. the first chunk of the dataset) is actually stored on the cluster. Access to the indices within the first chunk are fast, but any other indices force a new round of reading and loading the data before the result returns. This seems contrary to my belief that .persist() should pin the complete dataset across the memory of the workers once it finishes execution. In addition, when I increase the chunk size, one worker often runs out of memory and restarts due to it being assigned multiple huge chunks of data.
Is there a way to manually assign chunks to workers instead of the scheduler piling all of the tasks on one process? Or is this abnormal scheduler behavior? Is there a way to ensure that the entire dataset is loaded into RAM?
Problem 2
I found a temporary workaround by treating each chunk of the dataset as its own separate dask array and persisting each one individually.
dask_arrays = [da.from_delayed(lazy_slice, shape, dtype, name=False) for \
lazy_slice, shape in zip(lazy_slices, shapes)]
for i in range(len(dask_arrays)):
dask_arrays[i] = client.persist(dask_arrays[i])
I tested the bandwidth from persisted and published dask arrays to several parallel readers by calling .compute() on different chunks of the dataset in parallel. I could never achieve more than 2 GB/s aggregate bandwidth from the dask cluster, far below our network's capabilities.
Is the scheduler the bottleneck in this situation, i.e. is all data being funneled through the scheduler to my readers? If this is the case, is there a way to get in-memory data directly from each worker? If this is not the case, what are some other areas in dask I may be able to investigate?

DASK - Stopping workers during execution causes completed tasks to be launched twice

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?
I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected

MailboxProcessor usage guidelines?

I've just discovered the MailboxProcessor in F# and it's usage as a "state machine" ... but I can't find much on the recommended usage of them.
For example... say I'm making a simple game with 100 on-screen enemies should I use a MailboxProcessor to change enemy position and health; giving me 200 active MailboxProcessor?
Is there any clever thread management going on under the hood? should I try and limit the amount of active MailboxProcessor I have or can I keep banging them out willy-nilly?
Thanks in advance,
JD.
A MailboxProcessor for enemy simulation might look like this:
MailboxProcessor.Start(fun inbox ->
async {
while true do
let! message = inbox.Receive()
processMessage(message)
})
It does not consume a thread while it waits for a message to arrive (let! message = line). However, once message arrives it will consume a thread (on a thread pool). If you have a 100 mailbox processors that all receive a message simultaneously, they will all attempt to wake up and consume a thread. Since here message processing is CPU bound, 100s of mailbox processors will all wake up and start spawning (thread pool) threads. This is not a great performance.
One situation mailbox processors excel in is the situation where there is a lot of concurrent clients all sending messages to one processor (imagine several parallel web crawlers all downloading pages and sinking results to a queue). On-screen enemies case appears different - it is many entities responding to a single source of messages (player movement/time ticks).
Another example where thousands of MailboxProcessors is a great solution is I/O bound MailboxProcessor:
MailboxProcessor.Start(fun inbox ->
async {
while true do
let! message = inbox.Receive()
match message with
| ->
do! AsyncWrite("something")
let! response = AsyncResponse()
...
})
Here after receiving a message the agent very quickly yields a thread but still needs to maintain state across asynchronous operations. This will scale very very well in practice - you can run thousands and thousands of such agents: this is a great way to write a web server.
As per
http://blogs.msdn.com/b/dsyme/archive/2010/02/15/async-and-parallel-design-patterns-in-f-part-3-agents.aspx
you can bang them out willy-nilly. Try it! They use the ThreadPool. I have not tried this for a real-time GUI game app, but I would not be surprised if this is 'good enough'.
say I'm making a simple game with 100 on-screen enemies should I use a MailboxProcessor to change enemy position and health; giving me 200 active MailboxProcessor?
I don't see any reason to try to use MailboxProcessor for that. A serial loop is likely to be much simpler and faster.
Is there any clever thread management going on under the hood?
Yes, lots. But is it designed for asynchronous concurrent programming (particularly scalable IO) and your program isn't really doing that.
should I try and limit the amount of active MailboxProcessor I have or can I keep banging them out willy-nilly?
You can bang them out willy-nilly but they are far from optimized and performance is much worse than serial code.
Maybe this or this can help?

Resources