Is all communication between workers in Dask Distributed via the scheduler? - dask

I'm trying to establish if all the workers in my cluster need to be able to see each other, or just the scheduler process. When data needs to be transferred between workers, do they communicate directly, or send data via the sheduler?

Workers should ideally be able to communicate directly with each other, to be able to quicker copy data (results) as needed. You do not want to make your scheduler the single bottleneck for data communication; all messages and tasks pass through the scheduler, but these tend to be much smaller.
EDIT: docs link: http://distributed.dask.org/en/stable/journey.html#step-5-execute-on-the-worker

Related

Is it possible to assign worker resources to dask distributed worker after creation?

As per title, if I am creating workers via helm or kubernetes, is it possible to assign "worker resources" (https://distributed.readthedocs.io/en/latest/resources.html#worker-resources) after workers have been created?
The use case is tasks that hit a database, I would like to limit the amount of processes able to hit the database in a given run, without limiting the total size of the cluster.
As of 2019-04-09 there is no standard way to do this. You've found the Worker.set_resources method, which is reasonable to use. Eventually I would also expect Worker plugins to handle this, but they aren't implemented.
For your application of controlling access to a database, it sounds like what you're really after is a semaphore. You might help build one (it's actually decently straightforward given the current Lock implementation), or you could use a Dask Queue to simulate one.

Do you have to use worker pools in Erlang?

I have a server I am creating (a messaging service) and I am doing some preliminary tests to benchmark it. So far, the fastest way to process the data is to do it directly on the process of the user and to use worker pools. I have tested spawning and that is unbelievable slow.
The test is just connecting 10k users, and having each one send 15kb of data a couple of times at the same time(or trying too atleast) and having the server process the data (total length, headers, and payload).
The issue I have with worker pools is its only fast when you have enough workers to offset the amount of connections. For example, if you have 500k, or 1 million users, you would need more workers to process all the concurrent data coming in. And, as for my testing, having 1000 workers would make it unusable.
So my question is the following: When does it make sense to use pools of workers? Will there be a tipping point where I would have to use workers to process the data to free up the user process? How many workers is too much, is 500,000 too much?
And, if workers are the way to go (for those massive concurrent distributed servers), I am guessing you can dynamically create/delete as you need?
Any literature is also appreciated!
Thanks for your answer!
Maybe worker pools are not the best tool for your problem. If I were you I would try using Jay Nelson's epocxy, which gives you a very basic backpressure mechanism while still letting you parallelize your tasks. From that library I would check either concurrency fount or concurrency control tools.

Sharing data between Elastic Beanstalk web and worker tiers

I have a platform (based on Rails 4/Postgres) running on an auto scaling Elastic Beanstalk web environment. I'm planning on offloading long running tasks (sync with 3rd parties, delivering email etc) to a Worker tier, which appears simple enough to get up and running.
However, I also want to run periodic batch processes. I've looked into using cron.yml and the scheduling seems pretty simple, however the batch process I'm trying to build needs to access the data from the web application to be able to work.
Does anybody have any opinion of the best way of doing this? Either a shared RDS database between web and worker tier, or perhaps a web service that the worker tier can access?
Thanks,
Dan
Note: I've added an extra question, which more broadly describes my
requirements as it struck me that this might not be the best approach.
What's the best way to implement this shared batch process with Elastic Beanstalk?
Unless you need a full relational database management system (RDBMS), consider using S3 for shared persistent data storage across your instances.
Also consider Amazon Simple Queue Service (SQS):
SQS is a fast, reliable, scalable, fully managed message queuing
service. SQS makes it simple and cost-effective to decouple the
components of a cloud application. You can use SQS to transmit any
volume of data, at any level of throughput, without losing messages or
requiring other services to be always available.

Using SQS for batch running

I am starting a new architecture to support a MDM (Master data management) database. For the MDM database I will be using a graph database (Neo4J). So the I need to integrate disparate sources within the MDM database. I will have sources from orders, customers signup, likes from facebook and so on.
So I imagine that I can have one or multiple SQS queues for the sources. So each source will put messages on SQS.
Then I will have many worker nodes that will be responsible to get the messages from queue and update the MDM database.
At the moment, I will use just one worker, because I dont know how neo4j will performance doing multiple writers.
So I see many people using SQS to send messages to launch a batch process. Its like a job queueing.
Is it a common use case use SQS to send data messages? Or maybe using another AWS component.
Using Neo4j with a queueing system is pretty common. I've never used SQS myself, but being aware of people using ActiveMQ (or others) to feed data into Neo4j.
This is mostly done do decouple the architecture and provide a way to
gracefully deal with load peaks.
Depending on the kind of events you put onto the queue it might be beneficial if the consumer is single threaded - esp. when the events write the same part of the graph. Otherwise Neo4j's locking mechanism will prevent concurrent writes onto the same nodes/rels.

Connect resque to other key-value DB than redis?

I just read a little about resque here and how you use redis as a "advanced key-value store" for the jobs.
As you might know you can use resque on multiple machines to process the jobs:
Workers can be given multiple queues (a "queue list") and run on multiple machines. In fact they can be run anywhere with network access to the Redis server.
Now my question is... Is resque able to connect to any other key-value database such as SimpleDB or CouchDB? And if yes, does this even make sense?
No, it is not able, as it mostly uses Redis' features specifically written for handlin queues, such as brpop and blpush. CouchDB/SimpleDB's eventual consistency keeps them from being ideal candidates for queues, AMQP implementations, such as RabbitMQ would be suited, but neither usable with Resque.

Resources