I think each task will contain an instance of spout or bolt, and a while or for block calls them, is it right?
If so, since every task coordinates to one of some threads running in a worker process, and there is probability that two or more tasks of the same spout or bolt are assigned to the same worker, in this case, do we need to sync (especially if the spout or bolt contains critical resources such as static members)? Why?
Yes, several tasks of the same spout/bolt could be assigned to the same worker and run in the same JVM. I recommend not to use static members that are not thread-safe - in this case you won't to need care about synchronization.
Related
Looking for guidance on reactor schedulers.
I want to run certain IO tasks in background, i.e. send emails to the tech team. To make it asynchronous I use Mono.fromRunnable subscribed to a scheduler.
I have a choice to either use Schedulers.elastic() or Schedulers.newElastic(). I prefer the latter because it allows me to give it a unique name which would help in the logs analysis.
Is it ok to make a static variable e.g.
Scheduler emailSched = Schedulers.newElastic("email");
and subscribeOn my Mono to it versus should I create a new Scheduler instance every time?
I found only What is the difference between Schedulers.newElastic and Schedulers.elastic methods? and that did not help my question much.
should I create a new Scheduler instance every time?
There's no technical reason why you need to if you don't want to. In most instances it probably doesn't matter.
The key differences are:
You can give it a different name if you want (trivial)
Any individual elastic scheduler will cache and reuse the executors it creates under the hood, with a default timeout of 60 seconds. That caching is not shared between different scheduler instances of the same name, however.
You can dispose any individual elastic scheduler without effecting other schedulers of the same name.
In the case you describe, none of those really factor into play.
Separate to the above, note that Schedulers.boundedElastic() is now the preferred option, especially for wrapping blocking IO (which seems to be what you're doing there.)
We have jobs which interact with native code and there are unavoidable memory leaks while the worker is processing the task. The simple solution for our problems has been to restart the worker after a specified number of tasks.
We are migrating from python's multiprocessing which has a useful maxtasksperchild option which closes down the workers after a specified number of tasks.
Is there something built-in in dask that is comparable to maxtasksperchild?
As a workaround, we are keeping track of the workers who have completed a task by appending their worker address to the result payload and calling retire_workers on the client side manually.
No, there is no such equivalent in Dask
As per title, if I am creating workers via helm or kubernetes, is it possible to assign "worker resources" (https://distributed.readthedocs.io/en/latest/resources.html#worker-resources) after workers have been created?
The use case is tasks that hit a database, I would like to limit the amount of processes able to hit the database in a given run, without limiting the total size of the cluster.
As of 2019-04-09 there is no standard way to do this. You've found the Worker.set_resources method, which is reasonable to use. Eventually I would also expect Worker plugins to handle this, but they aren't implemented.
For your application of controlling access to a database, it sounds like what you're really after is a semaphore. You might help build one (it's actually decently straightforward given the current Lock implementation), or you could use a Dask Queue to simulate one.
I have a scenario where I have long-running jobs that I need to move to a background process. Delayed job with a single worker would be very simple to implement, but would run very, very slowly as jobs mount up. Much of the work is slow because the thread has to sleep to wait on various remote API calls, so running multiple workers concurrently is a very obvious choice.
Unfortunately, some of these jobs are dependent on each other. I can't run two jobs belonging to the same identifier simultaneously. Order doesn't matter, only that exactly one worker can be working on a given ID's work.
My first thought was named queues, and name the queue for the identifiers, but the identifiers are dynamic data. We could be running ID 1 today, 5 tomorrow, 365849 and 645609 the next, so on and so forth. That's too many named queues. Not only would giving each one a single worker probably exceed available system resources (as well as being incredibly wasteful since most of them won't be active at any given time), but since workers aren't configured from inside the code but rather as environment variables, I'd wind up with some insane config files. And creating a sane pool of N generic workers could wind up with all N workers running on the same queue if that's the only queue with work to do.
So what I need is a way to prevent two jobs sharing a given ID from running at the same time, while allowing any number of jobs not sharing IDs to run concurrently.
If I have a function that can be executed asynchronously without any dependencies and no other functions require its results directly, should I use spawn ? In my scenario I want to proceed to consume a message queue, so spawning would relif my blocking loop, but if there are other situations where I can distribute function calls as much as possible, will that affect negatively my application ?
Overall, what would be the pros and cons of using Spawn.
Unlike operating system processes or threads, Erlang processes are very light weight. There is minimal overhead in starting, stopping, and scheduling new processes. You should be able to spawn as many of them as you need (the max per vm is in the hundreds of thousands). The Actor model Erlang implements allows you to think about what is actually happening in parallel and write your programs to express that directly. Avoid complicating your logic with work queues if you can avoid it.
Spawn a process whenever it makes logical sense, and optimize only when you have to.
The first thing that come in mind is the size of parameters. They will be copied from your current process to the new one and if the parameters are huge it may be inefficient.
Another problem that may arise is bloating VM with such amount of processes that your system will become irresponsive. You can overcome this problem by using pool of worker processes or special monitor process that will allow to work only limited amount of such processes.
so spawning would relif my blocking loop
If you are in the situation that a loop will receive many messages requiring independant actions, don't hesitate and spawn new processes for each message processing, this way you will take advantage of the multicore capabilities (if any) of your computer. As kjw0188 says, the Erlang processes are very light weight and if the system hits the limit of process numbers alive in parallel (with the assumption that you are doing reasonable code) it is more likely that the application is overloading the capability of the node.