Can or should Schedulers.newElastic be reused? - project-reactor

Looking for guidance on reactor schedulers.
I want to run certain IO tasks in background, i.e. send emails to the tech team. To make it asynchronous I use Mono.fromRunnable subscribed to a scheduler.
I have a choice to either use Schedulers.elastic() or Schedulers.newElastic(). I prefer the latter because it allows me to give it a unique name which would help in the logs analysis.
Is it ok to make a static variable e.g.
Scheduler emailSched = Schedulers.newElastic("email");
and subscribeOn my Mono to it versus should I create a new Scheduler instance every time?
I found only What is the difference between Schedulers.newElastic and Schedulers.elastic methods? and that did not help my question much.

should I create a new Scheduler instance every time?
There's no technical reason why you need to if you don't want to. In most instances it probably doesn't matter.
The key differences are:
You can give it a different name if you want (trivial)
Any individual elastic scheduler will cache and reuse the executors it creates under the hood, with a default timeout of 60 seconds. That caching is not shared between different scheduler instances of the same name, however.
You can dispose any individual elastic scheduler without effecting other schedulers of the same name.
In the case you describe, none of those really factor into play.
Separate to the above, note that Schedulers.boundedElastic() is now the preferred option, especially for wrapping blocking IO (which seems to be what you're doing there.)

Related

Is it possible to assign worker resources to dask distributed worker after creation?

As per title, if I am creating workers via helm or kubernetes, is it possible to assign "worker resources" (https://distributed.readthedocs.io/en/latest/resources.html#worker-resources) after workers have been created?
The use case is tasks that hit a database, I would like to limit the amount of processes able to hit the database in a given run, without limiting the total size of the cluster.
As of 2019-04-09 there is no standard way to do this. You've found the Worker.set_resources method, which is reasonable to use. Eventually I would also expect Worker plugins to handle this, but they aren't implemented.
For your application of controlling access to a database, it sounds like what you're really after is a semaphore. You might help build one (it's actually decently straightforward given the current Lock implementation), or you could use a Dask Queue to simulate one.

delayed_job: One job per tenant at a time?

I have a multitenant-Rails app with multiple delayed_job workers.
In order to avoid overlapping tenant-specific work, I would like to separate the workers from each other in such a way that each one works on only one tenant-specific task at a time.
I thought about using the (named) queue column and add "tenant_1", "tenant_2" and so on. Unfortunately the queues have to be named during configuration, so this principle is not flexible enough for many tenants.
Is there a way to customize the way delayed_job picks the next task? Is there another way to define a scope?
Your best bet is probably to spin a custom solution that implements a distributed lock - essentially, the workers all run normally and pull from the usual queues, but before performing work check with another system (Redis, RDBMS, API, whatever) to verify that no other worker is yet performing a job for that tenant. If that tenant is not being worked, then set the lock for the tenant in question and work the job. If the tenant is locked, don't perform the work. It's your call on a lot of the implementation details like whether to move on to try another job, re-enqueue the job at the back of the queue, whether to consider it a failure and bind it to your retry limits, or do something else entirely. This is pretty open-ended, so I'll leave the details to you, but here are some tips:
Inheritance will be your friend; define this behavior on a base job and inherit from it on the jobs you expect your workers to run. This also allows you to customize the behavior if you have "special" cases for certain jobs that come up without breaking everything else.
Assuming you're not running through ActiveJob (since it wasn't mentioned), read up on delayed_job hooks: https://github.com/collectiveidea/delayed_job/#hooks - they may be an appropriate and/or useful tool
Get familiar with some of the differences and tradeoffs in Pessimistic and Optimistic locking strategies - this answer is a good starting point: Optimistic vs. Pessimistic locking
Read up on general practices surrounding the concept of distributed locks so you can choose the best tools and strategies for yourself (it doesn't have to be a crazy complicated solution, a simple table in the database that stores the tenant identifier is sufficient, but you'll want to consider the failure cases - how to you manage locks that are abandoned, for example)
Seriously consider not doing this; is it really strictly required for the system to operate properly? If so, it's probably indicative in an underlying flaw in your data model or how you've structured transformations around that data. Strive for ACIDity in your application when thinking about operations on the data and you can avoid a lot of these problems. There's a reason it's not a commonly available "out of the box" feature on background job runners. If there is an underlying flaw, it won't just bite you on this problem but on something else - guaranteed!
If you are trying to avoid two different workers working on the same tenant then that's a bad design choice. something is smelling. fix that first. however, if you want the same kind of worker instances working on different tenents below is the easiest solution. These relationships are my hypotheses.
ExpiredOrderCleaner = Struct.new(:tenant_id) do
def perform
Order.where(tenant_id: tenant_id).expired.delete_all
end
end
Tenant.each do |tenant|
Delayed::Job.enqueue ExpiredOrderCleaner.new(tenant.id)
end
this will create unique jobs for each tenant. single worker instance will work on a specific tenant. however, there can be other kinds of jobs working on the same tenant. which is good as it should be. if you need to more smaller scope, just pass more arguments for the worker and use in the query and use database transactions to avoid collisions.
these best practices are true for any background worker.
Make your job idempotent and transactional means that your job can safely execute multiple times
Embrace Concurrency design your jobs so you can run lots of them in parallel
your work will be a lot easier if you use apartment gem and active job wrappers. see the examples from there documents.

Interval based API access and processing different DSL

Background
I'm currently working on a small Rails 5 project that needs to access and process an external API. There is a ruby wrapper gem available for the API, so accessing the data is not a problem.
Problem description
There are two parts of the equation that I am currently missing, and hoping someone out there can help me with.
1: I need to call the API, via Rails, every 15 minutes. How can I realize this? I was looking towards Active Job for this, but my research kind of stalled after getting no useful results.
2: The external API has different domain models and a different domain-specific language than my application. How can I map the different models without changes in Active Record?
1: I need to call the API, via Rails, every 15 minutes. How can I realize this? I was looking towards Active Job for this, but my research kind of stalled after getting no useful results.
The first problem you can solve using recurring tasks. The main idea is to run the process that will perform some operations every x minutes (or days or whatever fits your problem.
There are several tools that you can use. One of them is built-in the unix system and it is cron. You can read about it in system's manual. You can easily manage it using whenever gem. The main disadvantage is that you need an access to the system's cron which may be non-trivial on non-bare machines (for example Platform as a Service hosts such as Heroku).
You should also take a look at clockwork which does not rely on the system's cron. It uses approach where you have a separate process running all time and it keeps an eye on defined tasks.
In the second approach (having a separate process) you need to remember that time-consuming instructions may "lock" the process and postpone another tasks. In this case, you may want to use background processing such as sidekiq or delayed_job. The idea is to use one process for scheduling tasks at certain time and another process to process those tasks as soon as they appear in the queue.
2: The external API has different domain models and a different domain-specific language than my application. How can I map the different models without changes in Active Record?
You need to create a client that will consume the API and map its responses into models that you have in your application. This way, you don't need to make your model's scheme dependent on the API scheme. Take a look at resource_kit gem - this is a sample solution that uses this approach.
HI hdauven,
processing the API every 15 minutes will affect your server performance,so done it by using sidekiq, it is a background job and use sidetiq it will help you to perform the task every 15 min automatically
You are accessing API, Then why are you worrying about different domain.

Whats the difference bewteen zerg and cheaper

It seems that cheaper and zerg (with broodlord) do much the same thing: that is spawning additional workers when needed.
What is the difference, and why would I use one and not the other?
With cheaper you choose the max number of processes an instance could spawn, no way to modify it without changing the config and reloading.
Zerg mode allows you to run new instances (even with a different config) attached to the same socket used by an already running instance.
This allows various forms of autoscaling.

What available message solutions are there for inter-process communication in ruby?

I have a rails app using delayed_job. I need my jobs to communicate with each other for things like "task 5 is done" or "this is the list of things that need to be processed for task 5".
Right now I have a special table just for this, and I always access the table inside a transaction. It's working fine. I want to build out a cleaner api/dsl for it, but first wanted to check if there were existing solutions for this already. Weirdly I haven't found a single things, I'm either googling completely wrong, or the task is so simple (set and get values inside a transaction) that no one has abstracted it out yet.
Am I missing something?
clarification: I'm not looking for a new queueing system, I'm looking for a way for background tasks to communicate with one another. Basically just safely shared variables. Do the below frameworks offer this facility? It's a shame that delayed job does not.
use case: "do these 5 tasks in parallel, and then when they are all done, do this 1 final task." So, each of the 5 tasks checks to see if it's the last one, and if it is, it fires off the final task.
I use resque. Also there are lots of plugins, which should make inter-process comms easier.
Using redis has another advantage: you can use the pub-sub channels for communication between workers/services.
Another approach (but untested by me): http://www.zeromq.org/, which also has ruby bindings. If you like to test new stuff, then try zeromq.
Update
To clarify/explain/extend my comments above:
Why I should switch from DelayedJobs to Resque is the mentioned advantage that I have queue and messages in one system because Redis offers this.
Further sources:
https://github.com/blog/542-introducing-resque
https://github.com/defunkt/resque#readme
If I had to stay on DJ I would extend the worker classes with redis or zeromq/0mq (only examples here) to get the messaging in my extisting background jobs.
I would not try messaging with ActiveRecord/MySQL (not even queueing actually!) because this DB isn't the best performing system for this use case especially if the application has too many background workers and huge queues and uncountable message exchanges in short times.
If it is a small app with less workers you also could implement a simple messaging via DB, but also here I would prefer memcache instead; messages are short living data chunk which can be handled in-memory only.
Shared variables will never be a good solution. Think of multiple machines where your application and your workers can live on. How you would ensure a save variable transfer between them?
Okay, someone could mention DRb (distributed ruby) but it seems not really used anymore. (never seen a real world example so far)
If you want to play around with DRb however, read this short introduction.
My personal preference order: Messaging (real) > Database driven messaging > Variable sharing
memcached
rabbitmq
You can use Pipes:
reader, writer = IO.pipe
fork do
loop do
payload = { name: 'Kris' }
writer.puts Marshal.dump(payload)
sleep(0.5)
end
end
loop do
begin
Timeout::timeout(1) do
puts Marshal.load(reader.gets) # => { name: 'Kris' }
end
rescue Timeout::Error
# no-op, no messages to receive
end
end
One way
Read as a byte stream
Pipes are expressed as a pair, a reader and a writer. To get two way communication you need two sets of pipes.

Resources