Ok.. I have an issue with worker management.
I have 30/n clients..
Each with a pile of things/jobs to get done.
Each with their own "schema" in postgresqu.
Non should ever block another..
So my thinking was to have a queue for each..
But then I have the problem how do I handle queues.. A worker for say 10 queues would have the same problem.. He'd not get to client 2 if he was working on client 1's stuff.
A worker for each queue.. Okay.. that's expensive.. We could create a worker per client and leave it running at all times (we got piles of dough, just like everyone else.)
So workers on the fly seems the best option.
But then we've got the build up and tear down issue, and scheduling is also a right pain.
A suggestion was put to have a single worker doing nothing but starting and stopping workers.
The issue I have is that I don't want to have the build up and tear down..
So here's my thinking.. I might have a worker on Q_A who's only got one job left for client A.. Could I switch his que.. make him work on Q_B stuff?
I was (for a second) thinking of switching the queue that a job was assigned to, to the existing workers queue, but then new stuff for Q_A would be behind that..
Any ideas? Alternatives to switching a workers Q would be much appreciated.
Related
I am currently using the delayed_job_active_record gem to run some scheduled tasks on a long run basis. The processes run in the background on a separate worker dyno on heroku and rarely go wrong but in some cases I would like to be able to stop a process mid run. I have been running the processes locally and because of the setup I have, the scheduled tasks only kick off the process which is essentially a very long loop.
Using
bin/delayed_job stop
only stops the jobs but since the process has started, it doesn't top this.
Because of this, I can't seem to stop the process once it has got going without restarting the entire dyno. This seems a bit excessive but is my only option at the moment.
Any help is greatly appreciated
I don't think there's anyway to interrupt it without essentially killing the process like you are doing. I would usually delete the job record in the database and then terminate the worker running it so it doesn't just retry the job (if you've got retries enabled for that job).
Another option... Since you know it's long running and, I imagine, has multiple steps... Modularize the operation and/or add periodic checks for a 'cancelled' flag you put somewhere in the model(s). If you detect the cancelled request, you can then give up and do any cleanup needed. This is probably preferred anyway so you can manage what happens when it's aborted more explicitly.
Sometime ago I wrote a small Ruby application which uses Sidekiq to convert video files and pushes them further to few online video hosting services. I use two Workers and Queues, one to actually convert file and second to publish converted files. Jobs are pushed to first Queue by Rails application for conversion, and after successful processing Conversion Worker pushes Upload job to second queue.
Rails -> Converter Queue -> Uploader Queue
Recently I discover a massive memory leak in converter library which appears after every few jobs and overloads whole server, so I did a little hack to avoid this by stopping whole Sidekiq Worker process using Interrupt exception and starting it again by Systemd.
It works perfectly until yesterday. I get notification from my client that files are not converted. I did some investigation to find out whats failing and found that jobs are not added to Converter queue. It starts failing without any changes in code or services. When Rails adds jobs to Sidekiq Queue it receives proper Job ID, no exception or warning at all, but the job simply not appears in any Queue. I checked Redis logs, Systemd logs, dmesg, every logs that i could check and did not find even the slightest warning - it seems that jobs get lost in vacuum :/ In fact, after more digging and debugging I discover that if one job is pushed rapidly ( 100 times in a loop ), then there is a chance that Sidekiq will add job to Queue. Of course, sometimes it will add all jobs, and sometimes not even single one.
The second Queue works perfectly - it picks every single job that I add to it. When I try to add 1000 new jobs, second Queue queues them all, when Converter queue gets at best 10 jobs. Things gets really weird when I try to use another Queue - I pushed 100 jobs to a new Queue, of course all of them are added properly and then I instruct Conversion worker to use that new Queue. And it works - I can add new Jobs to that Queue and it seems that all of them are pushed successfully - but when Worker finish processing all jobs that were pushed before that Worker was assigned to this Queue it starts to failing again. Disabling code that restarts Worker after every job didn't help at all.
Funny thing is that in fact jobs are pushed to Queue but only when I pushes them multiple times, and it seems totally random when Job is added properly. This bugs appears from nowhere, for few months things works perfectly and recently starts failing without any changes in code or server. Logs are perfectly clear, Sidekiq is used with the same Redis server without any problems by few other applications - it seems that only this particular Worker have this problem. I did not found any references to similar bug on the web and I spent two days trying to debug this and find source of this weird behavior, and I found nothing, everything seems to work perfectly and Jobs are simply disappearing somewhere between push and Redis database.
we use delayed job in our web application and we need multiple delayed jobs workers happening parallelly, but we don't know how many will be needed.
solution i currently try is running one worker and calling fork/Process.detach inside the needed task.
i was trying to run fork directly in rails application previously but it didnt work too good with passenger.
this solution seems to work well. could there be any caveats in production?
one issue which happened to me today and which anyone trying that should take care of was following:
i noticed that worker is down so i started it. something i didnt think about was that there were 70 jobs waiting in queue. and since processes are forked, they pretty much killed our server for around half an hour by starting all almost immediately and eating all memory in process.. :]
so ensuring that there is god watching over the worker is important.
also worker seems to die often but not sure yet if its connected with forking.
I need some advice writing a Job scheduler in Erlang which is able to distribute jobs ( external os processes) over a set of worker nodes. A job can last from a few milliseconds to a few hours. The "scheduler" should be a global registry where jobs come in, get sorted and then get assigned and executed on connected "worker nodes". Worker nodes should be able to register on the scheduler by telling how many jobs they are able to process in parallel (slots). Worker nodes should be able to join and leave at any time.
An Example:
Scheduler has 10 jobs waiting
Worker Node A connects and is able to process 3 jobs in parallel
Worker Node B connects and is able to process 1 job in parallel
Some time later, another worker node joins which is able to process 2 jobs in parallel
Questions:
I seriously spent some time thinking about the problem but I am still not sure which way to go. My current solution is to have a globally registered gen_server for the scheduler which holds the jobs in its state. Every worker node spawns N worker processes and registers them on the scheduler. The worker processes then pull a job from the scheduler (which is an infinite blocking call with {noreply, ...} if no jobs are currently availale).
Here are some questions:
Is it a good idea to assign every new job to an existing worker, knowing that I will have to re-assign the job to another worker at the time new workers connect? (I think this is how the Erlang SMP scheduler does things, but reassigning jobs seems like a big headache to me)
Should I start a process for every worker processing slot and where should this process live: on the scheduler node or on the worker node? Should the scheduler make rpc calls to the worker node or would it be better for the worker nodes to pull new jobs and then execute them on their own?
And finally: Is this problem already solved and where to find the code for it? :-)
I already tried RabbitMQ for job scheduling but custom job sorting and deployment adds a lot of complexity.
Any advice is highly welcome!
Having read your answer in the comments I'd still recommend to use pool(3):
Spawning 100k processes is not a big deal for Erlang because spawning a process is much cheaper than in other systems.
One process per job is a very good pattern in Erlang, start a new process run the job in the process keeping all the state in the process and terminate the process after the job is done.
Don't bother with worker processes that process a job and wait for a new one. This is the way to go if you are using OS-processes or threads because spawning is expensive but in Erlang this only adds unnecessary complexity.
The pool facility is useful as a low level building block, the only thing it misses your your functionality is the ability to start additional nodes automatically. What I would do is start with pool and a fixed set of nodes to get the basic functionality.
Then add some extra logic that watches the load on the nodes e.g. also like pool does it with statistics(run_queue). If you find that all nodes are over a certain load threshold just slave:start/2,3 a new node on a extra machine and use pool:attach/1to add it to your pool.
This won't rebalance old running jobs but new jobs will automatically be moved to the newly started node since its still idle.
With this you can have a fast pool controlled distribution of incoming jobs and a slower totally separate way of adding and removing nodes.
If you got all this working and still find out -- after some real world benchmarking please -- you need rebalancing of jobs you can always build something into the jobs main loops, after a message rebalance it can respawn itself using the pool master passing its current state as a argument.
Most important just go ahead and build something simple and working and optimize it later.
My solution to the problem:
"distributor" - gen_server,
"worker" - gen_server.
"distributor" starts "workers" using slave:start_link, each "worker" is started with max_processes parameter,
"distributor" behavior:
handle_call(submit,...)
* put job to the queue,
* cast itself check_queue
handle_cast(check_queue,...)
* gen_call all workers for load (current_processes / max_processes),
* find the least busy,
* if chosen worker load is < 1 gen_call(submit,...) worker
with next job if any, remove job from the queue,
"worker" behavior (trap_exit = true):
handle_call(report_load, ...)
* return current_process / max_process,
handle_call(submit, ...)
* spawn_link job,
handle_call({'EXIT', Pid, Reason}, ...)
* gen_cast distributor with check_queue
In fact it is more complex than that as I need to track running jobs, kill them if I need to, but it is easy to implement in such architecture.
This is not a dynamic set of nodes though, but you can start new node from the distributor whenever you need.
P.S. Looks similar to pool, but in my case I am submitting port processes, so I need to limit them and have better control of what is going where.
Here's my simple ideal case scenario for when I'd like delayed job to run:
When the first application server (whether through mongrel or passenger) starts, it'll start my delayed job workers.
When the last running application server terminates, it'll kill all the delayed job workers.
The first part (starting) is doable, although I'm not sure what the "right" or "best" way to do it is. Just make a conditional (on process not already running) system call to delayed_job start?
The second part (terminating) -- well, I'm not sure if it is doable or not. Definitely have no idea how this effect could be accomplished.
Any thoughts or ideas?
Is there another way that you start/end delayed job workers that you think is best?
Side question:
The main questions above are for the production environment -- a more difficult case because there are multiple app servers running at the same time. Could the same thing be easily done in the development environment (where there's guaranteed to only be one application server, not a cluster of them) by forking a child process to run the delayed job workers that would always terminate when the parent terminates? How would I go about doing this?
You could definitely pull the terminating off with god.
Simply watch the app processes and god will fire a callback when they're all stopped.