I am looking to automate the starting/restarting of queues with Resque in my Ruby on Rails application. (running on JRuby)
I want to ensure the following criteria are met:
Workers are started after I deploy with capistrano
Workers are restarted if they die for whatever reason
Workers eating too much memory are stopped/restarted and can fire me an email alert
Are there tools that current provide this functionality or at least a subset of it? If there isn't anything that restarts the queue/worker, I would like to be notified at minimum so I can manually do it.
The easiest way to do it would be using a program such as God or Monit to get #2 and #3. For #1, you can just setup your Capistrano script to send a kill -INT to all the Resque workers, then the monitoring program will start them up again.
The advantaged to using kill -INT rather than manually stopping and starting the jobs in the Capistrano script is that your deploy won't have to wait for every worker to stop processing its job to start them back up. It also means if you have a long running job, you will quickly have whatever free workers were running on the new code as quickly as possible.
I'm not especially familiar with it, however I believe the god gem is used frequently for process management.
Related
I am considering if replacing celery with dask. Currently we have a cluster where different jobs are submitted, each one generating multiple tasks that run in parallel. Celery has a killer feature, the "revoke" command: I can kill all the tasks of a given job without interfering with the other jobs that are running at the same time. How can I do that with dask? I only find references saying that this is not possible, but for us it would be a disaster. So don't want to be force the shut down the entire cluster when a calculation goes rogue, thus killing the jobs of the other users.
You can cancel tasks using the Client.cancel command.
If they haven't started yet then they won't start, however if they're already running in a thread somewhere then there isn't much that Python can do to stop them other than to tear down the process.
What is the best way to manage multiple interconnected services for a web applications like:
rails server / unicorn / puma
redis / sidekiq / resque
So that, if one is stopped or started, others get stopped/started too.
Usually a monitoring tool is used for this purpose. One such good tool is God.
The basic idea is to run God as a system service, and configure your sidekiq to be watched by God. When your server restarts, God runs as a service and it will start your sidekiq workers.
You have more benefits by using God, to name just a few:
notifications: you can configure it to send you notifications when your sidekiq worker dies and gets restarted.
resource monitoring: you can configure it to take actions based on predefined rules. For example, restart the job when it consumes too much memory.
Update: just read an article this morning which might be very helpful: Create, run and manage your Ruby background processes with upstart.
Situation: I am using Rails + Unicorn, deploying with Capistrano. Sometimes Rails app fails to start in production mode (though it is not the real production, but a staging env). This usually happens due to errors in deploy scripts or configuration (thus usually not detectable by tests). When this happens, unicorn master process kills the worker that failed and spawns a new one, which also fails and so on and so forth. During all that time unicorn consumes lots of CPU and pollutes logs with the same message.
Manual way (not good): Go to your home page to see if it works. Look at the htop. Tail the logs. Kill unicorn manually. Cons: easy to forget. Logs are polluted, CPU is loaded while you are reacting.
Another solution: Use unicorn's preload_app true. This will cause master process to fail fast. Cons: higher memory consumption in happy scenario.
Best practice: - ???
Is there any way to cleverly detect that unicorn master uselessly tries to spawn failing children and stop it?
You have something like "unicorn start" in your Capistrano script right? Make your Capistrano script ping Unicorn right after invoking that command. If Unicorn does not return an expected response within a timeout, then you know that something went wrong, and you can choose to rollback the deploy or performing some other action.
As for how to ping Unicorn, that depends. If you have Unicorn listening on a TCP socket then you can use curl. If you have Unicorn listening on a Unix domain socket then you have to write a little script that connects to it, like this:
require 'socket'
sock = UNIXSocket.new('/path-to-unicorn.sock')
sock.write("HEAD / HTTP/1.0\r\n")
sock.write("Host: www.foo.com\r\n")
sock.write("Connection: close\r\n")
sock.write("\r\n")
if sock.read !~ /something/
exit 1
end
But it sounds like Phusion Passenger Enterprise solves your problem beautifully. It has this feature called "deployment error resistance". When you deploy a new version and Phusion Passenger detects that it cannot spawn any processes for your new codebase, it will stop trying to spawn your new version and keep the processes for the old versions around indefinitely, until you manually give the signal that it's okay to spawn processes for the new version. In the mean time it will log all errors into the log file so that you can analyze the problem.
I would suggest brushing off your bash skills. The functionality you need is already in Unicorn as it leverages the Unix-y master/worker process.
You need a init.d script. Or at the very least godrb or monit. I recommend the init.d script route AND monitoring. Its more complex, but it can more easily be leveraged by your monitoring software and also gives you an automatic start on reboot.
The gist of it is:
Send the USR2 signal to the unicorn master process, this will fork the master process.
Then send the WINCH to the old master process that gets created, this will kill each worker.
Then you can send the old master process the QUIT signal.
Unicorn Signals
This will spin up a new master process running the new code and label the old one as (old). If it fails the old one should be returned to its prior state and you shouldn't suffer an outage, just a restart error. This is the beauty of unicorn. You can almost get instantaneous deploys of your code.
I'm using a lot of hedge words because I did this work on my apps over a year ago so there are a lot of cobwebs upstairs. Hope this helps!
This is by no mean a correct script. Its a good starting point though ... feel free to update the gist if you can improve upon it! :-)
Example Unicorn Control Script
Suppose I make a little change to my rails app such as changing the html layout. How would I do a rolling restart with Unicorn? Effectively one would like to bring up unicorn processes(or workers instead?) for the newest version of the rails app and then switch traffic from the old unicorn processes/workers to the new ones atomically. From Google searches I couldn't quite get a concrete definitive explanation of how to do this and all the gotchas surrounding it.
There are multiple methods, but one of them is as follows:
Send SIGUSR2 to the master process. Unicorn start a new master with worker processes, that live in parallel with your old master and old worker processes.
Wait until the new master and worker processes have started.
Kill the old master.
Source: http://unicorn.bogomips.org/SIGNALS.html
This is not very memory friendly though. You temporarily need twice the memory usage.
Phusion Passenger Enterprise supports rolling restarts (along with other cool features) but it restarts processes one-by-one and so does not need as much memory. It is possible to script one-by-one rolling restarts in Unicorn using the TTIN and TTOUT signals but Phusion Passenger does everything automatically for you without scripting.
I use delayed_job as a daemon https://github.com/tobi/delayed_job/wiki/Running-Delayed::Worker-as-a-daemon
I can't tell why, but sometimes I see more than one job done by several workers (different pids), and running stop doesn't stop anything. is there a way to kill all daemons of this proc/all workers? Or kill a specific pid (I'm on a shared hosting so kill/killall aren't available for me).
Not having access to "kill" in this setup will quickly become a PITA, and it boggles my mind that you wouldn't have the ability to kill processes you yourself started.
For increased worker dependability, you might want to try the collectiveidea fork of delayed_job, and using the daemon-spawn gem rather than daemons. I've had better luck with that combination.