What is common approach to make sure that Rails server is auto-restarted after a serious crash, or a process kill? How do you deal with hanging processes? I have nginx and thin running on my production server - would you suggest to put something in between them? Or using another server?
Firstly:
You should identify the cause of a process hang or kill. These are not normal behaviours and indicate a fault somewhere.
Look for:
Insufficient memory or high load before a crash - indicates a configuration problem.
Versions of nginx that are too new.
If you're virtualising, this can cause a number of subtle problems with linux kernels that may cause segfaults. If you're using EC2, use Amazon Linux for your best chance. Ubuntu server is too bleeding edge for this purpose.
In order to do the restarts, I suggest you use monit as this is quick, easy and reliable - it's the normal way to do this.
Lastly, I suggest you set up external monitoring as well using something like Pingdom, as even monit won't catch every type fault, such as hardware failures.
If you only want to monitor an application, I'm always using Nagios with Centreon. You can set email alarming when your rails server is down. You have to setup your NRPE on every machine you want to monitor.
When an error is detected you can run a bash file to kill hanging processes and restart the server automatically. Personally, I never use that because a crash mean something goes wrong. So I do it manually in order to check everything.
Try to look here : http://www.centreon.com/
Related
I have a ruby app in production that uses sidekiq (that uses redis) and I have managed to discover that flushall commands are being called which cause the database to be wiped (thus removing all the processed and scheduled jobs).
I don't know or understand what could be causing this.
Does anyone know how I can begin to trace the call to flushall?
Thanks,
It is most likely that your Redis server is open to the public network without any protection - that is just calling for trouble because anyone can connect and do much more damage than just a FLUSHALL. If that it the case, use password authentication at the very least, after burning the compromised server - the attacker may have gained access to your server's operating system and from there who knows where. More information at: http://antirez.com/news/96
If that isn't the case and you have a rogue application somewhere that randomly calls unwanted commands, you can try tracking it by combining the MONITOR and CLIENT LIST.
Lastly, you can consider renaming/disabling the FLUSHALL command, at least temporarily, until you get to the bottom of this.
We have a rails app running on passenger and we background process some tasks using a combination of RabbitMQ and Workling. The workling's worker process is started using the script/workling_client command. There is always only one worker process started, and the script/workling_client has a :multiple => false options, thus allowing only one instance. But sometimes, under mysterious circumstances which I haven't been able to track down, more worklings spawn up. If I let the system run for some time, more and more worklings appear. I'm not sure if these rogue worklings cause any problems, but it is still unsettling not to know why is it happening. We are using Monit to monitor the workling process. So if it dies, it will spawn it up again. But this still does not explain how come there are suddenly more than one of them.
So my question is: does anyone know what can be cause of this and how to make it stop? Is it possible that workling sometimes dies by itself, without deleting it's pid file? Could there be something wrong with the Daemons gem workling_client is build upon?
Not an answer - I have the same problems running RabbitMQ + Workling.
I'm using God to monitor the single workling process as well (:multiple => false)...
I found that the multiple worklings were eating up huge amounts of memory & causing serious resource usage, so it's important that I find a solution for this.
You might find this message thread helpful: http://groups.google.com/group/rubyonrails-talk/browse_thread/thread/ed8edd0368066292/5b17d91cc85c3ada?show_docid=5b17d91cc85c3ada&pli=1
i'm running a rails application with apache+passenger on virtual servers that do not have any swap space configured.
The site gets decent amount of traffic with 200K+ daily requests and sometimes the whole system runs out of memory causing odd behaviour on whole system.
The question is that is there any way to configure apache or passenger not to run out of memory (e.g. gracefully restarting passenger instances when they start using, say more than 300M of memory).
Servers have 4GB of memory and currently i'm using passenger's PassengerMaxRequests option but it does not seem to be the most solid solution here.
At the moment, i also cannot switch to nginx so that is not an option to preserve some room.
Any clever ideas i'm probably missing are welcome.
Edit: My temporary solution
I did not go with restarting Rails instances when they exceed certain amount of memory usage. Engine Yard wrote great blog post on the ActiveRecord memory bloat issue. This is our main suspect on the subject. As i did not have much time to optimize application, i set PassengerMaxRequests to 300 and added extra 2GB memory to server. Things have been good since then. At first i was worried that re-starting Rails instances continuously makes it slow but it does not seem to have impact i should worry about.
If you mean "limiting" as killing those processes and if this is the only application on the server and it is a Linux, then you have two choices:
Set maximum amount of memory one process can have:
# ulimit -m
unlimited
Or use cgroups for similar behavior:
http://en.wikipedia.org/wiki/Cgroups
I would advise against restarting instances (if that is possible) that go over the "memory limit", because that may put your system in infinite loops where a process repeatedly reaches that limit and restarts.
Maybe you could write a simple daemon that constantly watches the processes, and kills any that go over a certain amount of memory. Be sure to log any information about the process that did this so you can fix the issue whenever it comes up.
I'd also look into getting some real swap space on there... This seems like a bad hack.
I have a problem where passenger processes end up going out of control and consuming too much memory. I wrote the following script which has been helping to keep things under control until I find the real solution http://www.codeotaku.com/blog/2012-06/passenger-memory-check/index. It might be helpful.
Passenger web instances don't contain important state (generally speaking) so killing them isn't normally a process, and passenger will restart them as and when required.
I don't have a canned solution but you might want to use two commands that ship with Passenger to keep track of memory usage and nr of processes: passenger-status and sudo passenger-memory-stats, see
Passenger users guide for Nginx or
Passenger users guide for Apache.
I'm running into a problem in a Rails application.
After some hours, the application seems to start hanging, and I wasn't able to find where the problem was. There was nothing relevant in the log files, but when I tried to get the url from a browser nothing happened (like mongrel accept the request but wasn't able to respond).
What do you think I can test to understand where the problem is?
I might get voted down for dodging the question, but I recently moved from nginx + mongrel to mod_rails and have been really impressed. Moving to a much simpler setup will undoubtedly save me headaches in the future.
It was a really easy transition, I'd highly recommend it.
Are you sure the problem is caused by Mongrel? Have you tried running your application under WEBrick?
There are a few things you can check, but since you say there's nothing in the logs to indicate error, it sounds like you might be running into a bug when using the log rotation feature of the Logger class. It causes mongrel to lock up. Instead of relying on Logger to rotate your logs, consider using logrotate or some other external log rotation service.
Does this happen at a set number of hours/days every time? How much RAM do you have?
I had this same problem. The couple options I had narrowed it down to were MySQL adapter related. I was running on Red Hat Enterprise Linux 4 (or 5) and the app would hang after a given amount of idle time.
One suggested solution was to compile native MySQL bindings, I had been using the pure Ruby one.
The other was to set the timeout on the MySQL adapter higher than what the connection would idle out on. (I don't have the specific configuration recorded, but as I recall it was in environment.rb and it was some class variable in the mysql adapter.)
I don't recall if either of those solutions fixed it, we moved to Ubuntu shortly after that and hadn't had a problem since.
Check the Mongrel FAQ:
http://mongrel.rubyforge.org/wiki/FAQ
From my experience, mongrel hangs when:
the log file got too big (hundreds of megabytes in size). you have to setup log rotation.
the MySQL driver times out
you have to change the timeout settings of your MySQL driver by adding this to your environment.rb:
ActiveRecord::Base.verification_timeout = 14400
(this is further explained in the deployment section of the FAQ)
Unfortunately, Rails (and thereby Mongrel) using up too much memory over time and crashing is a known problem (50K+ Google entries for "Ruby, rails, crashing, memory"). The current ruby interpreter has the property that it sometimes simply fails entirely to give memory back to the system - it may reuse the memory it has but it won't give it up.
There are numerous schemes for monitoring, killing and restarting Mongrel instances in a production environment - for example: (choosing at random) rails monitor . Until the problem is fixed more decisively, one of these may be your best bet.
We have experienced this same issue. First off, install the mongrel_proctitle gem
http://github.com/rtomayko/mongrel_proctitle/tree/master
This gem/plugin will allow you to view the mongrel processes via "ps" and you can see if a Mongrel is hung. An issue we have seen with Mongrel is that it will happily accept connections and enqueue them, then wedge itself. This plugin will help you see when a Mongrel has been wedged but then you must use another monitoring app to actually restart a a wedged Mongrel, something like Monit or God
You might also want to consider putting a more balanced reverse proxy in front of your Mongrels, something HAproxy, instead of nginx, Apache or Lighttpd. With a setting of "maxconn 1" in HAproxy you can assure that the queue is being maintained by HAproxy versus Mongrel. The other reverse proxies (nginx, Apache, Lighttpd) only do round-robin which means that they can load up your Mongrel queue, inadvertently.
My personal choice is God as it is much more flexible.
tl;dr Install this gem plugin and keep an eye on your Mongrels. Try Apache+Phusion Passenger.
I have a simple Rails app deployed on a 500 MB Slicehost VPN. I'm the only one who uses the app. When I run it on my laptop, it's fast enough. But the deployed version is insanely slow. It take 6 to 10 seconds to load the login screen.
I would like to find out why it's so slow. Is it my code? (Don't think so because it's much faster locally, but maybe.) Is it Slicehost's server being overloaded? Is it the Internet?
Can someone suggest a technique or set of steps I can take to help narrow down the cause of this problem?
Update:
Sorry forgot to mention. I'm running it under CentOS 5 using Phusion Passenger (AKA mod_rails or mod_rack).
If it is just slow on the first time you load it is probably because of passenger killing the process due to inactivity. I don't remember all the details but I do recall reading people who used cron jobs to keep at least one process alive to avoid this lag that can occur with passenger needed to reload the environment.
Edit: more details here
Specifically - pool idle time defaults to 2 minutes which means after two minutes of idling passenger would have to reload the environment to serve the next request.
First, find out if there's a particularly slow response from the server. Use Firefox and the Firebug plugin to see how long each component (including JavaScript and graphics) takes to download. Assuming the main page itself is what is taking all the time, you can start profiling the application. You'll need to find a good profiler, and as I don't actually work in Ruby on Rails, I can't suggest any: google "profile ruby on rails" for some options.
As YenTheFirst points out, the server software and config you're using may contribute to a slowdown, but A) slicehost doesn't choose that, you do, as Slicehost just provides very raw server "slices" that you can treat as dedicated machines. B) you're unlikely to see a script that runs instantly suddenly take 6 seconds just because it's running as CGI. Something else must be going on. Check how much RAM you're using: have you gone into swap? Is the login slow only the first time it's hit indicating some startup issue, or is it always that slow? Is static content served slow? That'd tend to mean some network issue (either on the Slicehost side, or your local network) is slowing things down, assuming you're not in swap.
When you say "fast enough" you're being vague: does the laptop version take 1 second to the Slicehost 6? That wouldn't be entirely surprising, if the laptop is decent: after all, the reason slices are cheap is because they're a fraction of a full server. You're using probably 1/32 of an 8 core machine at Slicehost, as opposed to both cores of a modern laptop. The Slicehost cores are quick, but your laptop could be a screamer compared to 1/4 of core. :)
Try to pint point where the slowness lies
1/ application is slow, or infrastructure (network + web server)
put a static file on your web server, and access it through your browser
2/ If it is fast, it is probable a problem with application + server configuration.
database access is slow
try a page with a simpel loop: is it slow?
3/ If it slow, it is probably your infrastructure. You can check:
bad network connection: do a packet capture (with Wireshark for example) and look for retransmissions, duplicate packets, etc.
DNS resolution is slow?
server is misconfigured?
etc.
What is Slicehost using to serve it?
Fast options are things like: Mongrel, or apache's mod_rails (also called passenger phusion or
something like that)
These are dedicated servers (or plugins to servers) which run an instance of your rails app.
If your host isn't using that, then it's probably defaulting to CGI. Rails comes with a simple CGI script that will serve the page, but it reloads the app for every page.
(edit: I suspect that this is the most likely case, that your app is running off of the CGI in /webapp_directory/public/dispatch.cgi, which would explain the slowness. This tends to be a default deployment on many hosts, since it doesn't require extra configuration on their part, but it doesn't give good performance)
If your host supports "Fast CGI", rails supports that too. Fast CGI will open a CGI session, and keep it open for multiple pages, so you get much better performance, but it's not nearly as good as Mongrel or mod_rails.
Secondly, is it in 'production' or 'development' mode? The easy way to tell is to go to a page in your app that gives an error. If it shows you a stack trace, it's in development mode, which is slower than production mode. Mongrel and mod_rails have startup options to determine whether to run the app in production or development mode.
Finally, if your database is slow for whatever reason, that will be a big bottleneck as well. If you do have a good deployment (Mongrel/mod_rails/etc.) in production mode, try looking into that.
Do you have a lot of data in your DB? I would double check that you have indexed all the appropriate columns- because this can make a huge difference. On your local dev system, you probably have a lot more memory than on your 500 mb slice, which would result in the DB running a lot slower if you have big, un indexed tables. You can also run the slow queries logger in MySql to pinpoint columns without indexes.
Other than that, yes- passenger will need to spool up a process for you if you have not been using the site recently. If this is the case, you should see a significant speed increase on second, and especially third and later page loads.
You might want to run a local virtual machine with 500 MB. Are you doing a lot of client-server interaction? Delays over the WAN are significant
You might want to check out RPM (there's a free "lite" version too) and/or New Relic's Tune Up.
Your CPU time is guaranteed by Slicehost using the Xen virtualization system, so it's not that. Don't have the other answers for you, sorry! Might try 'top' on a console while you're trying to access the page.
If you are using FireFox and doing localhost testing (or maybe even on LAN) you may want to try editing the network.dns.disableIPv6 setting.
Type about:config in the address bar and filter for network.dns.disableIPv6 and double-click to set to true.
This bug has been reported mainly from Vista OS's, but some others as well.
You could try running 'top' when you SSH in to see which process is heavy. If you also have problems logging you, perhaps you may try getting Statistics in the Slicehost manager.
If you discover it is MySQL's fault, consider decreasing the number of servers it can spawn.
512 seems decent for Rails application, you might have to check if you misconfigured too.