We frequently have what we call "apache hangs", where our passenger queue starts to grow like crazy, and the application appears not to be responding. We believe that it due to all passenger processes happening to be serving very long running requests simultaneously.
We are trying to figure out how to identify what the actual current request is that each active passenger process is currently trying to serve. Note that I have read Phusion's frozen process blog, and am still digesting it, experimenting with SIGQUIT and other suggestions there.
My question here is, can passenger-status --show=server tell me what each process is doing, and if so, can anyone here help me understand how to interpret its results.
It looks to me like this command is reporting a ton of stuff that is not actually going on at the moment, in addition to whatever is, and I can't figure out how to suss out what of this really are current requests being handled by active passenger processes.
The output of passenger-status --show=server is too long to past here, so here is a gist, including the base passenger-status output too.
Related
Let me start off by saying I understand performance is hard, and even harder through a forum, but I hope someone can point me towards the next step of figuring out this problem we're having in production on a Ruby on Rails environment. I've been seeing this for a while, but now with our usage down due to Covid, it's become more clear that it's not a user-load issue, but something in the infrastructure.
The day starts off fine, but midday, things slow more and more... until a passenger restart clears everything up for 24/48/96 hours.
Checking the obvious items:
Server memory usage is in check. Memory use increases during this time from 20%-30%. So some increase, but definitely not swapping (I've seen that happen before)
CPU usage is similar. It increases from 1% at 6am to 18% at worst case, but still well within the capabilities of the server.
Passenger usage. Looking at passenger-status, there's never a backlog of requests. Because of covid, our user-base is down, and even at its slowest, I'm seeing 2-3 passenger processes serving content (out of ~40 max), and no long-running requests.
Delayed job workers (3) are running, but no jobs at the time. They don't seem to be using a significant amount of cpu or mem
Watching the apache/rails logs, nothing looks awry. No DoSes, no unexpected load.
Looking at long-running transactions in NewRelic, there are some that are taking 20-30-50 sec, but after the passenger restart, they are back to 1-3 sec like usual. All the time is in Ruby.
So this is where I'm stuck. I can see where the problem is (185 sec):
but if I restart passenger and re-run, that same code will take less than a second. And that's just the example that I see from today's traces. Yesterday, it looked like it was a different controller having problems.
Any recommendations on my next steps to troubleshoot? I don't know if I should instrument a specific controller, because it's not just one method that's causing the problem (afaik). I think I'm seeing the symptoms, not the cause - and no idea how to see what's really going on.
Thanks in advance
-Mike
Rails 4.2.11.1
Ruby 2.4.9
Passenger 6.0.5 on Apache 2.4.43
We are running a big server (12 threads, 6 cores, 64Gb ram, 2 SDDs raid-0) for our rails app deployed with nginx/passenger.
Unfortunately, pages are taken forever to load something like between 10 and 40 seconds. However, the server is under a really light load, with a load average of 0.61 0.56 0.53. We have ram used weirdly, free -ml reporting 57Gb (of 64Gb) usage whereas htop reporting only 4Gb (of 64Gb).
We have checked our production log, and rails request takes something like 100/200ms to be completed, so almost nothing.
How can we identify the bottleneck?
This question is fairly vague, but I'll see if I can give you some pointers.
My first guess is that your app is spending a lot of time doing database related stuff, see below for my advice.
As for the odd memory usage, are you looking at the correct part of the free -ml output? Just to clarify, you want to be looking at the -/+ buffers/cache: line to get the accurate output.
You might also check to see if any of your passenger workers are hanging, as that is a fairly common issue with passenger. You can do this by running strace -p $pid on your passenger workers. If its hanging, it will have an obvious lack of "doing anything"
As for troubleshooting response time within rails itself, I would highly suggest looking into using newrelic(http://newrelic.com/). You can often times see exactly which part of your app is causing the bad response time that way by breaking down how much time is spent in each part of your app. It's a simple gem to install and once you get reporting working, its pretty invaluable for issues like this.
Finally, the bottleneck was passenger, passenger-status is pretty useful showing up the queue left.
Our server is pretty precent so we just increase the number of passenger processes in nginx.conf to 600, resulting:
passenger_root /usr/local/rvm/gems/ruby-2.0.0-p195/gems/passenger-4.0.5;
passenger_ruby /usr/local/rvm/wrappers/ruby-2.0.0-p195/ruby;
passenger_max_pool_size 600;
Recently I run a below command to start daemon process which runs once in a three days.
RAILS_ENV=production lib/daemons/mailer_ctl start, which was working but when I came back after a week and found that process was killed.
Is this normal or not?
Thanks in advance.
Nope. As far as I understand it, this daemon is supposed to run until you kill it. The idea is for it to work at regular intevals, right? So the daemon is supposed to wake up, do its work, then go back to sleep until needed. So if it was killed, that's not normal.
The question is why was it killed and what can you do about it. The first question is one of the toughest ones to answer when debugging detached processes. Unless your daemon leaves some clues in the logs, you may have trouble finding out when and why it terminated. If you look through your logs (and if you're lucky) there may be some clues -- I'd start right around the time when you suspect it last ran and look at your Rails production.log, any log file the daemon may create but also at the system logs.
Let's assume for a moment that you never can figure out what happened to this daemon. What to do about it becomes an interesting question. The first step is: Log as much as you can without making the logs too bulky or impacting performance. At a minimum log wakeup, processing, and completion events, as well as trapping signals and logging them. Best to log to somewhere other than the Rails production.log. You may also want to run the daemon at a shorter interval than 3 days until you are certain it is stable.
Look into using a process monitoring tool like monit (http://mmonit.com/monit/) or god (http://god.rubyforge.org/). These tools can "watch" the status of daemons and if they are not running can automatically start them. You still need to figure out why they are being killed, but at least you have some safety net.
We have a rails app running on passenger and we background process some tasks using a combination of RabbitMQ and Workling. The workling's worker process is started using the script/workling_client command. There is always only one worker process started, and the script/workling_client has a :multiple => false options, thus allowing only one instance. But sometimes, under mysterious circumstances which I haven't been able to track down, more worklings spawn up. If I let the system run for some time, more and more worklings appear. I'm not sure if these rogue worklings cause any problems, but it is still unsettling not to know why is it happening. We are using Monit to monitor the workling process. So if it dies, it will spawn it up again. But this still does not explain how come there are suddenly more than one of them.
So my question is: does anyone know what can be cause of this and how to make it stop? Is it possible that workling sometimes dies by itself, without deleting it's pid file? Could there be something wrong with the Daemons gem workling_client is build upon?
Not an answer - I have the same problems running RabbitMQ + Workling.
I'm using God to monitor the single workling process as well (:multiple => false)...
I found that the multiple worklings were eating up huge amounts of memory & causing serious resource usage, so it's important that I find a solution for this.
You might find this message thread helpful: http://groups.google.com/group/rubyonrails-talk/browse_thread/thread/ed8edd0368066292/5b17d91cc85c3ada?show_docid=5b17d91cc85c3ada&pli=1
I'm running into a problem in a Rails application.
After some hours, the application seems to start hanging, and I wasn't able to find where the problem was. There was nothing relevant in the log files, but when I tried to get the url from a browser nothing happened (like mongrel accept the request but wasn't able to respond).
What do you think I can test to understand where the problem is?
I might get voted down for dodging the question, but I recently moved from nginx + mongrel to mod_rails and have been really impressed. Moving to a much simpler setup will undoubtedly save me headaches in the future.
It was a really easy transition, I'd highly recommend it.
Are you sure the problem is caused by Mongrel? Have you tried running your application under WEBrick?
There are a few things you can check, but since you say there's nothing in the logs to indicate error, it sounds like you might be running into a bug when using the log rotation feature of the Logger class. It causes mongrel to lock up. Instead of relying on Logger to rotate your logs, consider using logrotate or some other external log rotation service.
Does this happen at a set number of hours/days every time? How much RAM do you have?
I had this same problem. The couple options I had narrowed it down to were MySQL adapter related. I was running on Red Hat Enterprise Linux 4 (or 5) and the app would hang after a given amount of idle time.
One suggested solution was to compile native MySQL bindings, I had been using the pure Ruby one.
The other was to set the timeout on the MySQL adapter higher than what the connection would idle out on. (I don't have the specific configuration recorded, but as I recall it was in environment.rb and it was some class variable in the mysql adapter.)
I don't recall if either of those solutions fixed it, we moved to Ubuntu shortly after that and hadn't had a problem since.
Check the Mongrel FAQ:
http://mongrel.rubyforge.org/wiki/FAQ
From my experience, mongrel hangs when:
the log file got too big (hundreds of megabytes in size). you have to setup log rotation.
the MySQL driver times out
you have to change the timeout settings of your MySQL driver by adding this to your environment.rb:
ActiveRecord::Base.verification_timeout = 14400
(this is further explained in the deployment section of the FAQ)
Unfortunately, Rails (and thereby Mongrel) using up too much memory over time and crashing is a known problem (50K+ Google entries for "Ruby, rails, crashing, memory"). The current ruby interpreter has the property that it sometimes simply fails entirely to give memory back to the system - it may reuse the memory it has but it won't give it up.
There are numerous schemes for monitoring, killing and restarting Mongrel instances in a production environment - for example: (choosing at random) rails monitor . Until the problem is fixed more decisively, one of these may be your best bet.
We have experienced this same issue. First off, install the mongrel_proctitle gem
http://github.com/rtomayko/mongrel_proctitle/tree/master
This gem/plugin will allow you to view the mongrel processes via "ps" and you can see if a Mongrel is hung. An issue we have seen with Mongrel is that it will happily accept connections and enqueue them, then wedge itself. This plugin will help you see when a Mongrel has been wedged but then you must use another monitoring app to actually restart a a wedged Mongrel, something like Monit or God
You might also want to consider putting a more balanced reverse proxy in front of your Mongrels, something HAproxy, instead of nginx, Apache or Lighttpd. With a setting of "maxconn 1" in HAproxy you can assure that the queue is being maintained by HAproxy versus Mongrel. The other reverse proxies (nginx, Apache, Lighttpd) only do round-robin which means that they can load up your Mongrel queue, inadvertently.
My personal choice is God as it is much more flexible.
tl;dr Install this gem plugin and keep an eye on your Mongrels. Try Apache+Phusion Passenger.