I am running a cowboy erlang server. My server was genereted by following the getting started instructions on the 99s site, and I am running it with a command line:
./_rel/myapp_release/bin/myapp_release console
Thing is, after a certain while of no activity, the server crashes, and does not recover. The message I am getting is this:
heart: Sat Aug 16 22:33:18 2014: heart-beat time-out, no activity for 1771 seconds
heart: Sat Aug 16 22:33:18 2014: Would reboot. Terminating.
{"Kernel pid terminated",heart,{port_terminated,{heart,loop,[<0.0.0>,#Port<0.25>,[]]}}}
I know about the heart tool that can be used to monitor a service and restart it after a while if it's not getting any requests (I guess the logic is that if nothing is happening with the service something is wrong), but I can't figure out where in the cowboy application this configuration exists.
So I would ask:
Can anyone explain why is the server crashing?
If it is indeed crashing "on purpose", where is the configuration to set up things like the time-out period?
Ideally the application would restart itself if it's crashed (using a supervisor?). Does cowboy have a built in supervisor for apps that cowboy is running?
Related
Phusion Passenger version: 5.0.29
I've read the documentation about Passenger's OOB (Out-of-Band) feature and would like to use it to make a decision out-of-band on whether a process should exit or not. If the process has to exit, then at the end of the OOB work, the process calls raise SystemExit
We've managed to get this working where the process exits and then Passenger spins up a new process later to handle the new incoming requests. But, we're seeing occasional 502s with the following lines in the passenger log.
[ 2019-03-27 22:25:13.3855 31726/7f78b6c02700 age/Cor/Con/InternalUtils.cpp:112 ]: [Client 1-10] Sending
502 response: application did not send a complete response
[ 2019-03-27 22:25:13.3859 31726/7f78b6201700 age/Cor/CoreMain.cpp:819 ]: Checking whether to disconnect
long-running connections for process 10334, application agent
App 16402 stdout:
App 16441 stdout:
App 16464 stdout:
[ 2019-03-27 22:28:05.0320 31726/7f78ba9ff700 age/Cor/App/Poo/AnalyticsCollection.cpp:102 ]: Process (pid=16365, group=agent) no longer exists! Detaching it from the pool.
Is the above behavior due to a race condition between the request handler forwarding the request to the process and the process exiting? Is Passenger designed to handle this scenario? Any workaround / solution for this problem?
Thank you!
Looks like we need to run "passenger-config detach-process [PID]" for graceful termination.
Improved process termination mechanism
If you wish to terminate a specific application process -- maybe because it's misbehaving -- then you can do that simply by killing it with the kill command. Unfortunately this method has a few drawbacks:
Any requests which the process was handling will be forcefully aborted, causing error responses to be returned to the clients.
For a short amount of time, new requests may be routed to that process. These requests will receive error responses.
In Passenger 5 we've introduced a new, graceful mechanism for terminating application processes: passenger-config detach-process. This command removes the process from the load balancing list, drains all existing requests, then terminates it cleanly.
You can learn more about this tool by running:
passenger-config detach-process --help
Source: https://blog.phusion.nl/2015/03/04/whats-new-in-passenger-5-part-2-better-logging-better-restarting-better-websockets-and-more/
We are running v1.9.1, (latest stable release), of Neo4j in Embedded mode. We have had a couple of situations where the process has shutdown unexpectedly and the neo4j.shutdown() has not been called.
Note: when this has occurred we know there is no outstanding updates or changes occurring to the neoDB. Also this is on a linux OS.
When the application is started up again and it starts the connection to neo4j it begins the recovery process but is hanging forever. The messages.log file shows:
2013-07-17 21:05:09.143+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: XaResourceManager[nioneo_logical.log] recovery completed.
2013-07-17 21:05:09.143+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Recovery on log [/opt/pricing/data/database/app/nioneo_logical.log.1] completed.
2013-07-17 21:05:09.156+0000 INFO [o.n.k.i.t.TxManager]: TM opening log: /opt/pricing/data/database/app/tm_tx_log.2
2013-07-17 21:05:09.245+0000 INFO [o.n.b.BackupServer]: BackupServer communication server started and bound to /0.0.0.0:6362
2013-07-17 21:05:09.271+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Non clean shutdown detected on log [/opt/pricing/data/database/app/index/lucene.log.2]. Recovery started ...
2013-07-17 21:05:09.271+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: [/opt/pricing/data/database/app/index/lucene.log.2] logVersion=3 with committed tx=317
What's most interesting, we copied the DB over to a desktop and created a little program that just starts the DB then shuts it down and ran it against the DB. It recovered no problems and in only a couple of seconds, (this may be because the hang process had partially recovered the DB, but we don't think so because the application does recover the DB if we kill it and try running it again)
We repeated this on the linux machine with the same successful results.
We are obviously working on trying ensure shutdown will always be called on an unexpected termination of the application, but the real problem is why is the recovery process hanging when starting up?
We did find the following https://groups.google.com/forum/#!msg/neo4j/CBvuMybTRFw/NMIOpBjrIYIJ but that talks about running the DB as a server and just increasing the timeout. Although the point in the messages.log is exactly the same location as mine.
As a temporary solution if the recovery hangs we can run the little 'dummy' program to see if the DB will get fixed, but would rather get to the root cause.
Does anybody have any advice?
What does the CPU/memory/disk do when you say it hangs? Is everything quiet?
Also if you do a jstack or profile or similar of the JVM doing the recovery what does the "main" thread do?
Providing answers to these questions would help a great deal.
Sometimes the erlang application will coredown by out of memery or other reasons.
Can I config something to let it restart when it coredown?
erlang support option --heart for monitoring if erlang process still alive. In case of crashing or hanging erlang process, heartbeat part reboots the whole server (it assumes your erlang application starts at booting of your server). You can read all details here:
http://www.erlang.org/doc/man/heart.html
I'm using tomcat 6 and HypericHQ for monitoring via JMX.
The issue is the following:
hyperic, overtime, opens hundreds of jmx connection and never closes them.. after few hours our tomcat server is using 100% cpu without doing anything.
Once I stop hyperic agent, tomcat will go back to 0-1% cpu..
Here is what we are seeing virtual vm:
http://forums.hyperic.com/jiveforums/servlet/JiveServlet/download/1-11619-37096-2616/Capture.PNG
I don't know if this is an hyperic issue or not, but I wonder if there is an option to fix it via tomcat/java configuration? The reason that I don't know if this is an hyperic or a tomcat/java configuration issue is because that when we use hyperic on other standard java daemon it doesn't have the same connection leak issue.
The JMX is exposed using Spring, and it's working great when connecting with JMX clients (JConsole/VisualVM). When I close the client, I see that the number of connections drops by one.
Is there any thing that we can do to fix this via java configuration? (forcing it to close a connection that is open for more than X seconds?)
One more thing, in tomcat we see (from time to time) the following message (while hyperic is running):
Mar 7, 2011 11:30:00 AM ServerCommunicatorAdmin reqIncoming
WARNING: The server has decided to close this client connection.
Thanks
My RubyOnRails app is set up with the usual pack of mongrels behind Apache configuration. We've noticed that our Mongrel web server memory usage can grow quite large on certain operations and we'd really like to be able to dynamically do a graceful restart of selected Mongrel processes at any time.
However, for reasons I won't go into here it can sometimes be very important that we don't interrupt a Mongrel while it is servicing a request, so I assume a simple process kill isn't the answer.
Ideally, I want to send the Mongrel a signal that says "finish whatever you're doing and then quit before accepting any more connections".
Is there a standard technique or best practice for this?
I've done a little more investigation into the Mongrel source and it turns out that Mongrel installs a signal handler to catch an standard process kill (TERM) and do a graceful shutdown, so I don't need a special procedure after all.
You can see this working from the log output you get when killing a Mongrel while it's processing a request. For example:
** TERM signal received.
Thu Aug 28 00:52:35 +0000 2008: Reaping 2 threads for slow workers because of 'shutdown'
Waiting for 2 requests to finish, could take 60 seconds.Thu Aug 28 00:52:41 +0000 2008: Reaping 2 threads for slow workers because of 'shutdown'
Waiting for 2 requests to finish, could take 60 seconds.Thu Aug 28 00:52:43 +0000 2008 (13051) Rendering layoutfalsecontent_typetext/htmlactionindex within layouts/application
Look at using monit. You can dynamically restart mongrel based on memory or CPU usage. Here's a line from a config file that I wrote for a client of mine.
check process mongrel-8000 with pidfile /var/www/apps/fooapp/current/tmp/pids/mongrel.8000.pid
start program = "/usr/local/bin/mongrel_rails cluster::start --only 8000"
stop program = "/usr/local/bin/mongrel_rails cluster::stop --only 8000"
if totalmem is greater than 150.0 MB for 5 cycles then restart # eating up memory?
if cpu is greater than 50% for 8 cycles then alert # send an email to admin
if cpu is greater than 80% for 5 cycles then restart # hung process?
if loadavg(5min) greater than 10 for 3 cycles then restart # bad, bad, bad
if 3 restarts within 5 cycles then timeout # something is wrong, call the sys-admin
if failed host 192.168.106.53 port 8000 protocol http request /monit_stub
with timeout 10 seconds
then restart
group mongrel
You'd then repeat this configuration for all of your mongrel cluster instances. The monit_stub line is just an empty file that monit tries to download. If it can't, it tries to restart the instance as well.
Note: the resource monitoring seems not to work on OS X with the Darwin kernel.
Better question is how to keep your app from consuming so much memory that it requires you to reboot mongrels from time to time.
www.modrails.com reduced our memory footprint significantly
Boggy:
If you have one process running, it will gracefully shut down (service all the requests in its queue which should only be 1 if you are using proper load balancing). The problem is you can't start the new server until the old one dies, so your users will queue up in the load balancer. What I've found successful is a 'cascade' or rolling restart of the mongrels. Instead of stopping them all and starting them all (therefore queuing requests until the one mongrel is done, stopped, restarted and accepting connections), you can stop then start each mongrel sequentially, blocking the call to restart the next mongrel until the previous one is back up (use a real HTTP check to a /status controller). As your mongrels roll, only one at a time is down and you are serving across two code bases - if you can't do this you should throw up a maintenance page for a minute. You should be able to automate this with capistrano or whatever your deploy tool is.
So I have 3 tasks:
cap:deploy - which does the traditional restart all at the same time method with a hook that puts up a maintenance page and then takes it down after an HTTP check.
cap:deploy:rolling - which does this cascade across the machine (I pull from a iClassify to know how many mongrels are on the given machine) without a maintenance page.
cap deploy:migrations - which does maintenance page + migrations since its usually a bad idea to run migrations 'live'.
Try using:
mongrel_cluster_ctl stop
You can also use:
mongrel_cluster_ctl restart
got a question
what happens when /usr/local/bin/mongrel_rails cluster::start --only 8000 is triggered ?
are all of the requests served by this particular process, to their end ? or are they aborted ?
I curious if this whole start/restart thing can be done without affecting the end users...