We are running v1.9.1, (latest stable release), of Neo4j in Embedded mode. We have had a couple of situations where the process has shutdown unexpectedly and the neo4j.shutdown() has not been called.
Note: when this has occurred we know there is no outstanding updates or changes occurring to the neoDB. Also this is on a linux OS.
When the application is started up again and it starts the connection to neo4j it begins the recovery process but is hanging forever. The messages.log file shows:
2013-07-17 21:05:09.143+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: XaResourceManager[nioneo_logical.log] recovery completed.
2013-07-17 21:05:09.143+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Recovery on log [/opt/pricing/data/database/app/nioneo_logical.log.1] completed.
2013-07-17 21:05:09.156+0000 INFO [o.n.k.i.t.TxManager]: TM opening log: /opt/pricing/data/database/app/tm_tx_log.2
2013-07-17 21:05:09.245+0000 INFO [o.n.b.BackupServer]: BackupServer communication server started and bound to /0.0.0.0:6362
2013-07-17 21:05:09.271+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Non clean shutdown detected on log [/opt/pricing/data/database/app/index/lucene.log.2]. Recovery started ...
2013-07-17 21:05:09.271+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: [/opt/pricing/data/database/app/index/lucene.log.2] logVersion=3 with committed tx=317
What's most interesting, we copied the DB over to a desktop and created a little program that just starts the DB then shuts it down and ran it against the DB. It recovered no problems and in only a couple of seconds, (this may be because the hang process had partially recovered the DB, but we don't think so because the application does recover the DB if we kill it and try running it again)
We repeated this on the linux machine with the same successful results.
We are obviously working on trying ensure shutdown will always be called on an unexpected termination of the application, but the real problem is why is the recovery process hanging when starting up?
We did find the following https://groups.google.com/forum/#!msg/neo4j/CBvuMybTRFw/NMIOpBjrIYIJ but that talks about running the DB as a server and just increasing the timeout. Although the point in the messages.log is exactly the same location as mine.
As a temporary solution if the recovery hangs we can run the little 'dummy' program to see if the DB will get fixed, but would rather get to the root cause.
Does anybody have any advice?
What does the CPU/memory/disk do when you say it hangs? Is everything quiet?
Also if you do a jstack or profile or similar of the JVM doing the recovery what does the "main" thread do?
Providing answers to these questions would help a great deal.
Related
I was running a large delete query and got an out of memory error, so the DB shutdown automatically. I restarted it, but it is still showing as 'offline' in Neo4j desktop.
Here are the log entries from the restart:
2021-08-01 23:47:03.506+0000 INFO Starting...
2021-08-01 23:47:06.804+0000 INFO ======== Neo4j 4.2.1 ========
Exception in thread "neo4j.Scheduler-1" java.lang.OutOfMemoryError: Java heap space
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2021-08-01 23:47:22.505+0000 INFO Sending metrics to CSV file at /Users/my_user/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-########-####-####-####-##########/metrics
2021-08-01 23:47:22.524+0000 INFO Bolt enabled on localhost:7687.
2021-08-01 23:47:23.836+0000 INFO Remote interface available at http://localhost:7474/
2021-08-01 23:47:23.837+0000 INFO Started.
Similarly, when I attempt to connect from a browser it tells me that the Neo4j database is unavailable.
In the log I can see that there is a Java out of memory error. Why would this appear? Does Neo4j queue/cache incomplete queries? And how do I go about clearing it if I can't access the server?
The data is only test data, so I don't need to save it. I do need to understand if I can fix it, and how, since I am putting the product through its paces for a new project.
had a relatively small database that was working fine.
then the server spiked at 99% cpu for an hour. only thing i could do is terminate the process.
now the server will not start providing the following error.
Starting Neo4j Server failed: Error starting
org.neo4j.kernel.EmbeddedGraphDatabase, c:.......
i have since rebooted the machine.
i am running 2.1.5 community edition on a windows server.
Any help would be greatly appreciated.
It sounds like your data directory is corrupt. Go into db/data and rename graph.db to graph.db.old, then restart.
I am running a cowboy erlang server. My server was genereted by following the getting started instructions on the 99s site, and I am running it with a command line:
./_rel/myapp_release/bin/myapp_release console
Thing is, after a certain while of no activity, the server crashes, and does not recover. The message I am getting is this:
heart: Sat Aug 16 22:33:18 2014: heart-beat time-out, no activity for 1771 seconds
heart: Sat Aug 16 22:33:18 2014: Would reboot. Terminating.
{"Kernel pid terminated",heart,{port_terminated,{heart,loop,[<0.0.0>,#Port<0.25>,[]]}}}
I know about the heart tool that can be used to monitor a service and restart it after a while if it's not getting any requests (I guess the logic is that if nothing is happening with the service something is wrong), but I can't figure out where in the cowboy application this configuration exists.
So I would ask:
Can anyone explain why is the server crashing?
If it is indeed crashing "on purpose", where is the configuration to set up things like the time-out period?
Ideally the application would restart itself if it's crashed (using a supervisor?). Does cowboy have a built in supervisor for apps that cowboy is running?
I have a rails app on heroku that users log in to. I periodically get this exception:
UserSessionsController#
(ActiveRecord::StatementInvalid)
"PGError: FATAL: terminating
connection due to administrator
command\nserver closed the connection
unexpectedly\n\tThis probably means
the server terminated
abnormally\n\tbef...
URL
POST http://secure.huckberry.com/user_sessions
What's a likely cause of this? I'd appreciate any help.
Assuming you saw this recently, this is due to a recent bit of high-priority maintenance work to enable continuous backups on shared databases -- involving a server restart. You shouldn't worry about this error, provided it does not reproduce. I don't think that's very likely, so happy hacking!
I had this error happen to me. My Application server had an open connection to the database. In my SSH terminal I added an ipaddress to the ph_hba.conf file and restarted the postgreSQL server.
That is when this error showed up. I refreshed my web page one time and the error was gone.
This probably means that something sent the server process a SIGTERM signal. This could happen is if the postmaster gets a SIGINT from something. However, if you are able to reconnect that's not the case, because the postmaster would disallow new connections.
You're probably having a clash of some kind in your application. Enable query logging and check for something unusual.
This error may also appear, if you run a test suite which utilizes a database connection (PSQL in this case) and the test is still running (asynchronously).
A tear down hook may terminate the connection when the test is still running and this ends up in this error message.
My RubyOnRails app is set up with the usual pack of mongrels behind Apache configuration. We've noticed that our Mongrel web server memory usage can grow quite large on certain operations and we'd really like to be able to dynamically do a graceful restart of selected Mongrel processes at any time.
However, for reasons I won't go into here it can sometimes be very important that we don't interrupt a Mongrel while it is servicing a request, so I assume a simple process kill isn't the answer.
Ideally, I want to send the Mongrel a signal that says "finish whatever you're doing and then quit before accepting any more connections".
Is there a standard technique or best practice for this?
I've done a little more investigation into the Mongrel source and it turns out that Mongrel installs a signal handler to catch an standard process kill (TERM) and do a graceful shutdown, so I don't need a special procedure after all.
You can see this working from the log output you get when killing a Mongrel while it's processing a request. For example:
** TERM signal received.
Thu Aug 28 00:52:35 +0000 2008: Reaping 2 threads for slow workers because of 'shutdown'
Waiting for 2 requests to finish, could take 60 seconds.Thu Aug 28 00:52:41 +0000 2008: Reaping 2 threads for slow workers because of 'shutdown'
Waiting for 2 requests to finish, could take 60 seconds.Thu Aug 28 00:52:43 +0000 2008 (13051) Rendering layoutfalsecontent_typetext/htmlactionindex within layouts/application
Look at using monit. You can dynamically restart mongrel based on memory or CPU usage. Here's a line from a config file that I wrote for a client of mine.
check process mongrel-8000 with pidfile /var/www/apps/fooapp/current/tmp/pids/mongrel.8000.pid
start program = "/usr/local/bin/mongrel_rails cluster::start --only 8000"
stop program = "/usr/local/bin/mongrel_rails cluster::stop --only 8000"
if totalmem is greater than 150.0 MB for 5 cycles then restart # eating up memory?
if cpu is greater than 50% for 8 cycles then alert # send an email to admin
if cpu is greater than 80% for 5 cycles then restart # hung process?
if loadavg(5min) greater than 10 for 3 cycles then restart # bad, bad, bad
if 3 restarts within 5 cycles then timeout # something is wrong, call the sys-admin
if failed host 192.168.106.53 port 8000 protocol http request /monit_stub
with timeout 10 seconds
then restart
group mongrel
You'd then repeat this configuration for all of your mongrel cluster instances. The monit_stub line is just an empty file that monit tries to download. If it can't, it tries to restart the instance as well.
Note: the resource monitoring seems not to work on OS X with the Darwin kernel.
Better question is how to keep your app from consuming so much memory that it requires you to reboot mongrels from time to time.
www.modrails.com reduced our memory footprint significantly
Boggy:
If you have one process running, it will gracefully shut down (service all the requests in its queue which should only be 1 if you are using proper load balancing). The problem is you can't start the new server until the old one dies, so your users will queue up in the load balancer. What I've found successful is a 'cascade' or rolling restart of the mongrels. Instead of stopping them all and starting them all (therefore queuing requests until the one mongrel is done, stopped, restarted and accepting connections), you can stop then start each mongrel sequentially, blocking the call to restart the next mongrel until the previous one is back up (use a real HTTP check to a /status controller). As your mongrels roll, only one at a time is down and you are serving across two code bases - if you can't do this you should throw up a maintenance page for a minute. You should be able to automate this with capistrano or whatever your deploy tool is.
So I have 3 tasks:
cap:deploy - which does the traditional restart all at the same time method with a hook that puts up a maintenance page and then takes it down after an HTTP check.
cap:deploy:rolling - which does this cascade across the machine (I pull from a iClassify to know how many mongrels are on the given machine) without a maintenance page.
cap deploy:migrations - which does maintenance page + migrations since its usually a bad idea to run migrations 'live'.
Try using:
mongrel_cluster_ctl stop
You can also use:
mongrel_cluster_ctl restart
got a question
what happens when /usr/local/bin/mongrel_rails cluster::start --only 8000 is triggered ?
are all of the requests served by this particular process, to their end ? or are they aborted ?
I curious if this whole start/restart thing can be done without affecting the end users...