Neo4j TransactionNotPresentOnMasterException - neo4j

I've set up a HA cluster using one standalone instance and one embedded.
When the embedded instance is slave and I have a longer taking transaction (commit time about 30s), it throws:
org.neo4j.graphdb.TransactionFailureException: Unable to commit transaction
at org.neo4j.kernel.TopLevelTransaction.finish(TopLevelTransaction.java:143)
... 4 more
Caused by: org.neo4j.com.TransactionNotPresentOnMasterException: Transaction RequestContext[session: 1372752815336, ID:3, eventIdentifier:0, [lucene-index/1311102, nioneodb/7397657]] has either timed out on the master or was not started on this master. There may have been a master switch between the time this transaction started and up to now. This transaction cannot continue since the state from the previous master isn't transferred.
at org.neo4j.kernel.ha.com.master.MasterImpl.suspendOtherAndResumeThis(MasterImpl.java:280)
at org.neo4j.kernel.ha.com.master.MasterImpl.finishTransaction(MasterImpl.java:457)
at org.neo4j.kernel.ha.com.HaRequestType18$14.call(HaRequestType18.java:188)
at org.neo4j.kernel.ha.com.HaRequestType18$14.call(HaRequestType18.java:183)
at org.neo4j.com.Server$4.run(Server.java:559)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
As stated in the exception, I've checked wether the standalone instance loses it's master state, but it doesn't. So no master switch occurs. The instances are running on the same machine, so a network error can't be the cause.
When the embedded instance is master or not in any cluster, the transaction succeeds.
The code generating this error is quite complex, so I can't put an example here atm.

Related

Neo4J HA with Neo4J Spatial

So I have just set up an HA environment where I have a master server and instances of an application using Neo4J embedded talking to that cluster. Everything seems to work if the state of both databases is the same.
However if I delete all data from my slave instance, and have it join the cluster, I expect the data from the cluster to propagate into the slave instance. Instead I get errors with what appears to be Neo4J spatial. I have Neo4J spatial in my application, and the server plugin installed in the on the master server side.
An example of the stack trace I get:
2015-10-19 15:10:27.096+0000 ERROR [org.neo4j]: Exception when stopping org.neo4j.kernel.lifecycle.Lifecycle$Delegate#ae93556 org.neo4j.gis.spatial.indexprovider.SpatialIndexImplementation.stop()V
java.lang.AbstractMethodError: org.neo4j.gis.spatial.indexprovider.SpatialIndexImplementation.stop()V
at org.neo4j.kernel.lifecycle.Lifecycles$1.stop(Lifecycles.java:55)
at org.neo4j.kernel.lifecycle.Lifecycle$Delegate.stop(Lifecycle.java:75)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:527)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:155)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:185)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:1160)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:527)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:155)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:137)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.stopServicesAndHandleBranchedStore(SwitchToSlave.java:521)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.checkDataConsistency(SwitchToSlave.java:357)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.executeConsistencyChecks(SwitchToSlave.java:316)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.switchToSlave(SwitchToSlave.java:219)
at org.neo4j.kernel.ha.cluster.HighAvailabilityModeSwitcher$2.run(HighAvailabilityModeSwitcher.java:328)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:99)
2015-10-19 15:10:27.102+0000 ERROR [org.neo4j]: Lifecycle exception Failed to transition component 'org.neo4j.kernel.lifecycle.Lifecycle$Delegate#ae93556' from STOPPED to SHUTTING_DOWN. Please see attached cause exception
org.neo4j.kernel.lifecycle.LifecycleException: Failed to transition component 'org.neo4j.kernel.lifecycle.Lifecycle$Delegate#ae93556' from STOPPED to SHUTTING_DOWN. Please see attached cause exception
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:559)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:200)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:1160)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:527)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:155)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:137)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.stopServicesAndHandleBranchedStore(SwitchToSlave.java:521)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.checkDataConsistency(SwitchToSlave.java:357)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.executeConsistencyChecks(SwitchToSlave.java:316)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.switchToSlave(SwitchToSlave.java:219)
at org.neo4j.kernel.ha.cluster.HighAvailabilityModeSwitcher$2.run(HighAvailabilityModeSwitcher.java:328)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:99)
Caused by: java.lang.AbstractMethodError: org.neo4j.gis.spatial.indexprovider.SpatialIndexImplementation.shutdown()V
at org.neo4j.kernel.lifecycle.Lifecycles$1.shutdown(Lifecycles.java:64)
at org.neo4j.kernel.lifecycle.Lifecycle$Delegate.shutdown(Lifecycle.java:81)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:555)
... 18 more
2015-10-19 15:10:27.103+0000 ERROR [org.neo4j]: Chained lifecycle exception Component 'org.neo4j.kernel.lifecycle.Lifecycle$Delegate#ae93556' failed to stop. Please see attached cause exception.
org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.lifecycle.Lifecycle$Delegate#ae93556' failed to stop. Please see attached cause exception.
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:532)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:155)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:185)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:1160)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:527)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:155)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:137)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.stopServicesAndHandleBranchedStore(SwitchToSlave.java:521)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.checkDataConsistency(SwitchToSlave.java:357)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.executeConsistencyChecks(SwitchToSlave.java:316)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.switchToSlave(SwitchToSlave.java:219)
at org.neo4j.kernel.ha.cluster.HighAvailabilityModeSwitcher$2.run(HighAvailabilityModeSwitcher.java:328)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:99)
Caused by: java.lang.AbstractMethodError: org.neo4j.gis.spatial.indexprovider.SpatialIndexImplementation.stop()V
at org.neo4j.kernel.lifecycle.Lifecycles$1.stop(Lifecycles.java:55)
at org.neo4j.kernel.lifecycle.Lifecycle$Delegate.stop(Lifecycle.java:75)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:527)
... 19 more
Does Neo4j Spatial support replication across instances? Or more specifically restoring the spatial index to a new empty instance that joins the cluster for the first time?
An update to Neo4j spatial to version 0.15-neo4j-2.2.6 fixes this issue. So you need to be using Neo4j 2.2.6 in order to have Spatial indexes replicate properly

Jenkins xvnc plugin, some display numbers stay allocated when a build is stopped aburptly (ie jenkins restart) and cannot be used again

I am using Jenkins with the Xvnc plugin to run acceptance tests on Firefox in a CentOS slave . I have limited the display numbers to 2-4 since there will be at most 3 instances of testing that need a display. The tests and plugin work fine until Jenkins had to be restarted a few times due to issues in other builds. The following error now occurs whenever the build tries to run:
FATAL: All available display numbers are allocated or blacklisted.
allocated: [2, 3, 4]
blacklisted: []
java.lang.RuntimeException: All available display numbers are allocated or blacklisted.
allocated: [2, 3, 4]
blacklisted: []
at hudson.plugins.xvnc.DisplayAllocator.doAllocate(DisplayAllocator.java:59)
at hudson.plugins.xvnc.DisplayAllocator.allocate(DisplayAllocator.java:49)
at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:99)
at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:89)
at jenkins.tasks.SimpleBuildWrapper.setUp(SimpleBuildWrapper.java:146)
at hudson.model.Build$BuildExecution.doRun(Build.java:156)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
at hudson.model.Run.execute(Run.java:1741)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:381)
I checked a working build where I restarted Jenkins without manually stopping each job and found potential cause:
Terminating xvnc.
FATAL: hudson.remoting.Channel$OrderlyShutdown
hudson.remoting.RequestAbortedException: hudson.remoting.Channel$OrderlyShutdown
at hudson.remoting.Request.abort(Request.java:296)
at hudson.remoting.Channel.terminate(Channel.java:815)
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1034)
at hudson.remoting.Channel$2.handle(Channel.java:484)
at hudson.remoting.AbstractByteArrayCommandTransport$1.handle(AbstractByteArrayCommandTransport.java:61)
at org.jenkinsci.remoting.nio.NioChannelHub$2.run(NioChannelHub.java:594)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
at ......remote call to jenkinstest.build.thoughtwire.com.test(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1361)
at hudson.remoting.Request.call(Request.java:171)
at hudson.remoting.Channel.call(Channel.java:752)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:954)
at hudson.plugins.xvnc.Xvnc$DisposerImpl.tearDown(Xvnc.java:183)
at jenkins.tasks.SimpleBuildWrapper$EnvironmentWrapper.tearDown(SimpleBuildWrapper.java:175)
at hudson.model.Build$BuildExecution.doRun(Build.java:173)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
at hudson.model.Run.execute(Run.java:1741)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:381)
Caused by: hudson.remoting.Channel$OrderlyShutdown
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1034)
at hudson.remoting.Channel$2.handle(Channel.java:484)
at hudson.remoting.AbstractByteArrayCommandTransport$1.handle(AbstractByteArrayCommandTransport.java:61)
at org.jenkinsci.remoting.nio.NioChannelHub$2.run(NioChannelHub.java:594)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: Command close created at
at hudson.remoting.Command.<init>(Command.java:56)
at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1028)
at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1026)
at hudson.remoting.Channel.close(Channel.java:1109)
at hudson.remoting.Channel.close(Channel.java:1092)
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1033)
at hudson.remoting.Channel$2.handle(Channel.java:484)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:60)
It seems like the job did not close properly and the Xvnc plugin did not get a chance to deallocate the display. I made sure the processes and tests in the slave are properly terminated and nothing is running.
The core issue here is that display numbers 2, 3, and 4 are now permanently allocated and cannot be reused even though no builds are running. If the slave (TEST) is mirrored (TEST2) then TEST2 can use display 2, 3, and 4 but TEST cannot. I have tried reinstalling the plugin but the numbers stay allocated and linked to TEST.
Does anyone know of a way to clear the list of allocated display numbers?
Is this a bug with the plugin?
Is there a way to prevent display numbers from staying allocated if say Jenkins suddenly dies while jobs are running?
The allocated display number is saved in hudson.plugins.xvnc.Xvnc.xml file on the jenkins master (under jenkins home directory). To clear the numbers, you need to stop jenkins, clean up <allocatedNumbers> in that xml, and start jenkins server again.
It is important to edit the file after you stop jenkins server, since jenkins will save the current numbers when it stops.
This is a groovy script I created to clean up the Xvnc display numbers without stopping jenkins. But it might also clean up numbers of still running jobs.
https://github.com/sdiepend/jenkins-monitoring/blob/master/cleanXvncDisplayNumbers.groovy
import jenkins.*
import jenkins.model.Jenkins
Jenkins jenkins = Jenkins.getActiveInstance();
xvncDescriptor = jenkins.getDescriptorByType(hudson.plugins.xvnc.Xvnc.DescriptorImpl.class)
xvncDescriptor.allocators.each {
allocator = it.value
// collect is used to make sure numAlloc is an entire new list and not just a reference to the same list object, otherwise you'll get a
// concurrentmodification exception
numAlloc = allocator.allocatedNumbers.collect()
numAlloc.each {
allocator.allocatedNumbers.remove(it)
}
}

Jenkins slave job fails after successful job

I have 2 jenkins servers because my 2 builds have some incompatible system requirements.
I setup a new node for one of the servers and migrated the jobs from the other server and set them up to run on the node.
The node runs the job just fine and even archives the artifacts (they are linked from the job) but the job throws and exception and gets marked as a failure.
** Below is the output from the jobs **
Completed build, now archiving <-- I print this out at the end of my last build step
FATAL: Remote call on ops-1-jenkins-android-10-186.fam.io failed
java.io.IOException: Remote call on ops-1-jenkins-android-10-186.fam.io failed
at hudson.remoting.Channel.call(Channel.java:748)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:940)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:556)
at hudson.model.Run.execute(Run.java:1745)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:89)
at hudson.model.Executor.run(Executor.java:240)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class hudson.slaves.SlaveComputer
at hudson.util.ProcessTree.getKillers(ProcessTree.java:151)
at hudson.util.ProcessTree$OSProcess.killByKiller(ProcessTree.java:212)
at hudson.util.ProcessTree$UnixProcess.kill(ProcessTree.java:557)
at hudson.util.ProcessTree$UnixProcess.killRecursively(ProcessTree.java:564)
at hudson.util.ProcessTree$Unix.killAll(ProcessTree.java:488)
at hudson.Launcher$RemoteLauncher$KillTask.call(Launcher.java:952)
at hudson.Launcher$RemoteLauncher$KillTask.call(Launcher.java:943)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:328)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at hudson.remoting.Engine$1$1.run(Engine.java:63)
at java.lang.Thread.run(Thread.java:701)
Thanks Lovato!
After I updated all the plugins and jenkins things worked just fine.
We got this error: FATAL: java.io.IOException: Remote call on pm_jenprodslave_1 failed. We found the slave server was unresponsive and had to reboot it. Once it came back the build works again.

Neo4j HA backup fails after first succeeding

I have successfully setup a neo4j 3-instance HA cluster using version 2.0.2 enterprise, but I'm having a problem using the built in backup script (../bin/neo4j-backup).
I manually run:
./bin/neo4j-backup -from ha://10.6.10.48:5001 -to /usr/local/neo4j/backup
...on the master, and it works fine the first time, dumping the data into ../neo4j/backup.
Subsequent tries with the same command yields only this on the command line:
Could not find backup server in cluster neo4j.ha at 10.6.10.48:5001, operation timed out
and this in messages.log:
2014-04-29 17:08:00.919+0000 DEBUG [o.n.c.p.c.ClusterState$4]: ClusterState: entered-[configurationRequest]->entered from:cluster://10.6.10.48:5002 conversation-id:-1/8# payload:-1:cluster://0.0.0.0:5002/?name=Backup
2014-04-29 17:08:00.922+0000 ERROR [o.n.c.c.NetworkSender]: Receive exception:
java.nio.channels.ClosedChannelException: null
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:409) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:127) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:83) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.channel.Channels.write(Channels.java:725) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.channel.Channels.write(Channels.java:704) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.channel.Channels.write(Channels.java:671) ~[netty-3.6.3.Final.jar:na]
at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:248) ~[netty-3.6.3.Final.jar:na]
at org.neo4j.cluster.com.NetworkSender$2.run(NetworkSender.java:266) ~[neo4j-cluster-2.0.2.jar:2.0.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_15]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) ~[na:1.7.0_15]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) ~[na:1.7.0_15]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_15]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_15]
at java.lang.Thread.run(Thread.java:722) [na:1.7.0_15]
(exception repeats every 5 seconds for awhile)
relevant neo4j.properties values:
online_backup_enabled=true
online_backup_server=127.0.0.1:6362
ha.cluster_server=10.6.10.48:5001
I've checked all firewall settings for all instances.
Any help would be appreciated!
Running a online backup versus single://<host> in general way more easy compared to ha://<host>. From a functional view there is no advantage of ha://.
So you might change
online_backup_server=10.6.10.48:6362
and then run
/bin/neo4j-backup -single ha://10.6.10.48:6362 -to /usr/local/neo4j/backup

Restarting a failed/stalled stream during bootstrap of new node

We are trying to add a new Solr node to our cluster:
DC Cassandra
Cassandra node 1
DC Solr
Solr node 1 <-- new node (actually, a replacement for an old node)
Solr node 2
Solr node 3
Solr node 4
Solr node 5
During the bootstrap process:
The stream from node 3 to node 1 failed with an exception:
ERROR [STREAM-OUT-/IP_OF_NODE1] 2014-04-01 01:14:40,887 CassandraDaemon.java (line 196) Exception in thread Thread[STREAM-OUT-/IP_OF_NODE1,5,main]
java.lang.NullPointerException
at org.apache.cassandra.streaming.ConnectionHandler$MessageHandler.signalCloseDone(ConnectionHandler.java:249)
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:375)
at java.lang.Thread.run(Thread.java:744)
The stream from node 4 to node 1 never started. The last relevant line in node 4's system.log is:
Received streaming plan for Bootstrap.
It should have been followed by:
Prepare completed. Receiving 0 files(0 bytes), sending x files(y bytes)
It seems that the bootstrap process is now stalled because the data file sizes are not changing anymore. How can I force those streams to be retried?
EDIT:
I restarted all nodes today in an attempt to force new node to retry the bootstrap process. Unfortunately, it encountered some stream failures again. This time, the exception in node 1 is as follows:
WARN [STREAM-IN-/IP_OF_NODE3] 2014-04-06 20:48:17,963 StreamSession.java (line 532) [Stream #c84effb0-bda9-11e3-a07d-89325af2f6bf] Retrying for following error
java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/my_keyspace/my_table/my_keyspace-my_table-tmp-jb-1209-Data.db (Too many open files)
at org.apache.cassandra.io.util.SequentialWriter.<init>(SequentialWriter.java:75)
at org.apache.cassandra.io.compress.CompressedSequentialWriter.<init>(CompressedSequentialWriter.java:71)
at org.apache.cassandra.io.compress.CompressedSequentialWriter.open(CompressedSequentialWriter.java:42)
at org.apache.cassandra.io.sstable.SSTableWriter.<init>(SSTableWriter.java:107)
at org.apache.cassandra.io.sstable.SSTableWriter.<init>(SSTableWriter.java:60)
at org.apache.cassandra.streaming.StreamReader.createWriter(StreamReader.java:111)
at org.apache.cassandra.streaming.compress.CompressedStreamReader.read(CompressedStreamReader.java:65)
at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:47)
at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:37)
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:283)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.FileNotFoundException: /home/cassandra/data/my_keyspace/my_table/my_keyspace-my_table-tmp-jb-1209-Data.db (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at org.apache.cassandra.io.util.SequentialWriter.<init>(SequentialWriter.java:71)
ERROR [STREAM-IN-/78.46.63.218] 2014-04-06 20:48:17,964 StreamSession.java (line 418) [Stream #c84effb0-bda9-11e3-a07d-89325af2f6bf] Streaming error occurred
java.lang.IllegalArgumentException: Unknown type 0
at org.apache.cassandra.streaming.messages.StreamMessage$Type.get(StreamMessage.java:89)
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54)
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:283)
at java.lang.Thread.run(Thread.java:724)
There are tons of similar errors in the log. e.g.:
ERROR [CompactionExecutor:129] 2014-04-06 20:50:06,401 CassandraDaemon.java (line 196) Exception in thread Thread[CompactionExecutor:129,1,main]
java.lang.RuntimeException: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/my_keyspace/my_table/my_keyspace-my_table-jb-51-Data.db (Too many open files)
at org.apache.cassandra.service.pager.QueryPagers$1.next(QueryPagers.java:154)
at org.apache.cassandra.service.pager.QueryPagers$1.next(QueryPagers.java:137)
at org.apache.cassandra.db.Keyspace.indexRow(Keyspace.java:400)
at org.apache.cassandra.db.index.SecondaryIndexBuilder.build(SecondaryIndexBuilder.java:62)
at org.apache.cassandra.db.compaction.CompactionManager$9.run(CompactionManager.java:833)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/my_keyspace/my_table/my_keyspace-my_table-jb-51-Data.db (Too many open files)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:47)
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.createReader(CompressedPoolingSegmentedFile.java:48)
at org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment(PoolingSegmentedFile.java:39)
at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:1195)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>(SimpleSliceReader.java:57)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:42)
at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:167)
at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:250)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:53)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1550)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1379)
at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:327)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:65)
at org.apache.cassandra.service.pager.SliceQueryPager.queryNextPage(SliceQueryPager.java:77)
at org.apache.cassandra.service.pager.AbstractQueryPager.fetchPage(AbstractQueryPager.java:84)
at org.apache.cassandra.service.pager.SliceQueryPager.fetchPage(SliceQueryPager.java:33)
at org.apache.cassandra.service.pager.QueryPagers$1.next(QueryPagers.java:148)
... 10 more
Caused by: java.io.FileNotFoundException: /home/cassandra/data/my_keyspace/my_table/my_keyspace-my_table-jb-51-Data.db (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:43)
... 28 more
This appears to be very similar to a Cassandra bug/issue:
https://issues.apache.org/jira/browse/CASSANDRA-6965
I'll follow up on that.
Meanwhile, you could run rebuild/repair on that new node.
EDIT: Another Cassandra issue that appears to be related:
CASSANDRA-6984 - "NullPointerException in Streaming During Repair"
https://issues.apache.org/jira/browse/CASSANDRA-6984
That issue is labeled as a Blocker, so it should get some prompt attention. I've inquired as to whether there is a workaround.
Stay tuned.
(Too many open files)
Looks like you need to increase your ulimit.

Resources