Setting up Causal Cluster Fails - neo4j

I am trying to setup up a Neo4J Causal Cluster with 3 cores (core only). I have three Debian servers all debian 8.5. I have installed Java 8 and Neo4J Enterprise 3.4.0 (package source deb https://debian.neo4j.org/repo stable/) on each server.
My hosts are 192.168.20.163, 192.168.20.164 and 192.168.20.165. The config is the same on each host with the obvious change for IP address. The following is for the .163 host
dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=192.168.20.163
dbms.mode=CORE
causal_clustering.expected_core_cluster_size=3
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3
causal_clustering.initial_discovery_members=192.168.20.163:5000,192.168.20.164:5000,192.168.20.165:5000
causal_clustering.discovery_type=LIST
causal_clustering.discovery_listen_address=192.168.20.163:5000
causal_clustering.transaction_listen_address=192.168.20.163:6000
causal_clustering.raft_listen_address=192.168.20.163:7000
The servers go through the election process but the LEADER continues to switch back to FOLLOWER and trigger a new election.
The non-leader servers or 'members' each get the following error:
ERROR [o.n.c.c.s.s.CoreStateDownloader] Store copy failed due to store
ID mismatch
The server that was started first becomes a LEADER but as indicated switches back to FOLLOWER:
2018-05-30 14:58:22.808+0000 INFO [o.n.c.c.c.RaftMachine] Moving to CANDIDATE state after successfully starting election
2018-05-30 14:58:22.825+0000 INFO [o.n.c.m.SenderService] Creating channel to: [192.168.20.165:7000]
2018-05-30 14:58:22.827+0000 INFO [o.n.c.m.SenderService] Creating channel to: [192.168.20.164:7000]
2018-05-30 14:58:22.838+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null
2018-05-30 14:58:22.848+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null
2018-05-30 14:58:22.861+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0x2ee2e930, L:/192.168.20.163:50169 - R:/192.168.20.165:7000]
2018-05-30 14:58:22.862+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /192.168.20.163:50169 remote /192.168.20.165:7000
2018-05-30 14:58:22.863+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0x3d670ef3, L:/192.168.20.163:38239 - R:/192.168.20.164:7000]
2018-05-30 14:58:22.863+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /192.168.20.163:38239 remote /192.168.20.164:7000
2018-05-30 14:58:22.928+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}
2018-05-30 14:58:22.929+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}
2018-05-30 14:58:22.965+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:7000 remote /192.168.20.164:41725
2018-05-30 14:58:23.036+0000 INFO [o.n.c.c.c.RaftMachine] Moving to LEADER state at term 111 (I am MemberId{fbdff840}), voted for by [MemberId{4fe121e0}]
2018-05-30 14:58:23.036+0000 INFO [o.n.c.c.c.s.RaftState] First leader elected: MemberId{fbdff840}
2018-05-30 14:58:23.044+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{f202d023}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:23.045+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{4fe121e0}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:23.045+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Idle{}
2018-05-30 14:58:23.046+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Leader MemberId{fbdff840} updating leader info for database default and term 111
2018-05-30 14:58:24.105+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:6000 remote /192.168.20.164:58041
2018-05-30 14:58:26.841+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:7000 remote /192.168.20.165:48317
2018-05-30 14:58:30.881+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:6000 remote /192.168.20.165:47015
2018-05-30 14:58:38.462+0000 INFO [o.n.c.c.c.m.MembershipWaiter] Leader commit unknown
2018-05-30 14:58:40.411+0000 INFO [o.n.c.c.c.RaftMachine] Moving to FOLLOWER state after not receiving heartbeat responses in this election timeout period. Heartbeats received: []
2018-05-30 14:58:40.411+0000 INFO [o.n.c.c.c.s.RaftState] Leader changed from MemberId{fbdff840} to null
2018-05-30 14:58:40.412+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{f202d023}[matchIndex: -1, lastSentIndex: 3, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:40.413+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{4fe121e0}[matchIndex: -1, lastSentIndex: 3, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:40.413+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Inactive{}
2018-05-30 14:58:40.413+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Step down event detected. This topology member, with MemberId MemberId{fbdff840}, was leader in term 111, now moving to follower.
2018-05-30 14:58:48.342+0000 INFO [o.n.c.c.c.RaftMachine] Election timeout triggered
Eventually servers fail with:
ERROR [o.n.c.c.c.m.MembershipWaiterLifecycle] Server failed to join cluster within catchup time limit [600000 ms]

Based on the messages you have I assume you are trying to seed the cluster with a backup from somewhere ? Here's what you should do :
Check if the cluster forms correctly with no seeding (so with an empty database). That way you verify if all your settings are correct.
When seeding the cluster with a backup you need to neo4j-admin unbind the database on each of the instances before starting. Check https://neo4j.com/docs/operations-manual/current/clustering/causal-clustering/seed-cluster/ to find out the specific instructions for your case. The store ID mismatch is what you get if you don't unbind.
If 1. and 2. don't solve your problem, check with Neo4j support (since you are using the EE I assume you do have support).
Hope this helps.
Regards,
Tom

Related

In EKS, Worker pods going offline abruptly with 'hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection'

Our Environment:
Jenkins version - Jenkins 2.319.1
Jenkins Master image : jenkins/jenkins:2.319.1-lts-alpine
Jenkins worker image: jenkins/inbound-agent:4.11-1-alpine
Installed plugins:
Kubernetes - 1.30.6
Kubernetes Client API - 5.4.1
Kubernetes Credentials Plugin - 0.9.0
JAVA version on master: openjdk 11.0.13
JAVA version on Agent/worker : openjdk 11.0.14
Hi team,
We are facing issue in jenkins where jenkins agent disconnects(or goes offline) from master while still job is running on agent/worker. We are getting below error(highlighted) and tried below things but issue is still not resolving fully. Jenkins is deployed on EKS.
Error:
5334535:2022-11-02 14:07:54.573+0000 [id=140290] INFO hudson.slaves.NodeProvisioner#update: worker-7j4x4 provisioning successfully completed. We have now 2 computer(s)
5334695:2022-11-02 14:07:54.675+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes done-jenkins/worker-7j4x4
5334828:2022-11-02 14:07:56.619+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes done-jenkins/worker-7j4x4
5334964-2022-11-02 14:07:58.650+0000 [id=140309] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #97 from /100.122.254.111:42648
5335123-2022-11-02 14:09:19.733+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5335275-2022-11-02 14:09:19.733+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5335409-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2608, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5335965-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 1 nodes assigned to this Jenkins instance, which we will check
5336139-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5336279-2022-11-02 14:09:19.734+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5336438-groovy.lang.MissingPropertyException: No such property: envVar for class: groovy.lang.Binding
5336532- at groovy.lang.Binding.getVariable(Binding.java:63)
5336585- at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:271)
–
5394279-2022-11-02 15:09:19.733+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5394431-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5394565-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2620, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5395121-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 3 nodes assigned to this Jenkins instance, which we will check
5395295-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5395435-2022-11-02 15:09:19.734+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5395594-2022-11-02 15:11:59.502+0000 [id=140320] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-254-111.eu-central-1.compute.internal/100.122.254.111:42648.
5395817-java.util.concurrent.TimeoutException: Ping started at 1667401679501 hasn't completed by 1667401919502
5395920- at hudson.remoting.PingThread.ping(PingThread.java:134)
5395977- at hudson.remoting.PingThread.run(PingThread.java:90)
5396032:2022-11-02 15:11:59.503+0000 [id=141914] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5049 for worker-7j4x4 terminated: java.nio.channels.ClosedChannelException
5396231-2022-11-02 15:12:35.579+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Periodic background build discarder
5396368-2022-11-02 15:12:36.257+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished Periodic background build discarder. 678 ms
5396514-2022-11-02 15:14:15.582+0000 [id=141422] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-237-38.eu-central-1.compute.internal/100.122.237.38:55038.
5396735-java.util.concurrent.TimeoutException: Ping started at 1667401815582 hasn't completed by 1667402055582
5396838- at hudson.remoting.PingThread.ping(PingThread.java:134)
5396895- at hudson.remoting.PingThread.run(PingThread.java:90)
5396950-2022-11-02 15:14:15.584+0000 [id=141915] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5050 for worker-fjf1p terminated: java.nio.channels.ClosedChannelException
****5397149-2022-11-02 15:14:19.733+0000 [id=141950] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5397301-2022-11-02 15:14:19.733+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5397435-2022-11-02 15:14:19.734+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2621, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
Any suggestion or resolutions pls.
Tried below things:
Increased idleMinutes to 180 from default
Verified that resources are sufficient as per graphana dashboard
Changed podRetention to onFailure from Never
Changed podRetention to Always from Never
Increased readTimeout
Increased connectTimeout
Increased slaveConnectTimeoutStr
Disabled the ping thread from UI via disabling “response time" checkbox from preventive node monitroing
Increased activeDeadlineSeconds
Verified same java version on master and agent
Updated kubernetes and kubernetes API client plugins
Expectation is worker/agent should disconnect once job is successfully ran and after idleMinutes defined it should terminate but few times its terminating while job is still running on agent

Unable to start neo4j with systemctl: 'Failed to load from plugin jar'

I've been trying to restart neo4j after adding new data on an EC2 instance. I stopped the neo4j instance, then I called systemctl start neo4j, but when I call cypher-shell it says Connection refused, and connection to the browser port doesn't work anymore.
In the beginning I assumed it was a heap space problem, since looking at the debug.log it said there was a memory issue. I adjusted the heap space and cache settings in neo4j.conf as recommended by neo4j-admin memrec, but still neo4j won't start.
Then I assumed it was because my APOC package was outdated. My neo4j version is 3.5.6, but APOC is 3.5.0.3. I download the latest 3.5.0.4 version, but still neo4j won't start.
At last I tried chmod 777 on every file in the data/database and plugin directories and the directories themselves, but still neo4j won't start.
What's strange is when I try neo4j console for all of these attempts, both cypher-shell and the neo4j browser port works just fine. However, obviously I would prefer to be able to launch neo4j with systemctl.
Right now the only hint of error I can find in debug.log is the following:
2019-06-19 21:19:55.508+0000 INFO [o.n.i.d.DiagnosticsManager] Storage summary:
2019-06-19 21:19:55.508+0000 INFO [o.n.i.d.DiagnosticsManager] Total size of store: 3.07 GB
2019-06-19 21:19:55.509+0000 INFO [o.n.i.d.DiagnosticsManager] Total size of mapped files: 3.07 GB
2019-06-19 21:19:55.509+0000 INFO [o.n.i.d.DiagnosticsManager] --- STARTED diagnostics for KernelDiagnostics:StoreFiles
END ---
2019-06-19 21:19:55.509+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Fulfilling of requirement 'Database available' mak
es database available.
2019-06-19 21:19:55.509+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database is ready.
2019-06-19 21:19:55.568+0000 INFO [o.n.k.i.DatabaseHealth] Database health set to OK
2019-06-19 21:19:56.198+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.s3.S3URLConnection` from plugin jar `
/var/lib/neo4j/plugins/apoc-3.5.0.4-all.jar`: com/amazonaws/ClientConfiguration
2019-06-19 21:19:56.199+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.s3.S3Aws` from plugin jar `/var/lib/n
eo4j/plugins/apoc-3.5.0.4-all.jar`: com/amazonaws/auth/AWSCredentials
2019-06-19 21:19:56.200+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.s3.S3Aws$1` from plugin jar `/var/lib
/neo4j/plugins/apoc-3.5.0.4-all.jar`: com/amazonaws/services/s3/model/S3ObjectInputStream
2019-06-19 21:19:56.207+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.hdfs.HDFSUtils$1` from plugin jar `/v
ar/lib/neo4j/plugins/apoc-3.5.0.4-all.jar`: org/apache/hadoop/fs/FSDataInputStream
2019-06-19 21:19:56.208+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.hdfs.HDFSUtils` from plugin jar `/var
/lib/neo4j/plugins/apoc-3.5.0.4-all.jar`: org/apache/hadoop/fs/FSDataOutputStream
...
...
...
2019-06-19 21:20:00.678+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutting down database.
2019-06-19 21:20:00.679+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-06-19 21:20:00.679+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database is unavailable.
2019-06-19 21:20:00.684+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Database shutdown" # txId: 1
checkpoint started...
2019-06-19 21:20:00.704+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Database shutdown" # txId: 1
checkpoint completed in 20ms
2019-06-19 21:20:00.705+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in vers
ion 0
2019-06-19 21:20:00.725+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics START ---
2019-06-19 21:20:00.725+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-06-19 21:20:00.725+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-06-19 21:20:05.875+0000 INFO [o.n.g.f.m.e.CommunityEditionModule] No locking implementation specified, defaulting
to 'community'
2019-06-19 21:20:06.080+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Creating database.
2019-06-19 21:20:06.154+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Requirement `Database available` makes database unavailable.
2019-06-19 21:20:06.156+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database is unavailable.
2019-06-19 21:20:06.183+0000 INFO [o.n.i.d.DiagnosticsManager] --- INITIALIZED diagnostics START ---
I think the warning isn't an issue, since it's just a warning and not an error or exception. Also it seems that the database just shuts down automatically, and then restarts, creating an infinite loop. This loop does not happen when I call neo4j console (all the warnings still exist in the logs). All my ports are default.
Any clue why this is happening? I've never encountered this error when I previously launched neo4j on this instance.
If it works with neo4j console but not with systemctl, you should check the rights of the Neo4j folder.
I'm pretty sure you have a problem on it, and that the systemctl doesn't run Neo4j with the same user as you

Kafka stream application stopped with org.apache.kafka.common.errors.TimeoutException

I have installed Kafka 1.0.0 with help of docker composer and I am running this Kafka successfully with two brokers. I created a topic manually with partition and inserted the events.
Now I am running a application with 1.0.0 Kafka Stream by pointing to this Kafka. After running my application for some time, Following messages were showing in log and stopped from run. Except producer request.timeout.ms, all other config parameters are default parameters and producer request.timeout.ms is 120seconds.
Before stopping with below messages, Couple of times I observed 'Trying to rejoin the consumer group now. org.apache.kafka.streams.errors.TaskMigratedException:' and 'Caused by: org.apache.kafka.clients.consumer.CommitFailedException:' messages in the log.
What would be the possible reason? Please help me.
Messages before stopping:
2017-12-07 06:17:03,122 WARN o.a.k.c.p.i.Sender [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] [Producer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] Got error produce response with sample id 14099 on topic-partition abc-0, retrying (9 attempts left). Error: NETWORK_EXCEPTION
2017-12-07 06:18:02,675 ERROR o.a.k.s.p.i.RecordCollectorImpl [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] task [2_0] Error sending record (key 5a12c9ade532af0412fc7bcc.5a12c9ade532af0412fc7bca value com.sample.kafka.streams.SampleEvent#4a56c681 timestamp 1512363589768) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 9 record(s) for abc-0: 189836 ms has passed since last append; No more records will be sent and no more offsets will be recorded for this task.
2017-12-07 06:18:02,927 INFO o.a.k.c.c.i.AbstractCoordinator [sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1] [Consumer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-consumer, groupId=sample-app-0.0.1] Discovered coordinator 1.1.1.1:32775 (id: 2147482645 rack: null)

Failed to bind to: spark-master, using a remote cluster with two workers

I am managing to get everything working with the local master and two remote workers. Now, I want to connect to a remote master that has the same remote workers. I have tried different combinations of settings withing the /etc/hosts and other reccomendations on the Internet, but NOTHING worked.
The Main class is:
public static void main(String[] args) {
ScalaInterface sInterface = new ScalaInterface(CHUNK_SIZE,
"awsAccessKeyId",
"awsSecretAccessKey");
SparkConf conf = new SparkConf().setAppName("POC_JAVA_AND_SPARK")
.setMaster("spark://spark-master:7077");
org.apache.spark.SparkContext sc = new org.apache.spark.SparkContext(
conf);
sInterface.enableS3Connection(sc);
org.apache.spark.rdd.RDD<Tuple2<Path, Text>> fileAndLine = (RDD<Tuple2<Path, Text>>) sInterface.getMappedRDD(sc, "s3n://somebucket/");
org.apache.spark.rdd.RDD<String> pInfo = (RDD<String>) sInterface.mapPartitionsWithIndex(fileAndLine);
JavaRDD<String> pInfoJ = pInfo.toJavaRDD();
List<String> result = pInfoJ.collect();
String miscInfo = sInterface.getMiscInfo(sc, pInfo);
System.out.println(miscInfo);
}
It fails at:
List<String> result = pInfoJ.collect();
The error I am getting is:
1354 [sparkDriver-akka.actor.default-dispatcher-3] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
1354 [main] WARN org.apache.spark.util.Utils - Service 'sparkDriver' could not bind on port 0. Attempting port 1.
1355 [main] DEBUG org.apache.spark.util.AkkaUtils - In createActorSystem, requireCookie is: off
1363 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1364 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
1364 [sparkDriver-akka.actor.default-dispatcher-5] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1367 [sparkDriver-akka.actor.default-dispatcher-4] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
1370 [sparkDriver-akka.actor.default-dispatcher-6] INFO Remoting - Starting remoting
1380 [sparkDriver-akka.actor.default-dispatcher-4] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
Exception in thread "main" 1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
java.net.BindException: Failed to bind to: spark-master/192.168.0.191:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
1383 [sparkDriver-akka.actor.default-dispatcher-7] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1385 [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook called
Thank you kindly for your help!
Setting the environment variable SPARK_LOCAL_IP=127.0.0.1 solved this for me.
I had this problem when my /etc/hosts file was mapping the wrong IP address to my local hostname.
The BindException in your logs complains about the IP address 192.168.0.191. I assume that resolves to the hostname of your machine and it's not the actual IP address that your network interface is using. It should work fine once you fix that.
I had spark working in my EC2 instance. I started a new web server and to meet its requirement I had to change hostname to ec2 public DNS name i.e.
hostname ec2-54-xxx-xxx-xxx.compute-1.amazonaws.com
After that my spark could not work and showed error as below:
16/09/20 21:02:22 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.
16/09/20 21:02:22 ERROR SparkContext: Error initializing SparkContext.
I solve it by setting SPARK_LOCAL_IP to as below:
export SPARK_LOCAL_IP="localhost"
then just launched sparkling shell as below:
$SPARK_HOME/bin/spark-shell
Possily your master is running on non-default port. Can you post your submit command?
Have a look in https://spark.apache.org/docs/latest/spark-standalone.html#connecting-an-application-to-the-cluster

Unable to load webadmin for Neo4j High Availability

I have installed 3 instances of neo4j version 1.9.4 on a linux machine, in 3 different directories: Neo4j01, neo4j02, neo4j03.
I have updated the configuration files neo4j.properties and neo4j-server.properties as mentioned in the link (http://docs.neo4j.org/chunked/milestone/ha-setup-tutorial.html).
When I start all the neo4j instances one after the other, they are successfully installing, but after some time 2 of the 3 neo4j process/instances are automatically disappearing. I noticed it via ps -aef | grep neo4j.
When I checked console logs then i found below errors:
2013-11-12 16:37:32.512+0000 INFO [Cluster] Checking store consistency with master
2013-11-12 16:37:33.174+0000 INFO [Cluster] Store is consistent
2013-11-12 16:37:33.176+0000 INFO [Cluster] Catching up with master
2013-11-12 16:37:33.276+0000 INFO [Cluster] Now consistent with master
2013-11-12 16:37:34.442+0000 INFO [Cluster] ServerId 2, successfully moved to slave for master ha://localhost.localdomain:6363?serverId=1
2013-11-12 16:37:34.689+0000 INFO [Cluster] Instance 1 is available as backup at backup://localhost.localdomain:6366
2013-11-12 16:37:34.798+0000 INFO [Cluster] Instance 2 (this server) is available as slave at ha://localhost.localdomain:6364?serverId=2
2013-11-12 16:37:35.036+0000 INFO [Cluster] Database available for write transactions
2013-11-12 16:37:35.360+0000 INFO [API] Successfully started database
2013-11-12 16:37:36.079+0000 INFO [API] Starting HTTP on port :7474 with 10 threads available
2013-11-12 16:37:40.596+0000 INFO [Cluster] Instance 3 has failed
2013-11-12 16:37:43.654+0000 INFO [API] Enabling HTTPS on port :7473
2013-11-12 16:38:01.081+0000 INFO [API] Mounted REST API at: /db/manage/
2013-11-12 16:38:01.158+0000 INFO [API] Mounted discovery module at [/]
2013-11-12 16:38:02.375+0000 INFO [API] Loaded server plugin "CypherPlugin"
2013-11-12 16:38:02.449+0000 INFO [API] Loaded server plugin "GremlinPlugin"
2013-11-12 16:38:02.462+0000 INFO [API] Mounted REST API at [/db/data/]
2013-11-12 16:38:02.534+0000 INFO [API] Mounted management API at [/db/manage/]
2013-11-12 16:38:03.568+0000 INFO [API] Mounted webadmin at [/webadmin]
2013-11-12 16:38:06.189+0000 INFO [API] Mounting static content at [/webadmin] from [webadmin-html]
2013-11-12 16:38:30.844+0000 DEBUG [API] Failed to start Neo Server on port [7474], reason [org.mortbay.util.MultiException[java.net.BindException: Address already in use, java.net.BindException: Address already in use]]
2013-11-12 16:38:30.880+0000 DEBUG [API] org.neo4j.server.ServerStartupException: Starting Neo4j Server failed: org.mortbay.util.MultiException[java.net.BindException: Address already in use, java.net.BindException: Address already in use]
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:211) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.Bootstrapper.start(Bootstrapper.java:86) [neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.Bootstrapper.main(Bootstrapper.java:49) [neo4j-server-1.9.4.jar:1.9.4]
Caused by: java.lang.RuntimeException: org.mortbay.util.MultiException[java.net.BindException: Address already in use, java.net.BindException: Address already in use]
at org.neo4j.server.web.Jetty6WebServer.startJetty(Jetty6WebServer.java:334) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.web.Jetty6WebServer.start(Jetty6WebServer.java:154) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.AbstractNeoServer.startWebServer(AbstractNeoServer.java:344) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:187) ~[neo4j-server-1.9.4.jar:1.9.4]
... 2 common frames omitted
Caused by: org.mortbay.util.MultiException: Multiple exceptions
at org.mortbay.jetty.Server.doStart(Server.java:188) ~[jetty-6.1.25.jar:6.1.25]
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) ~[jetty-util-6.1.25.jar:6.1.25]
at org.neo4j.server.web.Jetty6WebServer.startJetty(Jetty6WebServer.java:330) ~[neo4j-server-1.9.4.jar:1.9.4]
... 5 common frames omitted
2013-11-12 16:38:30.894+0000 DEBUG [API] Failed to start Neo Server on port [7474]
Now, only neo4j01 process is running and neo4j02 and neo4j03 processes are disappeared. But even though neo4j01 process is up and running I am unable to access the webadmin page at http://htname:7474/webadmin/#/info/org.neo4j/High%20Availability/.
Please, can someone shed some light on this?
You might want to take a look at https://github.com/neo-technology/neo4j-enterprise-local-qa. This contains a rakefile that automates a local setup of 3 instances. Clone the repo locally, and use
rake setup_cluster start_cluster
to bring a locally running cluster online. Shutdown can be done via
rake stop_cluster
Find the configs in machine[ABC]/conf/.

Resources