I have a cluster situation consisting of 4 total nodes, 3 servers and 1 management node, working properly.
At the beginning of the month we planned to patch the OS and we started from the first server node with this procedure:
Stop service
S.O. patching
Server restart
Start service
The service of the first patched node named "serverA" fails to restart with this error:
Log entries cluster join:
serverA:
| INFO | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: shared=true ordered=false failed to connect to peer 10.237.110.195( Server serverB:9993):1024 because: java.net.ConnectException: Connection timed out (Connection timed out)
| WARN | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: Attempting reconnect to peer 10.237.110.195( Server serverB:9993):1024
ServerMgmt:
| WARN | pool-3-thread-1 | tributed.internal.ReplyProcessor21 | --> 15 seconds have elapsed while waiting for replies: <CreateRegionProcessor$CreateRegionReplyProcessor 44180 waiting for 1 replies from [10.237.110.194( Server serverA:632):1024]> on 10.237.110.225( Management:6033):1024 whose current membership list is: [[10.237.110.196( Server serverC:16805):1024, 10.237.110.225( Management:6033):1024, 10.237.110.195( Server serverB:9993):1024, 10.237.110.194( Server serverA:632):1024]]
The connection between the systems was verified with tcpdumps, udp 1024 is running fine.
We have tried redeploying the service and making numerous attempts but we always get the same error during startup.
Any suggestions? Thank you.
Marco.
I think to see this error message, serverA was probably able to send UDP messages to serverB but it is failing to create a TCP connection. It's hard to say why though - a firewall issue, some TCP configuration issue, ... ?
Check to see if serverB has anything interesting in its logs. Since you are using TCP dump, you should be watching for that TCP connection for serverB:9993, since it looks like that is wwhat failed.
There is no firewall between the systems, we've analyzed again the network connection, during startup from node a, and we can see that the communication can be established between all systems. But what we detected is, that on port 2323 which is configured as locater, the node sends packages to the b and c node, but only receives back packages from the c node, and not from the b node. This is for us again a sign that the b node has an issue. Does it give a way to check our assumption from the b node?
A node ip .194
B node ip .195
C node ip .196
Management ip .225
I am managing to get everything working with the local master and two remote workers. Now, I want to connect to a remote master that has the same remote workers. I have tried different combinations of settings withing the /etc/hosts and other reccomendations on the Internet, but NOTHING worked.
The Main class is:
public static void main(String[] args) {
ScalaInterface sInterface = new ScalaInterface(CHUNK_SIZE,
"awsAccessKeyId",
"awsSecretAccessKey");
SparkConf conf = new SparkConf().setAppName("POC_JAVA_AND_SPARK")
.setMaster("spark://spark-master:7077");
org.apache.spark.SparkContext sc = new org.apache.spark.SparkContext(
conf);
sInterface.enableS3Connection(sc);
org.apache.spark.rdd.RDD<Tuple2<Path, Text>> fileAndLine = (RDD<Tuple2<Path, Text>>) sInterface.getMappedRDD(sc, "s3n://somebucket/");
org.apache.spark.rdd.RDD<String> pInfo = (RDD<String>) sInterface.mapPartitionsWithIndex(fileAndLine);
JavaRDD<String> pInfoJ = pInfo.toJavaRDD();
List<String> result = pInfoJ.collect();
String miscInfo = sInterface.getMiscInfo(sc, pInfo);
System.out.println(miscInfo);
}
It fails at:
List<String> result = pInfoJ.collect();
The error I am getting is:
1354 [sparkDriver-akka.actor.default-dispatcher-3] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
1354 [main] WARN org.apache.spark.util.Utils - Service 'sparkDriver' could not bind on port 0. Attempting port 1.
1355 [main] DEBUG org.apache.spark.util.AkkaUtils - In createActorSystem, requireCookie is: off
1363 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1364 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
1364 [sparkDriver-akka.actor.default-dispatcher-5] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1367 [sparkDriver-akka.actor.default-dispatcher-4] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
1370 [sparkDriver-akka.actor.default-dispatcher-6] INFO Remoting - Starting remoting
1380 [sparkDriver-akka.actor.default-dispatcher-4] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
Exception in thread "main" 1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
java.net.BindException: Failed to bind to: spark-master/192.168.0.191:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
1383 [sparkDriver-akka.actor.default-dispatcher-7] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1385 [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook called
Thank you kindly for your help!
Setting the environment variable SPARK_LOCAL_IP=127.0.0.1 solved this for me.
I had this problem when my /etc/hosts file was mapping the wrong IP address to my local hostname.
The BindException in your logs complains about the IP address 192.168.0.191. I assume that resolves to the hostname of your machine and it's not the actual IP address that your network interface is using. It should work fine once you fix that.
I had spark working in my EC2 instance. I started a new web server and to meet its requirement I had to change hostname to ec2 public DNS name i.e.
hostname ec2-54-xxx-xxx-xxx.compute-1.amazonaws.com
After that my spark could not work and showed error as below:
16/09/20 21:02:22 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.
16/09/20 21:02:22 ERROR SparkContext: Error initializing SparkContext.
I solve it by setting SPARK_LOCAL_IP to as below:
export SPARK_LOCAL_IP="localhost"
then just launched sparkling shell as below:
$SPARK_HOME/bin/spark-shell
Possily your master is running on non-default port. Can you post your submit command?
Have a look in https://spark.apache.org/docs/latest/spark-standalone.html#connecting-an-application-to-the-cluster
When trying to start rabbitmq server, I get the following error:
{error_logger,{{2014,9,26},{15,30,21}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",econnrefused]}
{error_logger,{{2014,9,26},{15,30,21}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.93>,<0.18.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,987},{stack_size,27},{reductions,799}],[]]}
{error_logger,{{2014,9,26},{15,30,21}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfargs,{net_kernel,start_link,[[rabbitmqprelaunch791,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2014,9,26},{15,30,21}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2014,9,26},{15,30,21}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0>},{registered_name,[]},{error_info,{exit,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,133}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}},{ancestors,[<0.8.0>]},{messages,[{'EXIT',<0.10.0>,normal}]},{links,[<0.8.0>,<0.7.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,27},{reductions,117}],[]]}
{error_logger,{{2014,9,26},{15,30,21}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}}"}
I tried erl -sname foo command and it gave a similar error.
Then (as suggested here: http://permalink.gmane.org/gmane.comp.networking.rabbitmq.general/23204) I tried
epmd -debug and it gave the following output:
epmd: Mon Sep 29 11:56:16 2014: epmd running - daemon = 0
epmd: Mon Sep 29 11:56:16 2014: error opening stream socket: Address family not supported by protocol
Tried to google for epmd error Address family not supported by protocol, but couldn't find anything.
It might be that you are using IP6 address which might be not supported by epmd in your Erlang version. This mail might shed some light on issue (or just force IP4 if you can).