Q: Apache Geode - Unable to reconnect a node after SO patching "15 seconds have elapsed while waiting for replies" - timeout

I have a cluster situation consisting of 4 total nodes, 3 servers and 1 management node, working properly.
At the beginning of the month we planned to patch the OS and we started from the first server node with this procedure:
Stop service
S.O. patching
Server restart
Start service
The service of the first patched node named "serverA" fails to restart with this error:
Log entries cluster join:
serverA:
| INFO | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: shared=true ordered=false failed to connect to peer 10.237.110.195( Server serverB:9993):1024 because: java.net.ConnectException: Connection timed out (Connection timed out)
| WARN | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: Attempting reconnect to peer 10.237.110.195( Server serverB:9993):1024
ServerMgmt:
| WARN | pool-3-thread-1 | tributed.internal.ReplyProcessor21 | --> 15 seconds have elapsed while waiting for replies: <CreateRegionProcessor$CreateRegionReplyProcessor 44180 waiting for 1 replies from [10.237.110.194( Server serverA:632):1024]> on 10.237.110.225( Management:6033):1024 whose current membership list is: [[10.237.110.196( Server serverC:16805):1024, 10.237.110.225( Management:6033):1024, 10.237.110.195( Server serverB:9993):1024, 10.237.110.194( Server serverA:632):1024]]
The connection between the systems was verified with tcpdumps, udp 1024 is running fine.
We have tried redeploying the service and making numerous attempts but we always get the same error during startup.
Any suggestions? Thank you.
Marco.

I think to see this error message, serverA was probably able to send UDP messages to serverB but it is failing to create a TCP connection. It's hard to say why though - a firewall issue, some TCP configuration issue, ... ?
Check to see if serverB has anything interesting in its logs. Since you are using TCP dump, you should be watching for that TCP connection for serverB:9993, since it looks like that is wwhat failed.

There is no firewall between the systems, we've analyzed again the network connection, during startup from node a, and we can see that the communication can be established between all systems. But what we detected is, that on port 2323 which is configured as locater, the node sends packages to the b and c node, but only receives back packages from the c node, and not from the b node. This is for us again a sign that the b node has an issue. Does it give a way to check our assumption from the b node?
A node ip .194
B node ip .195
C node ip .196
Management ip .225

Related

Kafka stream application stopped with org.apache.kafka.common.errors.TimeoutException

I have installed Kafka 1.0.0 with help of docker composer and I am running this Kafka successfully with two brokers. I created a topic manually with partition and inserted the events.
Now I am running a application with 1.0.0 Kafka Stream by pointing to this Kafka. After running my application for some time, Following messages were showing in log and stopped from run. Except producer request.timeout.ms, all other config parameters are default parameters and producer request.timeout.ms is 120seconds.
Before stopping with below messages, Couple of times I observed 'Trying to rejoin the consumer group now. org.apache.kafka.streams.errors.TaskMigratedException:' and 'Caused by: org.apache.kafka.clients.consumer.CommitFailedException:' messages in the log.
What would be the possible reason? Please help me.
Messages before stopping:
2017-12-07 06:17:03,122 WARN o.a.k.c.p.i.Sender [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] [Producer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] Got error produce response with sample id 14099 on topic-partition abc-0, retrying (9 attempts left). Error: NETWORK_EXCEPTION
2017-12-07 06:18:02,675 ERROR o.a.k.s.p.i.RecordCollectorImpl [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] task [2_0] Error sending record (key 5a12c9ade532af0412fc7bcc.5a12c9ade532af0412fc7bca value com.sample.kafka.streams.SampleEvent#4a56c681 timestamp 1512363589768) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 9 record(s) for abc-0: 189836 ms has passed since last append; No more records will be sent and no more offsets will be recorded for this task.
2017-12-07 06:18:02,927 INFO o.a.k.c.c.i.AbstractCoordinator [sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1] [Consumer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-consumer, groupId=sample-app-0.0.1] Discovered coordinator 1.1.1.1:32775 (id: 2147482645 rack: null)

Neo4J bolt connector not working

I am trying to start Neo4J embedded db (3.2.0) and then access this db from another jvm process via bolt driver (1.4.4). Although my code below do print sysouts indicating that db has started,
System.out.println("Starting Neo embedded database at " + databaseFile);
BoltConnector bolt = new BoltConnector("key");
service = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder(new File(databaseFile))
.setConfig(bolt.enabled, "true").setConfig(bolt.type, "BOLT")
.setConfig(bolt.encryption_level, "DISABLED").newGraphDatabase();
System.out.println("Started Neo embedded database");
my another jvm process running on same machine trying to query db via bolt driver fails with below error:
Exception in thread "main" org.neo4j.driver.v1.exceptions.ServiceUnavailableException: Unable to connect to localhost:7687, ensure the database is running and that there is a working network connection to it.
at org.neo4j.driver.internal.net.SocketClient.start(SocketClient.java:132)
at org.neo4j.driver.internal.net.SocketConnection.startSocketClient(SocketConnection.java:92)
at org.neo4j.driver.internal.net.SocketConnection.<init>(SocketConnection.java:67)
at org.neo4j.driver.internal.net.SocketConnector.createConnection(SocketConnector.java:77)
at org.neo4j.driver.internal.net.SocketConnector.connect(SocketConnector.java:50)
at org.neo4j.driver.internal.net.pooling.SocketConnectionPool$ConnectionSupplier.get(SocketConnectionPool.java:216)
at org.neo4j.driver.internal.net.pooling.SocketConnectionPool$ConnectionSupplier.get(SocketConnectionPool.java:198)
at org.neo4j.driver.internal.net.pooling.BlockingPooledConnectionQueue.acquire(BlockingPooledConnectionQueue.java:96)
at org.neo4j.driver.internal.net.pooling.SocketConnectionPool.acquireConnection(SocketConnectionPool.java:149)
at org.neo4j.driver.internal.net.pooling.SocketConnectionPool.acquire(SocketConnectionPool.java:76)
at org.neo4j.driver.internal.DirectConnectionProvider.acquireConnection(DirectConnectionProvider.java:47)
at org.neo4j.driver.internal.DirectConnectionProvider.verifyConnectivity(DirectConnectionProvider.java:67)
at org.neo4j.driver.internal.DirectConnectionProvider.<init>(DirectConnectionProvider.java:41)
at org.neo4j.driver.internal.DriverFactory.createDirectDriver(DriverFactory.java:109)
at org.neo4j.driver.internal.DriverFactory.createDriver(DriverFactory.java:93)
at org.neo4j.driver.internal.DriverFactory.newInstance(DriverFactory.java:67)
at org.neo4j.driver.v1.GraphDatabase.driver(GraphDatabase.java:135)
at org.neo4j.driver.v1.GraphDatabase.driver(GraphDatabase.java:117)
at
Caused by: java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:111)
at org.neo4j.driver.internal.net.ChannelFactory.connect(ChannelFactory.java:79)
at org.neo4j.driver.internal.net.ChannelFactory.create(ChannelFactory.java:41)
at org.neo4j.driver.internal.net.SocketClient.start(SocketClient.java:126)
... 20 more
running netstat on windows doesn't show any process listening to port 7687, can someone please help.

Kafka cannot resolve Zookeper's DNS name

I have a kafka 0.10.1.0 cluster (2 nodes) and zookeeper 3.4.6 (3 nodes)
The clusters are hosted on Kubernetes following this tutorial.
Relevant entries from Kafka's server.properties:
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://kafka.internal.<companyname>.com:9092
zookeeper.connect=zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181
Upon server startup, each Kafka broker fails quickly with the following. To me, it looks like it cannot resolve the DNS name zookeeper-1. I also attempted removing the ports from zookeeper.connect, although my reading of the relevant code, I don't believe that will make a difference.
Naturally, I confirmed that zookeeper-1 can be resolved from within the cluster. Other containers from within the cluster can resolve the name.
I also attempted with a series of other aliases, including the services' DNS name and Zookeeper's load balancer(s), all of which I independently confirmed working. In each case, Kafka alone reported Name or service not known.
[2016-11-22 19:55:45,506] INFO Initiating client connection, connectString=zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient#7722c3c3 (org.apache.zookeeper.ZooKeeper)
[2016-11-22 19:56:05,571] INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[2016-11-22 19:56:05,572] FATAL Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.I0Itec.zkclient.exception.ZkException: Unable to connect to zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181
at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1227)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:156)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:130)
at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:76)
at kafka.utils.ZkUtils$.apply(ZkUtils.scala:58)
at kafka.server.KafkaServer.initZk(KafkaServer.scala:327)
at kafka.server.KafkaServer.startup(KafkaServer.scala:200)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:39)
at kafka.Kafka$.main(Kafka.scala:67)
at kafka.Kafka.main(Kafka.scala)
Caused by: java.net.UnknownHostException: zookeeper-1: Name or service not known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:446)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:69)
... 10 more
[2016-11-22 19:56:05,575] INFO shutting down (kafka.server.KafkaServer)
[2016-11-22 19:56:05,616] INFO shut down completed (kafka.server.KafkaServer)
Other info related to the Kafka image: It is based off wurstmeister/kafka-docker but updated to inherit from openjdk:8-jre.
It turns out that this was an issue with Kubernetes itself.
After an unrelated upgrade to v1.4.6 and no other changes, the names were able to resolve normally.

Failed to bind to: spark-master, using a remote cluster with two workers

I am managing to get everything working with the local master and two remote workers. Now, I want to connect to a remote master that has the same remote workers. I have tried different combinations of settings withing the /etc/hosts and other reccomendations on the Internet, but NOTHING worked.
The Main class is:
public static void main(String[] args) {
ScalaInterface sInterface = new ScalaInterface(CHUNK_SIZE,
"awsAccessKeyId",
"awsSecretAccessKey");
SparkConf conf = new SparkConf().setAppName("POC_JAVA_AND_SPARK")
.setMaster("spark://spark-master:7077");
org.apache.spark.SparkContext sc = new org.apache.spark.SparkContext(
conf);
sInterface.enableS3Connection(sc);
org.apache.spark.rdd.RDD<Tuple2<Path, Text>> fileAndLine = (RDD<Tuple2<Path, Text>>) sInterface.getMappedRDD(sc, "s3n://somebucket/");
org.apache.spark.rdd.RDD<String> pInfo = (RDD<String>) sInterface.mapPartitionsWithIndex(fileAndLine);
JavaRDD<String> pInfoJ = pInfo.toJavaRDD();
List<String> result = pInfoJ.collect();
String miscInfo = sInterface.getMiscInfo(sc, pInfo);
System.out.println(miscInfo);
}
It fails at:
List<String> result = pInfoJ.collect();
The error I am getting is:
1354 [sparkDriver-akka.actor.default-dispatcher-3] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
1354 [main] WARN org.apache.spark.util.Utils - Service 'sparkDriver' could not bind on port 0. Attempting port 1.
1355 [main] DEBUG org.apache.spark.util.AkkaUtils - In createActorSystem, requireCookie is: off
1363 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1364 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
1364 [sparkDriver-akka.actor.default-dispatcher-5] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1367 [sparkDriver-akka.actor.default-dispatcher-4] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
1370 [sparkDriver-akka.actor.default-dispatcher-6] INFO Remoting - Starting remoting
1380 [sparkDriver-akka.actor.default-dispatcher-4] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
Exception in thread "main" 1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
java.net.BindException: Failed to bind to: spark-master/192.168.0.191:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
1383 [sparkDriver-akka.actor.default-dispatcher-7] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1385 [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook called
Thank you kindly for your help!
Setting the environment variable SPARK_LOCAL_IP=127.0.0.1 solved this for me.
I had this problem when my /etc/hosts file was mapping the wrong IP address to my local hostname.
The BindException in your logs complains about the IP address 192.168.0.191. I assume that resolves to the hostname of your machine and it's not the actual IP address that your network interface is using. It should work fine once you fix that.
I had spark working in my EC2 instance. I started a new web server and to meet its requirement I had to change hostname to ec2 public DNS name i.e.
hostname ec2-54-xxx-xxx-xxx.compute-1.amazonaws.com
After that my spark could not work and showed error as below:
16/09/20 21:02:22 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.
16/09/20 21:02:22 ERROR SparkContext: Error initializing SparkContext.
I solve it by setting SPARK_LOCAL_IP to as below:
export SPARK_LOCAL_IP="localhost"
then just launched sparkling shell as below:
$SPARK_HOME/bin/spark-shell
Possily your master is running on non-default port. Can you post your submit command?
Have a look in https://spark.apache.org/docs/latest/spark-standalone.html#connecting-an-application-to-the-cluster

Flume agent throws java.net.ConnectException: Connection refused

I have been using flume for a while now, I have got agent and collector running on same machine.
Configuration
agent: exec("/usr/bin/tail -n +0 -F /path/to/file") | agentE2ESink("hostname", 35855)
collector: collectorSource(35855) | collector(10000) { collectorSink("/hdfs/path/to/sink","name") }
Facing issues in the agent node:
2012-06-04 19:13:33,625 [naive file wal consumer-27] INFO debug.InsistentOpenDecorator: open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink to hostname:35855 : java.net.ConnectException: Connection refused
2012-06-04 19:13:34,625 [logicalNode hostname-19] ERROR connector.DirectDriver: Expected ACTIVE but timed out in state OPENING
2012-06-04 19:13:34,632 [naive file wal consumer-27] INFO debug.InsistentOpenDecorator: open attempt 1 failed, backoff (2000ms): Failed to open thrift event sink to hostname:35855 : java.net.ConnectException: Connection refused
2012-06-04 19:13:36,635 [naive file wal consumer-27] INFO debug.InsistentOpenDecorator: open attempt 2 failed, backoff (4000ms): Failed to open thrift event sink to hostname:35855 : java.net.ConnectException: Connection refused
and then empty ACKs will be sent continuously
2012-06-04 19:19:56,960 [Roll-TriggerThread-0] INFO endtoend.AckListener$Empty: Empty Ack Listener began 20120604-191956958+0530.881565921235084.00000026
2012-06-04 19:20:07,043 [Roll-TriggerThread-0] INFO hdfs.SeqfileEventSink: closed /tmp/flume-user1/agent/hostname/writing/20120604-191956958+0530.881565921235084.00000026
I dont understand why the connection is refused. Are there any system level changes that needs to be done ?
Note: the collector is listening to the port but agent is unable to send data through the 35855 port.
Can anyone help me with this problem.
Thanks
If you are running both the agent and the collector on the same box, you should be using localhost as the address.
agentE2ESink("localhost", 35855)

Resources