HDFS write from kafka : createBlockOutputStream Exception - docker

I'm using Hadoop from docker swarm with 1 namenode and 3 datanodes (on 3 physical machines).
i'm also using kafka and kafka connect + hdfs connector to write messages into HDFS in parquet format.
I'm able to write data to HDFS using HDFS clients (hdfs put).
But when kafka is writing messages, it works at the very beginning, then if fails with this error :
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.8:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
[2018-05-23 10:30:10,125] INFO Abandoning BP-468254989-172.17.0.2-1527063205150:blk_1073741825_1001 (org.apache.hadoop.hdfs.DFSClient:1265)
[2018-05-23 10:30:10,148] INFO Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-cd1c0b17-bebb-4379-a5e8-5de7ff7a7064,DISK] (org.apache.hadoop.hdfs.DFSClient:1269)
[2018-05-23 10:31:10,203] INFO Exception in createBlockOutputStream (org.apache.hadoop.hdfs.DFSClient:1368)
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
And then the datanodes are not reachable anymore for the process :
[2018-05-23 10:32:10,316] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient:557)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /topics/+tmp/test_hdfs/year=2018/month=05/day=23/hour=08/60e75c4c-9129-454f-aa87-6c3461b54445_tmp.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828)
But if I look into the hadoop web admin console, all the nodes seem to be up and OK.
I've checked the hdfs-site and the "dfs.client.use.datanode.hostname" setting is set to true both on namenode and datanodes. All ips in hadoop configuration files are defined using 0.0.0.0 addresses.
I've tried to format the namenode too, but the error happened again.
Could the problem be that Kafka is writing too fast in HDFS, so it overwhelms it? It would be weird as I've tried the same configuration on a smaller cluster and it worked good even with a big throughputof kafka messages.
Do you have any other idea of the origin of this problem?
Thanks

dfs.client.use.datanode.hostname=true has to be configured also to the client side and, following your log stack:
java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
I guess 10.0.0.9 refers to a private net IP; thus, it seems that the property is not set in your client within hdfs-client.xml.
You can find more detail here.

Related

Running Kafka connect in standalone mode, having issues with offsets

I am using this Github repo and folder path I found: https://github.com/entechlog/kafka-examples/tree/master/kafka-connect-standalone to run Kafka connect locally in standalone mode. I have made some changes to the Docker compose file but mainly changes that pertain to authentication.
The problem I am now having is that when I run the Docker image, I get this error multiple times, for each partition (there are 10 of them, 0 through 9):
[2021-12-07 19:03:04,485] INFO [bq-sink-connector|task-0] [Consumer clientId=connector- consumer-bq-sink-connector-0, groupId=connect-bq-sink-connector] Found no committed offset for partition <topic name here>-<partition number here> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1362)
I don't think there are any issues with authenticating or connecting to the endpoint(s), I think the consumer (connect sink) is not sending the offset back.
Am I missing an environment variable? You will see this docker compose file has CONNECT_OFFSET_STORAGE_FILE_FILENAME: /tmp/connect.offsets, and I tried adding CONNECTOR_OFFSET_STORAGE_FILE_FILENAME: /tmp/connect.offsets (CONNECT_ vs. CONNECTOR_) and then I get an error Failed authentication with <Kafka endpoint here>, so now I'm just going in circles.
I think you are focused on the wrong output.
That is an INFO message
The offsets file (or topic in distributed mode) is for source connectors.
Sink connectors use consumer groups. If there is no found offset found for groupId=connect-bq-sink-connector, then the consumer group didn't commit it.

RabbitMQ Unable to Join Cluster

I am trying to learn clustering rabbitmq nodes and I am following this tutorial as well as the official documentation.
I have 2 physical machines with rabbitmq deployed on them through docker. machine1 (192.168.1.2) is to be the cluster, and machine2 (192.168.1.3) is to join it.
When I attempt to run rabbitmqctl join_cluster rabbit#192.168.1.2 from machine2, this fails with the following message.
Clustering node rabbit#node2.rabbit with rabbit#192.168.1.2
Error: unable to perform an operation on node 'rabbit#192.168.1.2'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#192.168.1.2
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#192.168.1.2']
rabbit#192.168.1.3:
* connected to epmd (port 4369) on 192.168.1.2
* epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic
* TCP connection succeeded but Erlang distribution failed
* suggestion: check if the Erlang cookie identical for all server nodes and CLI tools
* suggestion: check if all server nodes and CLI tools use consistent hostnames when addressing each other
* suggestion: check if inter-node connections may be configured to use TLS. If so, all nodes and CLI tools must do that
* suggestion: see the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
Current node details:
* node name: 'rabbitmqcli-1352-rabbit#node2.rabbit'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: XXXXXXXXXXXXX
The error logs on machine1 show nothing related to such a connection attempt. I have verified the md5sum of the cookies on both docker containers and they are exactly the same. So are the permissions.
I assumed perhaps the port 4369 isn't reachable, but it is.
I am unsure what I am doing wrong. Can someone help here?
Additional information:
I am using the rabbitmq:3.85-management image. It uses Erlang/OTP 23 [erts-11.0.3].
I have been checking the troubleshooting guide, but I am unsure what seems wrong here. Please let me know if I can provide more information.
So thanks to #NeoAnderson and #José M, I was able to understand what happened.
The containers running RMQ need to be accessible via the hostname that Erlang uses within the service, across the network. Since the hostname of the containers were not accessible in a container on another machine, this clustering failed.
A simple fix would be to edit the /etc/hosts file on the containers so that it would point the IP to the "leader" node.
I was just doing this to avoid installing RMQ and not because I thought this was the best way to do this. Alternately, docker swarm or k8s would have provided the right networking for me.
But the root cause was definitely the nodename problem.

kafka connect in distributed mode is not generating logs specified via log4j properties

I have been using Kafka Connect in my work setup for a while now and it works perfectly fine.
Recently I thought of dabbling with few connectors of my own in my docker based kafka cluser with just one broker (ubuntu:18.04 with kafka installed) and a separate node acting as client for deploying connector apps.
Here is the problem:
Once my broker is up and running, I login to the client node (with no broker running,just the vanilla kafka installation), i setup the class path to point to my connector libraries. Also the KAFKA_LOG4J_OPTS environment variable to point to the location of log file to generate with debug mode enabled.
So every time i start the kafka worker using command:
nohup /opt//bin/connect-distributed /opt//config/connect-distributed.properties > /dev/null 2>&1 &
the connector starts running, but I don't see the log file getting generated.
I have tried several changes but nothing works out.
QUESTIONS:
Does this mean that connect-distributed.sh doesn't generate the log file after reading the variable
KAFKA_LOG4J_OPTS? and if it does, could someone explain how?
NOTE:
(I have already debugged the connect-distributed.sh script and tried the options where daemon mode is included and not included, by default if KAFKA_LOG4J_OPTS is not provided, it uses the connect-log4j.properties file in config directory, but even then no log file is getting generated).
OBSERVATION:
Only when I start zookeeper/broker on the client node, then provided KAFKA_LOG4J_OPTS value is picked and logs start getting generated but nothing related to the Kafka connector. I have already verified the connectivity b/w the client and the broker using kafkacat
The interesting part is:
The same process i follow in my workpalce and logs start getting generated every time the worker (connnect-distributed.sh) is started, but I haven't' been to replicate the behaviors in my own setup). And I have no clue what I am missing here.
Could someone provide some reasoning, this is really driving me mad.

Can't mirror messages from Kafka producer in container to consumer

I am trying to mirror a kafka topic from a provider in a container in an ec2 instance to a consumer, and messages are not coming through. I suspect that I am messing things up with the .properties configs, as this is my first time using MirrorMaker. I may have also messed up with pointing to the wrong ports somewhere along the way.
The would-be provider broker is running in a centOS container in an ec2 instance. The provider is receiving data from a remote MySQL server through a custom-configured jdbc source connector to a topic called mysql-jdbc-events. The provider is successfully receiving messages.
The would-be consumer is currently on the host ec2 instance, although that will change once it's successfully been tested. I mapped port 12181 of the host to port 2181 of the container (where zookeeper is running). I am running the MirrorMaker command from the consumer.
I ran the command
./kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config /path/to/config/consumer.properties --producer.config /path/to/config/producer.properties --whitelist "mysql-jdbc-events"
consumer.properties:
# format: host1:port1,host2:port2 ...
zookeeper.connect=(host ip-address):12181
zookeeper.connection.timeout.ms=10000
bootstrap.servers=localhost:9092
# consumer group id
group.id=mirror_group
producer.properties:
# format: host1:port1,host2:port2 ...
zookeeper.connect=(host ip-address):2181
bootstrap.servers=localhost:9092
# specify the compression codec for all data generated: none, gzip, snappy, lz4, zstd
compression.type=none
I tried both with and without the zookeeper.connect parameter in the producer config because I found conflicting about it being necessary. I also got a warning to the effect of WARN The configuration 'zookeeper.connect' was supplied but isn't a known config., but I read elsewhere on SO that this could be ignored.
I did not get any messages populated to the topic at the consumer, but there are messages in the topic at the producer.
If any more information would be helpful, please let me know.
I am also not married to this configuration - if there is a simpler way to keep the jdbc in the container and forward the messages to a kafka instance outside the container, that's good too.

Confluent Control center not showing system health (for a Multi-Cluster Configuration)

I'm having trouble getting control-center to work. Setted up a 3 node kafka cluster using the following docker image = confluentinc/cp-enterprise-kafka. On a separate machine I've downloaded confluent platform v5.0.1 and I've configured (tried) control-center to monitor the docker cluster.
The kafka broker I'm using for the control-center configuration is the same from the confluent platform v5.0.1, downloaded.(I start the whole stack via bin/confluent start)
But I keep getting the rocket launching page when clicking Monitoring > System health.
My setup : --------------------------------------------------------
3 node kafka cluster using docker images.
docker image used = confluentinc/cp-enterprise-kafka
kafka running on these hostnames for the 3-node cluster :
os0 / running on tcp/29092
os1 / running on tcp/39092
os2 / running on tcp/49092
Control-center is running on a separate machine whose hostname = sb1
Futhermore the brokers have the following directives defined as :
metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsReporter
confluent.metrics.reporter.bootstrap.servers=sb1:9092
For the control-center I added the 3 node cluster config :
confluent.controlcenter.kafka.osd.bootstrap.servers=os0:29092,os1:39092,os2:49092
I'm expecting the kafka brokers writing to the kafka broker # sb1 (used by control-center) topic _confluent-metrics
What I've tried/checked/debugged so far :
dumped the the topic _confluent-metrics, and I have messages being written there
I dont know if logs from control-center (# /tmp/confluent.QJ2C4BmE/control-center/control-center.stdout) do show anyhting useful (at least for what I can interpret)
I can see HTTP/200 for the cluster I'm trying to monitor written down in the blog.
at the log from the kafka brokers I also see written the following, which put me thinking the messages were written to the topic :
[2018-12-15 07:57:59,893] ERROR Failed to submit metrics to Kafka topic __confluent.support.metrics (due to exception): java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for __confluent.support.metrics-0: 30083 ms has passed since batch creation plus linger time (io.confluent.support.metrics.submitters.KafkaSubmitter)
[2018-12-15 07:58:01,088] INFO Successfully submitted metrics to Confluent via secure endpoint (io.confluent.support.metrics.submitters.ConfluentSubmitter)
I run out of viable solutions to debug this, any help would be appreciated.
thanks in advance.
I was accessing control center via an ssh tunnel. (This was a testing environment I was using to set up CC (control center)).
When I accessed directly to the ip:port of CC everything run smoothly.

Resources