Kafka connect and HDFS in docker

Kafka connect and HDFS in docker - docker

I am using kafka connect HDFS sink and Hadoop (for HDFS) in a docker-compose.
Hadoop (namenode and datanode) seems working correctly.
But I have an error with kafka connect sink:
ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED
(io.confluent.connect.hdfs.TopicPartitionWriter:277)
org.apache.kafka.connect.errors.DataException:
Error creating writer for log file hdfs://namenode:8020/logs/MyTopic/0/log
For information:
Hadoop services in my docker-compose.yml:
namenode:
image: uhopper/hadoop-namenode:2.8.1
hostname: namenode
container_name: namenode
ports:
- "50070:50070"
networks:
default:
fides-webapp:
aliases:
- "hadoop"
volumes:
- namenode:/hadoop/dfs/name
env_file:
- ./hadoop.env
environment:
- CLUSTER_NAME=hadoop-cluster
datanode1:
image: uhopper/hadoop-datanode:2.8.1
hostname: datanode1
container_name: datanode1
networks:
default:
fides-webapp:
aliases:
- "hadoop"
volumes:
- datanode1:/hadoop/dfs/data
env_file:
- ./hadoop.env
And my kafka-connect file:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=MyTopic
hdfs.url=hdfs://namenode:8020
flush.size=3
EDIT:
I add an env variable for kafka connect to be aware of the cluster name (env variable: CLUSTER_NAME to add in kafka connect service in docker compose file).
The error is not the same (and it seems to solve a problem):
INFO Starting commit and rotation for topic partition scoring-topic-0 with start offsets {partition=0=0} and end offsets {partition=0=2}
(io.confluent.connect.hdfs.TopicPartitionWriter:368)
ERROR Exception on topic partition MyTopic-0: (io.confluent.connect.hdfs.TopicPartitionWriter:403)
org.apache.kafka.connect.errors.DataException: org.apache.hadoop.ipc.RemoteException(java.io.IOException):
File /topics/+tmp/MyTopic/partition=0/bc4cf075-ccfa-4338-9672-5462cc6c3404_tmp.avro
could only be replicated to 0 nodes instead of minReplication (=1).
There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
EDIT2:
The hadoop.env file is:
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
# Configure default BlockSize and Replication for local
# data. Keep it small for experimentation.
HDFS_CONF_dfs_blocksize=1m
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver

Finaly like noticed by #cricket_007 I need to configure hadoop.conf.dir.
The directory should contain hdfs-site.xml.
When each service is dockerized, I need to create a named volume in order to share configuration files between kafka-connect service and namenode service.
To do this I add in my docker-compose.yml:
volumes:
hadoopconf:
Then for namenode service I add:
volumes:
- hadoopconf:/etc/hadoop
And for kafka connect service:
volumes:
- hadoopconf:/usr/local/hadoop-conf
Finaly I set hadoop.conf.dir in my HDFS sink properties file to /usr/local/hadoop-conf.

Related

NiFi Cluster Docker Load Balancing configuration

I would like to configure Load Balancing in docker-compose.yml file for NiFi cluster deployed via Docker containers.
Current docker-compose parameters for LB are as follows (for each of three NiFi nodes):
# load balancing
- NIFI_CLUSTER_LOAD_BALANCE_PORT=6342
- NIFI_CLUSTER_LOAD_BALANCE_HOST=node.name
- NIFI_CLUSTER_LOAD_BALANCE_CONNECTIONS_PER_NODE=4
- NIFI_CLUSTER_LOAD_BALANCE_MAX_THREADS=8
But, when I try to use load balancing in queues, I can choose all the parameters there, and do not have any error, but LB is not working, everything is done on the primary node (because I used GetSFTP on the primary node only, but want to then process data on all 3 nodes). Also, NiFi cluster is configured to work with SSL.
Thanks in advance!

I had opened load balance port on my docker file. Also I had to specify hostname for each node's compose file
here is my docker file for basic clustering
version: "3.3"
services:
nifi_service:
container_name: "nifi_service"
image: "apache/nifi:1.11.4"
hostname: "APPTHLP7"
environment:
- TZ=Europe/Istanbul
- NIFI_CLUSTER_IS_NODE=true
- NIFI_CLUSTER_NODE_PROTOCOL_PORT=8088
- NIFI_ZK_CONNECT_STRING=172.16.2.238:2181,172.16.2.240:2181,172.16.2.241:2181
ports:
- "8080:8080"
- "8088:8088"
- "6342:6342"
volumes:
- /home/my/nifi-conf:/opt/nifi/nifi-current/conf
networks:
- my_network
restart: unless-stopped
networks:
my_network:
external: true
please not that you have to configure load balance strategy on the downstream connection in your flow.

Setting up sunbird-telemetry Kafka DRUID and superset

I am trying to create a analytics dashboard based from mobile events. I want to dockerize all the components to containers in docker and deploy it in localhost and create an analytical dashboard.
Sunbird telemetry https://github.com/project-sunbird/sunbird-telemetry-service
Kafka https://github.com/wurstmeister/kafka-docker
Druid https://github.com/apache/incubator-druid/tree/master/distribution/docker
Superset https://github.com/apache/incubator-superset
What i did
Druid
I executed the command docker build -t apache/incubator-druid:tag -f distribution/docker/Dockerfile .
I executed the command docker-compose -f distribution/docker/docker-compose.yml up
After everything get executed open http://localhost:4008/ and see DRUID running
It takes 3.5 hours to complete both build and run
Kafka
Navigate to kafka folder
docker-compose up -d executed this command
Issue
When we execute druid a zookeeper starts running, and when we start kafka the docker file starts another zookeeper and i cannot establish a connection between kafka and zookeeper.
After i start sunbird telemetry and tries to create topic and connect kafka from sunbird its not getting connected.
I dont understand what i am doing wrong.
Can we tell kafka to share the zookeeper started by DRUID. I am completed new to this environment and these stacks.
I am studying this stacks. Am i doing something wrong. Can anybody point out how to properly connect kafka and druid over docker.
Note:- I am running all this in my mac
My kafka compose file
version: '2'
services:
zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"
kafka:
build: .
ports:
- "9092"
environment:
KAFKA_ADVERTISED_HOST_NAME: **localhost ip**
KAFKA_ZOOKEEPER_CONNECT: **localhost ip**:2181
volumes:
- /var/run/docker.sock:/var/run/docker.sock

Can we tell kafka to share the zookeeper started by DRUID
You would put all services in the same compose file.
Druids kafka connection is listed here
https://github.com/apache/incubator-druid/blob/master/distribution/docker/environment#L31
You can set KAFKA_ZOOKEEPER_CONNECT to the same address, yes
For example, downloading the file above and adding Kafka to the Druid Compose file...
version: "2.2"
volumes:
metadata_data: {}
middle_var: {}
historical_var: {}
broker_var: {}
coordinator_var: {}
overlord_var: {}
router_var: {}
services:
# TODO: Add sunbird
postgres:
container_name: postgres
image: postgres:latest
volumes:
- metadata_data:/var/lib/postgresql/data
environment:
- POSTGRES_PASSWORD=FoolishPassword
- POSTGRES_USER=druid
- POSTGRES_DB=druid
# Need 3.5 or later for container nodes
zookeeper:
container_name: zookeeper
image: zookeeper:3.5
environment:
- ZOO_MY_ID=1
druid-coordinator:
image: apache/incubator-druid
container_name: druid-coordinator
volumes:
- coordinator_var:/opt/druid/var
depends_on:
- zookeeper
- postgres
ports:
- "3001:8081"
command:
- coordinator
env_file:
- environment
# renamed to druid-broker
druid-broker:
image: apache/incubator-druid
container_name: druid-broker
volumes:
- broker_var:/opt/druid/var
depends_on:
- zookeeper
- postgres
- druid-coordinator
ports:
- "3002:8082"
command:
- broker
env_file:
- environment
# TODO: add other Druid services
kafka:
image: wurstmeister/kafka
ports:
- "9092"
depends_on:
- zookeeper
environment:
KAFKA_ADVERTISED_HOST_NAME: kafka
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181/kafka # This is the same service that Druid is using

Can we tell kafka to share the zookeeper started by DRUID
Yes, as there's a zookeeper.connect setting for Kafka broker that specifies the Zookeeper address to which Kafka will try to connect. How to do it depends entirely on the docker image you're using. For example, one of the popular images wurstmeister/kafka-docker does this by mapping all environmental variables starting with KAFKA_ to broker settings and adds them to server.properties, so that KAFKA_ZOOKEEPER_CONNECT becomes a zookeeper.connect setting. I suggest taking a look at the official documentation to see what else you can configure.
and when we start kafka the docker file starts another zookeeper
This is your issue. It's the docker-compose file that starts Zookeeper, Kafka, and configures Kafka to use the bundled Zookeeper. You need to modify it, by removing the bundled Zookeeper and configuring Kafka to use a different one. Ideally, you should have a single docker-compose file that starts the whole setup.

How to change target of the Spring Cloud Stream Kafka binder?

Using Spring cloud Stream 2.1.4 with Spring Boot 2.1.10, I'm trying to target a local instance of Kafka.
This is an extract of my projetc configuation so far:
spring.kafka.bootstrap-servers=PLAINTEXT://localhost:9092
spring.kafka.streams.bootstrap-servers=PLAINTEXT://localhost:9092
spring.cloud.stream.kafka.binder.brokers=PLAINTEXT://localhost:9092
spring.cloud.stream.kafka.binder.zkNodes=localhost:2181
spring.cloud.stream.kafka.streams.binder.brokers=PLAINTEXT://localhost:9092
spring.cloud.stream.kafka.streams.binder.zkNodes=localhost:2181
But the binder keeps on calling a wrong target :
java.io.IOException: Can't resolve address: kafka.example.com:9092
How can can I specify the target if those properties won't do he trick?
More, I deploy the Kafka instance through a Docker Bitnami image and I'd prefer not to use SSL configuration (see PLAINTEXT protocol) but I'm don't find properties for basic credentials login. Does anyone know if this is hopeless?
This is my docker-compose.yml
version: '3'
services:
zookeeper:
image: bitnami/zookeeper:latest
container_name: zookeeper
environment:
- ZOO_ENABLE_AUTH=yes
- ZOO_SERVER_USERS=kafka
- ZOO_SERVER_PASSWORDS=kafka_password
networks:
- kafka-net
kafka:
image: bitnami/kafka:latest
container_name: kafka
hostname: kafka.example.com
depends_on:
- zookeeper
ports:
- 9092:9092
environment:
- ALLOW_PLAINTEXT_LISTENER=yes
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://:9092
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_ZOOKEEPER_USER=kafka
- KAFKA_ZOOKEEPER_PASSWORD=kafka_password
networks:
- kafka-net
networks:
kafka-net:
driver: bridge
Thanks in advance

The hostname isn't the issue, rahter the advertised listeners protocol//:port mapping that causes the hostname to be advertised, by default. You should change that, rather than the hostname.
kafka:
image: bitnami/kafka:latest
container_name: kafka
hostname: kafka.example.com # <--- Here's what you are getting in the request
...
environment:
- ALLOW_PLAINTEXT_LISTENER=yes
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://:9092 # <--- This returns the hostname to the clients
If you plan on running your code outside of another container, you should advertise localhost in addition to, or instead of the container hostname.
One year later, my comment still is not been merged into the bitnami README, where I was able to get it working with the following vars (changed to match your deployment)
KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_CFG_LISTENERS=PLAINTEXT://:29092,PLAINTEXT_HOST://:9092
KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka.example.com:29092,PLAINTEXT_HOST://localhost:9092

All right: got this to work by looking twice to the "dockerfile" (thx to cricket_007):
kafka:
...
hostname: localhost
For the record: I could get rid of all properties above, default being for Kafka localhost:9092

Restoration of Hdfs files

We have a spark cluster which is built with the help of docker(singularities/spark image). When we remove containers, data which is stored in hdfs is removed. It is normal I know, but how can I solve the problem such that whenever I start cluster again, files in hdfs restore without upload again

You can bind/mount a host volume as below for /opt/hdfs directory for both master & worker -
version: "2"
services:
master:
image: singularities/spark
command: start-spark master
hostname: master
volumes:
- "${PWD}/hdfs:/opt/hdfs"
ports:
- "6066:6066"
- "7070:7070"
- "8080:8080"
- "50070:50070"
worker:
image: singularities/spark
command: start-spark worker master
volumes:
- "${PWD}/hdfs:/opt/hdfs"
environment:
SPARK_WORKER_CORES: 1
SPARK_WORKER_MEMORY: 2g
links:
- master
This way your HDFS files will always persist at ./hdfs(hdfs in current working directory) on the host machine.
Ref - https://hub.docker.com/r/singularities/spark/

How to make HDFS work in docker swarm

I have troubles to make my HDFS setup work in docker swarm.
To understand the problem I've reduced my setup to the minimum :
1 physical machine
1 namenode
1 datanode
This setup is working fine with docker-compose, but it fails with docker-swarm, using the same compose file.
Here is the compose file :
version: '3'
services:
namenode:
image: uhopper/hadoop-namenode
hostname: namenode
ports:
- "50070:50070"
- "8020:8020"
volumes:
- /userdata/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=hadoop-cluster
datanode:
image: uhopper/hadoop-datanode
depends_on:
- namenode
volumes:
- /userdata/datanode:/hadoop/dfs/data
environment:
- CORE_CONF_fs_defaultFS=hdfs://namenode:8020
To test it, I have installed an hadoop client on my host (physical) machine with only this simple configuration in core-site.xml :
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://0.0.0.0:8020</value></property>
</configuration>
Then I run the following command :
hdfs dfs -put test.txt /test.txt
With docker-compose (just running docker-compose up) it's working and the file is written in HDFS.
With docker-swarm, I'm running :
docker swarm init
docker stack deploy --compose-file docker-compose.yml hadoop
Then when all services are up, I put my file on HDFS it fails like this :
INFO hdfs.DataStreamer: Exception in createBlockOutputStream
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/x.x.x.x:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:259)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1692)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1648)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
18/06/14 17:29:41 WARN hdfs.DataStreamer: Abandoning BP-1801474405-10.0.0.4-1528990089179:blk_1073741825_1001
18/06/14 17:29:41 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[10.0.0.6:50010,DS-d7d71735-7099-4aa9-8394-c9eccc325806,DISK]
18/06/14 17:29:41 WARN hdfs.DataStreamer: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
If I look in the web UI the datanode seems to be up and no issue is reported...
Update : it seems that dependsOn is ignored by swarm, but it does not seem to be the cause of my problem : I've restarted the datanode when the namenode is up but it did not work better.
Thanks for your help :)

The whole mess stems from interaction between docker swarm using overlay networks and how the HDFS name node keeps track of its data nodes. The namenode records the datanode IPs/hostnames based the datanode's overlay network IPs. When the HDFS client asks for read/write operations directly on the datanodes, the namenode reports back the IPs/hostnames of the datanodes based on the overlay network. Since the overlay network is not accessible to the external clients, any rw operations will fail.
The final solution (after lots of struggling to get overlay network to work) I used was to have the HDFS services use the host network. Here's a snippet from the compose file:
version: '3.7'
x-deploy_default: &deploy_default
mode: replicated
replicas: 1
placement:
constraints:
- node.role == manager
restart_policy:
condition: any
delay: 5s
services:
hdfs_namenode:
deploy:
<<: *deploy_default
networks:
hostnet: {}
volumes:
- hdfs_namenode:/hadoop-3.2.0/var/name_node
command:
namenode -fs hdfs://${PRIMARY_HOST}:9000
image: hadoop:3.2.0
hdfs_datanode:
deploy:
mode: global
networks:
hostnet: {}
volumes:
- hdfs_datanode:/hadoop-3.2.0/var/data_node
command:
datanode -fs hdfs://${PRIMARY_HOST}:9000
image: hadoop:3.2.0
volumes:
hdfs_namenode:
hdfs_datanode:
networks:
hostnet:
external: true
name: host

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart