I have troubles to make my HDFS setup work in docker swarm.
To understand the problem I've reduced my setup to the minimum :
1 physical machine
1 namenode
1 datanode
This setup is working fine with docker-compose, but it fails with docker-swarm, using the same compose file.
Here is the compose file :
version: '3'
services:
namenode:
image: uhopper/hadoop-namenode
hostname: namenode
ports:
- "50070:50070"
- "8020:8020"
volumes:
- /userdata/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=hadoop-cluster
datanode:
image: uhopper/hadoop-datanode
depends_on:
- namenode
volumes:
- /userdata/datanode:/hadoop/dfs/data
environment:
- CORE_CONF_fs_defaultFS=hdfs://namenode:8020
To test it, I have installed an hadoop client on my host (physical) machine with only this simple configuration in core-site.xml :
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://0.0.0.0:8020</value></property>
</configuration>
Then I run the following command :
hdfs dfs -put test.txt /test.txt
With docker-compose (just running docker-compose up) it's working and the file is written in HDFS.
With docker-swarm, I'm running :
docker swarm init
docker stack deploy --compose-file docker-compose.yml hadoop
Then when all services are up, I put my file on HDFS it fails like this :
INFO hdfs.DataStreamer: Exception in createBlockOutputStream
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/x.x.x.x:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:259)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1692)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1648)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
18/06/14 17:29:41 WARN hdfs.DataStreamer: Abandoning BP-1801474405-10.0.0.4-1528990089179:blk_1073741825_1001
18/06/14 17:29:41 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[10.0.0.6:50010,DS-d7d71735-7099-4aa9-8394-c9eccc325806,DISK]
18/06/14 17:29:41 WARN hdfs.DataStreamer: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
If I look in the web UI the datanode seems to be up and no issue is reported...
Update : it seems that dependsOn is ignored by swarm, but it does not seem to be the cause of my problem : I've restarted the datanode when the namenode is up but it did not work better.
Thanks for your help :)
The whole mess stems from interaction between docker swarm using overlay networks and how the HDFS name node keeps track of its data nodes. The namenode records the datanode IPs/hostnames based the datanode's overlay network IPs. When the HDFS client asks for read/write operations directly on the datanodes, the namenode reports back the IPs/hostnames of the datanodes based on the overlay network. Since the overlay network is not accessible to the external clients, any rw operations will fail.
The final solution (after lots of struggling to get overlay network to work) I used was to have the HDFS services use the host network. Here's a snippet from the compose file:
version: '3.7'
x-deploy_default: &deploy_default
mode: replicated
replicas: 1
placement:
constraints:
- node.role == manager
restart_policy:
condition: any
delay: 5s
services:
hdfs_namenode:
deploy:
<<: *deploy_default
networks:
hostnet: {}
volumes:
- hdfs_namenode:/hadoop-3.2.0/var/name_node
command:
namenode -fs hdfs://${PRIMARY_HOST}:9000
image: hadoop:3.2.0
hdfs_datanode:
deploy:
mode: global
networks:
hostnet: {}
volumes:
- hdfs_datanode:/hadoop-3.2.0/var/data_node
command:
datanode -fs hdfs://${PRIMARY_HOST}:9000
image: hadoop:3.2.0
volumes:
hdfs_namenode:
hdfs_datanode:
networks:
hostnet:
external: true
name: host
Related
I'm trying to create a service that must join the existing stack, so I force the compose to use the same network.
Surely, my network is persists
docker network ls
NETWORK ID NAME DRIVER SCOPE
oiaxfyeil72z ELK_default overlay swarm
okhs1e1wu73y ELK_elk overlay swarm
My docker-compose.yml
version: '3.3'
services:
logstash:
image: docker.elastic.co/logstash/logstash-oss:7.5.1
ports:
- "5000:5000"
- "9600:9600"
volumes:
- '/share/elk/logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro'
- '/share/elk/logstash/pipeline/:/usr/share/logstash/pipeline/:ro'
environment:
LS_JAVA_OPTS: "-Xmx512m -Xms256m"
networks:
- elk
deploy:
mode: replicated
replicas: 1
networks:
elk:
external:
name: ELK_elk
the other services was created with
version: '3.3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1
ports:
- "9200:9200"
- "9300:9300"
volumes:
- '/share/elk/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml:ro'
environment:
ES_JAVA_OPTS: "-Xmx512m -Xms256m"
ELASTIC_PASSWORD: changeme
discovery.type: single-node
networks:
- elk
deploy:
mode: replicated
replicas: 1
kibana:
image: docker.elastic.co/kibana/kibana:7.5.1
ports:
- "5601:5601"
volumes:
- '/share/elk/kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml:ro'
networks:
- elk
deploy:
mode: replicated
replicas: 1
networks:
elk:
driver: overlay
check with docker stack services
docker stack services ELK
ID NAME MODE REPLICAS IMAGE PORTS
c0rux6mdvzq3 ELK_kibana replicated 1/1 docker.elastic.co/kibana/kibana:7.5.1 *:5601->5601/tcp
j824fd0blxdp ELK_elasticsearch replicated 1/1 docker.elastic.co/elasticsearch/elasticsearch:7.5.1 *:9200->9200/tcp, *:9300->9300/tcp
Then trying to bring the service up with docker-compose up -d. The service doesn't create but produce the error
docker-compose up -d
WARNING: Some services (logstash) use the 'deploy' key, which will be ignored. Compose does not support 'deploy' configuration - use `docker stack deploy` to deploy to a swarm.
WARNING: The Docker Engine you're using is running in swarm mode.
Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.
To deploy your application across the swarm, use `docker stack deploy`.
Removing tmp_logstash_1
Recreating bbf503fc3eaa_tmp_logstash_1 ... error
ERROR: for bbf503fc3eaa_tmp_logstash_1 Cannot start service logstash: Could not attach to network ELK_elk: rpc error: code = PermissionDenied desc = network ELK_elk not manually attachable
ERROR: for logstash Cannot start service logstash: Could not attach to network ELK_elk: rpc error: code = PermissionDenied desc = network ELK_elk not manually attachable
ERROR: Encountered errors while bringing up the project.
The issue is due to the fact that the elk network is defined as an "overlay" network. It's a docker swarm feature so docker-compose does not know how to deal with it.
Instead of using docker-compose up you need to deploy a docker swarm stack:
docker stack deploy -c docker-compose.yml <service_name>
You can refer to the Docker documentation for more info:
https://docs.docker.com/network/
For some reason non manager nodes only see networks with active containers using it (Run on non manager node):
docker run --rm -d --name dummy busybox # Run a dummy container
docker network connect [OVERLAY_NETWORK] dummy # Connect to overlay network
now network is available on non manager node and you can run:
docker compose -f compose.yaml -p project up -d
docker stop dummy # Remove dummy container
Compose file:
networks:
db:
external: true
driver: overlay
I am new to Docker and trying to build a Hadoop cluster with Docker Swarm. I tried to build it with docker compose and it worked perfectly. However, I would like to add other services like Hive, Spark, HBase to it in the future so a Swarm seems a better idea.
When I tried to run it with a version 3.7 yaml file, the namenode and datanodes started successfully. But when I visited the web UI, it showed that there is no nodes available at the "Datanodes" tab (neither at the "Overview" tab). It seems the datanodes failed to connect to the namenode. I had checked the port of each node with netstat -tuplen and both 7946 and 4789 worked fine.
Here is the yaml file I used:
version: "3.7"
services:
namenode:
image: flokkr/hadoop:latest
hostname: namenode
networks:
- hbase
command: ["hdfs","namenode"]
ports:
- target: 50070
published: 50070
- target: 9870
published: 9870
environment:
- NAMENODE_INIT=hdfs dfs -chmod 777 /
- ENSURE_NAMENODE_DIR=/tmp/hadoop-hadoop/dfs/name
env_file:
- ./compose-config
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == manager
datanode:
image: flokkr/hadoop:latest
networks:
- hbase
command: ["hdfs","datanode"]
env_file:
- ./compose-config
deploy:
mode: global
restart_policy:
condition: on-failure
volumes:
namenode:
datanode:
networks:
hbase:
name: hbase
Basically I just update the yaml file from this repo to version 3.7 and tried to run it on GCP. And here is my repo in case you want to replicate the case.
And this is the status of ports of the manager node:
the worker node:
Thank you for your help!
It seems to be a network related issue, the pods are up an running but they are not registering on your Web GUI maybe the network communication it's not reaching between them. Check your internal firewall rules and OS firewall, run some network test on the specific ports.
I am using kafka connect HDFS sink and Hadoop (for HDFS) in a docker-compose.
Hadoop (namenode and datanode) seems working correctly.
But I have an error with kafka connect sink:
ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED
(io.confluent.connect.hdfs.TopicPartitionWriter:277)
org.apache.kafka.connect.errors.DataException:
Error creating writer for log file hdfs://namenode:8020/logs/MyTopic/0/log
For information:
Hadoop services in my docker-compose.yml:
namenode:
image: uhopper/hadoop-namenode:2.8.1
hostname: namenode
container_name: namenode
ports:
- "50070:50070"
networks:
default:
fides-webapp:
aliases:
- "hadoop"
volumes:
- namenode:/hadoop/dfs/name
env_file:
- ./hadoop.env
environment:
- CLUSTER_NAME=hadoop-cluster
datanode1:
image: uhopper/hadoop-datanode:2.8.1
hostname: datanode1
container_name: datanode1
networks:
default:
fides-webapp:
aliases:
- "hadoop"
volumes:
- datanode1:/hadoop/dfs/data
env_file:
- ./hadoop.env
And my kafka-connect file:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=MyTopic
hdfs.url=hdfs://namenode:8020
flush.size=3
EDIT:
I add an env variable for kafka connect to be aware of the cluster name (env variable: CLUSTER_NAME to add in kafka connect service in docker compose file).
The error is not the same (and it seems to solve a problem):
INFO Starting commit and rotation for topic partition scoring-topic-0 with start offsets {partition=0=0} and end offsets {partition=0=2}
(io.confluent.connect.hdfs.TopicPartitionWriter:368)
ERROR Exception on topic partition MyTopic-0: (io.confluent.connect.hdfs.TopicPartitionWriter:403)
org.apache.kafka.connect.errors.DataException: org.apache.hadoop.ipc.RemoteException(java.io.IOException):
File /topics/+tmp/MyTopic/partition=0/bc4cf075-ccfa-4338-9672-5462cc6c3404_tmp.avro
could only be replicated to 0 nodes instead of minReplication (=1).
There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
EDIT2:
The hadoop.env file is:
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
# Configure default BlockSize and Replication for local
# data. Keep it small for experimentation.
HDFS_CONF_dfs_blocksize=1m
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
Finaly like noticed by #cricket_007 I need to configure hadoop.conf.dir.
The directory should contain hdfs-site.xml.
When each service is dockerized, I need to create a named volume in order to share configuration files between kafka-connect service and namenode service.
To do this I add in my docker-compose.yml:
volumes:
hadoopconf:
Then for namenode service I add:
volumes:
- hadoopconf:/etc/hadoop
And for kafka connect service:
volumes:
- hadoopconf:/usr/local/hadoop-conf
Finaly I set hadoop.conf.dir in my HDFS sink properties file to /usr/local/hadoop-conf.
I'm currently trying to deploy an application with docker swarm in 3 virtual machines, I'm doing it through docker-compose to create the image, my files are the following:
Dockerfile:
FROM openjdk:8-jdk-alpine
WORKDIR /home
ARG JAR_FILE
ARG PORT
VOLUME /tmp
COPY ${JAR_FILE} /home/app.jar
EXPOSE ${PORT}
ENTRYPOINT ["java","-Djava.security.egd=file:/dev/./urandom","-jar","/home/app.jar"]
and my docker-compose is:
version: '3'
services:
service_colpensiones:
build:
context: ./colpensiones-servicio
dockerfile: Dockerfile
args:
JAR_FILE: ColpensionesServicio.jar
PORT: 8082
volumes:
- data:/home
ports:
- 8082:8082
volumes:
data:
I'm using the command docker-compose up -d --build to build the image, I automatically create the container which is deleted later. To use docker swarm I use the 3 machines, one manager and two worker, I have another file to deploy the service with 3 replicas
version: '3'
services:
service_colpensiones:
image: deploy_lyra_colpensiones_service_colpensiones
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
volumes:
- data:/home
ports:
- 8082:8082
networks:
- webnet
visualizer:
image: dockersamples/visualizer:stable
ports:
- "8080:8080"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
deploy:
placement:
constraints: [node.role == manager]
networks:
- webnet
networks:
webnet:
volumes:
data:
So far I think everything is fine because in the console with the command: docker service ls I see the services created, the viewer dockersamples / visualizer: stable, shows me the nodes correctly on port 8080, but when I want to make a request to the url of the services that is in the following way:
curl -4 http://192.168.99.100:8082/colpensiones/msg
the error appears:
curl: (7) Failed to connect to 192.168.99.100 port 8082: Refused connection.
The images from service are:
I am following the docker tutorial: Get Started https://docs.docker.com/get-started/part5/
I hope your help, thanks
I had the same issue but fixed after changing the port number of the spring boot service to
ports:
- "8082:8080"
The actual issue is: tomcat server by default listening on port 8080 not the port mentioned on the compose file. Also i increased the memory limit.
FYI: The internal port of the tasks/container running in the service can be same for other containers as well(:) so mentioning 8080(internal port) for both spring boot container and visualizer container is not a problem.
I also faced the same issue for my application. I rebuilt my app by removing from Dockerfile => -Djava.security.egd=file:/dev/./urandom java cmdline property, and it started working for me.
Please check "docker service logs #containerid#" (to see container ids run command "docker stack ps #servicename#") which served you request at that time, and see if you see any error message.
PS: I recently started on docker, so might not be an expert advice. Just in case if it helps.
I need to set service mode to global while using compose files .
Any chance we can use this in compose file ?
I have a requirement where for a service there should be exactly one container on every node/host .
This doesn't happen with "spread strategy" of swarm if a node goes down & comes up , it just attains the equal number of containers on each host irrespective of services .
https://github.com/docker/compose/issues/3743
We can do this easily now with docker compose v3 (version 3) under the deploy(mode) section.
Prerequisites -
docker compose version should be 1.10.0+
docker engine version should be 1.13.0+
Example compose file -
version: "3"
services:
nginx:
image: nexus3.example.com/prd-nginx-sm:v1
ports:
- "80:80"
networks:
- cheers
volumes:
- logs:/rest/out/
deploy:
mode: global
labels:
feature.description: "Frontend"
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: any
command: "/usr/sbin/nginx"
networks:
cheers:
volumes:
logs:
data:
Deploy the compose file -
$ docker stack deploy -c sm-deploy-compose.yml --with-registry-auth CHEERS
This will deploy nginx container on all the nodes participating in the cluster .