Putting file into HDFS using docker-compose - docker

Is there a way to put some file, let's say data.json, into HDFS automatically right from Docker-compose/Dockerfile?
When I start namenode and datanode I can enter into containers with
docker exec -it namenode [datanode] bash, and use
hdfs dfs -put data.json hdfs:/ (when safe mode is finished)
and that works, but I need a way to run this automatically. When I try to build containers from Dockerfile and put comands:
FROM bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8
WORKDIR /data
ADD hdfs_writer/data.json /data
# ADD python_script.py /data
CMD ["hdfs dfsadmin -safemode wait && hdfs dfs -put ./data.json hdfs:/"]
# CMD ["python python_script.py"]
Container namenode immediately terminates. I also tried with the python script, that I add to container and run it with CMD.
python_script
import time
import os
os.system("hdfs dfsadmin -safemode wait")
os.system("hdfs dfs -put -f data.json hdfs:/")
while True:
time.sleep(5)
in that case, container is running, but if I check logs and try to list hdfs with hdfs dfs -ls hdfs:/, there is following error
safemode: Call From 662aae005e8b/172.20.0.5 to namenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
19/04/18 14:36:36 WARN ipc.Client: Failed to connect to server: namenode/172.20.0.5:8020: try once and fail.
I read recommended link from error log, and to be honest, I am not sure that I understand what should I do.
Any your suggestions or ideas about possible solution is highly valuable for me, as I am new to this field and I don't have much experience.
If you need some more info, I will be happy to provide it.
docker-compose.yml (just part of it)
namenode:
#docker-compose.yml and Dockerfile are in the dame directory
build: .
volumes:
- ./data/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=cluster
env_file:
- ./hadoop.env
ports:
- 50070:50070
datanode:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.8-java8
depends_on:
- namenode
volumes:
- ./data/datanode:/hadoop/dfs/data
env_file:
- ./hadoop.env
hadoop.env
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_blocksize=1m
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031

You can't write to networked services in a Dockerfile. Imagine running docker build, running your combined application, tearing it down, and running it again. You'll reuse the same built image without re-running the Dockerfile steps; only the content in the image itself is kept. In most cases you need some minor amount of setup to communicate between services (Docker Compose can do this for you) but that is not set up during a build sequence. This is the same answer as "you can't run database migrations from a Dockerfile", but it applies equally to Hadoop.
A container only does one thing. Your sample Dockerfile sets a different CMD that waits for the namenode to be running and sets it up. This happens instead of starting the namenode process. A Docker container runs one main command and one main command only; there is not a way to run a main command and also a side support script of some form. The container you show would probably work, but you'd need to run it as a separate container alongside the namenode container.
You don't need to be "in Docker" to access Docker-hosted services. You can use a Docker Compose ports: directive to make services visible to the host, at which point you can use ordinary clients to interact with them. The docker exec path is the equivalent of "I ssh to my server as root, and then...", which isn't how you normally deal with any service at all.
Your server containers should only run servers. In your example you're both trying to launch an HDFS namenode and also populate the server from the same container; you'd be better off having the namenode container only be the namenode and running the setup job from another container or from the host. (See the standard postgres image's entrypoint script for some idea of the gyrations required otherwise.)
Docker Compose isn't great for one-off jobs. Every time you run docker-compose up it will discover that your setup container isn't running and try to start it again. Other more powerful orchestrators could be a better fit; for example, a Kubernetes Job is a reasonable fit for what you're describing.

Related

Docker editing entrypoint of existing container

I've docker container build from debian:latest image.
I need to execute a bash script that will start several services.
My host machine is Windows 10 and I'm using Docker Desktop, I've found configuration files in
docker-desktop-data wsl2 drive in data\docker\containers\<container_name>
I've 2 config files there:
config.v2.json and hostcongih.json
I've edited the first of them and replaced:
"Entrypoint":null with "Entrypoint":["/bin/bash", "/opt/startup.sh"]
I have done it while the container was down, when I restarted it the script was not executed. When I opened config.v2.json file again the Entrypoint was set to null again.
I need to run this script at every container start.
Additional strange thing is that this container doesn't have any volume appearing in docker desktop. I can checkout this container and start another one, but I need to preserve current state of this container (installed packages, files, DB content). How can I change the entrypoint or run the script in other way?
Is there anyway to export the container to image alongside with it's configuration? I need to expose several ports and run the startup script. Is there anyway to make every new container made from the image exported from current container expose the same ports and run same startup script?
Docker's typical workflow involves containers that only run a single process, and are intrinsically temporary. You'd almost never create a container, manually set it up, and try to persist it; instead, you'd write a script called a Dockerfile that describes how to create a reusable image, and then launch some number of containers from that.
It's almost always preferable to launch multiple single-process containers than to try to run multiple processes in a single container. You can use a tool like Docker Compose to describe the multiple containers and record the various options you'd need to start them:
# docker-compose.yml
# Describe the file version. Required with the stable Python implementation
# of Compose. Most recent stable version of the file format.
version: '3.8'
# Persistent storage managed by Docker; will not be accessible on the host.
volumes:
dbdata:
# Actual containers.
services:
# The database.
db:
# Use a stock Docker Hub image.
image: postgres:15
# Persist its data.
volumes:
- dbdata:/var/lib/postgresql/data
# Describe how to set up the initial database.
environment:
POSTGRES_PASSWORD: passw0rd
# Make the container accessible from outside Docker (optional).
ports:
- '5432:5432' # first port any available host port
# second port MUST be standard PostgreSQL port 5432
# Reverse proxy / static asset server
nginx:
image: nginx:1.23
# Get static assets from the host system.
volumes:
- ./static:/usr/share/nginx/html
# Make the container externally accessible.
ports:
- '8000:80'
You can check this file into source control with your application. Also consider adding a third container that build: an image containing the actual application code; that probably will not have volumes:.
docker-compose up -d will start this stack of containers (without -d, in the foreground). If you make a change to the docker-compose.yml file, re-running the same command will delete and recreate containers as required. Note that you are never running an unmodified debian image, nor are you manually running commands inside a container; the docker-compose.yml file completely describes the containers, their startup sequences (if not already built into the images), and any required runtime options.
Also see Networking in Compose for some details about how to make connections between containers: localhost from within a container will call out to that same container and not one of the other containers or the host system.

cp a file from within a volume to another location in the container - just use a volume, add Dockerfile? Or can I do it within compose.yml?

I have a docker-compose file in my working directory. I don't have a Dockerfile (Yet, I'm unsure if I need one?). Here's my docker-compose file:
version: "3.5"
services:
ide-rstudio:
image: rocker/verse:latest
ports:
- 8787:8787
- 3838:3838
environment:
PASSWORD: test
ROOT: "TRUE"
ADD: "shiny"
volumes:
- ${PROJECTS_DIR}/Zen:/home/rstudio/Projects
When I run this, a new container runs as expected. In the volume I have a file /Zen/ide-rstudio/rstudio-prefs.json. I would like to add rstudio-prefs.json into my container at /home/rstudio/.config/rstudio/rstudio-prefs.json. I CAN already do this by using a volume and adding this line to my docker-compose volumes:
volumes:
- ${PROJECTS_DIR}/Zen:/home/rstudio/Projects
- ${PROJECTS_DIR}/Zen/ide-rstudio/rstudio-prefs.json:/home/rstudio/.config/rstudio/rstudio-prefs.json
My question is, if after adding the volume in the first line ${PROJECTS_DIR}/Zen:/home/rstudio/Projects the file rstudio-prefs.json already exists in the container at /home/rstudio/Projects/ide-rstudio/rstudio-prefs.json. So, I would really just like to run the following shell command after the container is started cp /home/rstudio/Projects/ide-rstudio/rstudio-prefs.json /home/rstudio/.config/rstudio/rstudio-prefs.json.
Is it possible to run a shell command within a service using docker-compose? Or, must I now create a Dockerfile?
You should use the volumes: approach you show. This works automatically and doesn't require any user intervention. There's no harm to having a second copy of the file in the container, especially a small configuration file.
You could in principle run docker-compose exec after the container starts up. There are a couple of problems with doing this. If the config file is read by the container's main process, that will happen before you have an opportunity to run debug commands like this. You'll need to remember to repeat this command every time you restart the container. If you wind up in a cluster environment like Kubernetes, you'll need to remember to do this on every replica of the container, and arrange for it to happen if the cluster restarts the container without your knowledge (for example, if a node fails).
If you want this to happen reliably, as a shell command, then you need to write an entrypoint wrapper script. This runs whatever first-time setup you need and then execs the image's original entrypoint. This is easier to do reproducibly with a custom Dockerfile, and requires some knowledge of the image's detailed setup.
The one-line volumes: to inject the same file a second time is much easier.

Docker-compose: how to start a container with output supressed

I have a docker-compose file that spins up, among several other, a couchdb container (https://hub.docker.com/r/klaemo/couchdb/); and the couchdb container spews out a lot of output when I do the docker-compose up. Is there a way to suppress that output so I see only other containers' s output?
Maybe
I can run the couchdb in daemon mode somehow?
or
I can override the default command somehow and redirect output to a tmp file?
I am not sure how to do any of the two, and I want to do that within the compose file itself, not by changing my compose file callup command. Any help?
Here is the minimal compose file:
couchdb:
container_name: couchdb
image: klaemo/couchdb:2.0.0
ports:
- "5984:5984"
and I call that from a makefile with : docker-compose up --abort-on-container-exit --force-recreate && docker-compose down
Note that Docker containers log to stdout and stderr for a reason. It allows a consistent log interface for commands like docker logs to use and for logging drivers to pick up information from containers. In a large container eco system, it's easier if everything works the same.
Runtime
At runtime there are a couple of options.
You can background the couchdb container and start the others in the foreground.
docker-compose up -d couchdb
docker-compose up other container names
You can start everything in the background, and only view the logs for particular containers
docker-compose start # or docker-compose up -d
docker-compose logs -f other container names
Build time
To permanently modify logging you could change CouchDB's log config in an image build
couchdb:
container_name: couchdb
image: me/klaemo-couchdb:2.0.0
build:
context: .
dockerfile: Dockerfile.couchdb
ports:
- "5984:5984"
Dockerfile.couchdb
FROM klaemo/couchdb:2.0.0
COPY couchdb.ini /opt/couchdb/etc/local.ini
couchdb.ini needs to contain all the original config settings from the containers /opt/couchdb/etc/local.ini, updating some the log settings from stderr to a file:
[log]
file = /opt/couchdb/log/couch.log
level = info
You can also set log levels specifically for a module
[log_level_by_module]
couch_httpd = info
couch_replicator = info
couch_query_servers = error
You probably want to mount the /opt/couchdb/log directory as a volume from the container host so you are not writing data into the current container instance all the time.

How does one close a dependent container with docker-compose?

I have two containers that are spun up using docker-compose:
web:
image: personal/webserver
depends_on:
- database
entrypoint: /usr/bin/runmytests.sh
database:
image: personal/database
In this example, runmytests.sh is a script that runs for a few seconds, then returns with either a zero or non-zero exit code.
When I run this setup with docker-compose, web_1 runs the script and exits. database_1 remains open, because the process running the database is still running.
I'd like to trigger a graceful exit on database_1 when web_1's tasks have been completed.
You can pass the --abort-on-container-exit flag to docker-compose up to have the other containers stop when one exits.
What you're describing is called a Pod in Kubernetes or a Task in AWS. It's a grouping of containers that form a unit. Docker doesn't have that notion currently (Swarm mode has "tasks" which come close but they only support one container per task at this point).
There is a hacky workaround beside scripting it as #BMitch described. You could mount the Docker daemon socket from the host. Eg:
web:
image: personal/webserver
depends_on:
- database
volumes:
- /var/run/docker.sock:/var/run/docker.sock
entrypoint: /usr/bin/runmytests.sh
and add the Docker client to your personal/webserver image. That would allow your runmytests.sh script to use the Docker CLI to shut down the database first. Eg: docker kill database.
Edit:
Third option. If you want to stop all containers when one fails, you can use the --abort-on-container-exit option to docker-compose as #dnephin mentions in another answer.
I don't believe docker-compose supports this use case. However, making a simple shell script would easily resolve this:
#!/bin/sh
docker run -d --name=database personal/database
docker run --rm -it --entrypoint=/usr/bin/runmytests.sh personal/webserver
docker stop database
docker rm database

Setting a policy for RabbitMQ as a part of Dockerfile process

I'm trying to make a Dockerfile based on the RabbitMQ repository with a customized policy set. The problem is that I can't useCMD or ENTRYPOINT since it will override the base Dockerfile's and then I have to come up with my own and I don't want to go down that path. Let alone the fact if I don't use RUN, it will be a part of run time commands and I want this to be included in the image, not just the container.
Other thing I can do is to use RUN command but the problem with that is the RabbitMQ server is not running at build time and also there's no --offline flag for the set_policycommand of rabbitmqctl program.
When I use docker's RUN command to set the policy, here's the error I face:
Error: unable to connect to node rabbit#e06f5a03fe1f: nodedown
DIAGNOSTICS
===========
attempted to contact: [rabbit#e06f5a03fe1f]
rabbit#e06f5a03fe1f:
* connected to epmd (port 4369) on e06f5a03fe1f
* epmd reports: node 'rabbit' not running at all
no other nodes on e06f5a03fe1f
* suggestion: start the node
current node details:
- node name: 'rabbitmq-cli-136#e06f5a03fe1f'
- home dir: /var/lib/rabbitmq
- cookie hash: /Rw7u05NmU/ZMNV+F856Fg==
So is there any way I can set a policy for the RabbitMQ without writing my own version of CMD and/or ENTRYPOINT?
You're in a slightly tricky situation with RabbitMQ as it's mnesia data path is based on the host name of the container.
root#bf97c82990aa:/# ls -1 /var/lib/rabbitmq/mnesia
rabbit#bf97c82990aa
rabbit#bf97c82990aa-plugins-expand
rabbit#bf97c82990aa.pid
For other image builds you could seed the data files, or write a script that RUN calls to launch the application or database and configure it. With RabbitMQ, the container host name will change between image build and runtime so the image's config won't be picked up.
I think you are stuck with doing the config on container creation or at startup time.
Options
Creating a wrapper CMD script to do the policy after startup is a bit complex as /usr/lib/rabbitmq/bin/rabbitmq-server runs rabbit in the foreground, which means you don't have access to an "after startup" point. Docker doesn't really do background processes so rabbitmq-server -detached isn't much help.
If you were to use something like Ansible, Chef or Puppet to setup the containers. Configure a fixed hostname for the containers startup. Then start it up and configure the policy as the next step. This only needs to be done once, as long as the hostname is fixed and you are not using the --rm flag.
At runtime, systemd could complete the config to a service with ExecStartPost. I'm sure most service managers will have the same feature. I guess you could end up dropping messages, or at least causing errors at every start up if anything came in before configuration was finished?
You can configure the policy as described here.
Docker compose:
rabbitmq:
image: rabbitmq:3.7.8-management
container_name: rabbitmq
volumes:
- ~/rabbitmq/data:/var/lib/rabbitmq:rw
- ./rabbitmq/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf
- ./rabbitmq/definitions.json:/etc/rabbitmq/definitions.json
ports:
- "5672:5672"
- "15672:15672"

Resources