Running DBT within Airflow through the Docker Operator - docker

Building my question on How to run DBT in airflow without copying our repo, I am currently running airflow and syncing the dags via git. I am considering different option to include DBT within my workflow. One suggestion by louis_guitton is to Dockerize the DBT project, and run it in Airflow via the Docker Operator.
I have no prior experience using the Docker Operator in Airflow or generally DBT. I am wondering if anyone has tried or can provide some insights about their experience incorporating that workflow, my main questions are:
Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)
Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?
How would partial pipelines be run? (example: wanting to run only a part of the pipeline)

Judging by your questions, you would benefit from trying to dockerise dbt on its own, independently from airflow. A lot of your questions would disappear. But here are my answers anyway.
Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)
I suggest you build one docker image for the entire project. The docker image can be based on the python image since dbt is a python CLI tool. You then use the CMD arguments of the docker image to run any dbt command you would run outside docker.
Please remember the syntax of docker run (which has nothing to do with dbt): you can specify any COMMAND you wand to run at invocation time
$ docker run [OPTIONS] IMAGE[:TAG|#DIGEST] [COMMAND] [ARG...]
Also, the first hit on Google for "docker dbt" is this dockerfile that can get you started
Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?
Again, it's not a dbt question but rather a docker question or an airflow question.
Can you see the logs in the airflow UI when using a DockerOperator? Yes, see this how to blog post with screenshots.
Can you access logs from a docker container? Yes, Docker containers emit logs to stdout and stderr output streams (which you can see in airflow, since airflow picks this up). But logs are also stored in JSON files on the host machine in a folder /var/lib/docker/containers/. If you have any advanced needs, you can pick up those logs with a tool (or a simple BashOperator or PythonOperator) and do what you need with it.
How would partial pipelines be run? (example: wanting to run only a part of the pipeline)
See answer 1, you would run your docker dbt image with the command
$ docker run my-dbt-image dbt run -m stg_customers

Related

Isn't docker-compose a replacement for docker run?

I'm having difficulties understanding docker. No matter how many tutorials I watch, guides I read, for me docker-compose is like being able to define multiple Dockerfiles, ie multiple containers. I can define environment variables in both, ports, commands, base images.
I read in other questions/discussions that Dockerfile defines how to build an image, and docker-compose is how to run an image, but I don't understand that. I can build docker containers without having to have a Dockerfile.
It's mainly for local development though. Does Dockerfile have an important role when deploying to AWS for example (where it's probably coming out of the box for example for EC2)?
So the reason why I can work locally with docker-compose only is because the base image is my computer (sorting out the task Dockerfile is supposed to do)?
Think about how you'd run some program, without Docker involved. Usually it's two steps:
Install it using a package manager like apt-get or brew, or build it from source
Run it, without needing any of its source code locally
In plain Docker without Compose, similarly, you have the same two steps:
docker pull a prebuilt image with the software, or docker build it from source
docker run it, without needing any of its source code locally
I'd aim to have a Dockerfile that creates an immutable copy of your image, with all of its source code and library dependencies as part of the image. The ideal is that you can docker run your image without -v options to inject source code or providing the command at the docker run command line.
The reality is that there are a lot of moving parts: you probably need to docker network create a network to get containers to communicate with each other, and use docker run -e environment variables to specify host names and database credentials, and launch multiple containers together, and so on. And that's where Compose comes in: instead of running a series of very long docker commands, you can put all of the details you need in a docker-compose.yml file, check it in, and run docker-compose up to get all of those parts put together.
So, do:
Use Compose to start multiple containers together
Use Compose to write down complex runtime options like port mappings or environment variables with host names and credentials
Use Compose to build your image and start a container from it with a single command
Build your application code and a standard CMD to run it into your Dockerfile.

MLflow run within a docker container - Running with "docker_env" in MLflow project file

We are trying to develop an MLflow pipeline. We have our developing environment in a series of dockers (no local python environment "whatsoever"). This means that we have set up a docker container with MLflow and all requirements necessary to run pipelines. The issue we have is that when we write our MLflow project file we need to use "docker_env" to specify the environment. This figure illustrates what we want to achieve:
MLflow run dind
MLflow inside the docker needs to access the docker daemon/service so that it can either use the "docker-image" in the MLflow project file or pull it from docker hub. We are aware of the possibility of using "conda_env" in the MLflow project file but wish to avoid this.
Our question is,
Do we need to set some sort of "docker in docker" solution to achieve our goal?
Is it possible to set up the docker container in which MLflow is running so that it can access the "host machine" docker daemon?
I have been all over Google and MLflow's documentation but I can seem to find anything that can guide us. Thanks a lot in advance for any help or pointers!
I managed to create my pipeline using docker and docker_env in MLflow. It is not necessary to run d-in-d, the "sibling approach" is sufficient. This approach is described here:
https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/
and it is the preferred method to avoid d-in-d.
One needs to be very careful when mounting volumes within the primary and secondary docker environments: all volume mounts happen in the host machine.
In this case, I would like to suggest a simple alternative:
Install all dependencies in Docker Image
Run MLFlow Project with Conda Env.
then MLFlow will reuse everything cached in the container base environment each time the project run.

Is there a point in Docker start?

So, is there a point in the command "start"? like in "docker start -i albineContainer".
If I do this, I can't really do anything with the albine inside the container, I would have to do a run and create another container with the "-it" command and "sh" after (or "/bin/bash", don't remember it correctly right now).
Is that how it will go most of the times? delete and rebuilt containers and do the command "-it" if you want to do stuff in them? or would it more depend on the Dockerfile, how you define the cmd.
New to Docker in general and trying to understand the basics on how to use it. Thanks for the help.
Running docker run/exec with -it means you run the docker container and attach an interactive terminal to it.
Note that you can also run docker applications without attaching to them, and they will still run in the background.
Docker allows you to run a program (which can be bash, but does not have to be) in an isolated environment.
For example, try running the jenkins docker image: https://hub.docker.com/_/jenkins.
this will create a container, without you having attach to it, and you would still be able to use it.
You can also attach to an existing, running container by using docker exec -it [container_name] bash.
You can also use docker logs to peek at the stdout of a certain docker container, without actually attaching to its shell interactively.
You almost never use docker start. It's only possible to use it in two unusual circumstances:
If you've created a container with docker create, then docker start will run the process you named there. (But it's much more common to use docker run to do both things together.)
If you've stopped a container with docker stop, docker start will run its process again. (But typically you'll want to docker rm the container once you've stopped it.)
Your question and other comments hint at using an interactive shell in an unmodified Alpine container. Neither is a typical practice. Usually you'll take some complete application and its dependencies and package it into an image, and docker run will run that complete packaged application. Tutorials like Docker's Build and run your image go through this workflow in reasonable detail.
My general day-to-day workflow involves building and testing a program outside of Docker. Once I believe it works, then I run docker build and docker run, and docker rm the container once I'm done. I rarely run docker exec: it is a useful debugging tool but not the standard way to interact with a process. docker start isn't something I really ever run.

Installing and Running docker in a Docker container running in Openshift

I am currently working on the following scenario
I am trying to setup a container in OpenShift that runs a Jenkins that is itsself able to run docker to make use of declarative pipelines where the build is running in it's own docker container. This basically makes it necessary to install and run docker inside this container.
I have been working on it on quite some time now. Checked dozens of posts and threads online but I have not been able to accomplish it. Basically I got so far
I can install docker in my container (from the baseimage openshift/jenkins-2-centos7:latest)
I can't get docker to run as this makes use of systemctl which
Now I read that systemctl is not working inside docker containers or at least highly unrecommended as it interferes with the PID 1 in the system. Without
systemctl start docker
that will leave me with docker beeing unable to connect with the daemon (as expected) and the error message
Can't connect to docker daemon. Is 'docker -d' running on this host?
So I tried to set up the daemon myself using
the follwoing in my Dockerfile
RUN usermod -aG docker $(whoami)
RUN dockerd -H unix:///var/run/docker.sock
which will also not work telling me that cgroups cannot be mounted. After some more research I found that this could be handled with the cgroupfs-mount script from
https://github.com/tianon/cgroupfs-mount/tree/master
But also here I got no luck leaving me with the following error
Error starting daemon: Error initializing network controller: error obtaining controller instance: failed to create NAT chain DOCKER: iptables failed: iptables -t nat -N DOCKER: iptables v1.4.21: can't initialize iptables table `nat': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.
Now after hours I am out of ideas. Does anyone have an idea how to make docker work inside of OpenShift? Would be really greatful
I am trying to setup a container in OpenShift that runs a Jenkins that is itsself able to run docker to make use of declarative pipelines where the build is running in it's own docker container. This basically makes it necessary to install and run docker inside this container.
I don't think your conclusion here is the only possibility, and what I'll describe below is an easier approach to get what (I think) you want! :) If there are any other use cases that you have than these 3 I'll describe, let me know and I'll try to update to cover them:
Pipelines running in their own containers
Running additional containers from Pipelines
Building container images from Pipelines
Pipelines running in their own containers
For this case, there's the excellent Kubernetes plugin.
With this plugin, you add a Kubernetes/OpenShift cloud to the Jenkins global config. This can either be the one in which Jenkins is running (if you use the Jenkins image provided by OpenShift, this gets added by default at least), or an external cluster.
Inside that configuration, you can define PodTemplates (again, there are a couple of examples provided in the Jenkins image provided by OpenShift), or you can specify that in your pipeline directly also I think. When your pipeline requests a node/agent with a label that matches one of these (and there are no long-running agents that match), then a pod will be created from that template, and your pipeline execution will happen inside a container in that. Once it's no longer needed, it will be deprovisioned again.
Here are the pipeline steps exposed by this plugin: https://jenkins.io/doc/pipeline/steps/kubernetes/
Running additional containers from Pipelines
As part of your pipeline, you may want to run some tests, and those may expect to be able to interact with e.g. a database. You can create resources for that in your OpenShift project (e.g. a Deployment & expose it with a Service), and tear them down after. The openshift-client plugin is very useful here and has docs on how to interact with OpenShift.
Building container images from Pipelines
If your goal is to build container images from pipelines, remember that OpenShift also exposes this capability (depending on the security configuration) through Builds. Just like in the previous section, you can use the openshift-client plugin to create and trigger builds.
For more information on the Jenkins image that's maintained by OpenShift (and generally how to do useful things in Jenkins on OpenShift), there's this dedicated page in the OpenShift docs.
You have this article by #jpetazzo, from Docker team, about Docker In Docker (DinD):
article:
The primary purpose of Docker-in-Docker was to help with the development of Docker itself. Many people use it to run CI (e.g. with Jenkins), which seems fine at first, but they run into many “interesting” problems that can be avoided by bind-mounting the Docker socket into your Jenkins container instead.
DinD Repo:
This work is now obsolete, thanks to the combined efforts of some amazing people like #jfrazelle and #tianon, who also are black belts in the art of putting IKEA furniture together.
If you want to run Docker-in-Docker today, all you need to do is:
docker run --privileged -d docker:dind
So here is an article using another approach to build docker containers with Jenkins inside a docker container:
docker run -p 8080:8080 \
-v /var/run/docker.sock:/var/run/docker.sock \
--name jenkins \
jenkins/jenkins:lts
So you may want to adapt one of this solutions to your OpenShift scenario. I hope it solves your issue.
You'll need a privileged pod running jenkins wich mounts the openshift node docker socket. This is a bad idea as jenkins'll launch container outside kubernetes semantics and control.
Why do not use s2i service shipped with openshift ?
Regards.

New to Docker - how to essentially make a cloneable setup?

My goal is to use Docker to create a mail setup running postfix + dovecot, fully configured and ready to go (on Ubuntu 14.04), so I could easily deploy on several servers. As far as I understand Docker, the process to do this is:
Spin up a new container (docker run -it ubuntu bash).
Install and configure postfix and dovecot.
If I need to shut down and take a break, I can exit the shell and return to the container via docker start <id> followed by docker attach <id>.
(here's where things get fuzzy for me)
At this point, is it better to export the image to a file, import on another server, and run it? How do I make sure the container will automatically start postfix, dovecot, and other services upon running it? I also don't quite understand the difference between using a Dockerfile to automate installations vs just installing it manually and exporting the image.
Configure multiple docker images using Dockerfiles
Each docker container should run only one service. So one container for postfix, one for another service etc. You can have your running containers communicate with each other
Build those images
Push those images to a registry so that you can easily pull them on different servers and have the same setup.
Pull those images on your different servers.
You can pass ENV variables when you start a container to configure it.
You should not install something directly inside a running container.
This defeat the pupose of having a reproducible setup with Docker.
Your step #2 should be a RUN entry inside a Dockerfile, that is then used to run docker build to create an image.
This image could then be used to start and stop running containers as needed.
See the Dockerfile RUN entry documentation. This is usually used with apt-get install to install needed components.
The ENTRYPOINT in the Dockerfile should be set to start your services.
In general it is recommended to have just one process in each image.

Resources