What is a Docker build stage? - docker

As far as I understand build stages in Docker are fundamental things, and I have a practical understanding of them but I have trouble coming up with a proper definition, and I also can't seem to find one.
So: what is the definition of a Docker build stage?
Edit: I'm not asking "how do I use a build stage?" or "how can I use multi-build stages?" which people seem very eager to answer :-)
The reason I have this question is because I saw the following sentences in the docs:
"The FROM instruction initializes a new build stage"
"a name can be given to a new build stage"
Which left me wondering: what exactly is a build stage?

I don't think there will ever be a strict definition for Docker build stage because a build stage is in general something theoretical which:
can be defined by you
depends on your case (language / libraries)
In this question: Difference between build and deploy? one of the answers says...
Build means to Compile the project.
I think you can see it this way too. A build stage is any procedure that generates something which can later be taken and used.
The idea with docker multi-stage builds is to:
generate what you are going to need
leave behind what you don't need and use the product of step 1 in a more lightweight way
If you have read the docs, Alex Ellis has a nice example where the same logic takes place:
he starts with a golang image, adds libraries, builds his app (Go generates a binary executable file)
after that, he doesn't need golang and the libraries to ship/run it so, he picks an alpine image, adds the executable file from step 1 and ships his app with an image that has much smaller size.

Since version 17, docker now supports multiple stages during a docker build executions.
This means, that you no longer need to define only one source image in your docker file and do the whole build in a single run, but you can define multiple stages with different images in your Dockerfile for each stage with multiple FROM definitions:
# Build stage
FROM microsoft/aspnetcore
# ..do a build with a dev image for creating ./app artifact
# Publish - use a hardened, production image
FROM alpine:latest
CMD ["./app"]
This gives you the benefit to break your image building process to be optimized for a task that you are doing in a stage - for example the stages could be:
use an image with extra linting dependencies to check your source
use a dev-image with all development dependencies already installed to build your source
use another image including test frameworks to run various tests on the artifacts
and once everything passed ok, use a minimal-sized, optimized, hardened image to capture the final artifacts for production
Read more in details about multistage-build:
https://docs.docker.com/develop/develop-images/multistage-build/

A stage is the creation an image. In a multi-stage build, you go through the process of creating more than one image, however you typically only tag a single one (exceptions being multiple builds, building a multi-architecture image manifest with a tool like buildx, and anything else docker releases after this answer).
Each stage, building a distinct image, starts from a FROM line in the Dockerfile. One stage doesn't inherit anything done in previous stages, it is based on its own base image. So if you have the following:
FROM alpine as stage1
RUN apk add your_tool
FROM alpine as stage2
RUN your_tool some args
you will get an error since your_tool is not installed in the second stage.
Which stage do you get as output from the build? By default the last stage, but you can change that with the docker image build --target stage1 . to build the stage with the name, stage1 in this example. The classic docker build will run from the top of the Dockerfile until if finishes the target stage. Buildkit builds a dependency graph and builds stages concurrently and only if needed, so do not depend on this ordering to control something like a testing workflow in your Dockerfile (buildkit can see if nothing in the test stage is needed in your release stage and skip building the test).
What's the value of multiple stages? Typically, its done to separate the build environment from the runtime environment. It allows you to perform the entire build inside of docker. This has two advantages.
First, you don't require an external Makefile and various compilers and other tools installed on the host to compile the binaries that then get copied into the image with a COPY line, anyone with docker can build your image.
And second, the resulting image doesn't include all the compilers or other build time tooling that isn't needed at runtime, resulting in smaller and more secure images. The typical example is a java app with maven and a full JDK to build, a runtime with just the jar file and the JRE.
If each stage makes a separate image, how do you get the jar file from the build stage to the run stage? That comes from a new option to the COPY command, --from. An oversimplified multi-stage build looks like:
FROM maven as build
COPY src /app/src
WORKDIR /app/src
RUN mvn install
FROM openjdk:jre as release
COPY --from=build /app/src/target/app.jar /app
CMD java -jar /app/app.jar
With that COPY --from=build we are able to take the artifact built in the build stage and add it to the release stage, without including anything else from that first stage (no layers of compile tools like JDK or Maven get added to our second stage).
How is the FROM x as y and the COPY --from=y /a /b working together? The FROM x as y is defining an image name for the duration of this build, in this case y. Anywhere later in the Dockerfile that you would put an image name, you can put y and you'll get the result of this stage as your input. So you could say:
FROM upstream as mybuilder
RUN apk add common_tools
FROM mybuilder as stage2
RUN some_tool arg2
FROM mybuilder as stage3
RUN some_tool arg3
FROM minimal_base as release
COPY --from=stage2 /bin2 /
COPY --from=stage3 /bin3 /
Note how stage2 and stage3 are each FROM mybuilder that is the output of the first stage.
The COPY --from=y allows you to change the context where you are copying from to be another image instead of the build context. It doesn't have to be another stage. So, for example, you could do the following to get a docker binary in your image:
FROM alpine
COPY --from=docker:stable /usr/local/bin/docker /usr/local/bin/
Further documentation on this is available at: https://docs.docker.com/develop/develop-images/multistage-build/

a build stage starts at a FROM statement and ends at the step before the next FROM statement

stage | steɪdʒ |
noun
a point, period, or step in a process or development
Take a practical example: you want to build an image which contains a production ready web server with Typescript files compiled to Javascript. You want to build that Typescript within a Docker container to simplify dependency management. So you need:
node.js
Typescript
any dependencies needed for compilation
Webpack or whatever
nginx/Apache/whatever
In your final image you only really need the compiled .js files and, say, nginx. But to get there, you need all that other stuff first. When you upload that final image, it will contain all the intermediate layers, even if they're unnecessary for the final product.
Docker build stages now allow you to actually separate those stages, or steps, into separate images, while still using just one Dockerfile and not needing to glue several Dockerfiles together with external shell scripts or such. E.g.:
FROM node as builder
RUN npm install ...
# whatever you need to build your files
FROM nginx as production
COPY --from=builder /final.js /var/www/html
The final result of this Dockerfile is a small image with nginx as its base plus just the final .js file. It does not contain all the unnecessary stuff like node.js and the npm dependencies.
builder here is the first stage, production is the second stage. In this case the first stage will be discarded at the end of the process, but you can also choose to build a specific stage using docker build --target=builder. A new FROM introduces a new, separate stage. They're essentially separate Dockerfiles, but they can share data using COPY --from.

Related

Single entrypoint for pipeline steps in a docker image

I have a docker image which encapsulates some processing steps: A, B, C with a linear dependency: A -> B -> C. Each step produces some artifacts (files) that will be required for subsequent steps.
What is a robust way of running this pipeline given these constraints?
A simple idea is to write a shell script, running each step like:
# run.sh
python step_a.py [args]
python step_b.py [args]
./step_c [args]
and define run.sh as the ENTRYPOINT of the docker image.
Would this be good-enough? What are some potential caveats? Is there a better approach?
I would have preferred something like docker-compose, but even with depends_on, it's not guaranteed that subsequent steps will run only after former steps are finished.
I think the most robust way to do this with a dockerfile would be to use multi-stage builds.
At its core, multi-stage builds just break up the docker file into multiple smaller images that you can control more granularly; so for your use case, you would have a stage for each part. Then you can copy the artifacts you need between stages. Finally, since you want an output and not a container, you would make the entry point the Rust binary and then have that spit out whatever you need. This would look a little something like this
FROM python-3.8:latest AS stage-1 // this can be whatever image you want
RUN pip install requirements_1.txt // install the reqs for the first python file
RUN python_file_1.py
FROM python-3.8:latest AS stage-2 // again, whatever image you want
RUN pip install requirements_2.txt // same idea
COPY --from=stage-1 ./artifact_1 ./destination // this copies the artifact from running python_file_1.py to somewhere you want it to be. The paths here are obviously placeholders
RUN python_file_2.py
FROM rust:1.31
COPY --from=stage-2 ./artifact_2 ./destination
ENTRYPOINT ["./rust_binary"]
Basic gist -
Make some python image, install prereqs for first python file
Run first python file
Makes some python image, install prereqs for second python file
Copy needed artifact from first stage to current (second) stage
Run second python file
Makes some rust image, install anything needed
Copy needed artifact from second stage to current (third) stage
Entrypoint into the rust binary, which should produce your output

How to use docker images when building artefacts in Actions?

TL;DR: I would like to use on a self-hosted Actions runner (itself a docker container on my docker engine) specific docker images to build artefacts that I would move between the build phases, and end with a standalone executable (not a docker container to be deployed). I do not know how to use docker containers as "building engines" in Actions.
Details: I have a home project consisting of a backend in Go (cross compiled to a standalone binary) and a frontend in Javascript (actually a framework: Quasar).
I develop on my laptop in Windows and use GitHub as the SCM.
The manual steps I do are:
build a static version of the frontend which lands in a directory spa
copy that directory to the backend directory
compile the executable that embeds the spa directory
copy (scp) this executable to the final destination
For development purposes this works fine.
I now would like to use Actions to automate the whole thing. I use docker based self-hosted runners (tcardonne/github-runner).
My problem: the containers do a great job isolating the build environment from the server they run on. They are however reused across build jobs and this may create conflicts. More importantly, the default versions of software provided by these containers is not the right (usually - latest) one.
The solution would be to run the build phases in disposable docker containers (that would base on the right image, shortening the build time as a collateral nice to have). Unfortunately, I do not know how to set this up.
Note: I do not want to ultimately create docker containers, I just want to use them as "building engines" and extract the artefacts from them, and share between the jobs (in my specific case - one job would be to build the front with quasar and generate a directory, the other one would be a compilation ending up with a standalone executable copied elsewhere)
Interesting premise, you can certainly do this!
I think you may be slightly mistaken with regards to:
They are however reused across build jobs and this may create conflicts
If you run a new container from an image, then you will start with a fresh instance of that container. Files, software, etc, all adhering to the original image definition. Which is good, as this certainly aids your efforts. Let me know if I have the wrong end of the stick in regards to the above though.
Base Image
You can define your own image for building, in order to mitigate shortfalls of public images that may not be up to date, or suit your requirements. In fact, this is a common pattern for CI, and Google does something similar with their cloud build configuration. For either approach below, you will likely want to do something like the following to ensure you have all the build tools you may
As a rough example:
FROM golang:1.16.7-buster
RUN apt update && apt install -y \
git \
make \
...
&& useradd <myuser> \
&& mkdir /dist
USER myuser
You could build and publish this with the following tag:
docker build . -t <containerregistry>:buildr/golang
It would also be recommended that you maintain a separate builder image for other types of projects, such as node, python, etc.
Approaches
Building with layers
If you're looking to leverage build caching for your applications, this will be the better option for you. Caching is only effective if nothing has changed, and since the projects will be built in isolation, it makes it relatively safe.
Building your app may look something like the following:
FROM <containerregistry>:buildr/golang as builder
COPY src/ .
RUN make dependencies
RUN make
RUN mv /path/to/compiled/app /dist
FROM scratch
COPY --from=builder /dist /dist
The gist of this is that you would start building your app within the builder image, such that it includes all the build deps you require, and then use a multi stage file to publish a final static container that includes your compiled source code, with no dependencies (using the scratch image as the smallest image possible ).
Getting the final files out of your image would be a bit harder using this approach, as you would have to run an instance of the container once published in order to mount the files and persist it to disk, or use docker cp to retrieve the files from a running container (not image) to your disk.
In Github actions, this would look like running a step that builds a Docker container, where the step can occur anywhere with docker accessibility
For example:
jobs:
docker:
runs-on: ubuntu-latest
steps:
...
- name: Build and push
id: docker_build
uses: docker/build-push-action#v2
with:
push: true
tags: user/app:latest
Building as a process
This one can not leverage build caching as well, but you may be able to do clever things like mounting a host npm cache into your container to aid in actions like npm restore.
This approach differs from the former in that the way you build your app will be defined via CI / a purposeful script, as opposed to the Dockerfile.
In this scenario, it would make more sense to define the CMD in the parent image, and mount your source code in, thus not maintaining a image per project you are building.
This would shift the responsibility of building your application from the buildtime of the image, to the runtime. Retrieving your code from the container would be doable through volume mounting for example:
docker run -v /path/to/src:/src /path/to/dist:/dist <containerregistry>:buildr/golang
If the CMD was defined in the builder, that single script would execute and build the mounted in source code, and subsequently publish to /dist in the container, which would then be persisted to your host via that volume mapping.
Of course, this applies if you're building locally. It actually becomes a bit nicer in a Github actions context if you wish to keep your build instructions there. You can choose to run steps within your builder container using something like the following suggestion
jobs:
...
container:
runs-on: ubuntu-latest
container: <containerregistry>:buildr/golang
steps:
- run: |
echo This job does specify a container.
echo It runs in the container instead of the VM.
name: Run in container
Within that run: spec, you could choose to call a build script, or enter the commands that might be present in the script yourself.
What you do with the compiled source is muchly up to you once acquired 👍
Chaining (Frontend / Backend)
You mentioned that you build static assets for your site and then embed them into your golang binary to be served.
Something like that introduces complications of course, but nothing untoward. If you do not need to retrieve your web files until you build your golang container, then you may consider taking the first approach, and copying the content from the published image as part of a Docker directive. This makes more sense if you have two separate projects, one for frontend and backend.
If everything is in one folder, then it sounds like you may just want to extend your build image to facilitate go and js, and then take the latter approach and define those build instructions in a script, makefile, or your run: config in your actions file
Conclusion
This is alot of info, I hope it's digestible for you, and more importantly, I hope it gives you some ideas as to how you can tackle your current issue. Let me know if you would like clarity in the comments

Re-use Dockerfile with different base image

I have a Dockerfile that currently uses the node:10.21.0-buster-slim as its base. That works well for running in production since I get a nice small image (I'd rather not use Alpine since I've had issues with this code on Alpine previously and I'm already stuck running a MySQL image based on buster-slim in production). However, for development, it would obviously be nice to have an image with more tools for diagnosing issues that crop up (presumably based on either debian:buster or buildpack-deps:buster).
Is there some way I can run the same steps with two different base images without having to duplicate the Dockerfile contents? I assume the answer is yes with some multi-stage build magic, but I haven't figured out how that's supposed to work. In my dream world there are also a few minor differences between the dev and prod build steps (e.g. the --only=production argument to npm install, but I'm willing to sacrifice that if I have to to avoid maintaining two nearly identical Dockerfiles.
Multi-stage build magic is one way to do it:
ARG TARGET="prod"
FROM node:10.21.0-buster-slim as prod
# do stuff
FROM debian:buster as dev
# do other stuff, like apt-get install nodejs
FROM ${TARGET}
# anything in common here
Build the image with DOCKER_BUILDKIT=1 docker build --build-arg 'TARGET=dev' [...] to get the development-specific stuff. Build image with DOCKER_BUILDKIT=1 docker build [...] to get the existing "prod" stuff. Switch out the value in the first ARG line to change the default behavior if the --build-arg flag is omitted.
Using the DOCKER_BUILDKIT=1 environment flag is important; if you leave it out, builds will always do all three stages. This becomes a much bigger problem the more phases you have and the more conditional stuff you do. When you include it, the build executes the last stage in the file, and only the previous stages that are necessary to complete the multi-stage build. Meaning, for TARGET=prod, the dev stage never executes, and vice versa.

Docker multistage build without copying from previous image?

does it have any advantages to use a multistage build in Docker, if you don't copy any files from the previously built image?
eg.
FROM some_base_image as base
#Some random commands
RUN mkdir /app
RUN mkdir /app2
RUN mkdir /app3
#ETC
#Second stage starts from first stage
FROM base
#Add some files to image
COPY foo.txt /app
Does this result in a smaller image or offer any other advantages compared to a non multi-stage version? Or are multi stage builds only useful for preparing some files and then copying those into another base image?
Or are multi stage builds only useful for preparing some files and then copying those into another base image?
This is the main use-case discussed in "Use multi-stage builds"
The main goal is to reduce the number of layers by copying files from one image to another, without including the build environment needed to produce said files.
But, another goal could be not rebuild the entire Dockerfile including every stage.
Then your suggestion (not copying) could still apply.
You can specify a target build stage. The following command assumes you are using the previous Dockerfile but stops at the stage named builder:
$ docker build --target builder -t alexellis2/href-counter:latest .
A few scenarios where this might be very powerful are:
Debugging a specific build stage
Using a debug stage with all debugging symbols or tools enabled, and a lean production stage
Using a testing stage in which your app gets populated with test data, but building for production using a different stage which uses real data

Building software in docker - at `build` or at `run` time?

I am currently using docker do create a reproducable build environment (for building Android ROMs). Now I would like to run multiple builds, each with slight variations. Every build contains of several steps, e.g.
Build Linux kernel
Build Android
Include custom apps
Package image
If two builds only vary at step 3, it would be great to be able to reuse the first two steps.
I am thinking of two options:
Enter my docker container, run the build, and save the build artifacts at each step. Later check if I can reuse them. This would require quite a bit of coding, and manual management of build artifacts.
Abuse docker build. Create a dockerfile for each configuration, with one RUN command for each step. I think this will let me use docker's caching - if two builds only differ at step 3, docker will reuse a layer containing steps 1 and 2. I would only ever "run" the container I built to copy out the finished ROM.
Is there a "best" or canonical way to do this? Is there any downside to using docker build in this way?
You could build what's called a "base image", and push that up to a docker registry. Then for the two branches of that image, you use the FROM keyword. But, instead of using a base image like FROM ubuntu:latest , you use your base image:
To use the base image:
FROM repo/base-image:tag
So your base could be:
FROM ubuntu:14.04
# Step 1
COPY /tmp /tmp
# Step 2
ADD /src /src
You build and push that:
docker build -t repo/base-image .
docker push repo/base-image
Then, in your other two Dockerfiles...
Dockerfile1
FROM repo/base-image:tag
# Step 3 specific to this Dockerfile1
ADD /something /somewhere
# Do different things
EXPOSE 443
Dockerfile2
FROM repo/base-image:tag
# Step 3 specific to this Dockerfile2
ADD /something-else /somewhere-else
# Do different things
EXPOSE 80
That way they have the first 2 layers in common, and only differ by the third layer. The lines in docker files are called layers. Kind of like traversing a tree. The more lines you have, the more layers / levels you have. But, based on the FROM repo/img:tag line, this tells you where to inherit ALL previous layers from.
The second option (relying on Dockerfiles + docker build) is definitely the way to go.
Indeed as you already mentioned it in your question, this will enable Docker to use caching.
Also, I recall that even if there is only one Dockerfile involved with a single FROM ... command, Docker's caching will already be active. That's the reason why in a Dockerfile, the order of commands matters (it is preferable to run beforehand the commands that are unlikely to change at each build, and afterwards the commands that are likely to - such as the compilation of custom apps).
You can thus follow the steps detailed in #JabariDash's answer, but if you notice that an intermediate image repo/some-image is only used once (via a command FROM repo/some-image in another Dockerfile), note that you can avoid defining this repo/some-image in a separate Dockerfile: indeed you can put several FROM ... commands in the same Dockerfile, and rely on the so-called multi-stage builds feature of Docker >= 17.05.

Resources