How to debug docker-compose cache miss when building - docker

I'm executing the same docker-compose build command and I see that it misses the cache
Building service1
Step 1/31 : FROM node:10 as build-stage
---> a7366e5a78b2
Step 2/31 : WORKDIR /app
---> Using cache
---> 8a744e522376
Step 3/31 : COPY package.json yarn.lock ./
---> Using cache
---> 66c9bb64a364
Step 4/31 : RUN yarn install --production
---> Running in 707365c332e7
yarn install v1.21.1
..
..
..
As you can see the cache was missed, but I couldn't understand why
What is the best method to debug what changed and try to figure out why
EDIT: The question is not to debug my specific problem. But how can I generally debug a problem of this kind. How can I know WHY docker-compose thinks things changed (although I'm pretty sure NOTHING changed), which files/commands/results are different?

how can I generally debug a problem of this kind. How can I know WHY docker-compose thinks things changed (although I'm pretty sure NOTHING changed), which files/commands/results are different
In general, as shown here:
I'm a bit bummed that I can't seem to find any way to make the Docker build more verbose
But when it comes to docker-compose, it depends on your version and option used.
moby/moby issue 30081 explains (by Sebastiaan van Stijn (thaJeztah):
Current versions of docker-compose and docker build in many (or all) cases will not share the build cache, or at least not produce the same digest.
The reason for that is that when sending the build context with docker-compose, it will use a slightly different compression (docker-compose is written in Python, whereas the docker cli is written in Go).
There may be other differences due to them being a different implementation (and language).
(that was also discussed in docker/compose issue 883)
The next release of docker compose will have an (currently opt-in) feature to make it use the actual docker cli to perform the build (by setting a COMPOSE_DOCKER_CLI_BUILD=1 environment variable). This was implemented in docker/compose#6865 (1.25.0-rc3+, Oct. 2019)
With that feature, docker compose can also use BuildKit for building images (setting the DOCKER_BUILDKIT=1 environment variable).
I would highly recommend using buildkit for your builds if possible.
When using BuildKit (requires Docker 18.09 or up, and at this point is not supported for building Windows containers), you will see a huge improvement in build-speed, and the duration taken to send the build-context to the daemon in repeated builds (buildkit uses an interactive session to send only those files that are needed during build, instead of uploading the entire build context).
So double-check first if your docker-compose uses BuildKit, and if the issue (caching not reused) persists then:
COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose build
Sebastiaan added in issue 4012:
BuildKit is still opt-in (because there's no Windows support yet), but is production quality, and can be used as the default for building Linux images.
Finally, I realize that for the Azure pipelines, you (probably) don't have control over the versions of Docker and Docker Compose installed, but for your local machines, make sure to update to the latest 19.03 patch release; for example, Docker 19.03.3 and up have various improvements and fixes in the caching mechanisms of BuildKit (see, e.g., docker#373).
Note, in your particular case, even though this is not the main issue in your question, it would be interesting to know if the following helps:
yarnpkg/yarn/issue 749 suggests:
You wouldn't mount the Yarn cache directory. Instead, you should make sure you take advantage of Docker's image layer caching.
These are the commands I am using:
COPY package.json yarn.lock ./
RUN yarn --pure-lockfile
Then try your yarn install command, and see if docker still doesn't use its cache.
RUN yarn install --frozen-lockfile --production && yarn cache clean
Don't forget a yarn cache clean in order to prevent the yarn cache from winding up in docker layers.
If the issue persists, switch to buildkit directly (for testing), with a buildctl build --progress=plain to see a more verbose output, and debug the caching situation.
Typically, a multi-stage approach, as shown here, can be useful:
FROM node:alpine
WORKDIR /usr/src/app
COPY . /usr/src/app/
# We don't need to do this cache clean, I guess it wastes time / saves space: https://github.com/yarnpkg/rfcs/pull/53
RUN set -ex; \
yarn install --frozen-lockfile --production; \
yarn cache clean; \
yarn run build
FROM nginx:alpine
WORKDIR /usr/share/nginx/html
COPY --from=0 /usr/src/app/build/ /usr/share/nginx/html
As noted by nairum in the comments:
I just found that it is required to use cache_from to make caching working when I use my multi-stage Dockerfile with Docker Compose.
From the documentation:
cache_from defines a list of sources the Image builder SHOULD use for cache resolution.
Cache location syntax MUST follow the global format [NAME|type=TYPE[,KEY=VALUE]].
Simple NAME is actually a shortcut notation for type=registry,ref=NAME.
build:
context: .
cache_from:
- alpine:latest
- type=local,src=path/to/cache
- type=gha

Related

Combining multiple images in docker-compose [duplicate]

I have a few Dockerfiles right now.
One is for Cassandra 3.5, and it is FROM cassandra:3.5
I also have a Dockerfile for Kafka, but t is quite a bit more complex. It is FROM java:openjdk-8-fre and it runs a long command to install Kafka and Zookeeper.
Finally, I have an application written in Scala that uses SBT.
For that Dockerfile, it is FROM broadinstitute/scala-baseimage, which gets me Java 8, Scala 2.11.7, and STB 0.13.9, which are what I need.
Perhaps, I don't understand how Docker works, but my Scala program has Cassandra and Kafka as dependencies and for development purposes, I want others to be able to simply clone my repo with the Dockerfile and then be able to build it with Cassandra, Kafka, Scala, Java and SBT all baked in so that they can just compile the source. I'm having a lot of issues with this though.
How do I combine these Dockerfiles? How do I simply make an environment with those things baked in?
You can, with the multi-stage builds feature introduced in Docker 1.17
Take a look at this:
FROM golang:1.7.3
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=0 /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
Then build the image normally:
docker build -t alexellis2/href-counter:latest
From : https://docs.docker.com/develop/develop-images/multistage-build/
The end result is the same tiny production image as before, with a significant reduction in complexity. You don’t need to create any intermediate images and you don’t need to extract any artifacts to your local system at all.
How does it work? The second FROM instruction starts a new build stage with the alpine:latest image as its base. The COPY --from=0 line copies just the built artifact from the previous stage into this new stage. The Go SDK and any intermediate artifacts are left behind, and not saved in the final image.
You can't combine dockerfiles as conflicts may occur. What you want to do is to create a new dockerfile or build a custom image.
TL;DR;
If your current development container contains all the tools you need and works, then save it as an image and upon it to a repo and create a dockerfile to pull from that image off that repo.
Details:
Building a custom image is by far easier than creating a dockerfile using a public image as you can store whatever hacks and mods into the image. To do so, start a blank container with a basic Linux image (or broadinstitute/scala-baseimage), install whatever tools you need and configure them until everything works correctly, then save it (the container) as an image. Create a new container off this image and test to see if you can build your code on top of it via docker-compose (or however you want to do/build it). If it works, than you have a working base image that you can upload to a repo so others can pull it.
To build a dockerfile with a public image, you will need to put all hacks, mods and setup on the dockerfile itself. That is, you will need to place every command line that you used into a text file and reduce whatever hacks, mods and setup into command lines. At the end, your dockerfile will create an image automatically and you don't need to store this image into a repo and all you need to do is to give others the dockerfile and they can spin the image up at their own docker.
Note that once you have a working dockerfile, you can tweak it easily as it will create a new image every time you use the dockerfile. With a custom image, you may run into issues where you need to rebuild the image due to conflicts. For example, all of your tools work with openjdk until you install one that doesn't work. The fix may involve uninstalling openjdk and use the oracle one, but all configuration you did for all the tools that you have installed broke.
The following answer applies to docker 1.7 and above:
I would prefer to use --from=NAME and from image as NAME
Why?
You can use --from=0 and above but this might get little hard to manage when you have many docker stages in dockerfile.
sample example:
FROM golang:1.7.3 as backend
WORKDIR /backend
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN #install some stuff, compile assets....
FROM golang:1.7.3 as assets
WORKDIR /assets
RUN ./getassets.sh
FROM nodejs:latest as frontend
RUN npm install
WORKDIR /assets
COPY --from=assets /asets .
CMD ["./app"]
FROM alpine:latest as mergedassets
WORKDIR /root/
COPY --from=frontend . /
COPY --from=backend ./backend .
CMD ["./app"]
Note: Managing dockerfile properly will help to build a docker image much faster. Internally docker usings docker layer caching to help with this process, incase the image have to be rebuilt.
Yes, you can roll a whole lot of software into a single Docker image (GitLab does this, with one image that includes Postgres and everything else), but generalhenry is right - that's not the typical way to use Docker.
As you say, Cassandra and Kafka are dependencies for your Scala app, they're not part of the app, so they don't all belong in the same image.
Having to orchestrate many containers with Docker Compose adds an extra admin layer, but it gives you much more flexibility:
your containers can have different lifespans, so when you have a new version of your app to deploy, you only need to run a new app container, you can leave the dependencies running;
you can use the same app image in any environment, using different configurations for your dependencies - e.g. in dev you can run a basic Kafka container and in prod have it clustered on many nodes, your app container is the same;
your dependencies can be used by other apps too - so multiple consumers can run in different containers and all work with the same Kafka and Cassandra containers;
plus all the scalability, logging etc. already mentioned.
When might you want to "combine" Docker images?
As others are pointing out here, you typically don't want to put your database and you application into the same Docker image. Ideally you want a Docker image to wrap a "single process"/"runtime". This allows each process to be scaled up/down and restarted individually.
Let's say you want to use some shared C-libraries/executables that are not available in the package manager of the image you are using, but someone else has created an image where they are precompiled - and you might not want to recompile these binaries as part of your build (depending on how long this takes). Is there a way to quickly create a POC-Docker image containing all of these executables/libraries based on the existing images?
Docker and Composition
Relevant discussion: https://github.com/moby/moby/issues/3378
What Docker lacks is a good way of composing images. You can copy individual files or entire file systems from other images into your own using COPY --from=<image> <from-path> <to-path>. There is no builtin way of copying the environment variables from another image into your own.
That said, I have personally created a custom frontend/parser for Dockerfiles that adds an INCLUDE <image>-keyword. This copies the entire filesystem, along with the environment variables into your image:
DOCKER_BUILDKIT=1 docker build -t myimage .
#syntax=bergkvist/includeimage
FROM alpine:3.12.0
INCLUDE rust:1.44-alpine3.12
INCLUDE python:3.8.3-alpine3.12
nixpkgs.dockerTools
if you want truly composable Docker builds, I recommend checking out dockerTools in nixpkgs. This will also result in more reproducible (and typically very small) images. See https://nix.dev/tutorials/building-and-running-docker-images
docker load < $(nix-build docker-image.nix)
# docker-image.nix
let
pkgs = import <nixpkgs> {};
python = pkgs.python38;
rustc = pkgs.rustc;
in pkgs.dockerTools.buildImage {
name = "myimage";
tag = "latest";
contents = [ python rustc ];
}
Docker doesn't do merges of the images, but there isn't anything stopping you combining the dockerfiles if available, and rolling into them into a fat image which you'd need to build. There's times where this makes sense, however, as for running multiple processes in a container most Docker dogma will point to this as less desirable especially with microservice architecture (however rules are there to be broken right?)
You could not combine docker images into 1 container. See the detail discussions in Moby issue, How do I combine several images into one via Dockerfile.
For your case, it is better to not include the whole Cassandra and Kafka images. The application would only need the Cassandra Scala driver and Kafka Scala driver. The container should include the drivers only.
I needed docker:latest and python:latest images for Gitlab CI. Here is what I came up with:
FROM ubuntu:latest
RUN apt update
RUN apt install -y sudo
RUN sudo apt install -y docker.io
RUN sudo apt install -y python3-pip
RUN sudo apt install -y python3
RUN docker --version
RUN pip3 --version
RUN python3 --version
After I've build and pushed it to my Docker Hub repo:
docker build -t docker-hub-repo/image-name:latest path/to/Dockerfile
docker push docker-hub-repo/image-name:latest
Don't forget to docker login before push
Hope it helps

Merging two dockerfiles in one [duplicate]

I have a few Dockerfiles right now.
One is for Cassandra 3.5, and it is FROM cassandra:3.5
I also have a Dockerfile for Kafka, but t is quite a bit more complex. It is FROM java:openjdk-8-fre and it runs a long command to install Kafka and Zookeeper.
Finally, I have an application written in Scala that uses SBT.
For that Dockerfile, it is FROM broadinstitute/scala-baseimage, which gets me Java 8, Scala 2.11.7, and STB 0.13.9, which are what I need.
Perhaps, I don't understand how Docker works, but my Scala program has Cassandra and Kafka as dependencies and for development purposes, I want others to be able to simply clone my repo with the Dockerfile and then be able to build it with Cassandra, Kafka, Scala, Java and SBT all baked in so that they can just compile the source. I'm having a lot of issues with this though.
How do I combine these Dockerfiles? How do I simply make an environment with those things baked in?
You can, with the multi-stage builds feature introduced in Docker 1.17
Take a look at this:
FROM golang:1.7.3
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=0 /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
Then build the image normally:
docker build -t alexellis2/href-counter:latest
From : https://docs.docker.com/develop/develop-images/multistage-build/
The end result is the same tiny production image as before, with a significant reduction in complexity. You don’t need to create any intermediate images and you don’t need to extract any artifacts to your local system at all.
How does it work? The second FROM instruction starts a new build stage with the alpine:latest image as its base. The COPY --from=0 line copies just the built artifact from the previous stage into this new stage. The Go SDK and any intermediate artifacts are left behind, and not saved in the final image.
You can't combine dockerfiles as conflicts may occur. What you want to do is to create a new dockerfile or build a custom image.
TL;DR;
If your current development container contains all the tools you need and works, then save it as an image and upon it to a repo and create a dockerfile to pull from that image off that repo.
Details:
Building a custom image is by far easier than creating a dockerfile using a public image as you can store whatever hacks and mods into the image. To do so, start a blank container with a basic Linux image (or broadinstitute/scala-baseimage), install whatever tools you need and configure them until everything works correctly, then save it (the container) as an image. Create a new container off this image and test to see if you can build your code on top of it via docker-compose (or however you want to do/build it). If it works, than you have a working base image that you can upload to a repo so others can pull it.
To build a dockerfile with a public image, you will need to put all hacks, mods and setup on the dockerfile itself. That is, you will need to place every command line that you used into a text file and reduce whatever hacks, mods and setup into command lines. At the end, your dockerfile will create an image automatically and you don't need to store this image into a repo and all you need to do is to give others the dockerfile and they can spin the image up at their own docker.
Note that once you have a working dockerfile, you can tweak it easily as it will create a new image every time you use the dockerfile. With a custom image, you may run into issues where you need to rebuild the image due to conflicts. For example, all of your tools work with openjdk until you install one that doesn't work. The fix may involve uninstalling openjdk and use the oracle one, but all configuration you did for all the tools that you have installed broke.
The following answer applies to docker 1.7 and above:
I would prefer to use --from=NAME and from image as NAME
Why?
You can use --from=0 and above but this might get little hard to manage when you have many docker stages in dockerfile.
sample example:
FROM golang:1.7.3 as backend
WORKDIR /backend
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN #install some stuff, compile assets....
FROM golang:1.7.3 as assets
WORKDIR /assets
RUN ./getassets.sh
FROM nodejs:latest as frontend
RUN npm install
WORKDIR /assets
COPY --from=assets /asets .
CMD ["./app"]
FROM alpine:latest as mergedassets
WORKDIR /root/
COPY --from=frontend . /
COPY --from=backend ./backend .
CMD ["./app"]
Note: Managing dockerfile properly will help to build a docker image much faster. Internally docker usings docker layer caching to help with this process, incase the image have to be rebuilt.
Yes, you can roll a whole lot of software into a single Docker image (GitLab does this, with one image that includes Postgres and everything else), but generalhenry is right - that's not the typical way to use Docker.
As you say, Cassandra and Kafka are dependencies for your Scala app, they're not part of the app, so they don't all belong in the same image.
Having to orchestrate many containers with Docker Compose adds an extra admin layer, but it gives you much more flexibility:
your containers can have different lifespans, so when you have a new version of your app to deploy, you only need to run a new app container, you can leave the dependencies running;
you can use the same app image in any environment, using different configurations for your dependencies - e.g. in dev you can run a basic Kafka container and in prod have it clustered on many nodes, your app container is the same;
your dependencies can be used by other apps too - so multiple consumers can run in different containers and all work with the same Kafka and Cassandra containers;
plus all the scalability, logging etc. already mentioned.
When might you want to "combine" Docker images?
As others are pointing out here, you typically don't want to put your database and you application into the same Docker image. Ideally you want a Docker image to wrap a "single process"/"runtime". This allows each process to be scaled up/down and restarted individually.
Let's say you want to use some shared C-libraries/executables that are not available in the package manager of the image you are using, but someone else has created an image where they are precompiled - and you might not want to recompile these binaries as part of your build (depending on how long this takes). Is there a way to quickly create a POC-Docker image containing all of these executables/libraries based on the existing images?
Docker and Composition
Relevant discussion: https://github.com/moby/moby/issues/3378
What Docker lacks is a good way of composing images. You can copy individual files or entire file systems from other images into your own using COPY --from=<image> <from-path> <to-path>. There is no builtin way of copying the environment variables from another image into your own.
That said, I have personally created a custom frontend/parser for Dockerfiles that adds an INCLUDE <image>-keyword. This copies the entire filesystem, along with the environment variables into your image:
DOCKER_BUILDKIT=1 docker build -t myimage .
#syntax=bergkvist/includeimage
FROM alpine:3.12.0
INCLUDE rust:1.44-alpine3.12
INCLUDE python:3.8.3-alpine3.12
nixpkgs.dockerTools
if you want truly composable Docker builds, I recommend checking out dockerTools in nixpkgs. This will also result in more reproducible (and typically very small) images. See https://nix.dev/tutorials/building-and-running-docker-images
docker load < $(nix-build docker-image.nix)
# docker-image.nix
let
pkgs = import <nixpkgs> {};
python = pkgs.python38;
rustc = pkgs.rustc;
in pkgs.dockerTools.buildImage {
name = "myimage";
tag = "latest";
contents = [ python rustc ];
}
Docker doesn't do merges of the images, but there isn't anything stopping you combining the dockerfiles if available, and rolling into them into a fat image which you'd need to build. There's times where this makes sense, however, as for running multiple processes in a container most Docker dogma will point to this as less desirable especially with microservice architecture (however rules are there to be broken right?)
You could not combine docker images into 1 container. See the detail discussions in Moby issue, How do I combine several images into one via Dockerfile.
For your case, it is better to not include the whole Cassandra and Kafka images. The application would only need the Cassandra Scala driver and Kafka Scala driver. The container should include the drivers only.
I needed docker:latest and python:latest images for Gitlab CI. Here is what I came up with:
FROM ubuntu:latest
RUN apt update
RUN apt install -y sudo
RUN sudo apt install -y docker.io
RUN sudo apt install -y python3-pip
RUN sudo apt install -y python3
RUN docker --version
RUN pip3 --version
RUN python3 --version
After I've build and pushed it to my Docker Hub repo:
docker build -t docker-hub-repo/image-name:latest path/to/Dockerfile
docker push docker-hub-repo/image-name:latest
Don't forget to docker login before push
Hope it helps

Docker: Best practices for installing dependencies - Dockerfile or ENTRYPOINT?

Being relatively new to Docker development, I've seen a few different ways that apps and dependencies are installed.
For example, in the official Wordpress image, the WP source is downloaded in the Dockerfile and extracted into /usr/src and then this is installed to /var/www/html in the entrypoint script.
Other images download and install the source in the Dockerfile, meaning the entrypoint just deals with config issues.
Either way the source scripts have to be updated if a new version of the source is available, so one way versus the other doesn't seem to make updating for a new version any more efficient.
What are the pros and cons of each approach? Is one recommended over the other for any specific sorts of setup?
Generally you should install application code and dependencies exclusively in the Dockerfile. The image entrypoint should never download or install anything.
This approach is simpler (you often don't need an ENTRYPOINT line at all) and more reproducible. You might run across some setups that run commands like npm install in their entrypoint script; this work will be repeated every time the container runs, and the container won't start up if the network is unreachable. Installing dependencies in the Dockerfile only happens once (and generally can be cached across image rebuilds) and makes the image self-contained.
The Docker Hub wordpress image is unusual in that the underlying Wordpress libraries, the custom PHP application, and the application data are all stored in the same directory tree, and it's typical to use a volume mount for that application tree. Its entrypoint script looks for a wp-includes/index.php file inside the application source tree, and if it's not there it copies it in. That's a particular complex entrypoint script.
A generally useful pattern is to keep an application's data somewhere separate from the application source tree. If you're installing a framework, install it as a library using the host application's ordinary dependency system (for example, list it in a Node package.json file rather than trying to include it in a base image). This is good practice in general; in Docker it specifically lets you mount a volume on the data directory and not disturb the application.
For a typical Node application, for example, you might install the application and its dependencies in a Dockerfile, and not have an ENTRYPOINT declared at all:
FROM node:14
WORKDIR /app
# Install the dependencies
COPY package.json yarn.lock ./
RUN yarn install
# Install everything else
COPY . ./
# Point at some other data directory
RUN mkdir /data
ENV DATA_DIR=/data
# Application code can look at process.env.DATA_DIR
# Usual application metadata
EXPOSE 3000
CMD yarn start
...and then run this with a volume mounted for the data directory, leaving the application code intact:
docker build -t my-image .
docker volume create my-data
docker run -p 3000:3000 -d -v my-data:/data my-image

Cache Cargo dependencies in a Docker volume

I'm building a Rust program in Docker (rust:1.33.0).
Every time code changes, it re-compiles (good), which also re-downloads all dependencies (bad).
I thought I could cache dependencies by adding VOLUME ["/usr/local/cargo"]. edit I've also tried moving this dir with CARGO_HOME without luck.
I thought that making this a volume would persist the downloaded dependencies, which appear to be in this directory.
But it didn't work, they are still downloaded every time. Why?
Dockerfile
FROM rust:1.33.0
VOLUME ["/output", "/usr/local/cargo"]
RUN rustup default nightly-2019-01-29
COPY Cargo.toml .
COPY src/ ./src/
RUN ["cargo", "build", "-Z", "unstable-options", "--out-dir", "/output"]
Built with just docker build ..
Cargo.toml
[package]
name = "mwe"
version = "0.1.0"
[dependencies]
log = { version = "0.4.6" }
Code: just hello world
Output of second run after changing main.rs:
...
Step 4/6 : COPY Cargo.toml .
---> Using cache
---> 97f180cb6ce2
Step 5/6 : COPY src/ ./src/
---> 835be1ea0541
Step 6/6 : RUN ["cargo", "build", "-Z", "unstable-options", "--out-dir", "/output"]
---> Running in 551299a42907
Updating crates.io index
Downloading crates ...
Downloaded log v0.4.6
Downloaded cfg-if v0.1.6
Compiling cfg-if v0.1.6
Compiling log v0.4.6
Compiling mwe v0.1.0 (/)
Finished dev [unoptimized + debuginfo] target(s) in 17.43s
Removing intermediate container 551299a42907
---> e4626da13204
Successfully built e4626da13204
A volume inside the Dockerfile is counter-productive here. That would mount an anonymous volume at each build step, and again when you run the container. The volume during each build step is discarded after that step completes, which means you would need to download the entire contents again for any other step needing those dependencies.
The standard model for this is to copy your dependency specification, run the dependency download, copy your code, and then compile or run your code, in 4 separate steps. That lets docker cache the layers in an efficient manner. I'm not familiar with rust or cargo specifically, but I believe that would look like:
FROM rust:1.33.0
RUN rustup default nightly-2019-01-29
COPY Cargo.toml .
RUN cargo fetch # this should download dependencies
COPY src/ ./src/
RUN ["cargo", "build", "-Z", "unstable-options", "--out-dir", "/output"]
Another option is to turn on some experimental features with BuildKit (available in 18.09, released 2018-11-08) so that docker saves these dependencies in what is similar to a named volume for your build. The directory can be reused across builds, but never gets added to the image itself, making it useful for things like a download cache.
# syntax=docker/dockerfile:experimental
FROM rust:1.33.0
VOLUME ["/output", "/usr/local/cargo"]
RUN rustup default nightly-2019-01-29
COPY Cargo.toml .
COPY src/ ./src/
RUN --mount=type=cache,target=/root/.cargo \
["cargo", "build", "-Z", "unstable-options", "--out-dir", "/output"]
Note that the above assumes cargo is caching files in /root/.cargo. You'd need to verify this and adjust as appropriate. I also haven't mixed the mount syntax with a json exec syntax to know if that part works. You can read more about the BuildKit experimental features here: https://github.com/moby/buildkit/blob/master/frontend/dockerfile/docs/experimental.md
Turning on BuildKit from 18.09 and newer versions is as easy as export DOCKER_BUILDKIT=1 and then running your build from that shell.
I would say, the nicer solution would be to resort to docker multi-stage build as pointed here and there
This way you can create yourself a first image, that would build both your application and your dependencies, then use, only, in the second image, the dependency folder from the first one
This is inspired by both your comment on #Jack Gore's answer and the two issue comments linked here above.
FROM rust:1.33.0 as dependencies
WORKDIR /usr/src/app
COPY Cargo.toml .
RUN rustup default nightly-2019-01-29 && \
mkdir -p src && \
echo "fn main() {}" > src/main.rs && \
cargo build -Z unstable-options --out-dir /output
FROM rust:1.33.0 as application
# Those are the lines instructing this image to reuse the files
# from the previous image that was aliased as "dependencies"
COPY --from=dependencies /usr/src/app/Cargo.toml .
COPY --from=dependencies /usr/local/cargo /usr/local/cargo
COPY src/ src/
VOLUME /output
RUN rustup default nightly-2019-01-29 && \
cargo build -Z unstable-options --out-dir /output
PS: having only one run will reduce the number of layers you generate; more info here
Here's an overview of the possibilities. (Scroll down for my original answer.)
Add Cargo files, create fake main.rs/lib.rs, then compile dependencies. Afterwards remove the fake source and add the real ones. [Caches dependencies, but several fake files with workspaces].
Add Cargo files, create fake main.rs/lib.rs, then compile dependencies. Afterwards create a new layer with the dependencies and continue from there. [Similar to above].
Externally mount a volume for the cache dir. [Caches everything, relies on caller to pass --mount].
Use RUN --mount=type=cache,target=/the/path cargo build in the Dockerfile in new Docker versions. [Caches everything, seems like a good way, but currently too new to work for me. Executable not part of image. Edit: See here for a solution.]
Run sccache in another container or on the host, then connect to that during the build process. See this comment in Cargo issue 2644.
Use cargo-build-deps. [Might work for some, but does not support Cargo workspaces (in 2019)].
Wait for Cargo issue 2644. [There's willingness to add this to Cargo, but no concrete solution yet].
Using VOLUME ["/the/path"] in the Dockerfile does NOT work, this is per-layer (per command) only.
Note: one can set CARGO_HOME and ENV CARGO_TARGET_DIR in the Dockerfile to control where download cache and compiled output goes.
Also note: cargo fetch can at least cache downloading of dependencies, although not compiling.
Cargo workspaces suffer from having to manually add each Cargo file, and for some solutions, having to generate a dozen fake main.rs/lib.rs. For projects with a single Cargo file, the solutions work better.
I've got caching to work for my particular case by adding
ENV CARGO_HOME /code/dockerout/cargo
ENV CARGO_TARGET_DIR /code/dockerout/target
Where /code is the directory where I mount my code.
This is externally mounted, not from the Dockerfile.
EDIT1: I was confused why this worked, but #b.enoit.be and #BMitch cleared up that it's because volumes declared inside the Dockerfile only live for one layer (one command).
You do not need to use an explicit Docker volume to cache your dependencies. Docker will automatically cache the different "layers" of your image. Basically, each command in the Dockerfile corresponds to a layer of the image. The problem you are facing is based on how Docker image layer caching works.
The rules that Docker follows for image layer caching are listed in the official documentation:
Starting with a parent image that is already in the cache, the next
instruction is compared against all child images derived from that
base image to see if one of them was built using the exact same
instruction. If not, the cache is invalidated.
In most cases, simply comparing the instruction in the Dockerfile with
one of the child images is sufficient. However, certain instructions
require more examination and explanation.
For the ADD and COPY instructions, the contents of the file(s) in the
image are examined and a checksum is calculated for each file. The
last-modified and last-accessed times of the file(s) are not
considered in these checksums. During the cache lookup, the checksum
is compared against the checksum in the existing images. If anything
has changed in the file(s), such as the contents and metadata, then
the cache is invalidated.
Aside from the ADD and COPY commands, cache checking does not look at
the files in the container to determine a cache match. For example,
when processing a RUN apt-get -y update command the files updated in
the container are not examined to determine if a cache hit exists. In
that case just the command string itself is used to find a match.
Once the cache is invalidated, all subsequent Dockerfile commands
generate new images and the cache is not used.
So the problem is with the positioning of the command COPY src/ ./src/ in the Dockerfile. Whenever there is a change in one of your source files, the cache will be invalidated and all subsequent commands will not use the cache. Therefore your cargo build command will not use the Docker cache.
To solve your problem it will be as simple as reordering the commands in your Docker file, to this:
FROM rust:1.33.0
RUN rustup default nightly-2019-01-29
COPY Cargo.toml .
RUN ["cargo", "build", "-Z", "unstable-options", "--out-dir", "/output"]
COPY src/ ./src/
Doing it this way, your dependencies will only be reinstalled when there is a change in your Cargo.toml.
Hope this helps.
With the integration of BuildKit into docker, if you are able to avail yourself of the superior BuildKit backend, it's now possible to mount a cache volume during a RUN command, and IMHO, this has become the best way to cache cargo builds. The cache volume retains the data that was written to it on previous runs.
To use BuildKit, you'll mount two cache volumes, one for the cargo dir, which caches external crate sources, and one for the target dir, which caches all of your built artifacts, including external crates and the project bins and libs.
If your base image is rust, $CARGO_HOME is set to /usr/local/cargo, so your command looks like this:
RUN --mount=type=cache,target=/usr/local/cargo,from=rust,source=/usr/local/cargo \
--mount=type=cache,target=target \
cargo build
If your base image is something else, you will need to change the /usr/local/cargo bit to whatever is the value of $CARGO_HOME, or else add a ENV CARGO_HOME=/usr/local/cargo line. As a side note, the clever thing would be to set literally target=$CARGO_HOME and let Docker do the expansion, but it
doesn't seem to work right - expansion happens, but buildkit still doesn't persist the same volume across runs when you do this.
Other options for achieving Cargo build caching (including sccache and the cargo wharf project) are described in this github issue.
I figured out how to get this also working with cargo workspaces, using romac's fork of cargo-build-deps.
This example has my_app, and two workspaces: utils and db.
FROM rust:nightly as rust
# Cache deps
WORKDIR /app
RUN sudo chown -R rust:rust .
RUN USER=root cargo new myapp
# Install cache-deps
RUN cargo install --git https://github.com/romac/cargo-build-deps.git
WORKDIR /app/myapp
RUN mkdir -p db/src/ utils/src/
# Copy the Cargo tomls
COPY myapp/Cargo.toml myapp/Cargo.lock ./
COPY myapp/db/Cargo.toml ./db/
COPY myapp/utils/Cargo.toml ./utils/
# Cache the deps
RUN cargo build-deps
# Copy the src folders
COPY myapp/src ./src/
COPY myapp/db/src ./db/src/
COPY myapp/utils/src/ ./utils/src/
# Build for debug
RUN cargo build
I'm sure you can adjust this code for use with a Dockerfile, but I wrote a dockerized drop-in replacement for cargo that you can save to a package and run as ./cargo build --release. This just works for (most) development (uses rust:latest), but isn't set up for CI or anything.
Usage: ./cargo build, ./cargo build --release, etc
It will use the current working directory and save the cache to ./.cargo. (You can ignore the entire directory in your version control and it doesn't need to exist beforehand.)
Create a file named cargo in your project's folder, run chmod +x ./cargo on it, and place the following code in it:
#!/bin/bash
# This is a drop-in replacement for `cargo`
# that runs in a Docker container as the current user
# on the latest Rust image
# and saves all generated files to `./cargo/` and `./target/`.
#
# Be sure to make this file executable: `chmod +x ./cargo`
#
# # Examples
#
# - Running app: `./cargo run`
# - Building app: `./cargo build`
# - Building release: `./cargo build --release`
#
# # Installing globally
#
# To run `cargo` from anywhere,
# save this file to `/usr/local/bin`.
# You'll then be able to use `cargo`
# as if you had installed Rust globally.
sudo docker run \
--rm \
--user "$(id -u)":"$(id -g)" \
--mount type=bind,src="$PWD",dst=/usr/src/app \
--workdir /usr/src/app \
--env CARGO_HOME=/usr/src/app/.cargo \
rust:latest \
cargo "$#"

Is there a way to combine Docker images into 1 container?

I have a few Dockerfiles right now.
One is for Cassandra 3.5, and it is FROM cassandra:3.5
I also have a Dockerfile for Kafka, but t is quite a bit more complex. It is FROM java:openjdk-8-fre and it runs a long command to install Kafka and Zookeeper.
Finally, I have an application written in Scala that uses SBT.
For that Dockerfile, it is FROM broadinstitute/scala-baseimage, which gets me Java 8, Scala 2.11.7, and STB 0.13.9, which are what I need.
Perhaps, I don't understand how Docker works, but my Scala program has Cassandra and Kafka as dependencies and for development purposes, I want others to be able to simply clone my repo with the Dockerfile and then be able to build it with Cassandra, Kafka, Scala, Java and SBT all baked in so that they can just compile the source. I'm having a lot of issues with this though.
How do I combine these Dockerfiles? How do I simply make an environment with those things baked in?
You can, with the multi-stage builds feature introduced in Docker 1.17
Take a look at this:
FROM golang:1.7.3
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=0 /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
Then build the image normally:
docker build -t alexellis2/href-counter:latest
From : https://docs.docker.com/develop/develop-images/multistage-build/
The end result is the same tiny production image as before, with a significant reduction in complexity. You don’t need to create any intermediate images and you don’t need to extract any artifacts to your local system at all.
How does it work? The second FROM instruction starts a new build stage with the alpine:latest image as its base. The COPY --from=0 line copies just the built artifact from the previous stage into this new stage. The Go SDK and any intermediate artifacts are left behind, and not saved in the final image.
You can't combine dockerfiles as conflicts may occur. What you want to do is to create a new dockerfile or build a custom image.
TL;DR;
If your current development container contains all the tools you need and works, then save it as an image and upon it to a repo and create a dockerfile to pull from that image off that repo.
Details:
Building a custom image is by far easier than creating a dockerfile using a public image as you can store whatever hacks and mods into the image. To do so, start a blank container with a basic Linux image (or broadinstitute/scala-baseimage), install whatever tools you need and configure them until everything works correctly, then save it (the container) as an image. Create a new container off this image and test to see if you can build your code on top of it via docker-compose (or however you want to do/build it). If it works, than you have a working base image that you can upload to a repo so others can pull it.
To build a dockerfile with a public image, you will need to put all hacks, mods and setup on the dockerfile itself. That is, you will need to place every command line that you used into a text file and reduce whatever hacks, mods and setup into command lines. At the end, your dockerfile will create an image automatically and you don't need to store this image into a repo and all you need to do is to give others the dockerfile and they can spin the image up at their own docker.
Note that once you have a working dockerfile, you can tweak it easily as it will create a new image every time you use the dockerfile. With a custom image, you may run into issues where you need to rebuild the image due to conflicts. For example, all of your tools work with openjdk until you install one that doesn't work. The fix may involve uninstalling openjdk and use the oracle one, but all configuration you did for all the tools that you have installed broke.
The following answer applies to docker 1.7 and above:
I would prefer to use --from=NAME and from image as NAME
Why?
You can use --from=0 and above but this might get little hard to manage when you have many docker stages in dockerfile.
sample example:
FROM golang:1.7.3 as backend
WORKDIR /backend
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN #install some stuff, compile assets....
FROM golang:1.7.3 as assets
WORKDIR /assets
RUN ./getassets.sh
FROM nodejs:latest as frontend
RUN npm install
WORKDIR /assets
COPY --from=assets /asets .
CMD ["./app"]
FROM alpine:latest as mergedassets
WORKDIR /root/
COPY --from=frontend . /
COPY --from=backend ./backend .
CMD ["./app"]
Note: Managing dockerfile properly will help to build a docker image much faster. Internally docker usings docker layer caching to help with this process, incase the image have to be rebuilt.
Yes, you can roll a whole lot of software into a single Docker image (GitLab does this, with one image that includes Postgres and everything else), but generalhenry is right - that's not the typical way to use Docker.
As you say, Cassandra and Kafka are dependencies for your Scala app, they're not part of the app, so they don't all belong in the same image.
Having to orchestrate many containers with Docker Compose adds an extra admin layer, but it gives you much more flexibility:
your containers can have different lifespans, so when you have a new version of your app to deploy, you only need to run a new app container, you can leave the dependencies running;
you can use the same app image in any environment, using different configurations for your dependencies - e.g. in dev you can run a basic Kafka container and in prod have it clustered on many nodes, your app container is the same;
your dependencies can be used by other apps too - so multiple consumers can run in different containers and all work with the same Kafka and Cassandra containers;
plus all the scalability, logging etc. already mentioned.
When might you want to "combine" Docker images?
As others are pointing out here, you typically don't want to put your database and you application into the same Docker image. Ideally you want a Docker image to wrap a "single process"/"runtime". This allows each process to be scaled up/down and restarted individually.
Let's say you want to use some shared C-libraries/executables that are not available in the package manager of the image you are using, but someone else has created an image where they are precompiled - and you might not want to recompile these binaries as part of your build (depending on how long this takes). Is there a way to quickly create a POC-Docker image containing all of these executables/libraries based on the existing images?
Docker and Composition
Relevant discussion: https://github.com/moby/moby/issues/3378
What Docker lacks is a good way of composing images. You can copy individual files or entire file systems from other images into your own using COPY --from=<image> <from-path> <to-path>. There is no builtin way of copying the environment variables from another image into your own.
That said, I have personally created a custom frontend/parser for Dockerfiles that adds an INCLUDE <image>-keyword. This copies the entire filesystem, along with the environment variables into your image:
DOCKER_BUILDKIT=1 docker build -t myimage .
#syntax=bergkvist/includeimage
FROM alpine:3.12.0
INCLUDE rust:1.44-alpine3.12
INCLUDE python:3.8.3-alpine3.12
nixpkgs.dockerTools
if you want truly composable Docker builds, I recommend checking out dockerTools in nixpkgs. This will also result in more reproducible (and typically very small) images. See https://nix.dev/tutorials/building-and-running-docker-images
docker load < $(nix-build docker-image.nix)
# docker-image.nix
let
pkgs = import <nixpkgs> {};
python = pkgs.python38;
rustc = pkgs.rustc;
in pkgs.dockerTools.buildImage {
name = "myimage";
tag = "latest";
contents = [ python rustc ];
}
Docker doesn't do merges of the images, but there isn't anything stopping you combining the dockerfiles if available, and rolling into them into a fat image which you'd need to build. There's times where this makes sense, however, as for running multiple processes in a container most Docker dogma will point to this as less desirable especially with microservice architecture (however rules are there to be broken right?)
You could not combine docker images into 1 container. See the detail discussions in Moby issue, How do I combine several images into one via Dockerfile.
For your case, it is better to not include the whole Cassandra and Kafka images. The application would only need the Cassandra Scala driver and Kafka Scala driver. The container should include the drivers only.
I needed docker:latest and python:latest images for Gitlab CI. Here is what I came up with:
FROM ubuntu:latest
RUN apt update
RUN apt install -y sudo
RUN sudo apt install -y docker.io
RUN sudo apt install -y python3-pip
RUN sudo apt install -y python3
RUN docker --version
RUN pip3 --version
RUN python3 --version
After I've build and pushed it to my Docker Hub repo:
docker build -t docker-hub-repo/image-name:latest path/to/Dockerfile
docker push docker-hub-repo/image-name:latest
Don't forget to docker login before push
Hope it helps

Resources