Manage apt-get based dependency's on Docker - docker

We have a large code based on c++ in my company and we are trying to move to a microservice infrastructure based on docker.
We have a couple of library in house that help us with things like helper functions and utility that we regularly use in our code. Our idea was to create a base image for developers with this library's already installed and make it available to use it as our "base" image. This will give us the benefit of all our software always using the latest version of our own library's.
My questions is related to the cache system of Docker in relation with CI and external dependencys. Lets say we have a Docker file like this:
FROM ubuntu:latest
# Install External dependencys
RUN apt update && apt install -y\
boost-libs \
etc...
# Copy our software
...
# Build it
...
# Install it
...
If our code changes we can trigger the CI and docker will understand that it can use the cached image that was created before up to the point were it copy's our software. What happens if one of our external dependency's offers a newer version? Will the cached be automatically be invalidated? How can we trigger a CI build in case any of our packages receives a new version?
In essence how do we make sure we are always using the latest packages available for our external dependency's?
Please keep in mind the above Dockerfile is just an example to illustrate we are trying to use other tricks in the playbook such us using a lighter base image (not Ubuntu) and multistage builds to avoid dev-packages in our production containers.

Docker's caching algorithm is fairly simple. It looks at the previous state of the image build, and the string of the command you are running. If you are performing a COPY or ADD, it also looks at a hash of those files being copied. If a previous build is found on the server with the same previous state and command being run, it will reuse the cache.
That means an external change, e.g. pulling packages from an external repository, will not be detected and the cache will be reused instead of rerunning that line. There are two solutions that I've seen to this:
Option 1: Change your command by adding versions to the dependencies. When one of those dependencies changes, you'll need to update your build. This is added work, but also guarantees that you only pull in the versions that you are ready for. That would look like (fixing boost-libs to a 1.5 version number):
# Install External dependencys
RUN apt update && apt install -y\
boost-libs=1.5 \
etc...
Option 2: change a build arg. These are injected as environment variables into the RUN commands and docker sees a change in the environment as a different command to run. That would look like:
# Install External dependencys
ARG UNIQUE_VAR
RUN apt update && apt install -y\
boost-libs \
etc...
And then you could build the above with the following to trigger the cache to be recreated daily on that line:
docker build --build-arg "UNIQUE_VAR=$(date +%Y%m%d)" ...
Note, there's also the option to build without the cache any time you wish with:
docker build --no-cache ...
This would cause the cache to be ignored for all steps (except the from line).

Related

docker-compose ignores ubuntu:latest in Dockerfile [duplicate]

I want to build a docker image for the Linkurious project on github, which requires both the Neo4j database, and Node.js to run.
My first approach was to declare a base image for my image, containing Neo4j. The reference docs do not define "base image" in any helpful manner:
Base image:
An image that has no parent is a base image
from which I read that I may only have a base image if that image has no base image itself.
But what is a base image? Does it mean, if I declare neo4j/neo4j in a FROM directive, that when my image is run the neo database will automatically run and be available within the container on port 7474?
Reading the Docker reference I see:
FROM can appear multiple times within a single Dockerfile in order to create multiple images. Simply make a note of the last image ID output by the commit before each new FROM command.
Do I want to create multiple images? It would seem what I want is to have a single image that contains the contents of other images e.g. neo4j and node.js.
I've found no directive to declare dependencies in the reference manual. Are there no dependencies like in RPM where in order to run my image the calling context must first install the images it needs?
I'm confused...
As of May 2017, multiple FROMs can be used in a single Dockerfile.
See "Builder pattern vs. Multi-stage builds in Docker" (by Alex Ellis) and PR 31257 by Tõnis Tiigi.
The general syntax involves adding FROM additional times within your Dockerfile - whichever is the last FROM statement is the final base image. To copy artifacts and outputs from intermediate images use COPY --from=<base_image_number>.
FROM golang:1.7.3 as builder
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
The result would be two images, one for building, one with just the resulting app (much, much smaller)
REPOSITORY TAG IMAGE ID CREATED SIZE
multi latest bcbbf69a9b59 6 minutes ago 10.3MB
golang 1.7.3 ef15416724f6 4 months ago 672MB
what is a base image?
A set of files, plus EXPOSE'd ports, ENTRYPOINT and CMD.
You can add files and build a new image based on that base image, with a new Dockerfile starting with a FROM directive: the image mentioned after FROM is "the base image" for your new image.
does it mean that if I declare neo4j/neo4j in a FROM directive, that when my image is run the neo database will automatically run and be available within the container on port 7474?
Only if you don't overwrite CMD and ENTRYPOINT.
But the image in itself is enough: you would use a FROM neo4j/neo4j if you had to add files related to neo4j for your particular usage of neo4j.
Let me summarize my understanding of the question and the answer, hoping that it will be useful to others.
Question: Let’s say I have three images, apple, banana and orange. Can I have a Dockerfile that has FROM apple, FROM banana and FROM orange that will tell docker to magically merge all three applications into a single image (containing the three individual applications) which I could call smoothie?
Answer: No, you can't. If you do that, you will end up with four images, the three fruit images you pulled, plus the new image based on the last FROM image. If, for example, FROM orange was the last statement in the Dockerfile without anything added, the smoothie image would just be a clone of the orange image.
Why Are They Not Merged? I Really Want It
A typical docker image will contain almost everything the application needs to run (leaving out the kernel) which usually means that they’re built from a base image for their chosen operating system and a particular version or distribution.
Merging images successfully without considering all possible distributions, file systems, libraries and applications, is not something Docker, understandably, wants to do. Instead, developers are expected to embrace the microservices paradigm, running multiple containers that talk to each other as needed.
What’s the Alternative?
One possible use case for image merging would be to mix and match Linux distributions with our desired applications, for example, Ubuntu and Node.js. This is not the solution:
FROM ubuntu
FROM node
If we don’t want to stick with the Linux distribution chosen by our application image, we can start with our chosen distribution and use the package manager to install the applications instead, e.g.
FROM ubuntu
RUN apt-get update &&\
apt-get install package1 &&\
apt-get install package2
But you probably knew that already. Often times there isn’t a snap or package available in the chosen distribution, or it’s not the desired version, or it doesn't work well in a docker container out of the box, which was the motivation for wanting to use an image. I’m just confirming that, as far as I know, the only option is to do it the long way, if you really want to follow a monolithic approach.
In the case of Node.js for example, you might want to manually install the latest version, since apt provides an ancient one, and snap does not come with the Ubuntu image. For neo4j we might have to download the package and manually add it to the image, according to the documentation and the license.
One strategy, if size does not matter, is to start with the base image that would be hardest to install manually, and add the rest on top.
When To Use Multiple FROM Directives
There is also the option to use multiple FROM statements and manually copy stuff between build stages or into your final one. In other words, you can manually merge images, if you know what you're doing. As per the documentation:
Optionally a name can be given to a new build stage by adding AS name
to the FROM instruction. The name can be used in subsequent FROM and
COPY --from=<name> instructions to refer to the image built in this
stage.
Personally, I’d only be comfortable using this merge approach with my own images or by following documentation from the application vendor, but it’s there if you need it or you're just feeling lucky.
A better application of this approach though, would be when we actually do want to use a temporary container from a different image, for building or doing something and discard it after copying the desired output.
Example
I wanted a lean image with gpgv only, and based on this Unix & Linux answer, I installed the whole gpg with yum and then copied only the binaries required, to the final image:
FROM docker.io/photon:latest AS builder
RUN yum install gnupg -y
FROM docker.io/photon:latest
COPY --from=builder /usr/bin/gpgv /usr/bin/
COPY --from=builder /usr/lib/libgcrypt.so.20 /usr/lib/libgpg-error.so.0 /usr/lib/
The rest of the Dockerfile continues as usual.
The first answer is too complex, historic, and uninformative for my tastes.
It's actually rather simple. Docker provides for a functionality called multi-stage builds the basic idea here is to,
Free you from having to manually remove what you don't want, by forcing you to allowlist what you do want,
Free resources that would otherwise be taken up because of Docker's implementation.
Let's start with the first. Very often with something like Debian you'll see.
RUN apt-get update \
&& apt-get dist-upgrade \
&& apt-get install <whatever> \
&& apt-get clean
We can explain all of this in terms of the above. The above command is chained together so it represents a single change with no intermediate Images required. If it was written like this,
RUN apt-get update ;
RUN apt-get dist-upgrade;
RUN apt-get install <whatever>;
RUN apt-get clean;
It would result in 3 more temporary intermediate Images. Having it reduced to one image, there is one remaining problem: apt-get clean doesn't clean up artifacts used in the install. If a Debian maintainer includes in his install a script that modifies the system that modification will also be present in the final solution (see something like pepperflashplugin-nonfree for an example of that).
By using a multi-stage build you get all the benefits of a single changed action, but it will require you to manually allowlist and copy over files that were introduced in the temporary image using the COPY --from syntax documented here. Moreover, it's a great solution where there is no alternative (like an apt-get clean), and you would otherwise have lots of un-needed files in your final image.
See also
Multi-stage builds
COPY syntax
Here is probably one of the most fundamental use cases of using multiple FROMs, aka, multi stage builds.
I want want one dockerfile, and I want to change one word and depending on what I set that word to, I get different images depending on whether I want to run, Dev or Publish the application!
Run - I just want to run the app
Dev - I want to edit the code and run the app
Publish - Run the app in production
Lets suppose we're working in the dotnet environment. Heres one single Dockerfile. Without multi stage build, there would be multiple files (builder pattern)
#See https://aka.ms/containerfastmode to understand how Visual Studio uses this Dockerfile to build your images for faster debugging.
FROM mcr.microsoft.com/dotnet/runtime:5.0 AS base
WORKDIR /app
FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build
WORKDIR /src
COPY ["ConsoleApp1/ConsoleApp1.csproj", "ConsoleApp1/"]
RUN dotnet restore "ConsoleApp1/ConsoleApp1.csproj"
COPY . .
WORKDIR "/src/ConsoleApp1"
RUN dotnet build "ConsoleApp1.csproj" -c Release -o /app/build
FROM build AS publish
RUN dotnet publish "ConsoleApp1.csproj" -c Release -o /app/publish
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "ConsoleApp1.dll"]
Want to run the app? Leave FROM base AS final as it currently is in the dockerfile above.
Want to dev the source code in the container? Change the same line to FROM build AS final
Want to release into prod? Change the same line to FROM publish AS final
I agree with the OP, that this feature is useful for docker! Here is a different view into the same problem:
If you had multiple FROMs (or a "FROM" and multiple "MERGE"'s, for example) then you can use the docker registry versioning system for the base docker image AND other container elements, and that is the win here: I have third party development tools which do not exist in .deb format, so these tools must be installed by un-taring a tball and is HUGE, so caching on the docker host will be important but versioning/change control of the image is equally important. I (think I) can simply use "RUN git ....", and docker will deal with the caching of the new layer for me, which is what I want; because another container will have the same base image but a different set of HUGE third party tools, so the caching of the base image and the tools image is really important (the 3rd party tools tar can be as big as the base image of say ubuntu so caching of these is really important too). The (suggested) feature just allows all these elements to be managed in a central repo. versioning system.
Said a different way, why do we use FROM at all? If I were to simply git clone an ubuntu image using the RUN command for my "base image/layer", this would create a new layer and docker would cache this anyway...so is there any difference/advantage in using FROM, other than it uses dockers internal versioning system/syntax?

Can i modify the docker image provided by playwright to add custom node version

I was utilizing Playwright to test my frontend application at work, however, we use node version 16.15.0 specifically. But, while looking at the docker file by Playwright I see that they install the latest node version which is causing issues when running in CircleCi.
Does anyone have any ideas for a workaround? Would I have to create a custom docker image using Playwright's image to tackle this and install the correct node version?
Any help would be appreciated!
https://github.com/microsoft/playwright/blob/main/utils/docker/Dockerfile.focal.
https://playwright.dev/docs/docker.
Yes, that would be the way to go. The best way to go about this would be to patch the Dockerfile.focal with an ARG instruction. You will then be able to pass values to this argument with your docker build command. This is the best approach that will make maintenance easier. Edit the Dockerfile.focal and add this variable as:
# leave this blank or specify a default value.
ARG NODE_VERSION=
Then in docker build you can set the value for this as. The docker build command in the script on the repo will change as follows:
docker build --platform "${PLATFORM}" -t "$3" -f "Dockerfile.$2" --build-arg NODE_VERSION=16.15.0 .
This will inject this variable into the image when it is being built so you can have the correct version. Also, this will make it easier to maintain since you will not have to change the Dockerfile every time you upgrade the version of NodeJS in your image.
Now, finally, you can edit the build.sh script to use the version variable. You can edit the line 13 in the script to something like:
apt-get install -y nodejs="{NODE_VERSION}" && \
You can use apt search nodejs after running the setup script to verify the correct version of the package.

Why are dependencies installed via an ENTRYPOINT and tini?

I have a question regarding an implementation of a Dockerfile on dask-docker.
FROM continuumio/miniconda3:4.8.2
RUN conda install --yes \
-c conda-forge \
python==3.8 \
[...]
&& rm -rf /opt/conda/pkgs
COPY prepare.sh /usr/bin/prepare.sh
RUN mkdir /opt/app
ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]
prepare.sh is just facilitating installation of additional packages via conda, pip and apt.
There are two things I don't get about that:
Why not just place those instructions in the Dockerfile? Possibly indirectly (modularized) by COPYing dedicated files (requirements.txt, environment.yaml, ...)
Why execute this via tini? At the end it does exec "$#" where one can start a scheduler or worker - that's more what I associate with tini.
This way everytime you run the container from the built image you have to repeat the installation process!?
Maybe I'm overthinking it but it seems rather unusual - but maybe that's a Dockerfile pattern with good reasons for it.
optional bonus questions for Dask insiders:
why copy prepare.sh to /usr/bin (instead of f.x. to /tmp)?
What purpose serves the created directory /opt/app?
It really depends on the nature and usage of the files being installed by the entry point script. In general, I like to break this down into a few categories:
Local files that are subject to frequent changes on the host system, and will be rolled into the final image for production release. This is for things like the source code for an application that is under development and needs to be tested in the container. You want these to be copied into the runtime every time the image is rebuilt. Use a COPY in the Dockerfile.
Files from other places that change frequently and/or are specific to the deployment environment. This is stuff like secrets from a Hashicorp vault, network settings, server configurations, etc.... that will probably be downloaded into the container all the time, even when it goes into production. The entry point script should download these, and it should decide which files to get and from where based on environment variables that are injected by the host.
libraries, executable programs (under /bin, /usr/local/bin, etc...), and things that specifically should not change except during a planned upgrade. Usually anything that is installed using pip, maven or some other program that does dependency management, and anything installed with apt-get or equivalent. These files should not be installed from the Dockerfile or from the entrypoint script. Much, much better is to build your base image with all of the dependencies already installed, and then use that image as the FROM source for further development. This has a number of advantages: it ensures a stable, centrally located starting platform that everyone can use for development and testing (it forces uniformity where it counts); it prevents you from hammering on the servers that host those libraries (constantly re-downloading all of those libraries from pypy.org is really bad form... someone has to pay for that bandwidth); it makes the build faster; and if you have a separate security team, this might help reduce the number of files they need to scan.
You are probably looking at #3, but I'm including all three since I think it's a helpful way to categorize things.

During build, on what basis does docker decide whether to create a new layer or use an existing layer?

The docker file has various command like FROM, RUN, etc. Each of this command creates a layer (intermediary image).
During build, assuming the layer already exists, on what basis does docker decide whether to create a new layer or use an existing layer?
The docker build caching system is pretty simple. For most commands, if the previous layer was cached, and there is a layer that runs the exact same command (RUN, ENV, CMD, ...) then it reuses the cached layer, and repeats this check for the next command. For COPY and ADD commands the decision is based on a hash of the file contents.
This is detailed in Best practices for writing Dockerfiles in the Docker documentation.
Practically, there are a couple of things this means:
You almost never need docker build --no-cache, since if the Dockerfile or any involved files have changed, the cache will be automatically invalidated.
If you have an expensive step to install dependencies (npm install, pip install, bundle install, ...) have a first step that only COPYs the files that list the dependencies, then RUN whatever install, then COPY the rest of your application. This avoids invalidating the cache for the "install" step if only application code has changed.
If you have a Debian- or Ubuntu-based image, RUN apt-get update && apt-get install in a single command. This avoids a problem where the URLs from the "update" step are cached, but the packages in the "install" step change, and the cached URLs are no longer valid.

Selecting different code branches when using a shared base image in Docker

I am containerising a codebase that serves multiple applications. I have created three images;
app-base:
FROM ubuntu
RUN apt-get install package
COPY ./app-code /code-dir
...
app-foo:
FROM app-base:latest
RUN foo-specific-setup.sh
and app-buzz which is very similar to app-foo.
This works currently, except I want to be able to build versions of app-foo and app-buzz for specific code branches and versions. It's easy to do that for app-base and tag appropriately, but app-foo and app-buzz can't dynamically select that tag, they are always pinned to app-base:latest.
Ultimately I want this build process automated by Jenkins. I could just dynamically re-write the Dockerfile, or not have three images and just have two nearly-but-not-quite identical Dockerfiles for each app that would need to be kept in sync manually (later increasing to 4 or 5). Each of those solutions has obvious drawbacks however.
I've seen lots of discussions in the past about things such as an INCLUDE statement, or dynamic tags. None seemed to come to anything.
Does anyone have a working, clean(ish) solution to this problem? As long as it means Dockerfile code can be shared across images, I'd be happy. If it also means that the shared layers of images don't need to be rebuilt for each app, then even better.
You could still use build args to do this.
Dockerfile:
FROM ubuntu
ARG APP_NAME
RUN echo $APP_NAME-specific-setup.sh >> /root/test
ENTRYPOINT cat /root/test
Build:
docker build --build-arg APP_NAME=foo -t foo .
Run:
$ docker run --rm foo
foo-specific-setup.sh
In your case you could run the correct script in the RUN using the argument you just set before. You would have one Dockerfile per app-base variant and run the correct set-up based on the build argument.
FROM ubuntu
RUN apt-get install package
COPY ./app-code /code-dir
ARG APP_NAME
RUN $APP_NAME-specific-setup.sh
Any layers before setting the ARG would not need to be rebuilt when creating other versions.
You can then push the built images to separate docker repositories for each app.
If your apps need different ENTRYPOINT instructions, you can have an APP_NAME-entrypoint.sh per app and rename it to entrypoint.sh within your APP_NAME-specific-setup.sh (or pass it through as an argument to run).

Resources