docker build cache busting with apt-get - docker

My understanding is that if the RUN command "string" itself just does not change (i.e., the list of packages to be installed does not change), docker engine uses the image in the cache for the same operation. This is also my experience:
...
Step 2/6 : RUN apt update && DEBIAN_FRONTEND=noninteractive apt install -y curl git-all locales locales-all python3 python3-pip python3-venv libusb-1.0-0 gosu && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 518e8ff74d4c
...
However, the official Dockerfile best practices document says this about apt-get:
Using RUN apt-get update && apt-get install -y ensures your Dockerfile installs the latest package versions with no further coding or manual intervention. This technique is known as “cache busting”.
This is true if I add a new package to the list but it is not if I do not modify the list.
Is my understanding correct, or I am missing something here?
If yes, can I assume that I will only get newer packages in apt-get install if also the Ubuntu base image has been updated (which invalidates the whole cache)?

You cut off the quote in the middle. The rest of the quote included a very important condition:
You can also achieve cache-busting by specifying a package version. This is known as version pinning, for example:
RUN apt-get update && apt-get install -y \
package-bar \
package-baz \
package-foo=1.3.*
Therefore the command you run in there example would change each time by changing the pinned version of the package in the list. Note that in addition to changing the command run, you can change the environment, which has the same effect, using a build arg as described in this answer.

You are right. The documentation is very poorly written. If you read further you can see what the author is trying to say:
The s3cmd argument specifies a version 1.1.*. If the image previously used an older version, specifying the new one causes a cache bust of apt-get update and ensures the installation of the new version.
It seems author thinks 'cache busting' is when you change the Dockerfile in a way that invalidates the cache. But the usual definition of cache busting is a mechanism by which we can invalidate cache even if the file is the same.

Related

Docker SSH-Key looking for a simple solution

I'm trying to copy my ssh-keys into my docker, it's a very simple docker including some LinuxTools via Package Manager. A asking because, I can't come up with a simple solution ADD/COPY seem not to work, using docker-volume or compose seem to be over the top. Please advice.
FROM fedora:latest
RUN diskspacecheck=0 >> '/etc/dnf/dnf.conf'
RUN dnf -y update
RUN dnf -y install sshuttle \
&& dnf -y install git \
&& dnf -y install curl \
&& dnf -y install vim-X11 \
&& dnf -y install the_silver_searcher
RUN dnf -y clean packages
RUN adduser -m -p '' bowler
USER bowler
ADD /home/a/.ssh/id_rsa /home/bowler/.ssh/id_rsa
I would not add the key there during the build, I would mount it when you run the container as you are probably most likely to store Dockerfile and other files in VCS and then everyone can see your private key!
therefore adding this when you start your container is probably better/more secure option option :)
-v ${HOME}/.ssh/id_rsa:/home/bowler/.ssh/id_rsa
You can't copy files into a docker container that live outside of the build directory. This is for security reasons. What you'll need to do is first copy your id_rsa file into the same directory as your Dockerfile, and then change the ADD to use the copy you just made, instead of trying to copy it from the absolute path that it is currently using.
I would also suggest changing the ADD to COPY, as it is easier to work with and has less unexpected behavior to trip over.
so at your command line:
cp ${HOME}/.ssh/id_rsa [path to dockerfile directory]
then update the dockerfile:
COPY id_rsa /home/bowler/.ssh/id_rsa
you might also need to add a RUN mkdir -p /home/bowler/.ssh
Update
Based on the comments, I think it's worth adding a disclaimer here that if you go this route, then the image that you create needs to be handled with the same security considerations as your private SSH key.
It is much better to inject authentication credentials like this at runtime. That can be done by setting environment variables as part of the command that is used to start the container. Or, if you have a secure secrets repository, it could be added there and then downloaded by the container when it starts (ex. using cURL).
The approach of installing SSH keys directly to the image is not completely unreasonable, but proceed with caution and keep in mind that there may be a cleaner alternative.

How does docker cache the layer?

I have a docker image which run the following command
RUN apt-get update --fix-missing && apt-get install -y --no-install-recommends build-essential debhelper rpm ruby ruby-dev sudo cmake make gcc g++ flex bison git libpcap-dev libssl-dev ninja-build openssh-client python-dev python3-pip swig zlib1g-dev python3-setuptools python3-requests wget curl unzip zip default-jdk && apt-get clean && rm -rf /var/lib/apt/lists/*
If I run it couple time in the same day, the layer seems cached. However, docker will think the layer changed if I run it for the first time daily.
Just wonder what's special in the above command that makes docker thinks the layer changed?
This is not caused by docker. When docker sees a RUN command, all it does is simple string comparison to determine whether the layer is in the cache or not. If it sees it in cache, it will reuse it and if not, it will run it.
Since you have mentioned that it builds whole day using cache and then it doesn't the next day, the only possible explanation is that the cache has been invalidated/deleted during that time by someone/something.
I don't know how/where you are running the docker daemon but it may be the case that it is running in VM that is being recreated each day from a base image which would then destroy all the cache and force docker to rebuild the image.
Another explanation is that you have some cleanup process running once a day, maybe some cron that deletes the cache.
Bottom line is that docker will happily reuse that cache for unlimited period of time, as long as the cache actually exists.
I am assuming that previous layers has been built from cache (if there are any), otherwise you should look for COPY/ADD commands if they are not causing the cache busting due to file changes in your build context.
It's not the command, it's the steps that occur before it. Specifically, if the files being copied to previous layers were modified. I can be more specific if you'll edit the post to show all the steps in the Dockerfile before this one.
According to the docker doc:
Aside from the ADD and COPY commands, cache checking does not look at the files in the container to determine a cache match. For example, when processing a RUN apt-get -y update command the files updated in the container are not examined to determine if a cache hit exists. In that case just the command string itself is used to find a match
For a RUN command, it just command string itself is used to find a match. So, maybe any processes delete the cache layer, or maybe you changed your Dockerfile?

DockerFile one-line vs multi-line instruction [duplicate]

This question already has answers here:
Multiple RUN vs. single chained RUN in Dockerfile, which is better?
(4 answers)
Closed 2 years ago.
To my knowledge of the way docker build works is that for each line of instruction, it creates a separate image/layer. However, it is very efficient in managing to reuse the layers or avoid rebuilding those layers if nothing has changed.
So does it matter if I put below instruction either on same line or multi-line? For convenience, I would prefer the single line option unless it is not an efficient option.
Multi-Line Instruction
RUN apt-get -y update
RUN apt-get -y install ...
Single-Line Instruction
RUN apt-get -y update && apt-get -y install
In this specific case it is important to put apt-get update and apt-get install together. More broadly, fewer layers is considered "better" but it almost never has a perceptible difference.
In practice I tend to group together "related" commands into the same RUN command. If I need to configure and install a package from source, that can get grouped together, and even if I change make arguments I don't mind re-running configure. If I need to configure and install three packages, they'd go into separate RUN lines.
The important difference in this specific apt-get example is around layer caching. Let's say your Dockerfile has
FROM ubuntu:18.04
RUN apt-get update
RUN apt-get install package-a
If you run docker build a second time, it will decide it's already run all three of these commands and the input hasn't changed, so it will run very quickly and you'll get an identical image out.
Now you come back a day or two later and realize you were missing something, so you change
FROM ubuntu:18.04
RUN apt-get update
RUN apt-get install package-a package-b
When you run docker build again, Docker decides it's already run apt-get update and can jump straight to the apt-get install line. In this specific case you'll have trouble: Debian and Ubuntu update their repositories fairly frequently, and when they do the old versions of packages get deleted. So your apt-get update from two days ago points at a package that no longer exists, and your build will fail.
You'll avoid this specific problem by always putting the two apt-get commands together in the same docker run line
FROM ubuntu:18.04
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive \
apt-get install --assume-yes --no-install-recommends \
package-a \
package-b
I would use single-line instruction. It's considered to be a best practice for docker. So that you minimize number of layers(RUN is one of instructions that create layers).
As of multiple instructions for collecting dependencies, sometimes it's useful during development, if you're frequently changing list of packages(or their versions). But for production image, i would avoid that.

Update of root certificates on docker

If I understand correctly, on standard Ubuntu systems for example, root certificates are provided by ca-certificates package and get updated when the package itself is updated.
But how can the root certificates be updated when using docker containers ? Is there a common preferred way of doing this, or must the containers be redeployed with an up-to-date docker image ?
The containers must be redeployed with an up-to-date image.
The Docker Hub base images like ubuntu actually get updated fairly regularly, and if you look at the tag list you can see that there are several date-stamped variants of the images. So one approach that will get you pretty close to current is to always (have your CI system) pull the base image before you build.
docker pull ubuntu:18.04
docker build .
If you can't do that, or if you're working from some sort of derived image that updates less frequently, you can just manually run apt-get upgrade in your Dockerfile. Doing this in the same place you're otherwise installing packages makes sense. It needs to be in the same RUN line as a matching apt-get update, and you might need some way to force Docker to not cache that update line to get current updates.
FROM python:3.8-slim
# Have an option to force rebuilds; the RUN line won't be
# cacheable if the dependency_stamp option changes
ARG dependency_stamp
ENV dependency_stamp=${dependency_stamp:-unknown}
RUN touch /dependencies.${dependency_stamp}
# Update base OS packages and install other things we need
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get upgrade \
&& DEBIAN_FRONTEND=noninteractive apt-get install \
--no-install-recommends --assume-yes \
...
If you find yourself doing this routinely, maintaining your own base images that are upgraded to current packages but don't have anything else installed can be helpful; if you find yourself doing that, you might have more control over the process and get smaller images if you build an image FROM ubuntu and install e.g. Python, rather than building an image FROM python and then installing updates over it.

ADD and COPY to merge the contents of a dir with the one already on the build

I am trying to assemble an image with some files that are stored along with my dockerfile. What I expect with COPY or ADD is a merge behavior. However, the file hierarchy is overwritten if I use theses instructions. As such, files from the parent image are no longer available if they were within the path. (huge issue if for instance I was to save some file on /etc/mycoolstuff/filehere)
When this happens I am not able to use apt-get any more
I also saw some advice to convert my files to a tar packaged, but nothing different was observed.
For clarity, this was one of the issues(builds if I change the order, but it ends broken anyway cause child images will not be able to use apt any longer):
ADD build/image-base.tgz /
RUN apt-get clean -y && \
apt-get update && \
apt-get install unzip -y --no-install-recommends && \
apt-get install gosu -y --no-install-recommends && \
apt-get clean -y && \
apt-get autoclean -y && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*
Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease Temporary failure resolving 'archive.ubuntu.com'
Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease Temporary failure resolving 'security.ubuntu.com'
Failed to fetch http://ppa.launchpad.net/webupd8team/java/ubuntu/dists/xenial/InRelease
Temporary failure resolving 'ppa.launchpad.net'
Some index files failed to download. They have been ignored, or old ones used instead.
After some digging I was able to find a workaround with COPY and a RUN to achieved the behavior I was looking for.
COPY build/image-base.tgz /
RUN tar -xvf image-base.tgz --no-overwrite-dir
Unfortunately this creates a new layer and the base tar will remain there. It is not a bit issue if it is just some text config files.
A weird way to get around the COPY instruction was to use wget, and untar afterwards, so we can delete the tar file. wget will do the trick if we are on a build environment where artifacts are stored and URLs created for them...in that scenario we just need one RUN, keeping the house clean.
Docker has multi-stage builds since 17.05: Use multi-stage builds
FROM busybox as builder
ADD build/image-base.tgz /tmproot/
FROM alpine:latest
...
COPY --from=builder /tmproot /
...
Another example:
FROM busybox as builder
COPY src/etc/app /tmproot/etc/app
COPY src/app /tmproot/usr/local/app
FROM alpine:latest
...
COPY --from=builder /tmproot /
...
What are you trying to achieve? Do you really want to overwrite the root directory? Then you better build images from scratch. To do this, your Dockerfile must begin with FROM scratch
An official Docker tutorial on the subject, with scripts for multiple distros:
https://docs.docker.com/engine/userguide/eng-image/baseimages/
I'm using Azure DevOps Pipelines (windows agent) as well as Docker Desktop for Windows and find the merge happens automatically. Perhaps you should update your Docker version or try again to see if the merge is indeed happening as desired.

Resources