Re-running Docker only until a certain step using caches? - docker

Is there any option to force Docker to run a build without using caches from that step on?
A particular usecase is something like this:
...
ADD some.cfg some.cfg
RUN do something with some.cfg
While working on the Dockerfile it is often necessary to adjust configurations and test them.
From the Docker point of view the steps remain unmodified however from my point as a Dockerfile
write I want Docker to run the build using caches until the ADD operation and from that point on
without caches. Is this possible?

As Mykola suggests, Docker will take a checksum of the files and should invalidate the cache if the content changes. However, you can always force cache invalidation at a given line in a Dockerfile by setting or changing an environment variable at that point e.g:
...
ENV updated-adds-on 14-DEC-14
ADD...

According to this, you don't have to disable cache on the added file change as docker will examine the contents and skip cache on change. Otherwise
you can use the --no-cache=true option on the docker build command.

Related

Is there a way to force a particular line in my Dockerfile to always rebuild, whilst still benefiting from caching on the preceding layers? [duplicate]

This question already has answers here:
Disable cache for specific RUN commands
(9 answers)
Closed 1 year ago.
I frequently seem to have to write Dockerfiles like this (line numbers added for clarity):
1. FROM somebase
2. RUN cp /some/local/stuff /some/docker/container/path
3. RUN some-other-local-commands
4. RUN wget http://some.remote.server/some.remote.path.for.example.json
5. RUN some-other-local-commands-which-may-depend-on-the-json
On line (4), I'm fetching a remote resource. Let's assume for now that's a JSON file. It might change from time-to-time, maybe not on every build, but perhaps every few hours or days.
What this means is that every time I build my container, I want to ensure the freshest JSON file is fetched. One way to force this is to add the --no-cache parameter to my docker build command, but this forces all of the lines/layers to rebuild, including (1)-(3), where that is likely not necessary. Is there a pattern or technique to automatically 'taint' or 'mark' line (4) so that Docker knows it always has to re-run the wget (presumably this would also have to force a rebuild of line 5), whilst still getting the layer caching behaviour for lines (1)-(3) when Docker detects the pre-req files haven't changed?
If the specific thing you're trying to trigger rebuilds is the result of RUN wget ... a specific URL, Docker does actually have native support for this.
There are two similar commands to copy files into a container. COPY only copies files from the build context. ADD can also fetch external URLs and unpack local archives (but not both at the same time). The general recommendation is to use COPY, unless you need one of the specific things ADD does differently.
So you should be able to say
ADD http://some.remote.server/some.remote.path.for.example.json .
RUN some-other-local-commands-which-may-depend-on-the-json
and the RUN command will use the Docker layer cache based on the contents of the fetched file.
If this approach doesn't work for you (maybe you need special authentication to fetch the file) you can also fetch the file outside of Docker before you run docker build, and then COPY it in. Again, it will work like any other file you COPY in, and layer caching will take effect based on whether the file has changed or not.

How can I replace a set of ENV commands in Dockerfiles with placing them to some other file, which can be reused?

I have two Dockerfiles (maybe will have more) with the list of environment variables, same for the both files. Let's say:
ENV VAR1="value1"
ENV VAR2="value2"
ENV VAR3="value3"
Can I somehow move this setup to a file, which can be used in all the Dockerfiles, where it's required?
I want to remove duplicates and have a common place for setting those variables.
You can split these into a custom base image. That image would look like
FROM ubuntu:18.04 # or whatever else you're using
ENV VAR1="value1"
ENV VAR2="value2"
ENV VAR3="value3"
# and that's all
You would have to manually build this in most situations
docker build -t my/env-base -f Dockerfile.env .
and then you can refer to it in the downstream Dockerfiles
FROM my/env-base
# the rest of the Dockerfile commands as normal
Tooling like Docker Compose won't really be aware of this image layering. There's no good way to list a base image that needs to be built as a dependency of other things, but shouldn't run a container on its own. If you do change these values you'll have to manually rebuild the base image, then rebuild the application images.
You should also consider whether you need all of these environment variables. In other SO questions I see variables used for filesystem paths (which can be fixed in an isolated Docker image), usernames (not a Docker concept really), credentials (keep far away from the image, it's really easy to get them back out), versions, and URLs. You might be able to get away with using fixed values for these (use /app rather than $INSTALL_PATH), or have a sensible default in your application code.

Docker build not using cache when copying Gemfile while using --cache-from

On my local machine, I have built the latest image, and running another docker build uses cache everywhere it should.
Then I upload the image to the registry as the latest, and then on my CI server, I'm pulling the latest image of my app in order to use it as the build cache to build the new version :
docker pull $CONTAINER_IMAGE:latest
docker build --cache-from $CONTAINER_IMAGE:latest \
--tag $CONTAINER_IMAGE:$CI_COMMIT_SHORT_SHA \
.
From the build output we can see the COPY of the Gemfile is not using the cache from the latest image, while I haven't updated that file :
Step 15/22 : RUN gem install bundler -v 1.17.3 && ln -s /usr/local/lib/ruby/gems/2.2.0/gems/bundler-1.16.0 /usr/local/lib/ruby/gems/2.2.0/gems/bundler-1.16.1
---> Using cache
---> 47a9ad7747c6
Step 16/22 : ENV BUNDLE_GEMFILE=$APP_HOME/Gemfile BUNDLE_JOBS=8
---> Using cache
---> 1124ad337b98
Step 17/22 : WORKDIR $APP_HOME
---> Using cache
---> 9cd742111641
Step 18/22 : COPY Gemfile $APP_HOME/
---> f7ff0ee82ba2
Step 19/22 : COPY Gemfile.lock $APP_HOME/
---> c963b4c4617f
Step 20/22 : RUN bundle install
---> Running in 3d2cdf999972
Aside node : It is working perfectly on my local machine.
Looking at the Docker documentation Leverage build cache doesn't seem to explain the behaviour here as neither the Dockerfile, nor the Gemfile has changed, so the cache should be used.
What could prevent Docker from using the cache for the Gemfile?
Update
I tried to copy the files setting the right permissions using COPY --chown=user:group source dest but it still doesn't use the cache.
Opened Docker forum topic: https://forums.docker.com/t/docker-build-not-using-cache-when-copying-gemfile-while-using-cache-from/69186
Let me share with you some information that helped me to fix some issues with Docker build and --cache-from, while optimizing a CI build.
I had struggled for several days because I didn't have the correct understanding, I was basing myself on incorrect explanations found on the webs.
So I'm sharing this here hoping it will be useful to you.
When providing multiple --cache-from, the order matters
The order is very important, because at the first match, Docker will stop looking for other matches and it will use that one for all the rest of the commands.
This is explained by the person who implemented the feature in the Github PR:
When using multiple --cache-from they are checked for a cache hit in the order that user specified. If one of the images produces a cache hit for a command only that image is used for the rest of the build.
There is also a lenghtier explanation in the initial ticket proposal:
Specifying multiple --cache-from images is bit problematic. If both images match there is no way(without doing multiple passes) to figure out what image to use. So we pick the first one(let user control the priority) but that may not be the longest chain we could have matched in the end. If we allow matching against one image for some commands and later switch to a different image that had a longer chain we risk in leaking some information between images as we only validate history and layers for cache. Currently I left it so that if we get a match we only use this target image for rest of the commands.
Using --cache-from is exclusive: the local Docker cache won't be used
This means that it doesn't add new caching sources, the image tags you provide will be the only caching source for the Docker build.
Even if you just built the same image locally, the next time you run docker build for it, in order to benefit from the cache, you need to either:
provide the correct tag(s) with --cache-from (and with the correct precedence); or
not use --cache-from at all (so that it will use the local build cache)
If the parent image changes, the cache will be invalidated
For example, if you have an image based on docker:stable, and docker:stable gets updated, the cached builds of your image will not be valid anymore as the layers of the base image were changed.
This is why, if you're configuring a CI build, it can be useful to docker pull the base image as well and include it in the --cache-from, as mentioned in this comment in yet another Github discussion.
I struggled with this problem, and in my case I used COPY when the checksum might have changed (but only technically, the content was functionally identical). So, I worked around this way:
Dockerfile:
ARG builder_image=base-builder
# Compilation/build stage
FROM golang:1.16 AS base-builder
RUN echo "build the app" > /go/app
# This step is required to facilitate docker cache. With the definition of a `builder_image` build tag
# we can essentially skip the build stage and use a prebuilt-image directly.
FROM $builder_image AS builder
# myapp docker image
FROM ubuntu:20.04 AS myapp
COPY --from=builder /go/app /opt/my-app/bin/
Then, I can run the following:
# build cache
DOCKER_BUILDKIT=1 docker build --target base-builder -t myapp-builder .
docker push myapp-builder
# use cache
DOCKER_BUILDKIT=1 docker build --target myapp --build-arg=builder_image=myapp-builder -t myapp .
docker push myapp
This way we can force Docker to use a prebuilt image as a cache.
For whoever is fighting with DockerHub automated builds and --cache-from. I realized images built from DockerHub would always lead to cache bust on COPY commands when pulled and used as build cache source. It seems to be also the case for #Marcelo (refs his comment).
I investigated by creating a very simple image doing a couple of RUN commands and later COPY. Everything is using the cache except the COPY. Even though content and permissions of the file being copied is the same on both the pulled image and the one built locally (verified via sha1sum and ls -l).
The solution for me was to publish the image to the registry from the CI (Travis in my case) rather than letting DockerHub automated build doing it. Let me emphasis here that I'm talking here about a specific case where files are definitely the same and should not cache bust, but you're using DockerHub automated builds.
I'm not sure why is that, but I know for instance old docker-engine version e.g. prior 1.8.0 didn't ignore file timestamp to decide whether to use the cache or not, refs https://docs.docker.com/release-notes/docker-engine/#180-2015-08-11 and https://github.com/moby/moby/pull/12031.
For a COPY command to be cached, the checksum needs to be identical on the source being copied. You can compare the checksum in the docker history output between the cache image and the one you just built. Most importantly, the checksum includes metadata like the file owner and file permission, in addition to file contents. Whitespace changes inside a file like changing to linefeeds between Linux and Windows styles will also affect this. If you download the code from a repo, it's likely the metadata, like the owner, will be different from the cached value.

Check if docker used cache for build or not

is there any way to check if Docker used cache on every step of Docker build ?
Return value is 0 for successfull build - not saying anything about whether steps have been performed using cache or not.
I'm executing docker commands in bash script running in circleci environment and I'd like to skip Docker save, if every build step ran through cache.
Thanks for the answer.
I suspect the easiest way is to compare at the image ID - if this hasn't changed, the cache must have been used.
One interesting thing about the cache is that once one command invalidates it, all following commands skip it. From the docs:
Once the cache is invalidated, all subsequent Dockerfile commands will generate new images and the cache will not be used.
This means that if your last step is cached, all other steps before it were cached too - and your image has not changed.

Creating a Dockerfile - docker starts from scratch on each new build

I am trying to build a dockerfile - iteratively adding lines and testing. My understanding was that docker will cache the lines that have already been built and start from the new lines I had added. The case seems to be that it just builds from scratch each time I call build on my container. Is this normal? If not - what am I doing wrong?
As demas said, if you're simply appending lines, the previous lines will be cached.
However, if anywhere in your Dockerfile you have a line like
ADD . /some/path
then Docker will assume that that line has changed even if it was only the Dockerfile that changed. So that line and anything after it will never be cached, unless nothing in the folder you're adding has changed.
You should be able to see whether this is happening by paying close attention to the output of the docker build command.
As a side note: a consequence of this is that if you're building a Dockerfile, you generally want to add the files in the directory as late as possible, doing any preparations beforehand. Of course, you will end up having to do things to your files (like some kind of build process) which is unfortunately hard to cache.
If I understood correctly you can look at Docker as version control system where each line in your Dockerfile is a commit to the container.
If you add new line to your Dockerfile, Docker get the last revision of container and make a new commit. If you add line on the middle of your Dockerfile, Docker get one of the previous revisions and make new commit to this part of the tree.

Resources