"Caching" intermediate Docker build - docker

I'm learning to use Docker and I've come across a minor annoyance. Whenever I make a change to the Dockerfile,I run docker build -t tag . which goes through the entire Dockerfile as it should. This takes a good 5-6 minutes due to the dependencies in my project. Sometimes a command that I run will cause an error, or there will be a mistake in the Dockerfile. While the fix may take a couple seconds, I have to rebuild the entire thing which decreases my productivity. Is there a way to "continue from where the build last failed" after editing the Dockerfile? Thanks.

This is called the "build cache" and it is already a feature of Docker. Docker's builder will only use the cache up until the point where your Dockerfile has changed. There are some edge cases when using COPY or ADD directives that will cause the build cache to be invalidated (since it hashes files to determine if any have changed, and invalidates the cache if so). This means that if you are using COPY foo /foo and you have changed that file, the build cache will be invalidated. Also, if you do COPY . /opt/bar/ (meaning, you copy the entire directory to somewhere), even some small change like a Vim swap file or Dockerfile change will invalidate the cache!
The behavior of not using the build cache at all is invoked using --no-cache in your docker build command.
So basically, it's there, and you're using it, just that you're probably changing the Dockerfile at a very early point or hitting that lesser known edge case with a COPY/ADD directive, and the builder is invalidating everything after that point. And just to answer the question before you ask it, it would be very hard or impossible to continue using the cache after a change has invalidated the cache. Meaning, if you change your first Dockerfile line and invalidate the build cache, it is basically impossible to use the build cache past that point.

Is there a way to "continue from where the build last failed" after editing the Dockerfile?
No (as L0j1k's answer explains well)
That is why the best practice is to organize your Dockerfile from the stablest commands (the one which will never have to be changed/modified) to the most specific commands (the ones you might have to change quite a bit).
That way, your modifications will trigger only a build on the last few lines of your Dockerfile, instead of going through everything again, because you changed one of the first lines.

Related

Update dependencies in the Dockerfile and create an image without re downloading the previously mentioned dependencies

Consider the following scenario. There is an application that depends on the libraries "A", "B", "C", to build and run, otherwise it throws an error. Not knowing about the dependencies "B", and "C" a Dockerfile is created that builds an image with the dependency "A" installed.
The app is run inside a container started from the image and the app fails to build, since the container is missing the dependencies "B" and "C".
Now if the image is destroyed and rebuilt, the previously downloaded dependencies will again be re-downloaded. A workaround could be to write a Dockerfile to import from the existing image (that has the dependency "A" installed) and mention the installation of the dependencies "B" and "C" on top of it.
But this way, Every-time a new dependency needs to be added a new docker image has to be built that will import from the old image, so, the old and the new image both remains important.
My question is that
if there is any way to keep building images mentioning the new dependencies without re-downloding the old dependencies?
without importing the dependencies from the old image ?
and, without the hassle of writing a new "FROM" in the dockerfile?
What is the most clean solution for a scenario like this?
1. If there is any way to keep building images mentioning the new dependencies without re-downloading the old dependencies?
Well, i often optimize Dockerfile using layer caching. Whenever you write down a command in Dockerfile, it creates a new layer. Between 2 times build, docker compares the Dockerfile's commands top down and rebuild from where it detects command's changes. So i often put stable layers (like dependencies, environment setup) at the top of dockerfile. Otherwise layers like EXPOSE Port or CMD which i often change so i put them at bottom of the file. By doing this, it saves a lot of time whenerver i rebuild image.
You can also use multistage-build. But i not often use it so you can check it here: https://docs.docker.com/develop/develop-images/multistage-build/
2. without keeping the old image and import from that into the new one?
Sometime when i want to reinstall everything again, i just rebuild image use option --no-cache.**
docker build --no-cache=true .
3. Without the hassle of writing a new "FROM" in the dockerfile
Sometimes i use base image like linux alpine and install everything i need from scratch so my image will have smaller size and does not contain things that i dont need. FROM is just pulling images from Dockerhub which are created by the some way.
For example Dockerfile of image nginx-alpine :
https://github.com/nginxinc/docker-nginx/blob/2ef3fa66f2a434cd5e44e35a02f4ac502cf50808/mainline/alpine/Dockerfile
You can checkout alpine linux for more details: https://alpinelinux.org/

docker build context and sensitive data

The title of this question might suggest that it has already been answered, but trust me, I searched intensively here on SO :-)
As I understand it when building a docker-image the current folder will be packaged up and sent to the docker-deamon as the build-context. From this build-context the docker-image is build by "ADD"ing or "COPY"ing files and "RUN"ning the commands in the Dockerfile.
And furthermore: In case I have sensitive configuration-files in the folder of the DockerFile, these files will be sent to the docker-deamon as part of the build-context.
Now my question:
Lets say I did not use any COPY or ADD in my Dockerfile... will these configuration files be included somewhere in the docker-image? I ran a bash inside the image and could not find the configuration-files, but maybe they are still somewhere in the deeper layers of the image?
Basically my question is: Will the context of the build be stored in the image?
Only things you explicitly COPY or ADD to the image will be present there. It's common to have lines like COPY . . which will copy the entire context into the image, so it's up to you to check that you're not copying in things you don't want to have persisted and published.
It still is probably a good idea to keep these files from being sent to the Docker daemon at all. If you know which files have this information, you can add them to a .dockerignore file (syntax similar to .gitignore and similar files). There are other ways to more tightly control what's in the build context (by making a shadow install tree that has only the context content) but that's a relatively unusual setup.
As you said only COPY, ADD and RUN operations create layers, and therefore, only those operations add something to the image.
The build context is only the directory with the resources those operations (specifically COPY and ADD) will have access to while building the image. But it's not anything like a "base layer".
In fact, you said you ran bash and double checked that nothing sensitive was there. Another way to make sure about this is by checking the layers of the image. To do so, docker history --no-trunc <image>

How are layers cached in docker images?

I have this command in my docker file:
ADD static/ /www/static/
I have noticed that re-running docker build reuses the cache, even though the contents of the static/ directory have changed. Is this normal?
How does docker decide when a layer needs to be rebuilt? Just by looking at the command that needs to be executed, or by checking the actual operation performed? I assume is the former, since the latter would require re-running the operation, defeating the purpose of caching.
The workaround that I am using now is --no-cache but this makes building slower, since no layer is reused. Is there a better way?
I think the best option would be to mark one operation as non-cacheable. Is this possible?
According to Dockers website, the cache for a specific layer should be invalidated if the instruction has changed.
However, for ADD and COPY, the checksums of the files are compared and if these have changed, the cache is invalidated.
Therefore it seems that the contents of the files in static/ have not changed. So to be sure that you might see strange behaviour, please execute a checksum over the files in static/ before the first time you build, and before the second time - when you rebuild with the updated files.

How can I see which file(s) caused a Dockerfile `COPY` statement to invalidate the cache?

docker build . will rebuild the docker image given the Dockerfile in the current directory, and ignore any paths matched from the .dockerignore file.
Any COPY statements in that Dockerfile will cause the build cache to be invalidated if the files on-disk are different from last time it built.
I've noticed that if you don't ignore the .git dir, simple things like git fetch which have no side-effect will cause the build cache to become invalidated (presumably because some tracking information within the .git dir has changed.
It would be very helpful if I knew how to see precisely which files caused the cache to become invalidated... But I've been unable to find a way.
I don't think there is a way to see which file invalidated the cache with the current Docker image design.
Layers and images since v1.10 are 'content addressable'. Their ID's are based on a SHA256 checksum which reflects their content.
The caching code just looks up the ID of the image/layer which will only exist in Docker Engine if the contents of the entire layer match (or possibly a collision).
So when you run docker build, a new build context is created for each command in the Dockerfile. A checksum is calculated for the entire layer that command would produce. Then docker checks to see if an existing layer is available with that checksum and run config.
The only way I can see to get individual file detail back would be to recompute the destination file checksums, which would probably negate most of the caching speed up. If you did want to do this anyway, the other problem is deciding which layer to check that against. You would have to lookup a previous image build tree (maybe by tag?) to find what the contents of the previous comparable layer were.

Can I build a Docker image to "cache" a yocto/bitbake build?

I'm building a Yocto image for a project but it's a long process. On my powerful dev machine it takes around 3 hours and can consume up to 100 GB of space.
The thing is that the final image is not "necessarily" the end goal; it's my application that runs on top of it that is important. As such, the yocto recipes don't change much, but my application does.
I would like to run continuous integration (CI) for my app and even continuous delivery (CD). But both are quite hard for now because of the size of the yocto build.
Since the build does not change much, I though of "caching" it in some way and use it for my application's CI/CD and I though of Docker. That would be quite interesting as I could maintain that image and share it with colleagues who need to work on the project and use it in CI/CD.
Could a custom Docker image be built for that kind of use?
Would it be possible to build such an image completely offline? I don't want to have to upload the 100GB and have to re-download it on build machines...
Thanks!
1. Yes.
I've used docker to build Yocto images for many different reasons, always with positive results.
2. Yes, with some work.
You want to take advantage of the fact that Yocto caches all the stuff you need to do your build in what it calls "Shared State Cache". This is normally located in your build directory under ${BUILDDIR}/sstate-cache, and it contains exactly what you are looking for in this case. There are a couple of options for how to get these files to your build machines.
Option 1 is using sstate mirrors:
This isn't completely offline, but lets you download a much smaller cache and build from that cache, rather than from source.
Here's what's in my local.conf file:
SSTATE_MIRRORS ?= "\
file://.* http://my.shared-computer.com/some-folder/PATH"
Don't forget the PATH at the end. That is required. The build system substitutes the correct path within the directory structure.
Option 2 lets you keep a local copy of your sstate-cache and build from that locally.
In your dockerfile, create the sstate-cache directory (location isn't important here, I like /opt for my purposes):
RUN mkdir -p /opt/yocto/sstate-cache
Then be sure to bindmount these directories when you run your build in order to preserve the contents, like this:
docker run ... -v /place/to/save/cache:/opt/yocto/sstate-cache
Edit the local.conf in your build directory so that it points at these folders:
SSTATE_DIR ?= "/opt/yocto/sstate-cache"
In this way, you can get your cache onto your build machines in whatever way is best for you (scp, nfs, sneakernet).
Hope this helps!

Resources