How does docker create a cached layer for RUN command? - docker

I'm new to docker, II read from a book which says:
Dockerfile instruction results in an image layer, but if the
instruction doesn’t change between builds, and the content going into
the instruction is the same, Docker knows it can use the previous
layer in the cache. An example forCOPY instuction, Docker
calculates whether the input has a match in the cache by generating a
hash, which is like a digital fingerprint representing the input. The
hash is made from the Dockerfile instruction and the contents of any
files being copied. If there’s no match for the hash in the existing
image layers, Docker executes the instruction, and that breaks the
cache. As soon as the cache is broken, Docker executes all the
instructions that follow, even if they haven’t changed.
I can understand it, but is every instruction eligible for docker to create a cache layer?
Let's take RUN instruction for example
RUN dotnet build "WebApplication.csproj" -c Release -o /app/build
WebApplication.csproj is most likely the same (we are not adding any third party package, only modify source code). since the content of "WebApplication.csproj" is the same, should Docker use the cache layer generated before? if docker does use cache layer, which can cause problem because we modified the source code, I don't think Docker is smart enough to check every source files in our project?

Related

docker-compose ignores ubuntu:latest in Dockerfile [duplicate]

I want to build a docker image for the Linkurious project on github, which requires both the Neo4j database, and Node.js to run.
My first approach was to declare a base image for my image, containing Neo4j. The reference docs do not define "base image" in any helpful manner:
Base image:
An image that has no parent is a base image
from which I read that I may only have a base image if that image has no base image itself.
But what is a base image? Does it mean, if I declare neo4j/neo4j in a FROM directive, that when my image is run the neo database will automatically run and be available within the container on port 7474?
Reading the Docker reference I see:
FROM can appear multiple times within a single Dockerfile in order to create multiple images. Simply make a note of the last image ID output by the commit before each new FROM command.
Do I want to create multiple images? It would seem what I want is to have a single image that contains the contents of other images e.g. neo4j and node.js.
I've found no directive to declare dependencies in the reference manual. Are there no dependencies like in RPM where in order to run my image the calling context must first install the images it needs?
I'm confused...
As of May 2017, multiple FROMs can be used in a single Dockerfile.
See "Builder pattern vs. Multi-stage builds in Docker" (by Alex Ellis) and PR 31257 by Tõnis Tiigi.
The general syntax involves adding FROM additional times within your Dockerfile - whichever is the last FROM statement is the final base image. To copy artifacts and outputs from intermediate images use COPY --from=<base_image_number>.
FROM golang:1.7.3 as builder
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
The result would be two images, one for building, one with just the resulting app (much, much smaller)
REPOSITORY TAG IMAGE ID CREATED SIZE
multi latest bcbbf69a9b59 6 minutes ago 10.3MB
golang 1.7.3 ef15416724f6 4 months ago 672MB
what is a base image?
A set of files, plus EXPOSE'd ports, ENTRYPOINT and CMD.
You can add files and build a new image based on that base image, with a new Dockerfile starting with a FROM directive: the image mentioned after FROM is "the base image" for your new image.
does it mean that if I declare neo4j/neo4j in a FROM directive, that when my image is run the neo database will automatically run and be available within the container on port 7474?
Only if you don't overwrite CMD and ENTRYPOINT.
But the image in itself is enough: you would use a FROM neo4j/neo4j if you had to add files related to neo4j for your particular usage of neo4j.
Let me summarize my understanding of the question and the answer, hoping that it will be useful to others.
Question: Let’s say I have three images, apple, banana and orange. Can I have a Dockerfile that has FROM apple, FROM banana and FROM orange that will tell docker to magically merge all three applications into a single image (containing the three individual applications) which I could call smoothie?
Answer: No, you can't. If you do that, you will end up with four images, the three fruit images you pulled, plus the new image based on the last FROM image. If, for example, FROM orange was the last statement in the Dockerfile without anything added, the smoothie image would just be a clone of the orange image.
Why Are They Not Merged? I Really Want It
A typical docker image will contain almost everything the application needs to run (leaving out the kernel) which usually means that they’re built from a base image for their chosen operating system and a particular version or distribution.
Merging images successfully without considering all possible distributions, file systems, libraries and applications, is not something Docker, understandably, wants to do. Instead, developers are expected to embrace the microservices paradigm, running multiple containers that talk to each other as needed.
What’s the Alternative?
One possible use case for image merging would be to mix and match Linux distributions with our desired applications, for example, Ubuntu and Node.js. This is not the solution:
FROM ubuntu
FROM node
If we don’t want to stick with the Linux distribution chosen by our application image, we can start with our chosen distribution and use the package manager to install the applications instead, e.g.
FROM ubuntu
RUN apt-get update &&\
apt-get install package1 &&\
apt-get install package2
But you probably knew that already. Often times there isn’t a snap or package available in the chosen distribution, or it’s not the desired version, or it doesn't work well in a docker container out of the box, which was the motivation for wanting to use an image. I’m just confirming that, as far as I know, the only option is to do it the long way, if you really want to follow a monolithic approach.
In the case of Node.js for example, you might want to manually install the latest version, since apt provides an ancient one, and snap does not come with the Ubuntu image. For neo4j we might have to download the package and manually add it to the image, according to the documentation and the license.
One strategy, if size does not matter, is to start with the base image that would be hardest to install manually, and add the rest on top.
When To Use Multiple FROM Directives
There is also the option to use multiple FROM statements and manually copy stuff between build stages or into your final one. In other words, you can manually merge images, if you know what you're doing. As per the documentation:
Optionally a name can be given to a new build stage by adding AS name
to the FROM instruction. The name can be used in subsequent FROM and
COPY --from=<name> instructions to refer to the image built in this
stage.
Personally, I’d only be comfortable using this merge approach with my own images or by following documentation from the application vendor, but it’s there if you need it or you're just feeling lucky.
A better application of this approach though, would be when we actually do want to use a temporary container from a different image, for building or doing something and discard it after copying the desired output.
Example
I wanted a lean image with gpgv only, and based on this Unix & Linux answer, I installed the whole gpg with yum and then copied only the binaries required, to the final image:
FROM docker.io/photon:latest AS builder
RUN yum install gnupg -y
FROM docker.io/photon:latest
COPY --from=builder /usr/bin/gpgv /usr/bin/
COPY --from=builder /usr/lib/libgcrypt.so.20 /usr/lib/libgpg-error.so.0 /usr/lib/
The rest of the Dockerfile continues as usual.
The first answer is too complex, historic, and uninformative for my tastes.
It's actually rather simple. Docker provides for a functionality called multi-stage builds the basic idea here is to,
Free you from having to manually remove what you don't want, by forcing you to allowlist what you do want,
Free resources that would otherwise be taken up because of Docker's implementation.
Let's start with the first. Very often with something like Debian you'll see.
RUN apt-get update \
&& apt-get dist-upgrade \
&& apt-get install <whatever> \
&& apt-get clean
We can explain all of this in terms of the above. The above command is chained together so it represents a single change with no intermediate Images required. If it was written like this,
RUN apt-get update ;
RUN apt-get dist-upgrade;
RUN apt-get install <whatever>;
RUN apt-get clean;
It would result in 3 more temporary intermediate Images. Having it reduced to one image, there is one remaining problem: apt-get clean doesn't clean up artifacts used in the install. If a Debian maintainer includes in his install a script that modifies the system that modification will also be present in the final solution (see something like pepperflashplugin-nonfree for an example of that).
By using a multi-stage build you get all the benefits of a single changed action, but it will require you to manually allowlist and copy over files that were introduced in the temporary image using the COPY --from syntax documented here. Moreover, it's a great solution where there is no alternative (like an apt-get clean), and you would otherwise have lots of un-needed files in your final image.
See also
Multi-stage builds
COPY syntax
Here is probably one of the most fundamental use cases of using multiple FROMs, aka, multi stage builds.
I want want one dockerfile, and I want to change one word and depending on what I set that word to, I get different images depending on whether I want to run, Dev or Publish the application!
Run - I just want to run the app
Dev - I want to edit the code and run the app
Publish - Run the app in production
Lets suppose we're working in the dotnet environment. Heres one single Dockerfile. Without multi stage build, there would be multiple files (builder pattern)
#See https://aka.ms/containerfastmode to understand how Visual Studio uses this Dockerfile to build your images for faster debugging.
FROM mcr.microsoft.com/dotnet/runtime:5.0 AS base
WORKDIR /app
FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build
WORKDIR /src
COPY ["ConsoleApp1/ConsoleApp1.csproj", "ConsoleApp1/"]
RUN dotnet restore "ConsoleApp1/ConsoleApp1.csproj"
COPY . .
WORKDIR "/src/ConsoleApp1"
RUN dotnet build "ConsoleApp1.csproj" -c Release -o /app/build
FROM build AS publish
RUN dotnet publish "ConsoleApp1.csproj" -c Release -o /app/publish
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "ConsoleApp1.dll"]
Want to run the app? Leave FROM base AS final as it currently is in the dockerfile above.
Want to dev the source code in the container? Change the same line to FROM build AS final
Want to release into prod? Change the same line to FROM publish AS final
I agree with the OP, that this feature is useful for docker! Here is a different view into the same problem:
If you had multiple FROMs (or a "FROM" and multiple "MERGE"'s, for example) then you can use the docker registry versioning system for the base docker image AND other container elements, and that is the win here: I have third party development tools which do not exist in .deb format, so these tools must be installed by un-taring a tball and is HUGE, so caching on the docker host will be important but versioning/change control of the image is equally important. I (think I) can simply use "RUN git ....", and docker will deal with the caching of the new layer for me, which is what I want; because another container will have the same base image but a different set of HUGE third party tools, so the caching of the base image and the tools image is really important (the 3rd party tools tar can be as big as the base image of say ubuntu so caching of these is really important too). The (suggested) feature just allows all these elements to be managed in a central repo. versioning system.
Said a different way, why do we use FROM at all? If I were to simply git clone an ubuntu image using the RUN command for my "base image/layer", this would create a new layer and docker would cache this anyway...so is there any difference/advantage in using FROM, other than it uses dockers internal versioning system/syntax?

During build, on what basis does docker decide whether to create a new layer or use an existing layer?

The docker file has various command like FROM, RUN, etc. Each of this command creates a layer (intermediary image).
During build, assuming the layer already exists, on what basis does docker decide whether to create a new layer or use an existing layer?
The docker build caching system is pretty simple. For most commands, if the previous layer was cached, and there is a layer that runs the exact same command (RUN, ENV, CMD, ...) then it reuses the cached layer, and repeats this check for the next command. For COPY and ADD commands the decision is based on a hash of the file contents.
This is detailed in Best practices for writing Dockerfiles in the Docker documentation.
Practically, there are a couple of things this means:
You almost never need docker build --no-cache, since if the Dockerfile or any involved files have changed, the cache will be automatically invalidated.
If you have an expensive step to install dependencies (npm install, pip install, bundle install, ...) have a first step that only COPYs the files that list the dependencies, then RUN whatever install, then COPY the rest of your application. This avoids invalidating the cache for the "install" step if only application code has changed.
If you have a Debian- or Ubuntu-based image, RUN apt-get update && apt-get install in a single command. This avoids a problem where the URLs from the "update" step are cached, but the packages in the "install" step change, and the cached URLs are no longer valid.

Is there a reason why people write everything to the DockerFile instead of a separate shell script?

I somehow don't like the RUN x && y && z ... syntax we currently use in DockerFile. As far as I understand I could just run a shell script instead like RUN xyz.sh and do the same tasks on my favorite language. Does the latter have any disadvantage?
Update:
In additional to the point made by David about the complexity, I believe writing everything to the Dockerfile makes it easier for share (thus creating a survivorship bias for you). Namely on the DockerHub, most of the time, you have a "Dockerfile" tab to quickly get the idea on how the image is built. If the author uses COPY and RUN xyz.sh, he/she would have to host the script elsewhere or the Dockerfile alone becomes meaningless.
CMD is executed at runtime, that is when the container is created from the image. RUN is a build time instruction. So the question is actually why people run things with RUN instead of CMD at runtime. (You can of course COPY script.sh /script.sh then RUN bash /script.sh)
If you do things like installing dependencies, it could take a lot of time, in case of scaling up your service, this would make auto-scaling useless because it can't be fast enough to absorb the peak.
At build time, RUN can be cached, so next time the build will be a lot faster.
Because the way docker file system works, creating 10 containers from the same image takes only a few more space than creating 1 container. So you can save disk space by installing packages in the image, while if you install them at runtime, they will all occupy a part of disk space.
RUN executes commands in a new layer and creates a new image. This happens when you build the image using docker build.
CMD specifies a default command an parameters to be run when a container is launched from the image.
In summary. Run and cmd is not interchangeable, RUN runs when an image is created, CMD when a container is launched.

What does "From image" do in dockerfiles

I see that dockerfiles usually have a line beginning with "from" keywork, for example:
FROM composer/composer:1.1-alpine AS composer
As far as I know, dockerfiles are a set of commands that help to build and run many containers in docker.
As the example above, docker uses a image named composer/composer:1.1-alpine from docker hub. The As composer just make an alias, so we can use it more convenient.
When I looked for the image, I found the link enter link description here and then enter link description here.
The thing I dont really understand is that:
I guess docker will use the image to build something, but how exactly does it use the image? Does docker run the image or just prepare to use it when in need. Sometimes I dont see the dockerfiles use the image in following line (like this example, there are no lines using the keyword "composer" except the first line). It makes me confused.
Any help would be appreciated.
Thanks.
DockerFiles describes layers: Each command creates it's own layer. For example:
RUN touch test.txt
RUN cp test.txt foo.txt
would create two layers - the first one with the file test.txt, the second one without test.txt but with foo.txt
Each layer adds something to a container. When we walk the layers "up" we find that the very first layer is the empty layer, e.g. it contains only the linux (or windows) kernel itself - nothing else. But that's not really useful - we need a lot of tools (e.g. bash) to be able to run an app. So common base images like alpine add suc tools and core os functions.
It would be annoying as hell if we had to do this setup in every container so there a lots of base images, which do exactly this kind of setup.
If you want to see what a base image does, just search the name on hub.docker.com - there you will find the Dockerfile, describing the build process.
Aditionally, containers can be extenend, e.g. you use the elasticsearch container as a base image, and add your own functionality - that's the second use case for base images.
For your second question: You have to decide if you have to replicate the steps in your base image or not. If you inherit from a general OS image like apline - probably not, since linux normally ships with these tools. If you inherit from a more specialized container, it depends - if your machine matches the environment in the container, you don't need to, but if not you will have to apply these steps to your machine, too. E.g. if you don't have elasticsearch installed, you have to install it.
As for multiple froms in one Dockerfile: Please look up the documentation for Multi Stage builds. Essentially, they encapsulate multiple containers in a single dockerfile. Which can be very useful if you need a different set to build an app and to run the app. The first Container is responsible to build your app, while the second one takes the compiled source code and just runs it.
Watch for COPY --from= lines, these are copying files from one container to another.
The FROM instruction initializes a new build stage and sets the Base Image for subsequent instructions. As such, a valid Dockerfile must start with a FROM instruction. The image can be any valid image – it is especially easy to start by pulling an image from the Public Repositories.
FROM can appear multiple times within a single Dockerfile to create multiple images or use one build stage as a dependency for another. Simply make a note of the last image ID output by the commit before each new FROM instruction. Each FROM instruction clears any state created by previous instructions.
Optionally a name can be given to a new build stage by adding AS name to the FROM instruction. The name can be used in subsequent FROM and COPY --from= instructions to refer to the image built in this stage.
The tag or digest values are optional. If you omit either of them, the builder assumes a latest tag by default. The builder returns an error if it cannot find the tag value.
Taken from : https://docs.docker.com/engine/reference/builder/#from

How do I update docker images?

I read that docker works with layers, so when creating a container with a Dockerfile, you start with the base image, then subsequent commands run add a layer to the container, so if you save the state of that new container, you have a new image. There are a couple of things I'm wondering about this.
If I start from a Ubuntu image, which is pretty big and bulky since its a complete OS, then I add a few tools to it and save this as a new image which I upload to the hub. If someone downloads my image, and they already have a Ubuntu image saved in their images folder, does this mean they can skip downloading Ubuntu since they already have the image? If so, how does this work when I modify parts of the original image, does Docker use its cached data to selectively apply those changes to the Ubuntu image after it loads it?
2.) How do I update an image that I built by modifying the Dockerfile? I setup a simple django project with this Dockerfile:
FROM python:3.5
ENV PYTHONBUFFERED 1
ENV APPLICATION_ROOT /app
ENV APP_ENVIRONMENT L
RUN mkdir -p $APPLICATION_ROOT
WORKDIR $APPLICATION_ROOT
ADD requirements.txt $APPLICATION_ROOT
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
ADD . $APPLICATION_ROOT
and used this to create the image in the beginning. So everytime I create a box, it loads all these environment variables, if I rebuild the box completely it reinstalls the packages and all the extras. I need to add a new environment variable, so I added it to the bottom of the Dockerfile, along with a test variable:
ENV COMPOSE_CONVERT_WINDOWS_PATHS 1
ENV TEST_ENV_VAR TEST
When I delete the container and the image, and build a new container, it all seems to go accordingly, it tells me that it creates the new Step 4 : ENV
COMPOSE_CONVERT_WINDOWS_PATHS 1
---> Running in 75551ea311b2
---> b25b60e29f18
Removing intermediate container 75551ea311b2
So its like something gets lost in some of these intermediate container transitions. Is this how the caching system works, every new layer is an intermediate container? So with that in mind, how do you add a new layer, do you always have to add the new data at the bottom of the Dockerfile? Or would it be better to leave the Dockerfile alone once the image is built, and just modify the container and built a new image?
EDIT I just tried installing an image, a package called bwawrik/bioinformatics, which is a CentOS based container which has a wide range of tools installed.
It froze half way through, so I exited it and then ran it again to see if everything was installed:
$ docker pull bwawrik/bioinformatics
Using default tag: latest
latest: Pulling from bwawrik/bioinformatics
a3ed95caeb02: Already exists
a3ed95caeb02: Already exists
7e78dbe53fdd: Already exists
ebcc98113eaa: Already exists
598d3c8fd678: Already exists
12520d1e1960: Already exists
9b4912d2bc7b: Already exists
c64f941884ae: Already exists
24371a4298bf: Already exists
993de48846f3: Already exists
2231b3c00b9e: Already exists
2d67c793630d: Already exists
d43673e70e8e: Already exists
fe4f50dda611: Already exists
33300f752b24: Already exists
b4eec31201d8: Already exists
f34092f697e8: Already exists
e49521d8fb4f: Already exists
8349c93680fe: Already exists
929d44a7a5a1: Already exists
09a30957f0fb: Already exists
4611e742e0b5: Already exists
25aacf0148db: Already exists
74da82504b6c: Already exists
3e0aac083b86: Already exists
f52c7e0ac000: Already exists
35eee92aaf2f: Already exists
5f6d8eb70885: Already exists
536920bfe266: Already exists
98638e678c51: Already exists
9123956b991d: Already exists
1c4c8a29cd65: Already exists
1804bf352a97: Already exists
aa6fe9359956: Already exists
e7e38d1250a9: Already exists
05e935c831dc: Already exists
b7dfc22c26f3: Already exists
1514d4797ffd: Already exists
Digest: sha256:0391808e21b7b5cc0eb44fc2dad0d7f5415115bdaafb4534c0b6a12efd47a88b
Status: Image is up to date for bwawrik/bioinformatics:latest
So it definitely installed the package in pieces, not all in one go. Are these pieces, different images?
image vs. container
First, let me clarify some terminology.
image: A static, immutable object. This is the thing you build when you run docker build using a Dockerfile. An image is not a thing that runs.
Images are composed of layers. an image might have only one layer, or it might have many layers.
container: A running thing. It uses an image as its starting template.
This is similar to a binary program and a process. You have a binary program on disk (such as /bin/sh), and when you run it, it is a process on your system. This is similar to the relationship between images and containers.
Adding layers to a base image
You can build your own image from a base image (such as ubuntu in your example). Some commands in your Dockerfile will create a new layer in the ultimate image. Some of those are RUN, COPY, and ADD.
The very first layer has no parent layer. But every other layer will have a parent layer. In this way they link to one another, stacking up like pancakes.
Each layer has a unique ID (the long hexadecimal hashes you have already seen). They can also have human-friendly names, known as tags (e.g. ubuntu:16.04).
What is a layer vs. an image?
Technically, each layer is also an image. If you build a new image and it has 5 layers, you can use that image and it will contain all 5 layers. If you run a container using the third layer in the stack as your image ID, you can do that too - but it would only contain 3 layers. The one you specify and the two that are its ancestors.
But as a matter of convention, the term "image" generally means the layer that has a tag associated. When you run docker images, it will show you all of the top-level images, and hide the layers beneath (but you can show them all with -a).
What is an intermediate container?
When docker build runs, it does all of its work inside of containers (naturally!) So if it encounters a RUN step, it will create a container from the current top layer, run the specified commands in there, and then save the result as a new layer. Then it will create a container from this new layer, run the next thing... etc.
The intermediate containers are only used for the build process, and are discarded after the build.
How layer filesystems work
You asked whether someone downloading your ubuntu-based image are only doing a partial download, if they already had the ubuntu image locally.
Yes! That's exactly right.
Every layer uses the layer beneath it as a base. The new layer is basically a diff between that layer and a new state. It's not a diff in the same way as a git commit might work, though. It works at the file level, not at a the line level.
Say you started from ubuntu, and you ran this Dockerfile.
FROM: ubuntu:16.04
RUN groupadd dan && useradd -g dan dan
This would result in a two layer image. The first layer would be the ubuntu image. The second would probably have only a handful of changes.
A newer copy of /etc/passwd with user "dan"
A newer copy of /etc/group with group "dan"
A new directory /home/dan
A couple of default files like /home/dan/.bashrc
And that's it. If you start a container from this image, those few files would be in the topmost layer, and everything else would come from the filesystem in the ubuntu image.
The top-most read-write layer in a container
One other point. When you run a container, you can write files in the filesystem. But if you stop the container and run another container from the same image, everything is reset. So where are the files written?
Images are immutable, so once they are created, they can't be changed. You can build a new version, but that's a new image. It would have a different ID and would not be the same image.
A container has a top-level read-write layer which is put on top of the image layers. Any writes happen in that layer. It works just like the other layers. If you need to modify a file (or add one, or delete one), that is done in the top layer, and doesn't affect the lower layers. If the file exists already, it is copied into the read-write layer, and then modified. This is known as copy-on-write (CoW).
Where to add changes
Do you have to add new things to the bottom of Dockerfile? No, you can add anything anywhere (or change anything).
However, how you do things does affect your build times because of how the build caching works.
Docker will try to cache results during builds. If it finds as it reads through Dockerfile that the FROM is the same, the first RUN is the same, the second RUN is the same... it will assume it has already done those steps, and will use cached results. If it encounters something that is different from the last build, it will invalidate the cache. Everything from that point on will be re-run fresh.
Some things will always invalidate the cache. For instance if you use ADD or COPY, those always invalidate the cache. That's because Docker only keeps track of what the build commands are. It doesn't try to figure out "is this version of the file I'm copying the same one as last time?"
So it is a common practice to start with FROM, then put very static things like RUN commands that install packages with e.g. apt-get, etc. Those things tend to not change a lot after your Dockerfile has been initially written. Later in the file is a more convenient place to put things that change more often.
It's hard to concisely give good advice on this, because it really depends on the project in question. But it pays to learn how the build caching works and try to take advantage of it.

Resources