Docker: should I combine my apt-get install / build / cleanup steps into one big RUN? - docker

I have a Dockerfile that looks like this:
FROM debian:stable-slim
RUN apt-get update && \
apt-get install -y --no-install-recommends fp-compiler fp-units-fcl fp-units-net libc6-dev
COPY src /whatwg/wattsi/src
WORKDIR /whatwg/wattsi/src
RUN ./build.sh
RUN rm -rf /whatwg/wattsi/src && \
apt-get purge -y fp-compiler fp-units-fcl fp-units-net libc6-dev && \
apt-get autoremove -y
ENTRYPOINT ["/whatwg/wattsi/bin/wattsi"]
As you can see, there are three separate RUN steps: one to install dependencies, one to build, and one to cleanup after building.
I've been poking around to try to figure out why the resulting image is relatively large, and it seems like it's because, even though I do a cleanup step, a layer is retained containing all the installed dependencies.
Should I restructure my Dockerfile like so?
FROM debian:stable-slim
COPY src /whatwg/wattsi/src
WORKDIR /whatwg/wattsi/src
RUN apt-get update && \
apt-get install -y --no-install-recommends fp-compiler fp-units-fcl fp-units-net libc6-dev && \
./build.sh && \
rm -rf /whatwg/wattsi/src && \
apt-get purge -y fp-compiler fp-units-fcl fp-units-net libc6-dev && \
apt-get autoremove -y
ENTRYPOINT ["/whatwg/wattsi/bin/wattsi"]
This feels a bit "squashed", and I can't find any documentation explicitly recommending it. All the documentation that says "minimize RUN commands" seems to focus on not doing multiple apt-get steps; it doesn't talk about squashing everything into one. But maybe it's the right thing to do?

Each layer in a Docker image is like a commit in version control, in can't change previous layers just like deleting a file in Git won't remove it from from history. So deleting a file from a previous layer doesn't make the image smaller.
Since layers are created at the end of RUN, doing what you're doing is indeed one way to make smaller images. The other, as someone mentioned, is multi-stage builds.
The downside of the single RUN variant is that you have to rerun the whole thing every time source code changes. So you need to apt-get all those packages each time instead of relying on Docker's build caching (I wrote a thing explaining the caching here: https://pythonspeed.com/articles/docker-caching-model/).
So multi-stage lets you have both faster builds via caching, and small images. But it's complicated to get right, what you did is simpler and easier.

Related

Docker multistage build vs. keeping artifacts in git

My target container is a build environment container, so my team would build an app in a uniform environment.
This app doesn't necessarily run as a container - it runs on physical machine. The container is solely for building.
The app depends on third parties.
Some I can apt-get install with Dockerfile RUN command.
And some I must build myself because they require special building.
I was wondering which way is better.
Using multistage build seems cool; Dockerfile for example:
From ubuntu:18.04 as third_party
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
ADD http://.../boost.tar.gz /
RUN tar boost.tar.gz && \
... && \
make --prefix /boost_out ...
From ubuntu:18.04 as final
COPY --from=third_party /boost_out/ /usr/
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
CMD ["bash"]
...
Pros:
Automatically built when I build my final container
Easy to change third party version (boost in this example)
Cons
ADD command downloads ~100MB file each time, makes image build process slower
I want to use --cache-from so I would be able to cache third_party and build from different docker host machine. Meaning I need to store ~1.6GB image in a docker registry. That's pretty heavy to pull/push.
On the other hand
I could just build boost (with this third_party image) and store its artifacts on some storage, git for example. It would take ~200MB which is better than storing 1.6GB image.
Pros:
Smaller disc space
Cons:
Cumbersome build
Manually build and push artifacts to git when changing boost version.
Somehow link Docker build and git to pull newest artifacts and COPY to the final image.
In both ways I need a third_party image that uniformly and automatically builds third parties. In 1. the image bigger than 2. that will contain just build tools, and not build artifacts.
Is this the trade-off?
1. is more automatic but consumes more disk space and push/pull time,
2. is cumbersome but consumes less disk space and push/pull time?
Are there any other virtues for any of these ways?
I'd like to propose changing your first attempt to something like this:
FROM ubuntu:18.04 as third_party
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
RUN wget http://.../boost.tar.gz -O /boost.tar.gz && \
tar xvf boost.tar.gz && \
... && \
make --prefix /boost_out ... && \
find -name \*.o -delete && \
rm /boost.tar.gz # this is important!
From ubuntu:18.04 as final
COPY --from=third_party /boost_out/ /usr/
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
CMD ["bash"]
This way, you are paying for the download of boost only once (when building the image without a cache), and you do not pay for the storage/pull-time of the original tar-ed sources. Additionally, you should remove unneeded target files (.o?) from the build in the same step in which they are generated. Otherwise, they are stored and pulled as well.
If you are at liberty posting the whole Dockerfile, I'll gladly take a deeper look at it and give you some hints.

Putting RUN commands in one line make build faster?

In my docker file, does it really matter if I put RUN commands in one line or not? Does putting them in one line make build time faster?
RUN apt-get update
RUN apt-get -y install --no-install-recommends python3
RUN apt-get -y install --no-install-recommends open-vm-tools
vs.
RUN apt-get update && apt-get -y install --no-install-recommends python3
RUN apt-get -y install --no-install-recommends open-vm-tools
By minimizing the number of layers you're reducing the size of your image and yes, also build time. This is also recommended in best practices section "Minimize number of layers".
In older versions of Docker, it was important that you minimized the number of layers in your images to ensure they were performant. The following features were added to reduce this limitation.
Only the instructions RUN, COPY, ADD create layers. Other instructions create temporary intermediate images, and do not increase the size of the build.
...
Practically speaking, the build- and run-time cost of having one RUN command vs. several will be imperceptible, and I wouldn't try to optimize here solely in the name of performance.
In the specific example you show, a couple of things are true:
The Debian APT tool on its own has a non-trivial startup time; and separate from one RUN command vs. several, one apt-get install vs. two will be faster.
RUN apt-get install -y --no-install-recommends python3 open-vm-tools
Debian and Ubuntu update their repositories fairly frequently, and when they do, package links that were in last week's apt-get update stop working. Meanwhile, Docker layer caching will try to avoid re-running a step that it's already run. If you did build your image a week ago, Docker will say "I already did this RUN apt-get update and so I don't need to run it again"; but that means it's cached a stale package index. It's important to run apt-get update and apt-get install in the same RUN step.
RUN apt-get update \
&& apt-get install -y --no-install-recommends python3 open-vm-tools
And in general:
If you have some cleanup steps you want to run, it's important to run them in the same RUN step. This sequence creates a layer after the build step, and so the RUN rm step doesn't actually make the final layer smaller.
# All of this example should be combined into a single RUN step
RUN tar xzf package-1.2.3.tar.gz
RUN cd package-1.2.3 && ./configure && make && make install
# There is a layer here including the build tree
RUN rm -rf package-1.2.3
If you're iterating on a Dockerfile, it can be easier to split things out into many small RUN commands while you're debugging, and combine them later. That's perfectly fine and you should get an identical tree at the end.
# I'm trying to figure out the configure options so I might
RUN tar xzf package-1.2.3.tar.gz
RUN cd package-1.2.3 && ./configure --some-option
RUN cd package-1.2.3 && make
If you don't clean the apt cache at the end of your "one liner" you are actually not achieving a lot. A big layer is still added to your image.
The main idea is to put all the installations on one line and end that line with a cleanup, so that when docker saves that layer (before moving on to the next command) it only saves the newly installed software without all the downloads and cache that are not used anymore anyway.
As for the speed probably you gain a little with writing more on 1 line but I don't think it is that much. Of course if you have hundreds of them then you would see a difference.

yum -y install not assuming yes in docker build

I'm trying to simply create a dockerfile that installs wget and unzip on a centOS image. This is the file:
FROM centos:latest
EXPOSE 9000
RUN echo "proxy=http://xxx.xxx.xxx.xxx:x" >> /etc/yum.conf \
&& echo "proxy_username=username" >> /etc/yum.conf \
&& echo "proxy_password=password" >> /etc/yum.conf \
&& yum update \
&& yum -y install wget unzip
...
When I run the build it resolves the dependencies just fine but it doesn't seem to be honoring the -y flag and assuming yes for any prompts:
Total download size: 61 M
Is this ok [y/d/N]: Exiting on user command
Your transaction was saved, rerun it with:
yum load-transaction /tmp/yum_save_tx.2018-08-08.21-22.Q7f8LW.yumtx
The command '/bin/sh -c yum update && yum -y install wget unzip' returned
a non-zero code: 1
I've used the -y flag in this situation many times and have never had any trouble. It doesn't seem like this could be a caching issue but I have no idea what's going on. I also tried yum install -y wget unzip just for good measure but still no luck (as expected). I've searched stackoverflow but it seems like anyone with the same issue just wasn't using the -y flag. Any guidance would be appreciated because I don't know what could be going wrong with such a simple file.
It looks like you're missing the -y on the yum update.
Also, you should split those commands out to separate RUN commands. In this case, it doesn't make too much difference, but splitting the echos onto different lines will make it clearer.
You should keep the update and installs in the same command though
https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run

How to create the smallest possible Docker image after installing apt dependencies

I've created a Docker image using debian as the parent image. In my Dockerfile I've installed some dependencies using apt and pip.
Now, I want to get rid off everything that is not completely necessary to run my app, which of course, needs the dependencies installed.
For now I have the following lines in my Dockerfile after installing the dependencies.
RUN rm -rf /var/lib/apt/lists/* \
&& rm -Rf /usr/share/doc && rm -Rf /usr/share/man \
&& apt-get clean
I've also installed the dependencies using the --no-install-recommends option.
Anything else I can do to reduce the footprint of my Docker image?
PS: just in case, this is how I installed the dependencies:
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
sudo systemd \
build-essential libffi-dev libssl-dev \
python-pip python-dev python-setuptools python-wheel
To reduce the size of the image, you need to combine your RUN commands into one. When you create files in one layer and delete them in another, the files still exist on the drive and are shipped over the network. Their existence is just hidden when the layers of the filesystem are assembled for your container.
The Dockerfile best practices explain this in more detail: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run
I'd also recommend building with docker build --rm=false --no-cache . (temporarily) and then reviewing the output of docker diff on each of the created images to see what files are created in each layer.

How-to: Dockerfile layer(s) reuse / reference

I want to create a few images where each dockerfile will include some common layers. Its all mixed and no consistent sequence for all of the images.
E.g. some will need to have java on them, and I want to define the below lines (taken from the official java 8 dockerfile) as a building block (layer) that will be referenced in other dockerfiles.
In some cases it could be more than one layer that I would want to reuse - making this requirement recursive in nature (building blocks constructed of building blocks).
RUN \
echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | debconf-set-selections && \
add-apt-repository -y ppa:webupd8team/java && \
apt-get update && \
apt-get install -y oracle-java8-installer && \
rm -rf /var/lib/apt/lists/* && \
rm -rf /var/cache/oracle-jdk8-installer
Is that supported by Docker today? Is it a good practice to install e.g. Java separately as a layer that way ... or should I merge it with other installs (to keep the image minimal)? What are the best practices around such a scenario please?
=========== UPDATE ==============
Seems like the feature of referencing/including is not supported. But I am still not sure about the best practices ...
#Sri pointed out below to the best practices:
https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#run
Where is it encouraged to use "RUN apt-get update && apt-get install -y" for all package installation.
But does it also mean that we are not encouraged to use different layers for different packages? like the below example:
RUN apt-get update && apt-get install -y package-foo
RUN apt-get update && apt-get install -y package-bar
Based on further reading:
There is no INCLUDE like feature currently. https://github.com/docker/docker/issues/735
Best practices encourage to use "RUN apt-get update && apt-get install -y" for all package installation. But it doesn't mean that you can not use that same technique to separate package installs (e.g. package-foo and package-bar) due to maintainability. It is a tradeoff with minimizing the number of layers. https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#/minimize-the-number-of-layers (see also how the build cache operates, identifying it as different layers)
Thank you #Sri for some lead pointers.
Docker has the concept of base image. This can be provided in the Dockerfile as "FROM < base-image >". This should be the first line in the Dockerfile. The base image can contain all the commons and individual containers can deal with specific functionalities.

Resources