Dockerfile RUN creates a fat layer even with cleanup - docker

It's generally understood that if the RUN instruction cleans up after large file operations, the resulting layer size would be smaller. However when I execute the following script the resulting layer is large and roughly corresponds to the size of the files initially copied and cleaned.
How could this size of a layer be committed if the files copied are cleaned in the same layer?
RUN mkdir -p /etc/puppet && \
cd /etc/puppet && \
apt-get update && apt-get install -y wget puppet && \
wget -rnH --level=10 -e robots=off --reject "index.html*" http://somefileserver.org/puppet/allpuppetstuff && \
puppet module install puppetlabs/stdlib && \
puppet module install 7terminals-java && \
puppet apply -e "include someserver" && \
apt-get purge -y --auto-remove puppet wget && \
rm -rfv /etc/puppet/* && \
rm -rfv /var/lib/apt/lists/*

Found some details after few tweaks.
It seems docker duplicates on old layer, if a newer layer is modifying files persisted on the earlier one. This is the cause of the fat layer created by the above RUN command, even when the total file size diff at the end of operation is minimal. I couldn't find any definite resources to cite this, but this is the experience so far.
Basically, file operations related to the image should always be done in one layer.

Related

How can I build a similar docker image based on alpine that works on ubuntu?

I am trying to rewrite a Dockerfile (https://github.com/orangefoil/rcssserver-docker/blob/master/Dockerfile) so that it uses alpine instead of ubuntu. Goal is to reduce the file size.
In the original image the robocup soccer server is built from scratch using g++, flex, bison, etc.
FROM ubuntu:18.04 AS build
ARG VERSION=16.0.0
WORKDIR /root
RUN apt update && \
apt -y install autoconf bison clang flex libboost-dev libboost-all-dev libc6-dev make wget
RUN wget https://github.com/rcsoccersim/rcssserver/archive/rcssserver-$VERSION.tar.gz && \
tar xfz rcssserver-$VERSION.tar.gz && \
cd rcssserver-rcssserver-$VERSION && \
./bootstrap && \
./configure && \
make && \
make install && \
ldconfig
I tried to do the same on alpine and had to exchange some packages:
FROM alpine:latest
ARG VERSION=16.0.0
WORKDIR /root
# Add basics first
RUN apk — no-cache update \
&& apk upgrade \
&& apk add autoconf bison clang-dev flex-dev boost-dev make wget automake libtool-dev g++ build-base
RUN wget https://github.com/rcsoccersim/rcssserver/archive/rcssserver-$VERSION.tar.gz
RUN tar xfz rcssserver-$VERSION.tar.gz
RUN cd rcssserver-rcssserver-$VERSION && \
./bootstrap && \
./configure && \
make && \
make install && \
ldconfig
Unfortunately, my version doesn't work yet. It fails with
/usr/lib/gcc/x86_64-alpine-linux-musl/9.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lrcssclangparser
From what I found so far, this can happen, if dev packages are not installed (see ld cannot find an existing library), but I changed to dev packages where I could find them and still no luck.
So, my current assumption is that ubuntu has some package installed, that I need to add in my alpine image. I would exclude a code problem, since the ubuntu version works.
Any ideas, what could be missing? I would also be happy to understand how to compare the packages myself, but the package namings are not the same in ubuntu and alpine, so I find it pretty hard to figure this out.
You should break this up using a multi-stage build. In the image you're building now, the final image contains the C toolchain and all of the development libraries and headers that those -dev packages install; you don't need any of those to actually run the built application. The basic idea is to build the application exactly as you have it now, but then COPY only the built application into a new image with fewer dependencies.
This would look something like this (untested):
FROM ubuntu:18.04 AS build
# ... exactly what's in the original question ...
FROM ubuntu:18.04
# Install the shared libraries you need to run the application,
# but not -dev headers or the full C toolchain. You may need to
# run `ldd` on the built binary to see what exactly it needs.
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive \
apt-get install --assume-yes --no-install-recommends \
libboost-atomic1.65.1 \
libboost-chrono1.65.1 \
# ... more libboost-* libraries as required ...
# Get the built application out of the original image.
# Autoconf's default is to install into /usr/local, and in a
# typical Docker base image nothing else will be installed there.
COPY --from=build /usr/local /usr/local
RUN ldconfig
# Describe how to run a container.
EXPOSE 12345
CMD ["/usr/local/bin/rcssserver"]
Compared to the size of the C toolchain, header files, and build-time libraries, the difference between an Alpine and Ubuntu image is pretty small, and Alpine has well-documented library compatibility issues with its minimal libc implementation.

Docker: should I combine my apt-get install / build / cleanup steps into one big RUN?

I have a Dockerfile that looks like this:
FROM debian:stable-slim
RUN apt-get update && \
apt-get install -y --no-install-recommends fp-compiler fp-units-fcl fp-units-net libc6-dev
COPY src /whatwg/wattsi/src
WORKDIR /whatwg/wattsi/src
RUN ./build.sh
RUN rm -rf /whatwg/wattsi/src && \
apt-get purge -y fp-compiler fp-units-fcl fp-units-net libc6-dev && \
apt-get autoremove -y
ENTRYPOINT ["/whatwg/wattsi/bin/wattsi"]
As you can see, there are three separate RUN steps: one to install dependencies, one to build, and one to cleanup after building.
I've been poking around to try to figure out why the resulting image is relatively large, and it seems like it's because, even though I do a cleanup step, a layer is retained containing all the installed dependencies.
Should I restructure my Dockerfile like so?
FROM debian:stable-slim
COPY src /whatwg/wattsi/src
WORKDIR /whatwg/wattsi/src
RUN apt-get update && \
apt-get install -y --no-install-recommends fp-compiler fp-units-fcl fp-units-net libc6-dev && \
./build.sh && \
rm -rf /whatwg/wattsi/src && \
apt-get purge -y fp-compiler fp-units-fcl fp-units-net libc6-dev && \
apt-get autoremove -y
ENTRYPOINT ["/whatwg/wattsi/bin/wattsi"]
This feels a bit "squashed", and I can't find any documentation explicitly recommending it. All the documentation that says "minimize RUN commands" seems to focus on not doing multiple apt-get steps; it doesn't talk about squashing everything into one. But maybe it's the right thing to do?
Each layer in a Docker image is like a commit in version control, in can't change previous layers just like deleting a file in Git won't remove it from from history. So deleting a file from a previous layer doesn't make the image smaller.
Since layers are created at the end of RUN, doing what you're doing is indeed one way to make smaller images. The other, as someone mentioned, is multi-stage builds.
The downside of the single RUN variant is that you have to rerun the whole thing every time source code changes. So you need to apt-get all those packages each time instead of relying on Docker's build caching (I wrote a thing explaining the caching here: https://pythonspeed.com/articles/docker-caching-model/).
So multi-stage lets you have both faster builds via caching, and small images. But it's complicated to get right, what you did is simpler and easier.

How to create the smallest possible Docker image after installing apt dependencies

I've created a Docker image using debian as the parent image. In my Dockerfile I've installed some dependencies using apt and pip.
Now, I want to get rid off everything that is not completely necessary to run my app, which of course, needs the dependencies installed.
For now I have the following lines in my Dockerfile after installing the dependencies.
RUN rm -rf /var/lib/apt/lists/* \
&& rm -Rf /usr/share/doc && rm -Rf /usr/share/man \
&& apt-get clean
I've also installed the dependencies using the --no-install-recommends option.
Anything else I can do to reduce the footprint of my Docker image?
PS: just in case, this is how I installed the dependencies:
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
sudo systemd \
build-essential libffi-dev libssl-dev \
python-pip python-dev python-setuptools python-wheel
To reduce the size of the image, you need to combine your RUN commands into one. When you create files in one layer and delete them in another, the files still exist on the drive and are shipped over the network. Their existence is just hidden when the layers of the filesystem are assembled for your container.
The Dockerfile best practices explain this in more detail: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run
I'd also recommend building with docker build --rm=false --no-cache . (temporarily) and then reviewing the output of docker diff on each of the created images to see what files are created in each layer.

Beneifts of line continuation in Dockerfiles

I've been working with Docker recently to host an RStudio-server instance. There are a number of packages that need to be installed, an one way to do this in a Dockerfile is this (with many more lines, obviously):
RUN Rscript -e "install.packages('beanplot')"
RUN Rscript -e "install.packages('beeswarm')"
RUN Rscript -e "install.packages('boot')"
RUN Rscript -e "install.packages('caTools')"
I see many instances where this is done this way:
RUN Rscript -e "install.packages(c('beanplot','beeswarm','boot','caTools'))
Also, I often see various executable lines chained together like this:
RUN yum -y update \
&& yum -y groupinstall 'Development Tools' \
&& yum -y install epel-release \
vim \
initscripts \
libpng-devel \
mesa-libGL \
mesa-libGL-devel \
mesa-libGLU \
mesa-libGLU-devel \
ypbind \
rpcbind \
authconfig \
&& yum -y install R \
&& mkdir /rhome
rather than having is && line as a separate RUN line.
I had assumed the benefit was to reduce the size of the docker image, but when I tested a large example, either method resulted in the same size.
What is the advantage of chaining commands rather than having individual RUN commands for each line?
Each RUN command adds a new layer to the image and there is an upper limit of the number of layers allowed (somewhere around 255 or so). The limitation is enforced for performance reasons.
Every time an application that runs inside the container needs to access a file, the engine searches the file in all these layers, from top to bottom until it finds it. If the application attempts to change a file that is not on the topmost layer then the engine first makes a copy of the file on the topmost layer then handles the application's write requests onto the copy.
The topmost layer is writeable. It is not stored in the image but it is part of the container. The layers stored in the image are read-only.
It is all explained in the documentation. Keeping a small number of layers is recommended as best practice.

Gaining better performance with inotify-tools and unison.

I use inotify-tools and unison to synchronize folders between machines.
Because I have a large folder to synchronize, I just simply write an inotifywait script to do the job automatically.
Is it sensible to let inotifywait to monitor the subdirectories of the large folder to gain a better performance?
You should get better performance if you ditch inotify-tools and just use unison's native support for watching your folders for changes. By using inotify-tools and then calling unison when a change occurs, unison has to "re-find" the change before it syncs. You could instead add the line repeat = watch to your unison profile and unison will run continually and sync whenever there is a change. It detects the change with its own file-watcher utility unison-fsmonitor that communicates directly with unison.
For more information, check out the latest changelog for unison 2.48.3 with major changes to unison-fsmonitor.
The unison-fsmonitor is not provided by ubuntu package until now:
https://bugs.launchpad.net/ubuntu/+source/unison/+bug/1558737
https://github.com/bcpierce00/unison/issues/208
If you want it fast locally
UNISON_VERSION=2.51.2
echo "Install Unison." \
&& apt install wget ocaml
&& pushd /tmp \
&& wget https://github.com/bcpierce00/unison/archive/v$UNISON_VERSION.tar.gz \
&& tar -xzvf v$UNISON_VERSION.tar.gz \
&& rm v$UNISON_VERSION.tar.gz \
&& pushd unison-$UNISON_VERSION \
&& make \
&& cp -t /usr/local/bin ./src/unison ./src/unison-fsmonitor \
&& popd \
&& rm -rf unison-$UNISON_VERSION \
&& popd

Resources