Why are Docker container images so large? [closed] - docker

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last month.
The community reviewed whether to reopen this question last month and left it closed:
Original close reason(s) were not resolved
Improve this question
I made a simple image through Dockerfile from Fedora (initially 320 MB).
Added Nano (this tiny editor of 1MB size), and the size of the image has risen to 530 MB. I've added Git on top of that (30-ish MB), and then my image size sky-rockets to 830 MB.
Isn't that insane?
I've tried to export and import container to remove history/intermediate images. This effort saved up to 25 MB, now my image size is 804 MB. I've also tried to run many commands on one RUN, but still I'm getting the same initial 830MB.
I'm having my doubts if it is worth to use Docker at all. I mean, I barely installed anything and I'm hitting 1GB over. If I will have to add some serious stuff like a database and so on I might run out of disk space.
Anyone suffers from ridiculous size of images? How do you deal with it?
Unless my Dockerfile is horribly incorrect?
FROM fedora:latest
MAINTAINER Me NotYou <email#dot.com>
RUN yum -y install nano
RUN yum -y install git
but it's hard to imagine what could go wrong in here.

As #rexposadas said, images include all the layers and each layer includes all the dependencies for what you installed. It is also important to note that the base images (like fedora:latest tend to be very bare-bones. You may be surprised by the number of dependencies your installed software has.
I was able to make your installation significantly smaller by adding yum -y clean all to each line:
FROM fedora:latest
RUN yum -y install nano && yum -y clean all
RUN yum -y install git && yum -y clean all
It is important to do that for each RUN, before the layer gets committed, or else deletes don't actually remove data. That is, in a union/copy-on-write file system, cleaning at the end doesn't really reduce file system usage because the real data is already committed to lower layers. To get around this you must clean at each layer.
$ docker history bf5260c6651d
IMAGE CREATED CREATED BY SIZE
bf5260c6651d 4 days ago /bin/sh -c yum -y install git; yum -y clean a 260.7 MB
172743bd5d60 4 days ago /bin/sh -c yum -y install nano; yum -y clean 12.39 MB
3f2fed40e4b0 2 weeks ago /bin/sh -c #(nop) ADD file:cee1a4fcfcd00d18da 372.7 MB
fd241224e9cf 2 weeks ago /bin/sh -c #(nop) MAINTAINER Lokesh Mandvekar 0 B
511136ea3c5a 12 months ago 0 B

Docker images are not large, you are just building large images.
The scratch image is 0B and you can use that to package up your code if you can compile your code into a static binary. For example, you can compile your Go program and package it on top of scratch to make a fully usable image that is less than 5MB.
The key is to not use the official Docker images, they are too big. Scratch isn't all that practical either so I'd recommend using Alpine Linux as your base image. It is ~5MB, then only add what is required for your app. This post about Microcontainers shows you how to build very small images base on Alpine.
UPDATE: the official Docker images are based on alpine now so they are good to use now.

Here are some more things you can do:
Avoid multiple RUN commands where you can. Put as much as possbile into one RUN command (using &&)
clean-up unnecessary tools like wget or git (which you only need for download or building stuff, but not to run your process)
With these both AND the recommendations from #Andy and #michau I was able to resize my nodejs image from 1.062 GB to 542 MB.
Edit:
One more important thing:
"It took me a while to really understand that each Dockerfile command creates a new container with the deltas. [...] It doesn't matter if you rm -rf the files in a later command; they continue exist in some intermediate layer container."
So now I managed to put apt-get install, wget, npm install (with git dependencies) and apt-get remove into a single RUN command, so now my image has only 438 MB.
Edit 29/06/17
With Docker v17.06 there comes a new features for Dockerfiles:
You can have multiple FROM statements inside one Dockerfile and only the stuff from last FROM will be in your final Docker image. This is useful to reduce image size, for example:
FROM nodejs as builder
WORKDIR /var/my-project
RUN apt-get install ruby python git openssh gcc && \
git clone my-project . && \
npm install
FROM nodejs
COPY --from=builder /var/my-project /var/my-project
Will result in an image having only the nodejs base image plus the content from /var/my-project from the first steps - but without the ruby, python, git, openssh and gcc!

Yes, those sizes are ridiculous, and I really have no idea why so few people notice that.
I made an Ubuntu image that is actually minimal (unlike other so-called "minimal" images). It's called textlab/ubuntu-essential and has 60 MB.
FROM textlab/ubuntu-essential
RUN apt-get update && apt-get -y install nano
The above image is 82 MB after installing nano.
FROM textlab/ubuntu-essential
RUN apt-get update && apt-get -y install nano git
Git has many more prerequisites, so the image gets larger, about 192 MB. That's still less that the initial size of most images.
You can also take a look at the script I wrote to make the minimal Ubuntu image for Docker. You can perhaps adapt it to Fedora, but I'm not sure how much you will be able to uninstall.

The following helped me a lot:
After removing unused packages (e.g. redis 1200 mb freed) inside my container, I have done the following:
docker export [containerID] -o containername.tar
docker import -m "commit message here" containername.tar imagename:tag
The layers get flatten. The size of the new image will be smaller because I've removed packages from the container as stated above.
This took a lot of time to understand this and that's why I've added my comment.

For best practise, you should execute a single RUN command, because
every RUN instruction in the Dockerfile writes a new layer in the image and every layer requires extra space on disk. In order to keep the number layers to a minimum, any file manipulation like install, moving, extracting, removing, etc, should ideally be made under a single RUN instruction
FROM fedora:latest
RUN yum -y install nano git && yum -y clean all

Docker Squash is a really nice solution to this. you can $packagemanager clean in the last step instead of in every line and then just run a docker squash to get rid of all of the layers.
https://github.com/jwilder/docker-squash

Yes the layer system is quite surprising.
If you have a base image and you increment it by doing the following:
# Test
#
# VERSION 1
# use the centos base image provided by dotCloud
FROM centos7/wildfly
MAINTAINER JohnDo
# Build it with: docker build -t "centos7/test" test/
# Change user into root
USER root
# Extract weblogic
RUN rm -rf /tmp/* \
&& rm -rf /wildfly/*
The image has exactly the same size. That essentially means, you have to manage to put into your RUN steps a lot of extract, install and cleanup magic to make the images as small as the software installed.
This makes life much harder...
The dockerBuild is missing RUN steps without commit.

We had a similar issue in our docker build process. Each image built was significantly larger than the others. As it turns out we were getting tar.gz files included in the image. Among these were the compressed images we upload to a server. So each image contained the prior images by accident. Image sizes were soon in the 8gb range.
.dockerignore is your friend. Make sure anything in your project not necessary to build the image is in the ignore file.

Related

Update of root certificates on docker

If I understand correctly, on standard Ubuntu systems for example, root certificates are provided by ca-certificates package and get updated when the package itself is updated.
But how can the root certificates be updated when using docker containers ? Is there a common preferred way of doing this, or must the containers be redeployed with an up-to-date docker image ?
The containers must be redeployed with an up-to-date image.
The Docker Hub base images like ubuntu actually get updated fairly regularly, and if you look at the tag list you can see that there are several date-stamped variants of the images. So one approach that will get you pretty close to current is to always (have your CI system) pull the base image before you build.
docker pull ubuntu:18.04
docker build .
If you can't do that, or if you're working from some sort of derived image that updates less frequently, you can just manually run apt-get upgrade in your Dockerfile. Doing this in the same place you're otherwise installing packages makes sense. It needs to be in the same RUN line as a matching apt-get update, and you might need some way to force Docker to not cache that update line to get current updates.
FROM python:3.8-slim
# Have an option to force rebuilds; the RUN line won't be
# cacheable if the dependency_stamp option changes
ARG dependency_stamp
ENV dependency_stamp=${dependency_stamp:-unknown}
RUN touch /dependencies.${dependency_stamp}
# Update base OS packages and install other things we need
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get upgrade \
&& DEBIAN_FRONTEND=noninteractive apt-get install \
--no-install-recommends --assume-yes \
...
If you find yourself doing this routinely, maintaining your own base images that are upgraded to current packages but don't have anything else installed can be helpful; if you find yourself doing that, you might have more control over the process and get smaller images if you build an image FROM ubuntu and install e.g. Python, rather than building an image FROM python and then installing updates over it.

Purpose of specifying several UNIX commands in a single RUN instruction in Dockerfile

I have noticed that many Dockerfiles try to minimize the number of instructions by several UNIX commands in a single RUN instruction. So is there any reason?
Also is there any difference in the outcomes between the two Dockerfiles below?
Dockerfile1
FROM ubuntu
MAINTAINER demousr#example.com
RUN apt-get update
RUN apt-get install –y nginx
CMD ["echo", "Image created"]
Dockerfile2
FROM ubuntu
MAINTAINER demousr#example.com
RUN apt-get update && apt-get install –y nginx
CMD ["echo", "Image created"]
Roughly speaking, a Docker image contains some metadata & an array of layers, and a running container is built upon these layers by adding a container layer (read-and-write), the layers from the underlying image being read-only at that point.
These layers can be stored in the disk in different ways depending on the configured driver. For example, the following image taken from the official Docker documentation illustrates the way the files changed in these different layers are taken into account with the OverlayFS storage driver:
Next, the Dockerfile instructions RUN, COPY, and ADD create layers, and the best practices mentioned on the Docker website specifically recommend to merge consecutive RUN commands in a single RUN command, to reduce the number of layers, and thereby reduce the size of the final image:
https://docs.docker.com/develop/dev-best-practices/
[…] try to reduce the number of layers in your image by minimizing the number of separate RUN commands in your Dockerfile. You can do this by consolidating multiple commands into a single RUN line and using your shell’s mechanisms to combine them together. […]
See also: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
Moreover, in your example:
RUN apt-get update -y -q
RUN apt-get install -y nginx
if you do docker build -t your-image-name . on this Dockerfile, then edit the Dockerfile after a while, add another package beyond nginx, then do again docker build -t your-image-name ., due to the Docker cache mechanism, the apt-get update -y -q won't be executed again, so the APT cache will be obsolete. So this is another upside for merging the two RUN commands.
In addition to the space savings, it's also about correctness
Consider your first dockerfile (a common mistake when working with debian-like systems which utilize apt):
FROM ubuntu
MAINTAINER demousr#example.com
RUN apt-get update
RUN apt-get install –y nginx
CMD ["echo", "Image created"]
If two or more images follow this pattern, a cache hit could cause the image to be unbuildable due to cached metadata
let's say I built an image which looks similar to that ~a few weeks ago
now I'm building this image today. there's a cache present up until the RUN apt-get update line
the docker build will reuse that cached layer (since the dockerfile and base image are identical) up to the RUN apt-get update
when the RUN apt-get install line runs, it will use the cached apt metadata (which is now weeks out of date and likely will error)

docker build - Avoid ADDing files only needed at build time

I'm trying to build a docker image avoiding unnecessary bulk, and I've run into a problem that I think should be common, but so far I haven't found a straightforward solution. (I'm building the docker on an ubuntu 18.04 system, and starting with a FROM ubuntu layer.)
In particular, I have a very large .deb file (over 3G) that I need to install in the image. It's easy enough to COPY or ADD it and then RUN dpkg -i, but that results in duplication of several GB of space that I don't need. Of course, just removing the file doesn't reduce the image size.
I'd like to be able to mount a volume to access the .deb file, rather than COPY it, which is easy to do when running a container, but apparently not possible to do when building one?
What I've come up with so far is to build the docker up to the point where I would ADD the file, then run it with a volume mounted so I can access it from the container without COPYing it, then I dpkg -i it, then I do a docker commit to create an image from that container. Sure enough, I end up with an image that's over 3GB smaller than my first try, but that seems like a hack, and makes scripting the build more complicated.
I'm thinking there must be a more appropriate way to achieve this, but so far my searching has not revealed an obvious answer. Am I missing something?
Relying on docker commit indeed amounts to a hack :) and its use is thus mentioned as inadvisable by some references such as this blog article.
I only see one possible approach for the kind of use case you mention (copy a one-time .deb package, install it and remove the binary immediately from the image layer):
You could make remotely available to the docker engine that builds your image, the .deb you'd want to install, and replace the COPY + RUN directives with a single one, e.g., relying on curl:
RUN curl -OL https://example.com/foo.deb && dpkg -i foo.deb && rm -f foo.deb
If curl is not yet installed, you could run beforehand the usual APT commands:
RUN apt-get update -y -q \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y -q --no-install-recommends \
ca-certificates \
curl \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
Maybe there is another possible solution (but I don't think the multi-staged builds Docker feature would be of some help here, as all perms would be lost by doing e.g. COPY --from=build / /).

What is the difference between multiples RUN entries in Dockerfile and just one RUN entry?

What is the difference between multiples RUN entries in Dockerfile like:
FROM php:5.6-apache
RUN docker-php-ext-install mysqli
RUN apt update
RUN apt install git -y -q
and just one RUN entry?
FROM php:5.6-apache
RUN docker-php-ext-install mysqli && apt update && apt install git -y -q
OBS. I'm not asking which one is better. I Want to know all the differences between the two approaches.
Each RUN command creates a layer of the filesystem changes generated by a temporary container started to run that command. (It's effectively running a docker run and then packaging the result of docker diff into a filesystem layer.)
These layers have a few key details to note:
They are immutable. Once you create them you don't change them. You would have to generate/recreate a new layer, to update your image.
They are reusable between multiple images and running containers. You can do this because of the immutability.
You do not delete files from a parent layer, but you can register that a file is deleted in a later layer. This is a metadata change in that later layer, not a modification to the parent layer.
Layers are reused in docker's build cache. If two different images, or even the same image being rebuilt, perform the same command on top of the same parent layer, docker will reuse the already created layer.
These layers are merged together into the final filesystem you see inside your container.
The main difference between the two approaches are the build cache and deleting files. If you split apart the download of a source code tgz, extraction of the tgz, compiling a binary, and the deleting of the tgz and source folders, into multiple RUN lines, then when you ship the image over the network and store it on disk, you will have all of the source in the layers even though you don't see it in the final container. Your image will be significantly larger.
Caching can also be a bad thing when you cache too much. If you split the apt update and apt install, and then add a new package to install to your second run line months later, docker will reuse the months old cache of apt update and try to install packages that are months old, possibly no longer available, and your image may fail to build. Many people also run a rm -rf /var/lib/apt/lists/* after installing debian packages. And if you do this in a separate step, you will not actually delete the files from the previous layers, so your image will not shrink.

Update of operating system takes forever, while building Docker image

I'm using Debian:latest as the base image for my Docker containers.
Problem is that on every build I have to update the OS packages, which takes forever. Here is what I do:
FROM debian:latest
RUN apt-get update && apt-get install -y --force-yes --no-install-recommends nginx
...
apt-get update && apt-get install lasts forever. What do I do about this?
Docker images are minimal, only including the absolute necessities to run that base image. For the Debian base images, that means there is no package repo cache. So when you run an apt-get update it is downloading a the package repo cache for the first time from all the repos. If they included the package repo cache, it would be many megs of package state that would be quickly out of date, resulting in larger base images with little reduction on doing an update later.
The actual debian:latest image is relatively well maintained with commits from last month. You can view the various tags for it here: https://hub.docker.com/_/debian/
To reduce your image build time, I'd recommend not deleting your image every time. Instead, do your new build and tag, and once the new image is built, you can run a docker image prune --force to remove the untagged images from prior builds. This allows docker to reuse the cache from prior image builds.
Alternatively, you can create your own base image that you update less frequently and that has all of your application prerequisites. Build it like any other image, and then change the FROM debian:latest to FROM your_base_image.
One last tip, avoid using latest in your image builds, do something like FROM debian:9 instead so that a major version update in debian doesn't break your build.
Don't delete image on each build. Just modify your Dockerfile and build again. Docker is "smart" and it will keep unmodified layers building only from lines you changed. That intermediate images used for this purpose (docker is creating them automatically) can be easily removed with this command:
docker rmi $(docker images -q -f dangling=true)
You'll save a lot of time with this.Remember, don't delete image. Just modify the Dockerfile and build again. And after finishing with everything working lauch this command and that's all you need.
Another "good to launch for cleaning command" can be:
docker volume prune -f
But this last command is for other kind of cleaning not related to images... is more focused on containers.

Resources