docker build - Avoid ADDing files only needed at build time

docker build - Avoid ADDing files only needed at build time - docker

I'm trying to build a docker image avoiding unnecessary bulk, and I've run into a problem that I think should be common, but so far I haven't found a straightforward solution. (I'm building the docker on an ubuntu 18.04 system, and starting with a FROM ubuntu layer.)
In particular, I have a very large .deb file (over 3G) that I need to install in the image. It's easy enough to COPY or ADD it and then RUN dpkg -i, but that results in duplication of several GB of space that I don't need. Of course, just removing the file doesn't reduce the image size.
I'd like to be able to mount a volume to access the .deb file, rather than COPY it, which is easy to do when running a container, but apparently not possible to do when building one?
What I've come up with so far is to build the docker up to the point where I would ADD the file, then run it with a volume mounted so I can access it from the container without COPYing it, then I dpkg -i it, then I do a docker commit to create an image from that container. Sure enough, I end up with an image that's over 3GB smaller than my first try, but that seems like a hack, and makes scripting the build more complicated.
I'm thinking there must be a more appropriate way to achieve this, but so far my searching has not revealed an obvious answer. Am I missing something?

Relying on docker commit indeed amounts to a hack :) and its use is thus mentioned as inadvisable by some references such as this blog article.
I only see one possible approach for the kind of use case you mention (copy a one-time .deb package, install it and remove the binary immediately from the image layer):
You could make remotely available to the docker engine that builds your image, the .deb you'd want to install, and replace the COPY + RUN directives with a single one, e.g., relying on curl:
RUN curl -OL https://example.com/foo.deb && dpkg -i foo.deb && rm -f foo.deb
If curl is not yet installed, you could run beforehand the usual APT commands:
RUN apt-get update -y -q \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y -q --no-install-recommends \
ca-certificates \
curl \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
Maybe there is another possible solution (but I don't think the multi-staged builds Docker feature would be of some help here, as all perms would be lost by doing e.g. COPY --from=build / /).

Related

Docker SSH-Key looking for a simple solution

I'm trying to copy my ssh-keys into my docker, it's a very simple docker including some LinuxTools via Package Manager. A asking because, I can't come up with a simple solution ADD/COPY seem not to work, using docker-volume or compose seem to be over the top. Please advice.
FROM fedora:latest
RUN diskspacecheck=0 >> '/etc/dnf/dnf.conf'
RUN dnf -y update
RUN dnf -y install sshuttle \
&& dnf -y install git \
&& dnf -y install curl \
&& dnf -y install vim-X11 \
&& dnf -y install the_silver_searcher
RUN dnf -y clean packages
RUN adduser -m -p '' bowler
USER bowler
ADD /home/a/.ssh/id_rsa /home/bowler/.ssh/id_rsa

I would not add the key there during the build, I would mount it when you run the container as you are probably most likely to store Dockerfile and other files in VCS and then everyone can see your private key!
therefore adding this when you start your container is probably better/more secure option option :)
-v ${HOME}/.ssh/id_rsa:/home/bowler/.ssh/id_rsa

You can't copy files into a docker container that live outside of the build directory. This is for security reasons. What you'll need to do is first copy your id_rsa file into the same directory as your Dockerfile, and then change the ADD to use the copy you just made, instead of trying to copy it from the absolute path that it is currently using.
I would also suggest changing the ADD to COPY, as it is easier to work with and has less unexpected behavior to trip over.
so at your command line:
cp ${HOME}/.ssh/id_rsa [path to dockerfile directory]
then update the dockerfile:
COPY id_rsa /home/bowler/.ssh/id_rsa
you might also need to add a RUN mkdir -p /home/bowler/.ssh
Update
Based on the comments, I think it's worth adding a disclaimer here that if you go this route, then the image that you create needs to be handled with the same security considerations as your private SSH key.
It is much better to inject authentication credentials like this at runtime. That can be done by setting environment variables as part of the command that is used to start the container. Or, if you have a secure secrets repository, it could be added there and then downloaded by the container when it starts (ex. using cURL).
The approach of installing SSH keys directly to the image is not completely unreasonable, but proceed with caution and keep in mind that there may be a cleaner alternative.

How does docker cache the layer?

I have a docker image which run the following command
RUN apt-get update --fix-missing && apt-get install -y --no-install-recommends build-essential debhelper rpm ruby ruby-dev sudo cmake make gcc g++ flex bison git libpcap-dev libssl-dev ninja-build openssh-client python-dev python3-pip swig zlib1g-dev python3-setuptools python3-requests wget curl unzip zip default-jdk && apt-get clean && rm -rf /var/lib/apt/lists/*
If I run it couple time in the same day, the layer seems cached. However, docker will think the layer changed if I run it for the first time daily.
Just wonder what's special in the above command that makes docker thinks the layer changed?

This is not caused by docker. When docker sees a RUN command, all it does is simple string comparison to determine whether the layer is in the cache or not. If it sees it in cache, it will reuse it and if not, it will run it.
Since you have mentioned that it builds whole day using cache and then it doesn't the next day, the only possible explanation is that the cache has been invalidated/deleted during that time by someone/something.
I don't know how/where you are running the docker daemon but it may be the case that it is running in VM that is being recreated each day from a base image which would then destroy all the cache and force docker to rebuild the image.
Another explanation is that you have some cleanup process running once a day, maybe some cron that deletes the cache.
Bottom line is that docker will happily reuse that cache for unlimited period of time, as long as the cache actually exists.
I am assuming that previous layers has been built from cache (if there are any), otherwise you should look for COPY/ADD commands if they are not causing the cache busting due to file changes in your build context.

It's not the command, it's the steps that occur before it. Specifically, if the files being copied to previous layers were modified. I can be more specific if you'll edit the post to show all the steps in the Dockerfile before this one.

According to the docker doc:
Aside from the ADD and COPY commands, cache checking does not look at the files in the container to determine a cache match. For example, when processing a RUN apt-get -y update command the files updated in the container are not examined to determine if a cache hit exists. In that case just the command string itself is used to find a match
For a RUN command， it just command string itself is used to find a match. So, maybe any processes delete the cache layer， or maybe you changed your Dockerfile?

Update of root certificates on docker

If I understand correctly, on standard Ubuntu systems for example, root certificates are provided by ca-certificates package and get updated when the package itself is updated.
But how can the root certificates be updated when using docker containers ? Is there a common preferred way of doing this, or must the containers be redeployed with an up-to-date docker image ?

The containers must be redeployed with an up-to-date image.
The Docker Hub base images like ubuntu actually get updated fairly regularly, and if you look at the tag list you can see that there are several date-stamped variants of the images. So one approach that will get you pretty close to current is to always (have your CI system) pull the base image before you build.
docker pull ubuntu:18.04
docker build .
If you can't do that, or if you're working from some sort of derived image that updates less frequently, you can just manually run apt-get upgrade in your Dockerfile. Doing this in the same place you're otherwise installing packages makes sense. It needs to be in the same RUN line as a matching apt-get update, and you might need some way to force Docker to not cache that update line to get current updates.
FROM python:3.8-slim
# Have an option to force rebuilds; the RUN line won't be
# cacheable if the dependency_stamp option changes
ARG dependency_stamp
ENV dependency_stamp=${dependency_stamp:-unknown}
RUN touch /dependencies.${dependency_stamp}
# Update base OS packages and install other things we need
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get upgrade \
&& DEBIAN_FRONTEND=noninteractive apt-get install \
--no-install-recommends --assume-yes \
...
If you find yourself doing this routinely, maintaining your own base images that are upgraded to current packages but don't have anything else installed can be helpful; if you find yourself doing that, you might have more control over the process and get smaller images if you build an image FROM ubuntu and install e.g. Python, rather than building an image FROM python and then installing updates over it.

I found an image on docker hub that I like but doesn't meet my needs. How do I update it and make it my own?

I found an image on docker (https://hub.docker.com/r/realbazso/horizon) that I like. I am trying to update this to where it runs the most current version of this software.
I tested running the image with the arguments provided and it works great, but the version of the VMWare Horizon client that the image has does not have an updated SSL library and cannot connect to the servers I need it to without throwing an SSL error.
I'm super new to docker, but I was wondering if anyone could help me with this. I'm wanting to install it on the ubuntu:14.04 image, but I'm just not able to wrap my head around it.

I am going to add some more information to #user2915097's answer.
The first thing to do when you want to edit/update an already existing image is to see if you can find its Dockerfile. Fortunately, this repo has a Dockerfile attached to it so it makes it easier. I commented the file so that you can understand better what is going on:
# Pulls the ubuntu image. This will serve as the base image for the container. You could change this and use ubuntu:16.04 to get the latest LTS.
FROM ubuntu:14.04
# RUN will execute the commands for you when you build the image from this Dockerfile. This is probably where you will want to change the source
RUN echo "deb http://archive.canonical.com/ubuntu/ trusty partner" >> /etc/apt/sources.list && \
dpkg --add-architecture i386 && \
apt-get update && \
apt-get install -y vmware-view-client
# CMD will execute the command (there can only be one!) when you start/run the container
CMD /usr/bin/vmware-view
A good resource to understand those commands is https://docs.docker.com/engine/reference/builder/. Make sure to visit that page to learn more about Dockerfile!
Once you have a Dockerfile ready to build, navigate to the folder where your Dockerfile is and run:
# Make sure to change the argument of -t
docker build -t yourDockerHubUsername/containerName .
You might need to modify your Dockerfile a few times before it works correctly. If you are having issues with Docker using cached data

as you have the recipe, if you look at
https://hub.docker.com/r/realbazso/horizon/~/dockerfile/
you should create a directory, put this Dockerfile in, modify it, build another image
docker build -t tucker/myhorizon .
launch it, test it, modify again the Dockerfile maybe.
Check the doc R0MANARMY listed

Why are Docker container images so large? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last month.
The community reviewed whether to reopen this question last month and left it closed:
Original close reason(s) were not resolved
Improve this question
I made a simple image through Dockerfile from Fedora (initially 320 MB).
Added Nano (this tiny editor of 1MB size), and the size of the image has risen to 530 MB. I've added Git on top of that (30-ish MB), and then my image size sky-rockets to 830 MB.
Isn't that insane?
I've tried to export and import container to remove history/intermediate images. This effort saved up to 25 MB, now my image size is 804 MB. I've also tried to run many commands on one RUN, but still I'm getting the same initial 830MB.
I'm having my doubts if it is worth to use Docker at all. I mean, I barely installed anything and I'm hitting 1GB over. If I will have to add some serious stuff like a database and so on I might run out of disk space.
Anyone suffers from ridiculous size of images? How do you deal with it?
Unless my Dockerfile is horribly incorrect?
FROM fedora:latest
MAINTAINER Me NotYou <email#dot.com>
RUN yum -y install nano
RUN yum -y install git
but it's hard to imagine what could go wrong in here.

As #rexposadas said, images include all the layers and each layer includes all the dependencies for what you installed. It is also important to note that the base images (like fedora:latest tend to be very bare-bones. You may be surprised by the number of dependencies your installed software has.
I was able to make your installation significantly smaller by adding yum -y clean all to each line:
FROM fedora:latest
RUN yum -y install nano && yum -y clean all
RUN yum -y install git && yum -y clean all
It is important to do that for each RUN, before the layer gets committed, or else deletes don't actually remove data. That is, in a union/copy-on-write file system, cleaning at the end doesn't really reduce file system usage because the real data is already committed to lower layers. To get around this you must clean at each layer.
$ docker history bf5260c6651d
IMAGE CREATED CREATED BY SIZE
bf5260c6651d 4 days ago /bin/sh -c yum -y install git; yum -y clean a 260.7 MB
172743bd5d60 4 days ago /bin/sh -c yum -y install nano; yum -y clean 12.39 MB
3f2fed40e4b0 2 weeks ago /bin/sh -c #(nop) ADD file:cee1a4fcfcd00d18da 372.7 MB
fd241224e9cf 2 weeks ago /bin/sh -c #(nop) MAINTAINER Lokesh Mandvekar 0 B
511136ea3c5a 12 months ago 0 B

Docker images are not large, you are just building large images.
The scratch image is 0B and you can use that to package up your code if you can compile your code into a static binary. For example, you can compile your Go program and package it on top of scratch to make a fully usable image that is less than 5MB.
The key is to not use the official Docker images, they are too big. Scratch isn't all that practical either so I'd recommend using Alpine Linux as your base image. It is ~5MB, then only add what is required for your app. This post about Microcontainers shows you how to build very small images base on Alpine.
UPDATE: the official Docker images are based on alpine now so they are good to use now.

Here are some more things you can do:
Avoid multiple RUN commands where you can. Put as much as possbile into one RUN command (using &&)
clean-up unnecessary tools like wget or git (which you only need for download or building stuff, but not to run your process)
With these both AND the recommendations from #Andy and #michau I was able to resize my nodejs image from 1.062 GB to 542 MB.
Edit:
One more important thing:
"It took me a while to really understand that each Dockerfile command creates a new container with the deltas. [...] It doesn't matter if you rm -rf the files in a later command; they continue exist in some intermediate layer container."
So now I managed to put apt-get install, wget, npm install (with git dependencies) and apt-get remove into a single RUN command, so now my image has only 438 MB.
Edit 29/06/17
With Docker v17.06 there comes a new features for Dockerfiles:
You can have multiple FROM statements inside one Dockerfile and only the stuff from last FROM will be in your final Docker image. This is useful to reduce image size, for example:
FROM nodejs as builder
WORKDIR /var/my-project
RUN apt-get install ruby python git openssh gcc && \
git clone my-project . && \
npm install
FROM nodejs
COPY --from=builder /var/my-project /var/my-project
Will result in an image having only the nodejs base image plus the content from /var/my-project from the first steps - but without the ruby, python, git, openssh and gcc!

Yes, those sizes are ridiculous, and I really have no idea why so few people notice that.
I made an Ubuntu image that is actually minimal (unlike other so-called "minimal" images). It's called textlab/ubuntu-essential and has 60 MB.
FROM textlab/ubuntu-essential
RUN apt-get update && apt-get -y install nano
The above image is 82 MB after installing nano.
FROM textlab/ubuntu-essential
RUN apt-get update && apt-get -y install nano git
Git has many more prerequisites, so the image gets larger, about 192 MB. That's still less that the initial size of most images.
You can also take a look at the script I wrote to make the minimal Ubuntu image for Docker. You can perhaps adapt it to Fedora, but I'm not sure how much you will be able to uninstall.

The following helped me a lot:
After removing unused packages (e.g. redis 1200 mb freed) inside my container, I have done the following:
docker export [containerID] -o containername.tar
docker import -m "commit message here" containername.tar imagename:tag
The layers get flatten. The size of the new image will be smaller because I've removed packages from the container as stated above.
This took a lot of time to understand this and that's why I've added my comment.

For best practise, you should execute a single RUN command, because
every RUN instruction in the Dockerfile writes a new layer in the image and every layer requires extra space on disk. In order to keep the number layers to a minimum, any file manipulation like install, moving, extracting, removing, etc, should ideally be made under a single RUN instruction
FROM fedora:latest
RUN yum -y install nano git && yum -y clean all

Docker Squash is a really nice solution to this. you can $packagemanager clean in the last step instead of in every line and then just run a docker squash to get rid of all of the layers.
https://github.com/jwilder/docker-squash

Yes the layer system is quite surprising.
If you have a base image and you increment it by doing the following:
# Test
#
# VERSION 1
# use the centos base image provided by dotCloud
FROM centos7/wildfly
MAINTAINER JohnDo
# Build it with: docker build -t "centos7/test" test/
# Change user into root
USER root
# Extract weblogic
RUN rm -rf /tmp/* \
&& rm -rf /wildfly/*
The image has exactly the same size. That essentially means, you have to manage to put into your RUN steps a lot of extract, install and cleanup magic to make the images as small as the software installed.
This makes life much harder...
The dockerBuild is missing RUN steps without commit.

We had a similar issue in our docker build process. Each image built was significantly larger than the others. As it turns out we were getting tar.gz files included in the image. Among these were the compressed images we upload to a server. So each image contained the prior images by accident. Image sizes were soon in the 8gb range.
.dockerignore is your friend. Make sure anything in your project not necessary to build the image is in the ignore file.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart