Update of operating system takes forever, while building Docker image - docker

I'm using Debian:latest as the base image for my Docker containers.
Problem is that on every build I have to update the OS packages, which takes forever. Here is what I do:
FROM debian:latest
RUN apt-get update && apt-get install -y --force-yes --no-install-recommends nginx
...
apt-get update && apt-get install lasts forever. What do I do about this?

Docker images are minimal, only including the absolute necessities to run that base image. For the Debian base images, that means there is no package repo cache. So when you run an apt-get update it is downloading a the package repo cache for the first time from all the repos. If they included the package repo cache, it would be many megs of package state that would be quickly out of date, resulting in larger base images with little reduction on doing an update later.
The actual debian:latest image is relatively well maintained with commits from last month. You can view the various tags for it here: https://hub.docker.com/_/debian/
To reduce your image build time, I'd recommend not deleting your image every time. Instead, do your new build and tag, and once the new image is built, you can run a docker image prune --force to remove the untagged images from prior builds. This allows docker to reuse the cache from prior image builds.
Alternatively, you can create your own base image that you update less frequently and that has all of your application prerequisites. Build it like any other image, and then change the FROM debian:latest to FROM your_base_image.
One last tip, avoid using latest in your image builds, do something like FROM debian:9 instead so that a major version update in debian doesn't break your build.

Don't delete image on each build. Just modify your Dockerfile and build again. Docker is "smart" and it will keep unmodified layers building only from lines you changed. That intermediate images used for this purpose (docker is creating them automatically) can be easily removed with this command:
docker rmi $(docker images -q -f dangling=true)
You'll save a lot of time with this.Remember, don't delete image. Just modify the Dockerfile and build again. And after finishing with everything working lauch this command and that's all you need.
Another "good to launch for cleaning command" can be:
docker volume prune -f
But this last command is for other kind of cleaning not related to images... is more focused on containers.

Related

Combining multiple images in docker-compose [duplicate]

I have a few Dockerfiles right now.
One is for Cassandra 3.5, and it is FROM cassandra:3.5
I also have a Dockerfile for Kafka, but t is quite a bit more complex. It is FROM java:openjdk-8-fre and it runs a long command to install Kafka and Zookeeper.
Finally, I have an application written in Scala that uses SBT.
For that Dockerfile, it is FROM broadinstitute/scala-baseimage, which gets me Java 8, Scala 2.11.7, and STB 0.13.9, which are what I need.
Perhaps, I don't understand how Docker works, but my Scala program has Cassandra and Kafka as dependencies and for development purposes, I want others to be able to simply clone my repo with the Dockerfile and then be able to build it with Cassandra, Kafka, Scala, Java and SBT all baked in so that they can just compile the source. I'm having a lot of issues with this though.
How do I combine these Dockerfiles? How do I simply make an environment with those things baked in?
You can, with the multi-stage builds feature introduced in Docker 1.17
Take a look at this:
FROM golang:1.7.3
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=0 /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
Then build the image normally:
docker build -t alexellis2/href-counter:latest
From : https://docs.docker.com/develop/develop-images/multistage-build/
The end result is the same tiny production image as before, with a significant reduction in complexity. You don’t need to create any intermediate images and you don’t need to extract any artifacts to your local system at all.
How does it work? The second FROM instruction starts a new build stage with the alpine:latest image as its base. The COPY --from=0 line copies just the built artifact from the previous stage into this new stage. The Go SDK and any intermediate artifacts are left behind, and not saved in the final image.
You can't combine dockerfiles as conflicts may occur. What you want to do is to create a new dockerfile or build a custom image.
TL;DR;
If your current development container contains all the tools you need and works, then save it as an image and upon it to a repo and create a dockerfile to pull from that image off that repo.
Details:
Building a custom image is by far easier than creating a dockerfile using a public image as you can store whatever hacks and mods into the image. To do so, start a blank container with a basic Linux image (or broadinstitute/scala-baseimage), install whatever tools you need and configure them until everything works correctly, then save it (the container) as an image. Create a new container off this image and test to see if you can build your code on top of it via docker-compose (or however you want to do/build it). If it works, than you have a working base image that you can upload to a repo so others can pull it.
To build a dockerfile with a public image, you will need to put all hacks, mods and setup on the dockerfile itself. That is, you will need to place every command line that you used into a text file and reduce whatever hacks, mods and setup into command lines. At the end, your dockerfile will create an image automatically and you don't need to store this image into a repo and all you need to do is to give others the dockerfile and they can spin the image up at their own docker.
Note that once you have a working dockerfile, you can tweak it easily as it will create a new image every time you use the dockerfile. With a custom image, you may run into issues where you need to rebuild the image due to conflicts. For example, all of your tools work with openjdk until you install one that doesn't work. The fix may involve uninstalling openjdk and use the oracle one, but all configuration you did for all the tools that you have installed broke.
The following answer applies to docker 1.7 and above:
I would prefer to use --from=NAME and from image as NAME
Why?
You can use --from=0 and above but this might get little hard to manage when you have many docker stages in dockerfile.
sample example:
FROM golang:1.7.3 as backend
WORKDIR /backend
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN #install some stuff, compile assets....
FROM golang:1.7.3 as assets
WORKDIR /assets
RUN ./getassets.sh
FROM nodejs:latest as frontend
RUN npm install
WORKDIR /assets
COPY --from=assets /asets .
CMD ["./app"]
FROM alpine:latest as mergedassets
WORKDIR /root/
COPY --from=frontend . /
COPY --from=backend ./backend .
CMD ["./app"]
Note: Managing dockerfile properly will help to build a docker image much faster. Internally docker usings docker layer caching to help with this process, incase the image have to be rebuilt.
Yes, you can roll a whole lot of software into a single Docker image (GitLab does this, with one image that includes Postgres and everything else), but generalhenry is right - that's not the typical way to use Docker.
As you say, Cassandra and Kafka are dependencies for your Scala app, they're not part of the app, so they don't all belong in the same image.
Having to orchestrate many containers with Docker Compose adds an extra admin layer, but it gives you much more flexibility:
your containers can have different lifespans, so when you have a new version of your app to deploy, you only need to run a new app container, you can leave the dependencies running;
you can use the same app image in any environment, using different configurations for your dependencies - e.g. in dev you can run a basic Kafka container and in prod have it clustered on many nodes, your app container is the same;
your dependencies can be used by other apps too - so multiple consumers can run in different containers and all work with the same Kafka and Cassandra containers;
plus all the scalability, logging etc. already mentioned.
When might you want to "combine" Docker images?
As others are pointing out here, you typically don't want to put your database and you application into the same Docker image. Ideally you want a Docker image to wrap a "single process"/"runtime". This allows each process to be scaled up/down and restarted individually.
Let's say you want to use some shared C-libraries/executables that are not available in the package manager of the image you are using, but someone else has created an image where they are precompiled - and you might not want to recompile these binaries as part of your build (depending on how long this takes). Is there a way to quickly create a POC-Docker image containing all of these executables/libraries based on the existing images?
Docker and Composition
Relevant discussion: https://github.com/moby/moby/issues/3378
What Docker lacks is a good way of composing images. You can copy individual files or entire file systems from other images into your own using COPY --from=<image> <from-path> <to-path>. There is no builtin way of copying the environment variables from another image into your own.
That said, I have personally created a custom frontend/parser for Dockerfiles that adds an INCLUDE <image>-keyword. This copies the entire filesystem, along with the environment variables into your image:
DOCKER_BUILDKIT=1 docker build -t myimage .
#syntax=bergkvist/includeimage
FROM alpine:3.12.0
INCLUDE rust:1.44-alpine3.12
INCLUDE python:3.8.3-alpine3.12
nixpkgs.dockerTools
if you want truly composable Docker builds, I recommend checking out dockerTools in nixpkgs. This will also result in more reproducible (and typically very small) images. See https://nix.dev/tutorials/building-and-running-docker-images
docker load < $(nix-build docker-image.nix)
# docker-image.nix
let
pkgs = import <nixpkgs> {};
python = pkgs.python38;
rustc = pkgs.rustc;
in pkgs.dockerTools.buildImage {
name = "myimage";
tag = "latest";
contents = [ python rustc ];
}
Docker doesn't do merges of the images, but there isn't anything stopping you combining the dockerfiles if available, and rolling into them into a fat image which you'd need to build. There's times where this makes sense, however, as for running multiple processes in a container most Docker dogma will point to this as less desirable especially with microservice architecture (however rules are there to be broken right?)
You could not combine docker images into 1 container. See the detail discussions in Moby issue, How do I combine several images into one via Dockerfile.
For your case, it is better to not include the whole Cassandra and Kafka images. The application would only need the Cassandra Scala driver and Kafka Scala driver. The container should include the drivers only.
I needed docker:latest and python:latest images for Gitlab CI. Here is what I came up with:
FROM ubuntu:latest
RUN apt update
RUN apt install -y sudo
RUN sudo apt install -y docker.io
RUN sudo apt install -y python3-pip
RUN sudo apt install -y python3
RUN docker --version
RUN pip3 --version
RUN python3 --version
After I've build and pushed it to my Docker Hub repo:
docker build -t docker-hub-repo/image-name:latest path/to/Dockerfile
docker push docker-hub-repo/image-name:latest
Don't forget to docker login before push
Hope it helps

What is the difference between --no-cache and --rm when building a Docker image

After I was accidentally filling up disk space, with building Docker images one after one (same/iteration), I started searching, if there is a workaround?
Even if I stopped the containers and deleted them, there is no space left on HDD. (Here I useddocker image prune)
I was wondering, if adding --rm in future Docker build commands would solve my problem?
Can --rm and --no-cache be used in the same build command? What's the difference between them?
--rm after building the final image, removed the intermediate containers (this is the default behaviour).
--no-cached tells to docker to don't use cached intermediate layers and regenerate them as well. Each instruction inside a docker file generates an intermediate layer, for example RUN apt install -y some-package. In the above scenario, the default behaviour is to reuse the generated intermediate layer without download and installing again some-package. Instead, sometimes, you may need to refresh the intermediate layers with more recent stuffs, so you have to use the --no-cache option.

Purpose of specifying several UNIX commands in a single RUN instruction in Dockerfile

I have noticed that many Dockerfiles try to minimize the number of instructions by several UNIX commands in a single RUN instruction. So is there any reason?
Also is there any difference in the outcomes between the two Dockerfiles below?
Dockerfile1
FROM ubuntu
MAINTAINER demousr#example.com
RUN apt-get update
RUN apt-get install –y nginx
CMD ["echo", "Image created"]
Dockerfile2
FROM ubuntu
MAINTAINER demousr#example.com
RUN apt-get update && apt-get install –y nginx
CMD ["echo", "Image created"]
Roughly speaking, a Docker image contains some metadata & an array of layers, and a running container is built upon these layers by adding a container layer (read-and-write), the layers from the underlying image being read-only at that point.
These layers can be stored in the disk in different ways depending on the configured driver. For example, the following image taken from the official Docker documentation illustrates the way the files changed in these different layers are taken into account with the OverlayFS storage driver:
Next, the Dockerfile instructions RUN, COPY, and ADD create layers, and the best practices mentioned on the Docker website specifically recommend to merge consecutive RUN commands in a single RUN command, to reduce the number of layers, and thereby reduce the size of the final image:
https://docs.docker.com/develop/dev-best-practices/
[…] try to reduce the number of layers in your image by minimizing the number of separate RUN commands in your Dockerfile. You can do this by consolidating multiple commands into a single RUN line and using your shell’s mechanisms to combine them together. […]
See also: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
Moreover, in your example:
RUN apt-get update -y -q
RUN apt-get install -y nginx
if you do docker build -t your-image-name . on this Dockerfile, then edit the Dockerfile after a while, add another package beyond nginx, then do again docker build -t your-image-name ., due to the Docker cache mechanism, the apt-get update -y -q won't be executed again, so the APT cache will be obsolete. So this is another upside for merging the two RUN commands.
In addition to the space savings, it's also about correctness
Consider your first dockerfile (a common mistake when working with debian-like systems which utilize apt):
FROM ubuntu
MAINTAINER demousr#example.com
RUN apt-get update
RUN apt-get install –y nginx
CMD ["echo", "Image created"]
If two or more images follow this pattern, a cache hit could cause the image to be unbuildable due to cached metadata
let's say I built an image which looks similar to that ~a few weeks ago
now I'm building this image today. there's a cache present up until the RUN apt-get update line
the docker build will reuse that cached layer (since the dockerfile and base image are identical) up to the RUN apt-get update
when the RUN apt-get install line runs, it will use the cached apt metadata (which is now weeks out of date and likely will error)

What is the difference between multiples RUN entries in Dockerfile and just one RUN entry?

What is the difference between multiples RUN entries in Dockerfile like:
FROM php:5.6-apache
RUN docker-php-ext-install mysqli
RUN apt update
RUN apt install git -y -q
and just one RUN entry?
FROM php:5.6-apache
RUN docker-php-ext-install mysqli && apt update && apt install git -y -q
OBS. I'm not asking which one is better. I Want to know all the differences between the two approaches.
Each RUN command creates a layer of the filesystem changes generated by a temporary container started to run that command. (It's effectively running a docker run and then packaging the result of docker diff into a filesystem layer.)
These layers have a few key details to note:
They are immutable. Once you create them you don't change them. You would have to generate/recreate a new layer, to update your image.
They are reusable between multiple images and running containers. You can do this because of the immutability.
You do not delete files from a parent layer, but you can register that a file is deleted in a later layer. This is a metadata change in that later layer, not a modification to the parent layer.
Layers are reused in docker's build cache. If two different images, or even the same image being rebuilt, perform the same command on top of the same parent layer, docker will reuse the already created layer.
These layers are merged together into the final filesystem you see inside your container.
The main difference between the two approaches are the build cache and deleting files. If you split apart the download of a source code tgz, extraction of the tgz, compiling a binary, and the deleting of the tgz and source folders, into multiple RUN lines, then when you ship the image over the network and store it on disk, you will have all of the source in the layers even though you don't see it in the final container. Your image will be significantly larger.
Caching can also be a bad thing when you cache too much. If you split the apt update and apt install, and then add a new package to install to your second run line months later, docker will reuse the months old cache of apt update and try to install packages that are months old, possibly no longer available, and your image may fail to build. Many people also run a rm -rf /var/lib/apt/lists/* after installing debian packages. And if you do this in a separate step, you will not actually delete the files from the previous layers, so your image will not shrink.

Is it possible to remove unwanted packages from docker image?

I'm trying to reduce the size of my docker image which is using Centos 7.2
The issue is that it's 257MB which is too high...
I have followed the best practices to write Dockerfile in order to reduce the size...
Is there a way to modify the image after the build and rebuild that image to see the size reduced ?
First of all if you want to reduce an OS size, don't start with big one like CentOS, you can start with alpine which is small
Now if you are still keen on using CentOS, do the following:
docker run -d --name centos_minimal centos:7.2.1511 tail -f /dev/null
This will start a command in the background. You can then get into the container using
docker exec -it centos_minimal bash
Now start removing packages that you don't need using yum remove or yum purge. Once you are done you can commit the image
docker commit centos_minimal centos_minimal:7.2.1511_trial1
Experimental Squash Image
Another option is to use an experimental feature of the build command. In this you can have a dockerfile like below
FROM centos:7
RUN yum -y purge package1 package2 package2
Then build this file using
docker build --squash -t centos_minimal:squash .
For this you need to add "experimental": true to your /etc/docker/daemon.json and then restart the docker server
It is possible, but not at all elegant. Just like you can add software to the base image, you could also remove:
FROM centos:7
RUN yum -y update && yum clean all
RUN yum -y install new_software
RUN yum -y remove obsolete_software
Ask yourself: does your OS have to be CentOS? Then I would recommend you use the default installation and make sure your have enough disk space and memory.
If it does not need to be CentOS, you should rather start with a more minimalistic image. See the discussion here:
Which Docker base image should be used to install Apps in a container without any additional OS?

Resources