Docker layers are additive, meaning that purging packages in a subsequent layer will not remove them from the previous one, and thus from the image.
In my understanding, what happens is that an additional masking layer is created, in which those packages are not shown anymore.
Indeed, if I build the MWE below and then run apt list --installed | grep libpython3.9-minimal after the purging, the package cannot be found.
However, I still don't understand entirely what happens under the hood.
Are the packages effectively still there, but masked?
If one of these packages causes vulnerability issues, is purging=masking a solution, or will we still have issues while being unaware of them (because the package seems to be removed and so does not show in an image scan, but is still there)?
FROM openjdk:11
# Remove packages
RUN apt-get purge -y libpython3.9-minimal
RUN apt-get autoremove -y
ENTRYPOINT ["/bin/bash"]
Related
I need to have a hierarchy of the following Docker images:
A "base"image:
FROM python:3.5-slim-stretch
RUN apt install -y python3-enchant enchant libpq-dev gcc && apt clean
And a child image that inherits from the "base" likewise:
FROM myprivaterepo:30999/base-image
ENV PATH /usr/lib/postgresql/9.5/bin:$PATH
RUN pip3 install -r requirements.txt
The requirements.txt includes packages that are meant to be built with gcc and one of them needs to find the pg_config binary included in the libpq-dev package. The problem is that it cannot find them, even if it inherits and starts to build normally from the base image. (although if I install them in the child image, it all works - but that's not what I want.)
Any idea what I'm doing wrong? Many Thanks.
Have you ever built the base-image without that software? Then, it might be a caching problem of docker images, i.e. your child image is based on an old cached version of the base-image.
Verify that the following hashes match:
Building your base image prints as last line:
Successfully built <hash>
Building your child image prints in the beginning:
Step 1/x : FROM myprivaterepo:30999/base-image
---> <hash>
The <hash> should be identical.
Currently I am copying pre-downloaded packages and then installed on the docker image. The COPY layer currently has the same size as the directory being copied. Directory is later erased on another layer. Dockerfile looks as follows:
COPY python-packages /tmp/python-packages
RUN pip install -f /tmp/python-packages --no-index <pkg-name> \
&& rm -rf /tmp/*
Is there a way to copy files without having a layer the same size as the directory being copied? Any way to reduce COPY layer size?
Unfortunately as of this time you cannot reduce the size or eliminate the layer, RUN, COPY and ADD will create a layer every time.
What you can do is use pip to install directly from your version control
e.g. pip install git+https://git.example.com/MyProject#egg=MyProject
More info: https://pip.pypa.io/en/stable/reference/pip_install/#vcs-support
This brings the downside that you will have to give access to pip if your code is private and will introduce the need for network connectivity on your private network or the internet, depending on where your code is, on docker build time.
You could also use a multi stage build and install the python module with pip in another docker image and then just copy the artifacts to the final docker image. I highly do not recommend this though unless you have no choice and understand the risks, since you would have to copy all the folders and/or files pip touches in the install process and maybe create some others that it expects to be present and also get the permissions right in the final docker image, this will be hard to get right without deep diving in pip internals and also hard to maintain in the long run since pip might change its files and folders locations and/or structure in the future.
More on multi stage builds: https://docs.docker.com/develop/develop-images/multistage-build/
Two constraints are often important in writing Dockerfiles: image size and image build time.
It's a commonplace observation that time and space usage can often be traded off for one another. However, it can be useful to avoid that choice by going for fast build time in development and small-but-slower builds in production.
For example, if I write something like this in a project I can quickly rebuild the images in development when frequently_changing_source_code changes, because there is a layer with build-essential installed that can be reused in the derived image:
base image:
RUN apt install build-essential python-dev && \
pip install some-pypi-project
ADD frequently_changing_source_code
derived image:
FROM base_image
RUN pip install another-pypi-project-requiring-build-essential
ADD more_stuff
The above results in larger builds than this next version, which achieves the same functionality but sacrifices build times. Now whenever frequently_changing_source_code changes, rebuilding the derived image results in a re-install of build-essential:
base image:
RUN apt install build-essential python-dev && \
pip install some-pypi-project && \
apt remove build-essential python-dev
ADD frequently_changing_source_code
derived image:
FROM base_image
RUN apt install build-essential python-dev && \
pip install another-pypi-project-requiring-build-essential && \
apt remove build-essential python-dev
ADD more_stuff
I can imagine ways of solving this: for example, writing a slightly more complicated set of Dockerfiles that are parameterized on some sort of development flag, which has the first behaviour for development builds, and the second for production builds. I suspect that would not result in Dockerfiles that people like to read and use, though.
So, how can I best achieve my ends without surprising other developers: i.e. using Dockerfiles that respect docker conventions as much as I can?
Some notes about answers I've considered:
I'm aware of the layer-caching behaviour of docker (that is why the ADD commands for both images in my example are at the end).
I'm aware that one can mount code using -v. Using -v is my usual practice, but this question is about building images, which is also something that happens in development (from time to time, it happens quite a lot).
One obvious suggestion is to eliminate the base image. However, note that for the projects concerned, the base image is typically a base for multiple images, so merging the base with those would result in a bunch of repeated directives in each of those Dockerfiles. Perhaps this is the least-worst option, though.
Also note that (again, in the projects I'm involved with) the mere presence of the frequently_changing_source_code does not by itself significantly contribute to build times: it is re-installs of packages like build-essential that does that. another-pypi-project-requiring-build-essential typically does contribute significantly to build times, but perhaps not enough to need to eliminate that step in development builds too.
Finally, though it is a commonly-cited nice feature of docker that it's possible to use the same configuration in development as in production, this particular source of variation is not a significant concern for us.
In the past there hasn't really been a good answer to this. You either build two different images, one for fast moving developers and the other for compact distribution, or you pick one that's less than ideal for others. There's a potential workaround if the developers compile the code themselves and simply mount their compiled product directly into the container as a volume for testing without a rebuild.
But last week docker added the ability to have a multi-stage build in 17.05.0-ce-rc1 (see pr 32063). They allow you to build parts of the app in separate pieces and copy the results into another image at the end, with caching of all the layers while the final image only contains the layers of the last section of the build. So for your scenario, you could have something like:
FROM debian:latest as build-env
# you can split these run lines now since these layers are only used at build
RUN apt install build-essential python-dev
RUN pip install some-pypi-project
RUN pip install another-pypi-project-requiring-build-essential
# you only need this next remove if the build tools are in the same folders as the app
RUN apt remove build-essential python-dev
FROM debian:latest
# update this copy command depending on the pip install location
COPY --from=build-env /usr/bin /usr/bin
ADD frequently_changing_source_code
ADD more_stuff
All the layers in the first build environment stick around in the cache, letting developers add and remove as they need to, without having to rerun the build-essential install. But in the final image, there's just 3 layers added, one copy command from the build-env and a couple adds, resulting in a small image. And if they only change files in those ADD commands, then only those steps run.
Here's an early blog post going into it in more detail. This is now available as an RC and you can expect it in the 17.05 edge release from docker, hopefully in the next few weeks. If you want to see another example of this really put to use, have a look at the miragesdk Dockerfile.
I've created a Dockerfile for an application I'm building that has a lot of large apt-get package dependencies. It looks something like this:
FROM ubuntu:15.10
RUN apt-get update && apt-get install -y \
lots-of-big-packages
RUN install_my_code.sh
As I develop my application, I keep coming up with unanticipated package dependencies. However, since all the packages are in one Dockerfile instruction, even adding one more breaks the cache and requires the whole lot to be downloaded and installed, which takes forever. I'm wondering if there's a better way to structure my Dockerfile?
One thought would be to put a separate RUN apt-get update && apt-get install -y command for each package, but running apt-get update lots of times probably eats up any savings.
The simplest solution would be to just add a second RUN apt-get update && apt-get install -y right after the first as a catchall for all of the unanticipated packages, but that divides the packages in an unintuitive way. (ie, "when I realized I needed it") I suppose I could combine them when dependencies are more stable, but I find I'm always overly optimistic about when that is.
Anyway, if anyone has a better way to structure it I'd love to hear it. (all of my other ideas run against the Docker principles of reproducibility)
I think you need to run apt-get update only once within the Dockerfile, typically before any other apt-get commands.
You could just first have the large list of known programs to install, and if you come up with a new one then just add a new RUN apt-get install -y abc to you Dockerfile and let docker continue form the previously cached command. Periodically (once a week, one a month?) you could re-organize them as you see fit or just run everything in a single command.
I suppose I could combine them when dependencies are more stable, but
I find I'm always overly optimistic about when that is.
Oh you actually mentioned this solution already, anyway there is no harm doing these "tweaks" every now and then. Just run apt-get update only once.
It would seem that apt-get is having issues connecting with the repository servers. I suppose it is likely compatibility issues, as mentioned here, however the proposed solution of apt-get clean does not work for me. Also I am surprised, if this is the case, that there are not more people having my issue.
MWE
Dockerfile
FROM debian:jessie
RUN apt-get clean && apt-get update && apt-get install -y --no-install-recommends \
git
$ docker build .
docker build .
Sending build context to Docker daemon 2.048 kB
Step 0 : FROM debian:jessie
---> 4a5e6db8c069
Step 1 : RUN apt-get clean && apt-get update && apt-get install -y --no-install-recommends git
---> Running in 43b93e93feab
Get:1 http://security.debian.org jessie/updates InRelease [63.1 kB]
... some omitted ...
Get:6 http://httpredir.debian.org jessie-updates/main amd64 Packages [3614 B]
Fetched 9552 kB in 7s (1346 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following extra packages will be installed:
... some omitted ...
0 upgraded, 26 newly installed, 0 to remove and 0 not upgraded.
Need to get 13.2 MB of archives.
After this operation, 64.0 MB of additional disk space will be used.
Get:1 http://security.debian.org/ jessie/updates/main libgnutls-deb0-28 amd64 3.3.8-6+deb8u2 [694 kB]
... some omitted ...
Get:5 http://httpredir.debian.org/debian/ jessie/main libnettle4 amd64 2.7.1-5 [176 kB]
Err http://httpredir.debian.org/debian/ jessie/main libffi6 amd64 3.1-2+b2
Error reading from server. Remote end closed connection [IP: 176.9.184.93 80]
... some omitted ...
Get:25 http://httpredir.debian.org/debian/ jessie/main git amd64 1:2.1.4-2.1 [3624 kB]
Fetched 13.2 MB in 10s (1307 kB/s)
E: Failed to fetch http://httpredir.debian.org/debian/pool/main/libf/libffi/libffi6_3.1-2+b2_amd64.deb Error reading from server. Remote end closed connection [IP: 176.9.184.93 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
The command '/bin/sh -c apt-get clean && apt-get update && apt-get install -y --no-install-recommends git' returned a non-zero code: 100
Please note that I also posted here with a different issue. I believe it to be unrelated, but it may well actually be.
For whoever is having an issue with this, this is my attempt to "fix" the issue by swapping out httpredir with a single working domain whenever the Dockerfile is being built:
FROM debian:je...
# Insert this line before "RUN apt-get update" to dynamically
# replace httpredir.debian.org with a single working domain
# in attempt to "prevent" the "Error reading from server" error.
RUN sed -i "s/httpredir.debian.org/`curl -s -D - http://httpredir.debian.org/demo/debian/ | awk '/^Link:/ { print $2 }' | sed -e 's#<http://\(.*\)/debian/>;#\1#g'`/" /etc/apt/sources.list
# Continue with your apt-get update...
RUN apt-get update...
What this command does is:
Curl the http://httpredir.debian.org/demo/debian/ from the building machine to get the headers from debian demo page (-s is silent, don't output. -D is to dump headers)
Extract the headers, find the Link header fragment. This contains the best route as recommended by httpredir.
The last sed -e ... is to extract out the domain name of the link in step 2.
Then finally, the domain found in step 3 is being feed into the global sed command, and replace the domain httpredir.debian.org found in /etc/apt/sources.list.
This is not a fix, but rather a simple hack to (greatly) reduce the chances of failed build. And... pardon me if it looks weird, as it's my virgin sed & piping attempt.
Edit
On a side note, if the domain that it picks simply too slow or not responding as it should, you may want to do it manually by
Visit http://httpredir.debian.org/demo.html, and you should see a link there like http://......./debian/. For example, at the point of writing, I saw http://mirrors.tuna.tsinghua.edu.cn/debian/
Instead of the long RUN sed -i.... command, use this instead:
RUN sed -i "s/httpredir.debian.org/mirrors.tuna.tsinghua.edu.cn/" /etc/apt/sources.list
I added apt-get clean to my dockerfile before the apt-get update line, it seems to have done the trick.
I guess I have no way of knowing whether or not it was the extra command or if it was luck that fixed my build, but I took the advice from https://github.com/CGAL/cgal-testsuite-dockerfiles/issues/19
The httpredir.debian.org mirror is "magic" in that it will load-balance and geo-ip you to transparent increase performance and availability. I would therefore immediately suspect it of causing your problem, or at least be the first thing to rule out.
I would check if you could:
Still reproduce the problem; httpredir.debian.org will throw out "bad" mirrors from its internal lists so your issue may have been temporary.
Reproduce the problem with a different, non-httpredir.debian.org mirror. Try something like ftp.de.debian.org. If it then works with this mirror, do please contact the httpredir.debian.org maintainer and report the issue to them. They are quite responsive and open to bug reports.
For those visiting with similar issues, using the --no-cache flag in the docker build may help. Similar issues (though not this exact one) can occur if the apt-get update is old and not being recalled sue to caching.
Not enough reputation to comment on previous answers, so I will (confusingly) add a new answer:
I don't think hardcoding a single mirror is really a viable solution, since as for example seen here, there's a reason debian implemented the whole httpredir thing -- mirrors go down or out of date.
I've dealt with this issue a lot of times, and the logs always indicate that docker's actually running the apt-get command, which means --no-cache is unlikely to be fixing it -- it's just that if you rebuild, httpredir is likely to pick a different mirror, even if you don't change anything in your docker file, and the build will work.
I was able to solve this problem by adding -o 'Acquire::Retries=3' to the apt-get install command.
RUN apt-get update && apt-get install -o 'Acquire::Retries=3' -y git
Seems that it automatically retries to acquire the package from another mirror.
apt-get acquire documentation
EDIT: After some investigation I found that my problem was related with the proxy server I was using. I'll leave the answer here anyway in case it helps anyone.