Can anyone help me figure out why it took around 20G of my C disk to install QIIME2 through Docker?
Thank you!
Before installing QIIME2, I had 30GB in my C disk, but only remains 8GB after installation.
The short answer to that question is: QIIME2 is pretty big. But I'm sure you knew that already, so let's dig into the details.
First, the QIIME image is roughly 12GB when uncompressed. (This raises the question of where the other 8GB went if you lost 20GB in total. I don't have an answer to that.)
Using a tool called dive, I can explore the QIIME image, and see where that disk space is going. There's one entry that stands out in the log:
5.9 GB |1 QIIME2_RELEASE=2022.8 /bin/sh -c chmod -R a+rwx /opt/conda
For reference, the chmod command is a command which changes the permissions on a directory, without changing the directory itself. Yet, this command is responsible for half the size of the image. It turns out that due to the way docker works internally. If a layer changes the metadata or permissions of a file, then the original file must be re-included into the layer. More information
The remainder is 6GB, which comes mostly from a step where QIIME installs all of its dependencies. That's fairly reasonable for a project packaged with conda.
To summarize, it's an intersection of three factors:
Conda is fairly space-hungry, compared to equivalent pip packages.
QIIME has a lot of features and dependencies.
Every dependency is included twice.
Edit: this is now fixed in version 2022.11.
Related
I'm trying to set up a Docker container (RHEL8) with Kaniko. IN the Dockerfile I specified to install Python3.8 and PIP3 to install Python libraries that were requested for the specific container. requirements.txt lists about 9 libraries (joblib, nltk, numpy, pandas, scikit-learn, scipy, spacy, torch, transformers), from some of which are quite large in size (for example Torch: 890M) but then, when I run
RUN python3.8 -m pip install -r requirements.txt
it runs through the requirements.txt from top to bottom, downloads them but then after the last line it also downloads a lot of other folders/packages too, some quite huge in size, like:
nvidia-cublas-cu11 : 317M
nvidia-cudnn-cu11 : 557M
It installs a lot of packages, like: MarkupSafe, blis, catalogue, certifi, charset-normalizer, click, confection, cymem, filelock, huggingface-hub, idna, jinja, langcodes, murmurash, etc.. and the list is quite impressive
I had to increase the disk size of the runner with 6G in order to even cope with the increased amount of downloaded stuff, but the build still fails upon Taking a snapshot of the full filesystem, due to running out of free disk space.
I have increased free disk space from 8G to 10G and then as a second attempt, to 14G, but the build still fails. I have also specified --single-snapshot option for Kaniko to only take one single snapshot at the end, and not creating separate snapshots at every step (RUN, COPY). I have installed an Nvidia driver to the container, for which I picked a quite lightweight one (450.102.04) which should not take up too much space either.
My question is: are the packages installed by pip3 after installing the list specified in requirements.txt basically dependencies, that I still must install, or are those optional?
Is there any option to overcome this excessing disk space issue? When I start the build process (via GitLab CI - Kaniko) the available free space on the xfs is 12G from 14G, so I should be enough, but the build fails with exit code 1 and message: "no space left on drive"
I have a question regarding an implementation of a Dockerfile on dask-docker.
FROM continuumio/miniconda3:4.8.2
RUN conda install --yes \
-c conda-forge \
python==3.8 \
[...]
&& rm -rf /opt/conda/pkgs
COPY prepare.sh /usr/bin/prepare.sh
RUN mkdir /opt/app
ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]
prepare.sh is just facilitating installation of additional packages via conda, pip and apt.
There are two things I don't get about that:
Why not just place those instructions in the Dockerfile? Possibly indirectly (modularized) by COPYing dedicated files (requirements.txt, environment.yaml, ...)
Why execute this via tini? At the end it does exec "$#" where one can start a scheduler or worker - that's more what I associate with tini.
This way everytime you run the container from the built image you have to repeat the installation process!?
Maybe I'm overthinking it but it seems rather unusual - but maybe that's a Dockerfile pattern with good reasons for it.
optional bonus questions for Dask insiders:
why copy prepare.sh to /usr/bin (instead of f.x. to /tmp)?
What purpose serves the created directory /opt/app?
It really depends on the nature and usage of the files being installed by the entry point script. In general, I like to break this down into a few categories:
Local files that are subject to frequent changes on the host system, and will be rolled into the final image for production release. This is for things like the source code for an application that is under development and needs to be tested in the container. You want these to be copied into the runtime every time the image is rebuilt. Use a COPY in the Dockerfile.
Files from other places that change frequently and/or are specific to the deployment environment. This is stuff like secrets from a Hashicorp vault, network settings, server configurations, etc.... that will probably be downloaded into the container all the time, even when it goes into production. The entry point script should download these, and it should decide which files to get and from where based on environment variables that are injected by the host.
libraries, executable programs (under /bin, /usr/local/bin, etc...), and things that specifically should not change except during a planned upgrade. Usually anything that is installed using pip, maven or some other program that does dependency management, and anything installed with apt-get or equivalent. These files should not be installed from the Dockerfile or from the entrypoint script. Much, much better is to build your base image with all of the dependencies already installed, and then use that image as the FROM source for further development. This has a number of advantages: it ensures a stable, centrally located starting platform that everyone can use for development and testing (it forces uniformity where it counts); it prevents you from hammering on the servers that host those libraries (constantly re-downloading all of those libraries from pypy.org is really bad form... someone has to pay for that bandwidth); it makes the build faster; and if you have a separate security team, this might help reduce the number of files they need to scan.
You are probably looking at #3, but I'm including all three since I think it's a helpful way to categorize things.
I'm running Julia on the raspberry pi 4. For what I'm doing, I need Julia 1.5 and thankfully there is a docker image of it here: https://github.com/Julia-Embedded/jlcross
My challenge is that, because this is a work-in-progress development I find myself adding packages here and there as I work. What is the best way to persistently save the updated environment?
Here are my problems:
I'm having a hard time wrapping my mind around volumes that will save packages from Julia's package manager and keep them around the next time I run the container
It seems kludgy to commit my docker container somehow every time I install a package.
Is there a consensus on the best way or maybe there's another way to do what I'm trying to do?
You can persist the state of downloaded & precompiled packages by mounting a dedicated volume into /home/your_user/.julia inside the container:
$ docker run --mount source=dot-julia,target=/home/your_user/.julia [OTHER_OPTIONS]
Depending on how (and by which user) julia is run inside the container, you might have to adjust the target path above to point to the first entry in Julia's DEPOT_PATH.
You can control this path by setting it yourself via the JULIA_DEPOT_PATH environment variable. Alternatively, you can check whether it is in a nonstandard location by running the following command in a Julia REPL in the container:
julia> println(first(DEPOT_PATH))
/home/francois/.julia
You can manage the package and their versions via a Julia Project.toml file.
This file can keep both the list of your dependencies.
Here is a sample Julia session:
julia> using Pkg
julia> pkg"generate MyProject"
Generating project MyProject:
MyProject\Project.toml
MyProject\src/MyProject.jl
julia> cd("MyProject")
julia> pkg"activate ."
Activating environment at `C:\Users\pszufe\myp\MyProject\Project.toml`
julia> pkg"add DataFrames"
Now the last step is to provide package version information to your Project.toml file. We start by checking the version number that "works good":
julia> pkg"st DataFrames"
Project MyProject v0.1.0
Status `C:\Users\pszufe\myp\MyProject\Project.toml`
[a93c6f00] DataFrames v0.21.7
Now you want to edit Project.toml file [compat] to fix that version number to always be v0.21.7:
name = "MyProject"
uuid = "5fe874ab-e862-465c-89f9-b6882972cba7"
authors = ["pszufe <pszufe#******.com>"]
version = "0.1.0"
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
[compat]
DataFrames = "= 0.21.7"
Note that in the last line the equality operator is twice to fix the exact version number see also https://julialang.github.io/Pkg.jl/v1/compatibility/.
Now in order to reuse that structure (e.g. different docker, moving between systems etc.) all you do is
cd("MyProject")
using Pkg
pkg"activate ."
pkg"instantiate"
Additional note
Also have a look at the JULIA_DEPOT_PATH variable (https://docs.julialang.org/en/v1/manual/environment-variables/).
When moving installations between dockers here and there it might be also sometimes convenient to have control where all your packages are actually installed. For an example you might want to copy JULIA_DEPOT_PATH folder between 2 dockers having the same Julia installations to avoid the time spent in installing packages or you could be building the Docker image having no internet connection etc.
In my Dockerfile I simply install the packages just like you would do with pip:
FROM jupyter/datascience-notebook
RUN julia -e 'using Pkg; Pkg.add.(["CSV", "DataFrames", "DataFramesMeta", "Gadfly"])'
Here I start with a base datascience notebook which includes Julia, and then call Julia from the commandline instructing it to execute the code needed to install the packages. Only downside for now is that package precompilation is triggered each time I load the container in VS Code.
If I need new packages, I simply add them to the list.
Using multi-stage builds, I want to downsize an image at the end of Dockerfile, something like this:
FROM ubuntu AS ubuntu_build
RUN # do a lot of build things
FROM alpine
COPY --from=ubuntu_build /app /app
ENTRYPOINT whatever
the alpine image is small, and in theory only the /app stuff will get copied from the ubuntu image, is this the best trick in the book, or is there some other way to minimize the size of the final image?
Distroless
Google provides instructions and tools to help make distroless images.
"Distroless" images contain only your application and its runtime dependencies. They do not contain package managers, shells or any other programs you would expect to find in a standard Linux distribution.
Why should I use distroless images?
Restricting what's in your runtime container to precisely what's necessary for your app is a best practice employed by Google and other tech giants that have used containers in production for many years. It improves the signal to noise of scanners (e.g. CVE) and reduces the burden of establishing provenance to just what you need.
If your app is a compiled binary then you could get away with a single binary plus the shared libraries it links against. If you limit the libraries you link against you might only need a couple. Here, for instance, is what a minimal C program compiled with gcc links against on my machine:
$ ldd basic-program
linux-vdso.so.1 (0x00007fffd3fa2000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2e4611b000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2e4670e000)
Heck, you could even statically link the entire program and have no dependencies at all.
Google provides a set of base images targeted at different languages:
gcr.io/distroless/static (Go, Rust, D)
gcr.io/distroless/base (C, C++)
gcr.io/distroless/java (Java)
gcr.io/distroless/cc (Rust, D)
They have only some bare essential files, far less than what even a minimal distro like alpine pulls in, since it still has the apk package manager, userspace utilities, etc. You use them the same way you describe in your question: as the last stage in a multi-stage build.
FROM gcr.io/distroless/base
COPY --from=build /go/bin/app /
CMD ["/app"]
FROM scratch
You can also go the full raw-food-living-in-the-woods route and build your final image FROM scratch. It doesn't get any purer than that. There's absolutely nothing that you didn't put there yourself. This is the route traefik chose, for instance.
FROM scratch
COPY certs/ca-certificates.crt /etc/ssl/certs/
COPY traefik /
EXPOSE 80
VOLUME ["/tmp"]
ENTRYPOINT ["/traefik"]
Besides use multi-stage, another typical way is use docker-slim to reduce the size of final built out image, like next:
docker-slim build --http-probe your-name/your-app
Detail refers to this guide
Add other common thoughts excerpt from "Five Ways to Slim Your Docker Images" in case you needed:
Think Carefully About Your Application’s Needs
Just install what you really needed in dockerfile.
Use a Small Base Image
Could use Alpine Linux or even directly from scratch, see how to create-the-smallest-possible-docker-container
Use as Few Layers As Possible
Combine run together, more run more layers, more layers more size
Use .dockerignore files
This avoid copy all things to images if use COPY . /, more, if not use .gitignore, you then need to avoid to use COPY . / which may copy some necessary things to image.
Squash Docker Images
The idea here is that after your image is created, you then flatten it as much as possible, using a tool such as docker-squash.
I'm creating an auto-testing service for my university. I need to take student code, put it into the project directory, and run tests.
This needs to be done for multiple different languages in an extensible way.
My initial plan:
Have a "base image" for each language (i.e. install the language runtime on buildpack-deps:stretch)
Take user files & pre-made project structure
Put user files into the correct location in the project
Build an image of the project extending the base image
Run the container. It will compile the project and run tests.
Save test results to the database, stop & delete the image
Rinse repeat for every submission
When testing manually, the image sizes are huge! Almost 1.5GB in size! I'm installing the runtime for one language, and I was testing with Hello World - so the project wasn't big either.
This "works", but feels very inefficient. I'm also very new to Docker – is there a better way to do this?
Cheers
In this specific application, I'd probably compile the program inside a container and not build an image out of it (since you're throwing it away immediately, and the compilation and testing is the important part and, unusually, you don't need the built program for anything after that).
If you assume that the input file gets into the container somehow, then you can write a script that does the building and testing:
#!/bin/sh
cd /project/src/student
tar xzf "/app/$1"
cd ../..
make
...
curl ??? # send the test results somewhere
Then your Dockerfile just builds this into an image, without any specific student code in it
FROM buildpack-deps:stretch
RUN apt-get update && apt-get install ...
RUN adduser user
COPY build_and_test.sh /usr/local/bin
USER user
ADD project-structure.tar.gz /project
Then when you actually go to run it, you can use the docker run -v option to inject the submitted code.
docker run --rm -v $HOME/submissions:/app theimage \
build_and_test.sh student_name.tar.gz
In your original solution, note that the biggest things are likely to be the language runtime, C toolchain, and associated header files, and so while you get an apparently huge image, all of these things come from layers in the base image and so are shared across the individual builds (it's not taking up quite as much space as you think).