I've been working with Docker recently to host an RStudio-server instance. There are a number of packages that need to be installed, an one way to do this in a Dockerfile is this (with many more lines, obviously):
RUN Rscript -e "install.packages('beanplot')"
RUN Rscript -e "install.packages('beeswarm')"
RUN Rscript -e "install.packages('boot')"
RUN Rscript -e "install.packages('caTools')"
I see many instances where this is done this way:
RUN Rscript -e "install.packages(c('beanplot','beeswarm','boot','caTools'))
Also, I often see various executable lines chained together like this:
RUN yum -y update \
&& yum -y groupinstall 'Development Tools' \
&& yum -y install epel-release \
vim \
initscripts \
libpng-devel \
mesa-libGL \
mesa-libGL-devel \
mesa-libGLU \
mesa-libGLU-devel \
ypbind \
rpcbind \
authconfig \
&& yum -y install R \
&& mkdir /rhome
rather than having is && line as a separate RUN line.
I had assumed the benefit was to reduce the size of the docker image, but when I tested a large example, either method resulted in the same size.
What is the advantage of chaining commands rather than having individual RUN commands for each line?
Each RUN command adds a new layer to the image and there is an upper limit of the number of layers allowed (somewhere around 255 or so). The limitation is enforced for performance reasons.
Every time an application that runs inside the container needs to access a file, the engine searches the file in all these layers, from top to bottom until it finds it. If the application attempts to change a file that is not on the topmost layer then the engine first makes a copy of the file on the topmost layer then handles the application's write requests onto the copy.
The topmost layer is writeable. It is not stored in the image but it is part of the container. The layers stored in the image are read-only.
It is all explained in the documentation. Keeping a small number of layers is recommended as best practice.
Related
Our embedded system product is built in an Ubuntu 12.04 with some ancient tools that are no longer available. We have the tools in our local git repo.
Setting up the build environment for a new comer is extremely challenging. I would like to set up the build environment in a docker container, download the source code into a host machine, mount the source code into the container and execute the build so that someone starting fresh doesnt have to endure the challenging setup. Is this a reasonable thing to do?
Here is what I have done so far:
Created a dockerfile to set up the env
# Ubuntu 12.04.5 LTS is the standard platform for development
FROM ubuntu:12.04.5
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
build-essential \
dialog \
autoconf \
automake \
libtool \
libgtk-3-dev \
default-jdk \
bison \
flex \
php5 \
php5-pgsql \
libglib2.0-dev \
gperf \
sqlite3 \
txt2man \
libssl-dev \
libudev-dev \
ia32-libs \
git
ENV PATH="$PATH:toolchain/bin"
The last line (ENV ...) sets the path to the toolchain location. Also there are a few more env variables to set.
On my host machine I run a have my source pulled in to my working dir.
Built the docker image using:
docker build --tag=myimage:latest .
And then I mounted the source code as a volume to the container using:
docker run -it --volume /path/to/host/code:/path/in/container myimage
All this works - it mounts the code in the container and I am in the container's terminal, I can see the code. However I dont see the path I set to the toolchain in my dockerfile. I was hoping the path would get set and I could call make.
Is this not how it is supposed to work, is there a better way to do this?
I have tried to build a docker image and found that the PATH variable I set has some issues. A Minimal non-working example is:
FROM ubuntu:latest
SHELL ["/bin/bash", "-cu"]
ARG CTAGS_DIR=/root/tools/ctags
# Install common dev tools
RUN apt-get update --allow-unauthenticated \
&& apt-get install --allow-unauthenticated -y git curl autoconf pkg-config zsh
# Compile ctags
RUN cd /tmp \
&& git clone https://github.com/universal-ctags/ctags.git \
&& cd ctags \
&& ./autogen.sh \
&& ./configure --prefix=${CTAGS_DIR} \
&& make -j$(nproc) \
&& make install \
&& rm -rf /tmp/ctags
ENV PATH=$HOME/tools/ctags/bin:$PATH
RUN echo "PATH is $PATH"
RUN which ctags
In the above Dockerfile, the line ENV PATH=$HOME/tools/ctags/bin:$PATH does not work as expected. It seems that $HOME is not correctly expanded. The following two instructions also do not work:
ENV PATH=~/tools/ctags/bin:$PATH
ENV PATH="~/tools/ctags/bin:$PATH"
Only settings the absolute path works:
# the following setting works.
ENV PATH="/root/tools/ctags/bin:$PATH"
I have looked up the docker references but can not find document about this.
In general, when you're building a Docker image, it's okay to install things into the normal "system" directories. Whatever you're building will be isolated inside the image, and it can't conflict with other tools.
The easiest answer to your immediate question is to arrange things so you don't need to set $PATH.
In the example you give, you can safely use Autoconf's default installation directory of /usr/local. That will almost certainly be empty when you start your image build and only things you install will be there.
RUN ... \
&& ./configure \
&& make \
&& make install
(The Python corollary is to not create a virtual environment for your application; just use the system pip to install things into the default Python library directories.)
Don't expect there to be a home directory. If you have to install in some non-default place, /app is common, and /opt/whatever is consistent with non-Docker Linux practice. Avoid $HOME or ~, they aren't generally well-defined in Docker (unless you go out of your way to make them be).
My target container is a build environment container, so my team would build an app in a uniform environment.
This app doesn't necessarily run as a container - it runs on physical machine. The container is solely for building.
The app depends on third parties.
Some I can apt-get install with Dockerfile RUN command.
And some I must build myself because they require special building.
I was wondering which way is better.
Using multistage build seems cool; Dockerfile for example:
From ubuntu:18.04 as third_party
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
ADD http://.../boost.tar.gz /
RUN tar boost.tar.gz && \
... && \
make --prefix /boost_out ...
From ubuntu:18.04 as final
COPY --from=third_party /boost_out/ /usr/
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
CMD ["bash"]
...
Pros:
Automatically built when I build my final container
Easy to change third party version (boost in this example)
Cons
ADD command downloads ~100MB file each time, makes image build process slower
I want to use --cache-from so I would be able to cache third_party and build from different docker host machine. Meaning I need to store ~1.6GB image in a docker registry. That's pretty heavy to pull/push.
On the other hand
I could just build boost (with this third_party image) and store its artifacts on some storage, git for example. It would take ~200MB which is better than storing 1.6GB image.
Pros:
Smaller disc space
Cons:
Cumbersome build
Manually build and push artifacts to git when changing boost version.
Somehow link Docker build and git to pull newest artifacts and COPY to the final image.
In both ways I need a third_party image that uniformly and automatically builds third parties. In 1. the image bigger than 2. that will contain just build tools, and not build artifacts.
Is this the trade-off?
1. is more automatic but consumes more disk space and push/pull time,
2. is cumbersome but consumes less disk space and push/pull time?
Are there any other virtues for any of these ways?
I'd like to propose changing your first attempt to something like this:
FROM ubuntu:18.04 as third_party
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
RUN wget http://.../boost.tar.gz -O /boost.tar.gz && \
tar xvf boost.tar.gz && \
... && \
make --prefix /boost_out ... && \
find -name \*.o -delete && \
rm /boost.tar.gz # this is important!
From ubuntu:18.04 as final
COPY --from=third_party /boost_out/ /usr/
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
...
CMD ["bash"]
This way, you are paying for the download of boost only once (when building the image without a cache), and you do not pay for the storage/pull-time of the original tar-ed sources. Additionally, you should remove unneeded target files (.o?) from the build in the same step in which they are generated. Otherwise, they are stored and pulled as well.
If you are at liberty posting the whole Dockerfile, I'll gladly take a deeper look at it and give you some hints.
I'm rather new to Docker and I'm trying to make a simple Dockerfile that combines an alpine image with a python one.
This is what the Dockerfile looks like:
FROM alpine
RUN apk update &&\
apk add -q --progress \
bash \
bats \
curl \
figlet \
findutils \
git \
make \
mc \
nodejs \
openssh \
sed \
wget \
vim
ADD ./src/ /home/src/
WORKDIR /home/src/
FROM python:3.7.4-slim
When running:
docker build -t alp-py .
the image builds as normal.
When I run
docker run -it alp-py bash
I can access the bash, but when I cd to /home/ and ls, it shows an empty directory:
root#5fb77bbc81a1:/# cd home
root#5fb77bbc81a1:/home# ls
root#5fb77bbc81a1:/home#
I've alredy tried changing ADD to COPY and also trying:
CPOY . /home/src/
but nothing works.
What am I doing wrong? Am I missing something?
Thanks!
There is no such thing as "combining 2 images". You should see the images as different virtual machines (only for the purpose of understanding the concept - because they are more than that). You cannot combine them.
In your example you can start directly with the python image and install the tools you need on top of it:
FROM python:3.7.4-slim
RUN apt update &&\
apt-get install -y \
bash \
bats \
curl \
figlet \
findutils \
git \
make \
mc \
nodejs \
openssh \
sed \
wget \
vim
ADD ./src/ /home/src/
WORKDIR /home/src/
I didn't test if all the packages are available so you might want to so a bit of research to get them all in case you get errors.
When you use 2 FROM statements in your Dockerfile you are creating a multi-stage build. That is useful if you want to create a final image that doesn't contain your source code, but only binaries of your product (first stage build the source and the second only copies the binaries from the first one).
It's generally understood that if the RUN instruction cleans up after large file operations, the resulting layer size would be smaller. However when I execute the following script the resulting layer is large and roughly corresponds to the size of the files initially copied and cleaned.
How could this size of a layer be committed if the files copied are cleaned in the same layer?
RUN mkdir -p /etc/puppet && \
cd /etc/puppet && \
apt-get update && apt-get install -y wget puppet && \
wget -rnH --level=10 -e robots=off --reject "index.html*" http://somefileserver.org/puppet/allpuppetstuff && \
puppet module install puppetlabs/stdlib && \
puppet module install 7terminals-java && \
puppet apply -e "include someserver" && \
apt-get purge -y --auto-remove puppet wget && \
rm -rfv /etc/puppet/* && \
rm -rfv /var/lib/apt/lists/*
Found some details after few tweaks.
It seems docker duplicates on old layer, if a newer layer is modifying files persisted on the earlier one. This is the cause of the fat layer created by the above RUN command, even when the total file size diff at the end of operation is minimal. I couldn't find any definite resources to cite this, but this is the experience so far.
Basically, file operations related to the image should always be done in one layer.