How COPY in separate lines helps with less cache invalidations? - docker

Docker documentation suggests the following as best practice.
If you have multiple Dockerfile steps that use different files from
your context, COPY them individually, rather than all at once. This
ensures that each step’s build cache is only invalidated (forcing the
step to be re-run) if the specifically required files change.
For example:
COPY requirements.txt /tmp/
RUN pip install --requirement /tmp/requirements.txt
COPY . /tmp/
Results in fewer cache invalidations for the RUN step, than if you put
the COPY . /tmp/ before it.
My question is how, how does it help?
In either cases, if requirement.txt file doesn't change then pip install would fetch me the same result, so why does it matter that in best practice scenario, the requirement.txt is the only file in directory (while doing pip install)?
On the other hand, it creates one more layer in the image, which is
something I would not want.

Say you have a very simple application
$ ls
Dockerfile main.py requirements.txt
With the corresponding Dockerfile
FROM python:3
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["./main.py"]
Now say you only change the main.py script. Since the requirements.txt file hasn't changed, the RUN pip install ... can reuse the Docker image cache. This avoids re-running pip install, which can download a lot of packages and take a while.

Related

How to rebuild docker container in air gapped environment?

I have a fastapi appplication which is to be containerised. I created the docker image from a system with internet connectivity and saved the file (tar archive). This image was loaded in a system with docker installed using docker load command which has no internet connectivity and is working fine. But now I want to make changes to the application code and rebuild the image. Only the app changes have to be pushed. How can this be achieved from this isolated system?
There are two actions during the build that need internet connection.
The first one is pulling the base image for your Dockerfile.
So for example if your Dockerfile is something like:
FROM python:3.9
WORKDIR /code
COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
COPY ./app /code/app
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "80"]
Then you would need the python:3.9 docker image on the system.
This is easily achievable by moving images using docker load as you described in the question.
The second is pip installing packages (in the previous case the step RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt).
To do this install in a system with no internet connection you would need to download the .whl wheel file for each requirement and install them using --find-links /path/to/wheel/dir/ (and probably --no-index) flags.
This can become complicated but if your dependencies are more or less fixed you can do the following:
First on the system that CAN connetc to the internet you build a base image with all your requirements:
FROM python:3.9
WORKDIR /code
COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
Then you can build this image and load it on the system with no internet. On that you can create a new Dockerfile that starts from your newly created image and just adds your code:
FROM your-base-image
COPY ./app /code/app
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "80"]
Then rebuilding this image should not need any internet.

docker image multistage build: How to create a docker image copying only python packages

I am trying to create a python based image with some packages installed. But i want the image layer not to show anything about the packages I installed.
I am trying to use the multistage build
eg:
FROM python:3.9-slim-buster as builder
RUN pip install django # (I dont want this command to be seen when checking the docker image layers, So thats why using multistage build)
FROM python:3.9-slim-buster
# Here i want to copy all the site packages
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages
Now build image
docker build -t python_3.9-slim-buster_custom:latest .
and later check the image layers
dive python_3.9-slim-buster_custom:latest
this will not show the RUN pip install django line
Will this be a good way to achieve what i want (hide all the pip install commands)
It depends on what you are installing, if this will be sufficient or not. Some python libraries add binaries to your system on which they rely.
FROM python:3.9-alpine as builder
# install stuff
FROM python:3.9-alpine
# this is for sure required
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages
# this depends on what you are installing
COPY --from=builder /usr/local/bin /usr/local/bin
The usual approach I see for this is to use a virtual environment in an earlier build stage, then copy the entire virtual environment into the final image. Remember that virtual environments are very specific to a single Python build and installation path.
If your application has its own setup.cfg or setup.py file, then a minimal version of this could look like:
FROM python:3.9-slim-buster as builder
# If you need build-only tools, like build-essential for Python C
# extensions, install them first
# RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install ...
WORKDIR /src
# Create and "activate" the virtual environment
RUN python3 -m venv /app
ENV PATH=/app/bin:$PATH
# Install the application as normal
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
RUN pip install .
FROM python:3.9-slim-buster as builder
# If you need runtime libraries, like a database client C library,
# install them first
# RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install ...
# Copy the entire virtual environment over
COPY --from=builder /app /app
ENV PATH=/app/bin:$PATH
# Run an entry_points script from the setup.cfg as the main command
CMD ["my_app"]
Note that this has only minimal protection against a curious user seeing what's in the image. The docker history or docker inspect output will show the /app container directory, you can docker run --rm the-image pip list to see the package dependencies, and the application and library source will be present in a human-readable form.
Currently whats working for me is.
FROM python:3.9-slim-buster as builder
# DO ALL YOUR STUFF HERE
FROM python:3.9-slim-buster
COPY --from=builder / /

Docker: is it possible not to build from scratch w/o using cache?

I had a simple Docker file:
FROM python:3.6
COPY . /app
WORKDIR /app
RUN pip install -r requirements
The problem was - it installs requirements on every build. I have a lot of requirements, but they rarely change.
I searched for solutions and ended up with this:
FROM python:3.6
COPY requirements.txt /app/requirements.txt
WORKDIR /app
RUN pip install -r requirements.txt
COPY . /app
That worked perfectly fine, till moment it stopped updating the code. E.g., comment couple lines in some file that goes to /app and build - lines stays uncommented in image.
I searched again and found out that this is possibly caused by cache. I tried --no-cache build flag, but now I'm getting requirements installation again.
Is there some workaround or right way to do it in my situation?
You should use ADD not COPY if you want to invalidate cache.
FROM python:3.6
COPY requirements.txt /app/requirements.txt
WORKDIR /app
RUN pip install -r requirements.txt
ADD . /app
Try the above docker file.
Have you ever used docker-compose? Docker-compose has 'volumes', it's as a cache, and when you start container, It will not re-build your dependencies. It auto refresh when your code changes.
and with your situation, you should do like this:
FROM python:3.6
WORKDIR /app
COPY . /app
COPY requirements.txt ./
RUN pip install -r requirements.txt
CMD["python","app.py"]
Let try.
Changing a file that you simply copy in (COPY . /app) will not be seen by Docker, so it will use a cached layer *, hence your result. Using --no-cache will force a re-build of every layer, again explaining what you've observed.
The 'docker' way to avoid re-installing all requirements every time would be to put all the static requirements in a base image, then use this image in your FROM line with all the other requirements which do change.
* Although, I'm fairly sure I've observed that if you copy a named file, as opposed to a directory, changes are picked up even without --no-cache

Pip compiled as part of dockerfile - fastest way to add a new entry to requirements.txt?

I'm using this Dockerfile as part of this docker compose file.
Right now, every time I want to add a new pip requirement, I stop my containers, add the new pip requirement, run docker-compose -f local.yml build, and then restart the containers with docker-compose -f local.yml up. This takes a long time, and it even looks like it's recompiling the container for Postgres if I just add a pip dependency.
What's the fastest way to add a single pip dependency to a container?
This is related to fact that the Docker build cache is being invalidated. When you edit the requirements.txt the step RUN pip install --no-cache-dir -r /requirements/production.txt and all subsequent instructions in the Dockerfile get invalidated. Thus they get re-executed.
As a best practice, you should avoid invalidaing the build cache as much as possible. This is achieved by moving the steps that change often to the bottom of the Dockerfile. You can edit the Dockerfile and while developing add separate pip installation steps to the end.
...
USER django
WORKDIR /app
pip install --no-cache-dir <new package>
pip install --no-cache-dir <new package2>
...
And once you are sure of all the dependencies needed, add them to the requirements file. That way you avoid invalidating the build cache early on and only build the steps starting from the installation of the new packages on ward.

how to use pip to install pkg from requirement file without reinstall

I am trying to build an Docker image. My Dockerfile is like this:
FROM python:2.7
ADD . /code
WORKDIR /code
RUN pip install -r requirement.txt
CMD ["python", "manage.py", "runserver", "0.0.0.0:8300"]
And my requirement.txt file like this:
wheel==0.29.0
numpy==1.11.3
django==1.10.5
django-cors-headers==2.0.2
gspread==0.6.2
oauth2client==4.0.0
Now, I have a little change in my code, and i need pandas, so i add it in to requirement.txt file
wheel==0.29.0
numpy==1.11.3
pandas==0.19.2
django==1.10.5
django-cors-headers==2.0.2
gspread==0.6.2
oauth2client==4.0.0
pip install -r requirement.txt will install all packages in that file, although almost of them has installed before. My question is how to make pip install pandas only? That will save the time to build image.
Thank you
If you rebuild your image after changing requirement.txt with docker build -t <your_image> ., I guess it cann't be done because each time when docker runs docker build, it'll start an intermediate container from base image, and it's a new environment so pip obviously will need to install all of dependencies.
You can consider to build your own base image on python:2.7 with common dependencies pre-installed, then build your application image on your own base image. Once there's a need to add more dependencies, manually re-build the base image on the previous one with only extra dependencies installed, and then maybe optionally docker push it back to your registry.
Hope this could be helpful :-)

Resources