How can squid be used in a Dockerfile to cache downloads in a host directory - docker

I am running a docker build command with a Dockerfile, but this is being held up by a slow, and sometimes aborted, download of a certain package (google's boringssl as it happens).
I would like to install squid near the start of the Dockerfile, so that subsequent git clones, apt gets, etc, i.e. every kind of download is cached to a directory outside the docker image, by defining a volume in the docker build command.
I'm fairly familiar with Docker, and understand the concept of layers. So a reply dealing solely with those will not be useful to me (although possibly to others). But sometimes one has to make a Dockerfile change that disrupts subsequent layers, and also a build error of a subsystem within a layer will mean all the web fetches for that layer will need repeating on the next build. So layers are not the answer to everything cache-related.
Thanks in anticipation!

Related

Why do docker containers rely on uploading (large) images rather than building from the spec files?

Having needed several times in the last few days to upload a 1Gb image after some micro change, I can't help but wonder why there isnt a deploy path built into docker and related tech (e.g. k8s) to push just the application files (Dockerfile, docker-compose.yml and app related code) and have it build out the infrastructure from within the (live) docker host?
In other words, why do I have to upload an entire linux machine whenever I change my app code?
Isn't the whole point of Docker that the configs describe a purely deterministic infrastructure output? I can't even see why one would need to upload the whole container image unless they make changes to it manually, outside of Dockerfile, and then wish to upload that modified image. But that seems like bad practice at the very least...
Am I missing something or this just a peculiarity of the system?
Good question.
Short answer:
Because storage is cheaper than processing power, building images "Live" might be complex, time-consuming and it might be unpredictable.
On your Kubernetes cluster, for example, you just want to pull "cached" layers of your image that you know that it works, and you just run it... In seconds instead of compiling binaries and downloading things (as you would specify in your Dockerfile).
About building images:
You don't have to build these images locally, you can use your CI/CD runners and run the docker build and docker push from the pipelines that run when you push your code to a git repository.
And also, if the image is too big you should look into ways of reducing its size by using multi-stage building, using lighter/minimal base images, using few layers (for example multiple RUN apt install can be grouped to one apt install command listing multiple packages), and also by using .dockerignore to not ship unnecessary files to your image. And last read more about caching in docker builds as it may reduce the size of the layers you might be pushing when making changes.
Long answer:
Think of the Dockerfile as the source code, and the Image as the final binary. I know it's a classic example.
But just consider how long it would take to build/compile the binary every time you want to use it (either by running it, or importing it as a library in a different piece of software). Then consider how indeterministic it would download the dependencies of that software, or compile them on different machines every time you run them.
You can take for example Node.js's Dockerfile:
https://github.com/nodejs/docker-node/blob/main/16/alpine3.16/Dockerfile
Which is based on Alpine: https://github.com/alpinelinux/docker-alpine
You don't want your application to perform all operations specified in these files (and their scripts) on runtime before actually starting your applications as it might be unpredictable, time-consuming, and more complex than it should be (for example you'd require firewall exceptions for an Egress traffic to the internet from the cluster to download some dependencies which you don't know if they would be available).
You would instead just ship an image based on the base image you tested and built your code to run on. That image would be built and sent to the registry then k8s will run it as a black box, which might be predictable and deterministic.
Then about your point of how annoying it is to push huge docker images every time:
You might cut that size down by following some best practices and well designing your Dockerfile, for example:
Reduce your layers, for example, pass multiple arguments whenever it's possible to commands, instead of re-running them multiple times.
Use multi-stage building, so you will only push the final image, not the stages you needed to build to compile and configure your application.
Avoid injecting data into your images, you can pass it later on-runtime to the containers.
Order your layers, so you would not have to re-build untouched layers when making changes.
Don't include unnecessary files, and use .dockerignore.
And last but not least:
You don't have to push images from your machine, you can do it with CI/CD runners (for example build-push Github action), or you can use your cloud provider's "Cloud Build" products (like Cloud Build for GCP and AWS CodeBuild)

why we don't use CMD apt update instead of RUN apt update on Dockerfile?

why we don't use CMD apt update instead of RUN apt update on Dockerfile
we use RUN apt update for update an image this is for one time but why we don't use CMD apt update for update every container we create ? ? ? ?
As it sounds like you already know, RUN is intended "xecute any commands in a new layer on top of the current image and commit the results", and CMD is intended to "xecute any commands in a new layer on top of the current image and commit the results". So RUN is a build-time instruction, while CMD is a run-time instruction.
There are a few reasons this won't be a good idea:
Containers should be fast
Containers are usually expected to consume as few resources as possible, and startup and shutdown quickly and easily. If we update a container's packages EVERY time we want to run a container, it might take the container many minutes or even hours on a bad network before it can even start running whatever process it is intended for.
Unexpected behavior
Part of the process when developing a new container image is ensuring that the packages that are necessary for the container to work, play well together. But if we are upgrading all the packages each time the container is run on whatever system it is run on, it is possible (if not inevitable) that there will eventually be a package that will be published that introduces a breaking change to the container, and this is obviously not ideal.
Now this could be avoided by removing the default repositories and replacing them with your own where you can vet each package upgrade, test them together, and publish them, but this is probably a much greater effort than what would make sense unless the repos would be serving multiple container images.
Image Versioning
Many container images (ex Golang) will version their images based on the version of Golang they support; however, when the underlying packages on the container are changing how would you start to version the image?
Now this isn't necessarily a deal breaker, but it could cause confusion among the containers user-base and ultimately undercut their trust in the container.
Unexpected network traffic
Even if well documented, most developers would not expect this type of functionality and would lead to development issues when your container requires access to the internet. For example, in a K8s environment networking can be extremely strict and the developer would need to manually open up a route to the internet (or a set of custom repos).
Additionally, even if the networking is not an issue, if you expected a lot of these containers to be started, you might be clogging the network with the upgrade packages and cause network performance issues.
Wouldn't work for base images
While it sounds like you are probably not developing an image intended to serve as a base image for anything else... but obviously the CMD likely would be overriden for the base image.

What's the best way to cache downloads done by docker while building containers?

While testing new Docker builds (modifying Dockerfile) it can take quite some time for the image to rebuild due to huge download size (either direct by wget, or indirect using apt, pip, etc)
One way around this that I personally use often is to just split commands I plan to modify into their own RUN variable. This avoids re-downloading some parts because previous layers are cached. This, however, doesn't cut it if the command that requires "tuning" is early on in the Dockerfile.
Another solution is to use an image that already contains most of the required packages so that it would just be pulled once and cached, but this can come with unnecessary "baggage".
So is there a straight forward way to cache all downloads done by Docker while building/running? I'm thinking of having Memcached on the host machine but it seems kind of an overkill. Any suggestions?
I'm also aware that I can test in an interactive shell first but sometimes you need to test the Dockerfile and make sure it's production-ready (including arguments and defaults) especially if the only way you will ever see what's going on after that point is ELK or cluster crash logs
This here:
https://superuser.com/questions/303621/cache-for-apt-packages-in-local-network
Is the same question but regarding a local network instead of the same machine. However, the answer can be used in this scenario, it's actually a simpler scenario than a network with multiple machines.
If you install Squid locally you can use it to cache all your downloads including your host-side downloads.
But more specifically, there's also a Squid Docker image!
Headsup: If you use a squid service in a docker-compose file, don't forget to use the squid service name instead of docker's subnet gateway 172.17.0.1:3128 becomes squid:3128
the way i did this was
used the new --mount=type=cache,target=/home_folder/.cache/curl
wrote a script which looks into the cache before calling curl (wrapper over curl with cache)
called the script in the Dockerfile during build
it is more a hack, works

Why should our work inside the container shouldn't modify the content of the container itself?

I am reading an article related to docker images and containers.
It says that a container is an instance of an image. Fair enough. It also says that whenever you make some changes to a container, you should create an image of it which can be used later.
But at the same time it says:
Your work inside a container shouldn’t modify the container. Like
previously mentioned, files that you need to save past the end of a
container’s life should be kept in a shared folder. Modifying the
contents of a running container eliminates the benefits Docker
provides. Because one container might be different from another,
suddenly your guarantee that every container will work in every
situation is gone.
What I want to know is that, what is the problem with modifying container's contents? Isn't this what containers are for? where we make our own changes and then create an image which will work every time. Even if we are talking about modifying container's content itself and not just adding any additional packages, how will it harm anything since the image created from this container will also have these changes and other containers created from that image will inherit those changes too.
Treat the container filesystem as ephemeral. You can modify it all you want, but when you delete it, the changes you have made are gone.
This is based on a union filesystem, the most popular/recommended being overlay2 in current releases. The overlay filesystem merges together multiple lower layers of the image with an upper layer of the container. Reads will be performed through those layers until a match is found, either in the container or in the image filesystem. Writes and deletes are only performed in the container layer.
So if you install packages, and make other changes, when the container is deleted and recreated from the same image, you are back to the original image state without any of your changes, including a new/empty container layer in the overlay filesystem.
From a software development workflow, you want to package and release your changes to the application binaries and dependencies as new images, and those images should be created with a Dockerfile. Persistent data should be stored in a volume. Configuration should be injected as either a file, environment variable, or CLI parameter. And temp files should ideally be written to a tmpfs unless those files are large. When done this way, it's even possible to make the root FS of a container read-only, eliminating a large portion of attacks that rely on injecting code to run inside of the container filesystem.
The standard Docker workflow has two parts.
First you build an image:
Check out the relevant source tree from your source control system of choice.
If necessary, run some sort of ahead-of-time build process (compile static assets, build a Java .jar file, run Webpack, ...).
Run docker build, which uses the instructions in a Dockerfile and the content of the local source tree to produce an image.
Optionally docker push the resulting image to a Docker repository (Docker Hub, something cloud-hosted, something privately-run).
Then you run a container based off that image:
docker run the image name from the build phase. If it's not already on the local system, Docker will pull it from the repository for you.
Note that you don't need the local source tree just to run the image; having the image (or its name in a repository you can reach) is enough. Similarly, there's no "get a shell" or "start the service" in this workflow, just docker run on its own should bring everything up.
(It's helpful in this sense to think of an image the same way you think of a Web browser. You don't download the Chrome source to run it, and you never "get a shell in" your Web browser; it's almost always precompiled and you don't need access to its source, or if you do, you have a real development environment to work on it.)
Now: imagine there's some critical widespread security vulnerability in some core piece of software that your application is using (OpenSSL has had a couple, for example). It's prominent enough that all of the Docker base images have already updated. If you're using this workflow, updating your application is very easy: check out the source tree, update the FROM line in the Dockerfile to something newer, rebuild, and you're done.
Note that none of this workflow is "make arbitrary changes in a container and commit it". When you're forced to rebuild the image on a new base, you really don't want to be in a position where the binary you're running in production is something somebody produced by manually editing a container, but they've since left the company and there's no record of what they actually did.
In short: never run docker commit. While docker exec is a useful debugging tool it shouldn't be part of your core Docker workflow, and if you're routinely running it to set up containers or are thinking of scripting it, it's better to try to move that setup into the ordinary container startup instead.

Is it a bad idea to use docker to run a front end build process during development?

I have an angular project I'm working on containerizing. It currently has enough build tooling that a front-end developer could (and this is how it currently works) just run gulp in the project root, edit source files in src/, and the build tooling handles running traceur, templates and libsass and so forth, spitting content into a build directory. That build directory is served with a minimal server in development, and handled by nginx in production.
My goal is to build a docker-based environment that replicates this workflow. My users are developers who are on different kinds of boxes, so having build dependencies frozen in a dockerfile seems to make sense.
I've gotten close enough to this- docker mounts the host volume, the developer edits the file on the local disk, and the gulp build watcher is running in the docker container instance and rebuilds the site (and triggers livereload, etc etc).
The issue I have is with wrapping my head around how filesystem layers work. Is this process of rebuilding files in the container's build/frontend directory generating a ton of extraneous saved layers? That's not something I'd really like, because I'm not wild about monotonically growing this instance as developers tweak and rebuild and tweak and rebuild. It'd only grow locally, but having to go through the "okay, time to clean up and start over" process seems tedious.
Is this process of rebuilding files in the container's build/frontend directory generating a ton of extraneous saved layers?
Nope, the only way to stack up extra layers is to commit a container with changes to a new image then use that new image to create the next container. Rinse, repeat.
Filesystem layers are saved when a container is committed to a new image (docker commit ...). When a container is running there will be a single read/write layer on top that contains all of the changes made to the container since it was created.
having to go through the "okay, time to clean up and start over" process seems tedious.
If you run the build container with docker run --rm ... then you'll get the cleanup for free. The build container will be created fresh from the image each time.
Also, data volumes bypass the union filesystem so there's a good chance you won't write to the container's filesystem at all.

Resources