Truly reproducible Docker containers?

Truly reproducible Docker containers? - docker

There is a security trend called reproducible builds, which aims for having a way to create bit-exact copies of output binaries so that the user can verify whether the version found on the internet is trustworthy. Is there a similar movement and tooling across Docker community? The way I understand it, even with version pinning in containerized Linux distributions, timestamps would make this problematic. Is there a way to solve this problem and create a readable Dockerfile that doesn't build from scratch that would describe dependencies deterministically and in a future-proof way?

Related

Why do we build "inside" docker?

When I first learned Docker I expected a config file, image producer, CLI, and options for mounting and networks. That's all there.
I did not expect to put build commands inside a Dockerfile. I thought docker would wrap/tar/include a prebuilt task I made. Why give build commands in Docker?
Surely it can import a task thus keeping Jenkins/Bazel etc. distinct and apart for making an image/container?

I guess we are dealing with a misconception here. Docker is NOT a lighweight version of VMware/Xen/KVM/Parallels/FancyVirtualization.
Disclaimer: The following is heavily simplified for the sake of comprehensiveness.
So what is Docker?
In one sentence: Docker is a system to isolate processes from the other processes within an operating system as much as possible while still providing all means to run them. Put differently:
Docker is a package manager for isolated processes.
One of its closest ancestors are chroot and BSD jails. What those basically do is to isolate (more in the case of BSD, less in the case of chroot) a part of your OS resources and have a complete environment running independently from the rest of the OS - except for the kernel.
In order to be able to do that, a Docker image obviously needs to contain everything except for a kernel. So you need to provide a shell (if you choose to do so), standard libraries like glibc and even resources like CA certificates. For reference: In order to set up chroot jails, you did all this by hand once upon a time, preinstalling your chroot environment with each and every piece of software required. Docker is basically taking the heavy lifting from you here.
The mentioned isolation even down to the installed (and usable software) sounds cumbersome, but it gives you several advantages as a developer. Since you provide basically everything except for a (compatible) kernel, you can develop and test your code in the same environment it will run later down the road. Not a close approximation, but literally the same environment, bit for bit. A rather famous proverb in relation to Docker is:
"Runs on my machine" is no excuse any more.
Another advantage is that can add static resources to your Docker image and access them via quite ordinary file system semantics. While it is true that you can do that with virtualisation images as well, they usually do not come with a language for provisioning. Docker does - the Dockerfile:
FROM alpine
LABEL maintainer="you#example.com"
COPY file/in/host destination/on/image
Ok, got it, now why the build commands?
As described above, you need to provide all dependencies (and transitive dependencies) your application has. The easiest way to ensure that is to build your application inside your Docker image:
FROM somebase
RUN yourpackagemanager install long list of dependencies && \
make yourapplication && \
make install
If the build fails, you know you have missing dependencies. Now you can tweak and tune your Dockerfile until it compiles and is tested. So now your Docker image is finished, you can confidently distribute it, since you know that as long as the docker daemon runs on the machine somebody tries to run your image on, your image will run.
In the Go ecosystem, you basically assure your go.mod and go.sum are up to date and working and your work stay's reproducible.
Again, this works with virtualisation as well, so where is the deal?
A (good) docker image only runs what it needs to run. In the vast majority of docker images, this means exactly one process, for example your Go program.
Side note: It is very bad practise to run multiple processes in one Docker image, say your application and a database server and a cache and whatnot. That is what docker-compose is there for, or more generally container orchestration. But this is far too big of a topic to explain here.
A virtualised OS, however, needs to run a kernel, a shell, drivers, log systems and whatnot.
So the deal basically is that you get all the good stuff (isolation, reproducibility, ease of distribution) with less waste of resources (running 5 versions of the same OS with all its shenanigans).

Because we want to have enviroment for reproducible build. We don't want to depend on version of language, existence of compiler, version of libraires and so on.

Building inside a Dockerfile allows you to have all the tools and environment you need inside independently of your platform and ready to use. In a development perspective is easier to have all you need inside the container.
But you have to think about the objective of building inside a Dockerfile, if you have a very complex build process with a lot of dependencies you have to be worried about having all the tools inside and it reflects on the final size of your resulting image. Because this is not the same building to generate an artifact than building to produce the final container.
Thinking about this two aspects you have to learn to use the multistage build process in Docker here. The main idea is closer to your question because you can have a as many stages as you need depending on your build process and use different FROM images to ensure you have the correct requirements and dependences on each stage, to finally generate the image with the minimum dependencies and smaller size.

I'll add to the answers above:
Doing builds in or out of docker is a choice that depends on your goal. In my case I am more interested in docker containers for kubernetes, and in addition we have mature builds already.
This link shows how you take prebuilt tasks and add them to an image. This strategy together with adding libs, env etc leverages docker well and shows that indeed docker is flexible. https://medium.com/#chemidy/create-the-smallest-and-secured-golang-docker-image-based-on-scratch-4752223b7324

Docker's standardization of environments

I am struggling on a question that nobody seems to answer in detail on the Internet.
"Standardizing service infrastructure across the entire pipeline allows every team member to work in a production parity environment"
This is a key benefit of Docker : it allows everybody to develop, test or whatever in a production-like environment. Because the container that is passed through the pipeline is always the same.
I get that. I understand that this is necessary and that Docker allows this easily.
But what I don't understand, is why was it so hard before Docker ? If I have a production machine and a testing machine, I won't have any problem building a script that installs the right dependencies, no matter what the machine is. So my environment in terms of libraries or frameworks will be the same.
The only thing that I understand with this whole environment-related benefit, is that Docker allows a developer to choose his OS without fear of the platform-related bugs. I've already run into features that worked on Windows and not on Mac. Worst kind of bugs in my opinion. So yeah if I had Docker at the time, I wouldn't have had this problem. But I don't understand why Docker was such a miracle for other environment-related stuff.
I think I am not understanding this because I've only worked on small scale projects. Maybe I also don't realize the full meaning of the word "environment".
What am I missing here ? Why containers were a breakthrough for standardizing environments, whereas scripts can achieve that ?

The following list is not exhaustive, it represents only three important advantages of docker. Please note that docker is not a magical solution and may not be adapted in specific contexts.
Firstly, with containers you don't have conflicts between dependencies.
If you have two programs using the same library at different version you'll have to manually install both versions and specify custom environment variables before executing your programs. (For example LD_LIBRARY_PATH). Please note that some tools exists to address this issue but only in specific cases (virtualenvs in python for example).
Secondly, with containers you don't have persistence.
For example if you write a little bash script to install your development environment based on Nginx and PHP and by mistake I install Apache, my package will still be present even if you run again your script. The thing is Apache will sometimes starts before Nginx and block the 80 port, breaking your development environment.
To sum up, without docker you're not sure about the state of untracked elements and they may break your environment.
Thirdly, docker allows you to reduce the gap between development and production.
The close environment is "everything needed for your code to run". For example libraries, config files, your interpreter (python, php, ...). Docker packages the application with its close environment so you don't have mismatches between what your app needs and the environment you provided.
This is especially important when you update dependencies during development and may forget to update them in production.
A false argument is security and isolation. The security process starts with defining your threat model and then choosing countermeasures. Adding docker because it increases security in a risky environment won't be enough (there is no kernelspace isolation) and adding docker for security if you don't need more is called paranoïa. Docker adds userspace isolation and default seccomp profiles, but this is not a reason to use it, except if it matches your threat model.

Portable docker daemon for deterministic CI builds

We are looking to make use of Docker to run integration tests within CI builds (with Bazel).
We need to support Debian as well as MacOS.
In order to guarantee build correctness, and ensure determinism and portability, we cannot rely on the host having a running docker daemon. The build needs to come with its own docker daemon.
What is the best way to achieve this? Is there a standard “portable” docker binary?
If not, what do you think would be the right approach to implement this?
In linux systems, I imagine this would be relatively simple, as we would just need to download the binaries and run.
In MacOS, I guess we would need to bundle it with hyperkit.
Would love to hear your thoughts on this.

In terms of building Docker images, you should look at bazelbuild/rules_docker (disclaimer: I wrote/own them). They implement the only properly deterministic Docker builds of which I'm aware (at least to Bazel's standard).
They do this by avoiding Dockerfile and the Docker daemon (which most other approaches use), as it is unclear these can produce deterministic artifacts. This avoids the root requirement too, which is nice.
However, you specifically asked about testing, which tl;dr we have not solved.
#ittaiz is also interested in this and started this Github issue for discussing it. Would you mind moving the discussion there?

DevOps Simple Setup

I'm looking to start creating proper isolated environments for django web apps. My first inclination is to use Docker. Also, it's usually recommended to use virtualenv with any python project to isolate dependencies.
Is virtualenv still necessary if I'm isolating projects via Docker images?

If your Docker container is relatively long-lived or your project dependencies change, there is still value in using a Python virtual-environment. Beyond (relatively) isolating dependencies of a codebase from other projects and the underlying system (and notably, the project at a given state), it allows for a certain measure of denoting the state of requirements at a given time.
For example, say that you make a Docker image for your Django app today, and end up using it for the following three weeks. Do you see your requirements.txt file being modified between now and then? Can you imagine a scenario in which you put out a hotpatch that comes with environmental changes?
As of Python 3.3, virtual-env is stdlib, which means it's very cheap to use, so I'd continue using it, just in case the Docker container isn't as disposable as you originally planned. Stated another way, even if your Docker-image pipeline is quite mature and the version of Python and dependencies are "pre-baked", it's such low-hanging fruit that while not explicitly necessary, it's worth sticking with best-practices.

No not really if each Python / Django is going to live in it's own container.

Does Docker reduce or mitigate the need for Puppet/Chef et al?

I'm not au fait with any of these technologies (embarrassing really), but at my present gig, the company badly needs to automate.
So as I begin to read-up on Puppet and Chef and PowerShell DSC, I then remember that Docker and containerisation is coming to Windows.
Does Docker do away with the need for these tools, or do they work together?
I understand that Docker uses virtualisation technology in the OS, so I get the feeling that Docker solves a different problem, and a configuration tool is still needed but I've no certain, practical knowledge.

Does Docker do away with the need for these tools, or do they work together?
They work together: provisioning and containerization solve different issues, and you actually can provision docker containers themselves with a provisioning tool.
See for instance "Docker: Using Puppet"

Tools like Chef & Puppet are important for configuration, but they do have one weakness that Docker helps to shore up. They are not always fully idempotent (hype notwithstanding). In other words, running Chef twice on the same virtual machine may cause unexpected and hard-to-find changes on that machine, and you'd be restoring a backup to get to a known good state.
By contrast, a Docker deployment involves building an entirely new image and swapping it out with your old image. Rollback involves simply unswapping them and comparing them to diagnose the problems in the new image.
Note that you still might very well use Chef to build your Docker container. But you might very well not. Since containers are supposed to run just one process in a particular way, I've found that a series of simple shell commands is way preferable to the overhead entailed by Chef.

In short no, you don't need anything like Chef or Puppet. Of course you can use if like to but it's not required.
If you build your system in such way that everything in containerized then what you need is only a tiny OS like CoreOS or Atomic.
So you just configure your VM via Cloud-Config if needed and deploy your container either with cloud config or Docker cli itself. The idea is your machines should have a static state and they can be created whenever you want new one and destroyed when you don't need.
There are other tools that can help with Docker orchestration which another story by itself.
Tools like Swarm, Kubernetes and Mesosphere.
docker-machine is also very helpful for development purpose. (maybe deployment too).
Here is CoreOS example:
https://coreos.com/os/docs/latest/cloud-config.html
Resource: I do it in production for different apps.
UPDATE:
BTW, Docker is not only a visualization technology. It does some sort of containerization (you can call it virtualization too) and that's only a small part of the what Docker can do. Docker can configure, build, ship and run application whit eliminating its dependencies on host machine. And that's why you don't need those classic configuration tools.

Puppet and Chef are configuration management tools, where as Docker is a virtualization tool such as LXC.
Usually you'd be using Chef or puppet to manage Docker containers. For example take a look at Chef docs.
EDIT as per #ptierno comment.

Docker is three things: a cool way to run a process, a decent image-based deploy system, and a mediocre system image builder.
The first is not related to config management as those tools aren't involved in running a process, at least not directly. The second takes the place of some amount of config management in production by doing it ahead of time when you build the image. There is still often some need for last-mile config for stuff like service discovery and secrets but this can be handled by lighter tools like consul-templates or confd. The last is where the rub lies. docker build is simple, easy to get started with, and mostly unhelpful for complex situations. You get, at most, a single inheritance tree between dockerfiles which makes stuff like multi-axis matrix builds ({app1 app2 app3} x {prod qa dev}) more difficult than it could be. Also building composable abstraction for other groups to use is difficult, though again it isn't impossible. Using something like Packer to drive image builds can produce simpler code sometimes, and supports the full suite of CAPS (Chef, Ansible, Puppet, Salt) tools. This is mostly aimed at the use case where you are treating Docker images like tiny VMs, which I wish fewer people would do, but it's a thing so here we are.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart