Should I Compile My Application Inside of a Docker Image - docker

Although most of the time I am developing Java apps and am simply using Maven so my builds should be reproducible (at least that's what Maven says).
But say you are compiling a C++ program or something a little more involved, should you build inside of docker?
Or ideally use vagrant or another technology to produce reproduce able builds.
How do you manage reproducible build with docker?

You can, but not in your final image, as that would mean a much larger image than necessary: it would include all the compilation tool, instead of limiting to only what you need to execute the resulting binary.
You can see an alternative in "How do I build a Docker image for a Ruby project without build tools?"
I use an image to build,
I commit the resulting stopped container as a new image (with a volume including the resulting binary)
I use an execution image (one which only contain what you need to run), and copy the binary from the other image. I commit again the resulting container.
The final image includes the compiled binary and the execution environment.

I wanted to post an answer to this as well actually because to build on VonC's answer. Actually I just had Redhat Openshift training and they use a tool called Source to Image s2i, which uses docker to create docker images. And actually this strategy is great for managing a private (or public) cloud, where your build may be compiled on different machines, but you need to keep the build environment consistent.

Related

Concurrent build within Docker with regards to multi staging

I have a monolithic repo that contains all of my projects. The current setup I have is to bring up a build container, mount my monolithic repo, and build my projects sequentially. Copy out the binaries, and build their respective runtime (production) containers sequentially.
I find this process quite slow and want to improve the speed. Two main approach I want to take is
Within the build container, build my project binaries concurrently. Instead of sequentially.
Like step 1, also build my runtime (production) containers concurrently.
I did some research and it seems like there are two Docker features that are of my interest:
Multi-stage building. Which allows me to skip worrying about the build container and put everything into one Dockerfiles.
--parallel option for docker-compose, which would solve approach #2, allowing me to build my runtime containers concurrently.
However, there's still two main issues:
How do I glue the two features together?
How do I build my binaries concurrently inside the build Docker? In other words, how can I achieve approach #1?
Clarifications
Regardless of whether multi-stage is used or not, there's two logical phases.
First is the binary building phase. During this phase, the artifacts are the compiled executables (binaries) from the build containers. Since I'm not using multi-stage build, I'm copying these binaries out to the host, so the host serves as an intermediate staging area. Currently, the binaries are being built sequentially, I want to build them concurrently inside the build container. Hence approach #1.
Second is the image building phase. During this phase, the binaries from the previous phase, which are now stored on the host, are used to built my production images. I also want to build these images concurrently, hence approach #2.
Multi-stage allows me to eliminate the need for an intermedia staging area (the host). And --parallel allows me to build the production images concurrently.
What I'm wondering is how I can achieve approach #1 & #2 using multi-stage and --parallel. Because for every project, I can define a separate multi-stage Dockerfiles and call --parallel on all of them to have their images built separately. This would achieve approach #2, but this would spawn a separate build container for each project and take up a lot of resource (I use the same build container for all my projects and it's 6 GB). On the other hand, I can write a script to build my project binaries concurrently inside the build container. This would achieve approach #1, but then I can't use multi-stage if I want to build the production images concurrently.
What I really want is a Dockerfiles like this:
FROM alpine:latest AS builder
RUN concurrent_build.sh binary_a binary_b
FROM builder AS prod_img_a
COPY binary_a .
FROM builder AS prod_img_b
COPY binary_b .
And be able to run a docker-compose command like this (I'm making this up):
docker-compose --parallel prod_img_a prod_img_b
Further clarifications
The run-time binaries and run-time containers are not separate things. I just want to be able to parallel build the binaries AND the production images.
--parallel does not use different hosts, but my build container is huge. If I use multi-stage build and running something like 15 of these build containers in parallel on my local dev machine could be bad.
I'm thinking about compiling the binary and run-time containers separately too but I'm not finding an easy way to do that. I have never used docker commit, would that sacrifice docker cache?
Results
My mono-repo containers 16 projects, some are micro services being a few MBs, some are bigger services that are about 300 to 500 MBs.
The build contains the compilation of two prerequisites, one is gRPC, and the other is XDR. both are trivially small, taking only 1 or 2 seconds to build.
The build contains a node_modules installation phase. NPM install and build is THE bottleneck of the project and by far the slowest.
The strategy I am using is to split the build into two stages:
First stage is to spin up a monolithic build docker, mount the mono-repo to it with cache consistency as a binding volume. And build all of my container's binary dependencies inside of it in parallel using Goroutines. Each Goroutine is calling a build.sh bash script that does the building. The resulting binaries are written to the same mounted volume. There is cache being used in the form of a mounted docker volume, and the binaries are preserved across runs to a best effort.
Second stage is to build the images in parallel. This is done using docker's Go SDK documented here. This is also done in parallel using Goroutines. Nothing else is special about this stage besides some basic optimizations.
I do not have performance data about the old build system, but building all 16 projects easily took the upper bound of 30 minutes. This build was extremely basic and did not build the images in parallel or use any optimizations.
The new build is extremely fast. If everything is cached and there's no changes, then the build takes ~2 minutes. In other words, the overhead of bring up the build system, checking the cache, and building the same cached docker images takes ~2 minutes. If there's no cache at all, the new build takes ~5 minutes. A HUGE improvement from the old build.
Thanks to #halfer for the help.
So, there are several things to try here. Firstly, yes, do try --parallel, it would be interesting to see the effect on your overall build times. It looks like you have no control over the number of parallel builds though, so I wonder if it would try to do them all in one go.
If you find that it does, you could write docker-compose.yml files that only contain a subset of your services, such that you only have five at a time, and then build against each one in turn. Indeed, you could write a script that reads your existing YAML config and splits it up, so that you do not need to maintain your overall config and your split-up configs separately.
I suggested in the comments that multi-stage would not help, but I think now that this is not the case. I was wondering whether the second stage in a Dockerfile would block until the first one is completed, but this should not be so - if the second stage starts from a known image then it should only block when it encounters a COPY --from=first_stage command, which you can do right at the end, when you copy your binary from the compilation stage.
Of course, if it is the case that multi-stage builds are not parallelised, then docker commit would be worth a try. You've asked whether this uses the layer cache, and the answer is I don't think it matters - your operation here would thus:
Spin up the binary container to run a shell or a sleep command
Spin up the runtime container in the same way
Use docker cp to copy the binary from the first one to the second one
Use docker commit to create a new runtime image from the new runtime container
This does not involve any network operations, and so should be pretty quick - you will have benefited greatly from the parallelisation already at this point. If the binaries are of non-trivial size, you could even try parallelising your copy operations:
docker cp binary1:/path/to/binary runtime1:/path/to/binary &
docker cp binary2:/path/to/binary runtime2:/path/to/binary &
docker cp binary3:/path/to/binary runtime3:/path/to/binary &
Note though these are disk-bound operations, so you may find there is no advantage over doing them serially.
Could you give this a go and report back on:
your existing build times per container
your existing build times overall
your new build times after parallelisation
Do it all locally to start off with, and if you get some useful speed-up, try it on your build infrastructure, where you are likely to have more CPU cores.

Docker dealing with images instead of Dockerfiles

Can someone explain to me why the normal Docker process is to build an image from a Dockerfile and then upload it to a repository, instead of just moving the Dockerfile to and from the repository?
Let's say we have a development laptop and a test server with Docker.
If we build the image, that means uploading and downloading all of the packages inside the Dockerfile. Sometimes this can be very large (e.g. PyTorch > 500MB).
Instead of transporting the large imagefile to and from the server, doesn't it make sense to, perhaps compile the image locally to verify it works, but mostly transport the small Dockerfile and build the image on the server?
This started out as a comment, but it got too long. It is likely to not be a comprehensive answer, but may contain useful information regardless.
Often the Dockerfile will form part of a larger build process, with output files from previous stages being copied into the final image. If you want to host the Dockerfile instead of the final image, you’d also have to host either the (usually temporary) processed files or the entire source repo & build script.
The latter is often done for open source projects, but for convenience pre-built Docker images are also frequently available.
One tidy solution to this problem is to write the entire build process in the Dockerfile using multi-stage builds (introduced in Docker CE 17.05 & EE 17.06). But even with the complete build process described in a platform-independent manner in a single Dockerfile, the complete source repository must still be provided.
TL,DR: Think of a Docker image as a regular binary. It’s convenient to download and install without messing around with source files. You could download the source for a C application and build it using the provided Makefile, but why would you if a binary was made available for your system?
Instead of transporting the large imagefile to and from the server,
doesn't it make sense to, perhaps compile the image locally to verify
it works, but mostly transport the small Dockerfile and build the
image on the server?
Absolutely! You can, for example, set up an automated build on Docker Hub which will do just that every time you check in an updated version of your Dockerfile to your GitHub repo.
Or you can set up your own build server / CI pipeline accordingly.
IMHO, one of the reason for building the images concept and putting into repository is sharing with people too. For example we call Python's out of the box image for performing all python related stuff for a python program to run in Dockerfile. Similarly we could create a custom code(let's take example I did for apache installation with some custom steps(like ports changes and additionally doing some steps) I created its image and then finally put it to my company's repository.
I came to know after few days that may other teams are using it too and now when they are sharing it they need NOT to make any changes simply use my image and they should be done.

Using build system to run tests or interact with clusters

What is the purpose of projects like these below that use Bazel for things other than building software?
https://github.com/bazelbuild/rules_webtesting
https://github.com/bazelbuild/rules_k8s
Are they just conveniently providing environment for run command (as opposed to building portable executables) or am I missing something?
The best I can tell is that Bazel could be used to run only subset of E2E tests based on knowledge what changed.
Disclaimer: I have only cursory knowledge about k8s and docker.
Bazel isn't just used for building and testing, it can also deploy, as you've discovered with the rules in those projects.
The best I can tell is that Bazel could be used to run only subset of E2E tests based on knowledge what changed.
Correct, but also extend tests to deployments. If you've only changed a single string in your Go binary that's injected into an image, Bazel is able to use rules_k8s, rules_docker, and rules_go to:
incrementally and reproducibly rebuild the minimum set files to
build the new Go executable
create a new image layer containing the Go executable (without using Docker)
push the image to the registry
redeploy changed pod(s) into cluster
The notable thing is that if you didn't change the source file, Bazel will always create an image with the same digest due to its reproducibility. That means that you can now trust the deployment workflow to not redeploy/restart a pod even if you do a bazel run twice or more.
For more, check out this BazelCon 2017 talk: Using Bazel for Fast, Correct Docker Deployments w/ Databricks
Fun fact: you can also use bazel run to start a REPL for your build target, from 0.15.0 onwards. Haskell and Scala rules use this.

Build chain in the cloud?

(I understand this question is somewhat out of scope for stack overflow, because contains more problems and somewhat vague. Suggestions to ask it in the proper ways are welcome.)
I have some open source projects depending in each other.
The code resides in github, the builds happen in shippable, using docker images which in turn are built on docker hub.
I have set up an artifact repo and a debian repository where shippable builds put the packages, and docker builds use them.
The build chain looks like this in terms of deliverables:
pre-zenta docker image
zenta docker image (two steps of docker build because it would time out otherwise)
zenta debian package
zenta-tools docker image
zenta-tools debian package
xslt docker image
adadocs artifacts
Currently I am triggering the builds by pushing to github and sometimes rerunning failed builds on shippable after the docker build ran.
I am looking for solutions for the following problems:
Where to put Dockerfiles? Now they are in the repo of the package needing the resulting docker image for build. This way all information to build the package are in one place, but sometimes I have to trigger an extra build to have the package actually built.
How to trigger build automatically?
..., in a way supporting git-flow? For example if I change the code in zenta develop branch, I want to make sure that zenta-tools will build and test with the development version of it, before merging with master.
Are there a tool with which I can overview the health of the whole build chain?
Since your question is related to Shippable, I've created a support issue for you here - https://github.com/Shippable/support/issues/2662. If you are interested in discussing the best way to handle your scenario, you can also send me an email at support#shippable.com You can set up your entire flow, including building the docker images, using Shippable.

Where to keep Dockerfile's in a project?

I am gaining knowledge about Docker and I have the following questions
Where are Dockerfile's kept in a project?
Are they kept together with the source?
Are they kept outside of the source? Do you have an own Git repository just for the Dockerfile?
If the CI server should create a new image for each build and run that on the test server, do you keep the previous image? I mean, do you tag the previous image or do you remove the previous image before creating the new one?
I am a Java EE developer so I use Maven, Jenkins etc if that matter.
The only restriction on where a Dockerfile is kept is that any files you ADD to your image must be beneath the Dockerfile in the file system. I normally see them at the top level of projects, though I have a repo that combines a bunch of small images where I have something like
top/
project1/
Dockerfile
project1_files
project2/
Dockerfile
project2_files
The Jenkins docker plugin can point to an arbitrary directory with a Dockerfile, so that's easy. As for CI, the most common strategy I've seen is to tag each image built with CI as 'latest'. This is the default if you don't add a tag to a build. Then releases get their own tags. Thus, if you just run an image with no arguments you get the last image built by CI, but if you want a particular release it's easy to say so.
I'd recommend keeping the Dockerfile with the source as you would a makefile.
The build context issue means most Dockerfiles are kept at or near the top-level of the project. You can get around this by using scripts or build tooling to copy Dockerfiles or source folders about, but it gets a bit painful.
I'm unaware of best practice with regard to tags and CI. Tagging with the git hash or similar might be a good solution. You will want to keep at least one generation of old images in case you need to rollback.

Resources