Concurrent build within Docker with regards to multi staging - docker

I have a monolithic repo that contains all of my projects. The current setup I have is to bring up a build container, mount my monolithic repo, and build my projects sequentially. Copy out the binaries, and build their respective runtime (production) containers sequentially.
I find this process quite slow and want to improve the speed. Two main approach I want to take is
Within the build container, build my project binaries concurrently. Instead of sequentially.
Like step 1, also build my runtime (production) containers concurrently.
I did some research and it seems like there are two Docker features that are of my interest:
Multi-stage building. Which allows me to skip worrying about the build container and put everything into one Dockerfiles.
--parallel option for docker-compose, which would solve approach #2, allowing me to build my runtime containers concurrently.
However, there's still two main issues:
How do I glue the two features together?
How do I build my binaries concurrently inside the build Docker? In other words, how can I achieve approach #1?
Clarifications
Regardless of whether multi-stage is used or not, there's two logical phases.
First is the binary building phase. During this phase, the artifacts are the compiled executables (binaries) from the build containers. Since I'm not using multi-stage build, I'm copying these binaries out to the host, so the host serves as an intermediate staging area. Currently, the binaries are being built sequentially, I want to build them concurrently inside the build container. Hence approach #1.
Second is the image building phase. During this phase, the binaries from the previous phase, which are now stored on the host, are used to built my production images. I also want to build these images concurrently, hence approach #2.
Multi-stage allows me to eliminate the need for an intermedia staging area (the host). And --parallel allows me to build the production images concurrently.
What I'm wondering is how I can achieve approach #1 & #2 using multi-stage and --parallel. Because for every project, I can define a separate multi-stage Dockerfiles and call --parallel on all of them to have their images built separately. This would achieve approach #2, but this would spawn a separate build container for each project and take up a lot of resource (I use the same build container for all my projects and it's 6 GB). On the other hand, I can write a script to build my project binaries concurrently inside the build container. This would achieve approach #1, but then I can't use multi-stage if I want to build the production images concurrently.
What I really want is a Dockerfiles like this:
FROM alpine:latest AS builder
RUN concurrent_build.sh binary_a binary_b
FROM builder AS prod_img_a
COPY binary_a .
FROM builder AS prod_img_b
COPY binary_b .
And be able to run a docker-compose command like this (I'm making this up):
docker-compose --parallel prod_img_a prod_img_b
Further clarifications
The run-time binaries and run-time containers are not separate things. I just want to be able to parallel build the binaries AND the production images.
--parallel does not use different hosts, but my build container is huge. If I use multi-stage build and running something like 15 of these build containers in parallel on my local dev machine could be bad.
I'm thinking about compiling the binary and run-time containers separately too but I'm not finding an easy way to do that. I have never used docker commit, would that sacrifice docker cache?

Results
My mono-repo containers 16 projects, some are micro services being a few MBs, some are bigger services that are about 300 to 500 MBs.
The build contains the compilation of two prerequisites, one is gRPC, and the other is XDR. both are trivially small, taking only 1 or 2 seconds to build.
The build contains a node_modules installation phase. NPM install and build is THE bottleneck of the project and by far the slowest.
The strategy I am using is to split the build into two stages:
First stage is to spin up a monolithic build docker, mount the mono-repo to it with cache consistency as a binding volume. And build all of my container's binary dependencies inside of it in parallel using Goroutines. Each Goroutine is calling a build.sh bash script that does the building. The resulting binaries are written to the same mounted volume. There is cache being used in the form of a mounted docker volume, and the binaries are preserved across runs to a best effort.
Second stage is to build the images in parallel. This is done using docker's Go SDK documented here. This is also done in parallel using Goroutines. Nothing else is special about this stage besides some basic optimizations.
I do not have performance data about the old build system, but building all 16 projects easily took the upper bound of 30 minutes. This build was extremely basic and did not build the images in parallel or use any optimizations.
The new build is extremely fast. If everything is cached and there's no changes, then the build takes ~2 minutes. In other words, the overhead of bring up the build system, checking the cache, and building the same cached docker images takes ~2 minutes. If there's no cache at all, the new build takes ~5 minutes. A HUGE improvement from the old build.
Thanks to #halfer for the help.

So, there are several things to try here. Firstly, yes, do try --parallel, it would be interesting to see the effect on your overall build times. It looks like you have no control over the number of parallel builds though, so I wonder if it would try to do them all in one go.
If you find that it does, you could write docker-compose.yml files that only contain a subset of your services, such that you only have five at a time, and then build against each one in turn. Indeed, you could write a script that reads your existing YAML config and splits it up, so that you do not need to maintain your overall config and your split-up configs separately.
I suggested in the comments that multi-stage would not help, but I think now that this is not the case. I was wondering whether the second stage in a Dockerfile would block until the first one is completed, but this should not be so - if the second stage starts from a known image then it should only block when it encounters a COPY --from=first_stage command, which you can do right at the end, when you copy your binary from the compilation stage.
Of course, if it is the case that multi-stage builds are not parallelised, then docker commit would be worth a try. You've asked whether this uses the layer cache, and the answer is I don't think it matters - your operation here would thus:
Spin up the binary container to run a shell or a sleep command
Spin up the runtime container in the same way
Use docker cp to copy the binary from the first one to the second one
Use docker commit to create a new runtime image from the new runtime container
This does not involve any network operations, and so should be pretty quick - you will have benefited greatly from the parallelisation already at this point. If the binaries are of non-trivial size, you could even try parallelising your copy operations:
docker cp binary1:/path/to/binary runtime1:/path/to/binary &
docker cp binary2:/path/to/binary runtime2:/path/to/binary &
docker cp binary3:/path/to/binary runtime3:/path/to/binary &
Note though these are disk-bound operations, so you may find there is no advantage over doing them serially.
Could you give this a go and report back on:
your existing build times per container
your existing build times overall
your new build times after parallelisation
Do it all locally to start off with, and if you get some useful speed-up, try it on your build infrastructure, where you are likely to have more CPU cores.

Related

Docker dealing with images instead of Dockerfiles

Can someone explain to me why the normal Docker process is to build an image from a Dockerfile and then upload it to a repository, instead of just moving the Dockerfile to and from the repository?
Let's say we have a development laptop and a test server with Docker.
If we build the image, that means uploading and downloading all of the packages inside the Dockerfile. Sometimes this can be very large (e.g. PyTorch > 500MB).
Instead of transporting the large imagefile to and from the server, doesn't it make sense to, perhaps compile the image locally to verify it works, but mostly transport the small Dockerfile and build the image on the server?
This started out as a comment, but it got too long. It is likely to not be a comprehensive answer, but may contain useful information regardless.
Often the Dockerfile will form part of a larger build process, with output files from previous stages being copied into the final image. If you want to host the Dockerfile instead of the final image, you’d also have to host either the (usually temporary) processed files or the entire source repo & build script.
The latter is often done for open source projects, but for convenience pre-built Docker images are also frequently available.
One tidy solution to this problem is to write the entire build process in the Dockerfile using multi-stage builds (introduced in Docker CE 17.05 & EE 17.06). But even with the complete build process described in a platform-independent manner in a single Dockerfile, the complete source repository must still be provided.
TL,DR: Think of a Docker image as a regular binary. It’s convenient to download and install without messing around with source files. You could download the source for a C application and build it using the provided Makefile, but why would you if a binary was made available for your system?
Instead of transporting the large imagefile to and from the server,
doesn't it make sense to, perhaps compile the image locally to verify
it works, but mostly transport the small Dockerfile and build the
image on the server?
Absolutely! You can, for example, set up an automated build on Docker Hub which will do just that every time you check in an updated version of your Dockerfile to your GitHub repo.
Or you can set up your own build server / CI pipeline accordingly.
IMHO, one of the reason for building the images concept and putting into repository is sharing with people too. For example we call Python's out of the box image for performing all python related stuff for a python program to run in Dockerfile. Similarly we could create a custom code(let's take example I did for apache installation with some custom steps(like ports changes and additionally doing some steps) I created its image and then finally put it to my company's repository.
I came to know after few days that may other teams are using it too and now when they are sharing it they need NOT to make any changes simply use my image and they should be done.

Efficient svn checkout in docker container

I want to checkout some files (specifically, test suite http://llvm.org/svn/llvm-project/test-suite/trunk) in my docker container.
Now I just use RUN svn co http://llvm.org/svn/llvm-project/test-suite/trunk train.out/llvm-test-suite inside Dockerfile.
It works, but doesn't look efficient: on each docker-compose I need to wait for ~5 minutes while the tests are loading.
Is there a better way to prevent Docker from checking out this file each time? The only alternative I see for now is including the file into container.
You generally don’t run source-control tools from inside a Dockerfile. Check them out in a host directory (better still, if you can manage it, add the Dockerfile directly to the repository you’re trying to build) and run docker build with all of its inputs directly on disk.
There are a couple of good reasons for this:
Docker image caching can often mean that Docker won’t repeat a “clone”, “checkout”, or “pull” type operation: it knows it’s done it once and already knows the output of it and skips the step, even if there are new commits you don’t have.
Adding tools like svn or git to the image that you only need to build it makes it unnecessarily larger. (Multi-stage builds can avoid this, but they’re relatively new.)
The more common use case for this is to clone a private repository that needs credentials, and it’s hard to avoid leaking those credentials into the final image. (Again multi-stage builds can avoid this, with some care, but it’s better to not have the security exposure at all.)

How to stop TeamCity from rebuilding docker dependencies every time?

I have a TeamCity build project that parameterizes a docker-compose.yml template with the build versions of a dozen Docker containers, so in order to get the build_counter from each container, I have them set as snapshot dependencies in the docker-compose build job. Each container's Dockerfile and other files are in their own BitBucket repo, and they have triggers for the appropriate files. In the snapshot dependencies in the docker-compose build I have them set to "Do not run new build if there is a suitable one" but it still tries to run all of the dependent builds even though there aren't any changes in their respective repos.
This makes what should be a very simple and quick build in to a very long build. And often times, one of the dependent builds will fail with "could not collect changes: connection refused" and I suspect it has to do with TC trying to hit all of these different repos all at once.
Is there something I can do to not trigger a build of every dependency every time the docker-compose build is run?
Edit:
Here's an example of what our docker-compose.yml.j2 looks like: http://termbin.com/b2xy
Obviously, I've sanitized it for sharing, and our real docker-compose template has about a dozen services listed.
Here is an example Dockerfile for one of the services: http://termbin.com/upins
Rather than changing the source code of your build (parameterized docker-compose.yml) and brute-force your build every time, you could consider building the containers independently while tagging them with a version increment, and labels. After the build store the images in a local registry. Use docker-compose to suit your runtime needs. docker-compose can use multiple yaml files, so if you need other images for a particular build, just pull the other images you need. For production use another yaml file that composes the system to run. Add LABEL to your Dockerfile. See http://label-schema.org//rc1/ for a set of labels that suit your needs.
I know this is old question but I have come across this issue and you can't do what sounds reasonable i.e. get recent green builds without rebuilding. This is partly because of what the snapshot dependencies are designed to do by Jetbrains.
The basic idea is that dependencies are for synchronized revisions of code: that is if you build Compose at a certain time, it will need to use not just its own versions of source code at that point in time but also the code for all the dependencies that also comes from that point of time, regardless of whether anything significant has changed.
In my example, there were always changes because the same repo was used for lots of projects and had unrelated changes that would not trigger a build but would make the project appear behind and cause a build.
If your dependencies have NOT changed and show no changes pending, then they shouldn't build. In this case, you need to tick "Do not run new build if there is a suitable one". "Enforce Revisions Synchronization" is slightly confusing. If ticked, it will find older builds that match the first build after your build was triggered. If unticked, it can use newer builds.

Shared build logic with docker-compose and multi-stage Dockerfiles

I am using docker-compose with multi-stage Dockerfiles to build and run multiple services. This works, but the "build" portion of each multi-stage build is largely copy-and-pasted between each service's Dockerfile. I want to reduce the copy-and-paste / centralize the common build logic in one spot.
Reading https://engineering.busbud.com/2017/05/21/going-further-docker-multi-stage-builds/ I could create a local image with the shared build steps and have the service Docker files depend on it, but I want the development experience to be a simple docker-compose up. Creating a local build image means a developer would have to know to run docker build [common_build_image] first so that the build image exists locally and THEN run docker compose up to build and run all the services that depend on it.
There doesn't appear to be a way to include a Dockerfile into another Dockerfile. FROM does not appear to support local paths.
Is there a way to accomplish what I want? Of course I can use a shell script to tie everything together, but that is basically what multi-stage builds was trying to solve in the first place.
It turns out you can "compose" docker-compose: https://docs.docker.com/compose/extends/#adding-and-overriding-configuration which is what I was looking for.

Should I Compile My Application Inside of a Docker Image

Although most of the time I am developing Java apps and am simply using Maven so my builds should be reproducible (at least that's what Maven says).
But say you are compiling a C++ program or something a little more involved, should you build inside of docker?
Or ideally use vagrant or another technology to produce reproduce able builds.
How do you manage reproducible build with docker?
You can, but not in your final image, as that would mean a much larger image than necessary: it would include all the compilation tool, instead of limiting to only what you need to execute the resulting binary.
You can see an alternative in "How do I build a Docker image for a Ruby project without build tools?"
I use an image to build,
I commit the resulting stopped container as a new image (with a volume including the resulting binary)
I use an execution image (one which only contain what you need to run), and copy the binary from the other image. I commit again the resulting container.
The final image includes the compiled binary and the execution environment.
I wanted to post an answer to this as well actually because to build on VonC's answer. Actually I just had Redhat Openshift training and they use a tool called Source to Image s2i, which uses docker to create docker images. And actually this strategy is great for managing a private (or public) cloud, where your build may be compiled on different machines, but you need to keep the build environment consistent.

Resources