Does extracting (untarring) of docker image layers by docker pull have to be conducted sequentially or could it be parallelized?
Example
docker pull mirekphd/ml-cpu-r40-base - an image which had to be split into more than 50 layers for build performance reasons - it contains around 4k R packages precompiled as DEB's (the entire CRAN Task Views contents), that would be impossible to build in docker without splitting these packages into multiple layers of roughly equal size, which cuts the build time from a whole day to minutes. Extraction stage - if parallelized - could become up to 50 times faster...
Context
When you observe docker pull for a large multi-layer image (gigabytes in size), you will notice that the download of each layer can be performed separately, in parallel. Not so for subsequent extracting (untarring) of each of these layers, which is performed sequentially. Do we know why?
From my anecdotal observations with such large images it would speed up the execution of the docker pull operation considerably.
Moreover, if splitting image into more layers would let you spin up containers faster, people would start writing Dockerfiles that are more readable and faster to both debug/test and pull/run, rather than trying to pile all instructions onto a single slow-bulding, impossibly convoluted and cache-busting string of instructions only to save a few megabytes of extra layers overhead (which will be easily recouped by parallel extraction).
From the discussion at https://github.com/moby/moby/issues/21814, there are two main reasons that layers are not extracted in parallel:
It would not work on all storage drivers.
It would potentially use lots of CPu.
See the related comments below:
Note that not all storage drivers would be able to support parallel extraction. Some are snapshotting filesystems where the first layer must be extracted and snapshotted before the next one can be applied.
#aaronlehmann
We also don't really want a pull operation consuming tons of CPU on a host with running containers.
#cpuguy83
And the user who closed the linked issue wrote the following:
This isn't going to happen for technical reasons. There's no room for debate here. AUFS would support this, but most of the other storage drivers wouldn't support this. This also requires having specific code to implement at least two different code paths: one with this parallel extraction and one without it.
An image is basically something like this graph A->B->C->D and most Docker storage drivers can't handle extracting any layers which depend on layers which haven't been extracted already.
Should you want to speed up docker pull, you most certainly want faster storage and faster network. Go itself will contribute to performance gains once Go 1.7 is out and we start using it.
I'm going to close this right now because any gains from parallel extraction for specific drivers aren't worth the complexity for the code, the effort needed to implement it and the effort needed to maintain this in the future.
#unclejack
Related
How can I measure the efficiency of a container image, in terms of what portion of its contents are actually used (accessed) for the processes therein?
There are various forms of wastage that could contribute to excessively large images, such as layers storing files that are superseded in later layers (which can be analysed using dive), or binaries interlaced with unstripped debug information, or the inclusion of extraneous files (or data) that are simply not needed for the process which executes in the container. Here I'm asking about the latter.
Are there docker-specific tools (analogous to dive) for estimating/measuring this kind of wastage/efficiency, or should I just apply general Linux techniques? Can the filesystem access time (atime) be relied upon inside a container (to distinguish which files have/haven't been read since the container was instantiated) or do I need to instrument the image with tools like the Linux auditing system (auditd)?
I have a base image and several completely orthogonal "dimensions" of completely static overlays that map into data directories, each with several options, that I want to permute to produce final container(s) in my deployments. As a degenerate example, the base image (X) will need one of each of (A,B,C), (P,D,Q), and (K,L,M) at deployment time. What I'm doing now is building separate images for each permutation that I end up actually needing: e.g. XADM, XBDK, etc. The problem is that as the number of dimensions of static data overlays expands and the number of choices inside each dimension gets larger, I run into serious combinatorial explosion issues - it might take 10 minutes for our CI/CD system to build each image (some of the overlays are large) and since it is the base image that changes most often, the layers don't cache well.
Thoughts so far:
generate each layer (ABCPDQKLM) as a separate container that populates a volume which then gets mounted by each of my X containers. This is is fine, though I NEVER need the layers to be writable and don't especially want to pay for persistent storage associated with volumes that feel like they should be superfluous.
reorder my layers to be slowest-to-fastest changing. I can get some improvement from doing this, but I still hit the combinatorics issue: I probably still have to build all the combinations I need, but at least my CI/CD build time will be improved. I think it results in poorer overall layer caching, but trading off space for time might be reasonable and the result per tenant is still good and doesn't incur any volume storage during deployment.
I'm not happy about either option (or my current solution). Any ideas would be welcome.
Edits/Questions:
"static" means read-only, but as a practical matter, the A/B/C overlays might each be a few 100MB of directory structure to be mounted/present in a specific place in the container's file system. In every case, it is data that is going to be used (even memory-mapped!) by the programs in the base image, so it needs to be at least very effectively cached near each of the CPUs that is going to be using it. I like the performance characteristics of having the data baked into the containers, but perhaps I should be trusting the storage layer more to keep the data properly cached/replicated near the real CPUs. Doing so means trading off registry space charges against PV storage charges, but that may be a minor consideration.
Basically, each "dimension" is a type of trained machine learning model. I need to compose the dimensions by choosing the right set of trained models to fit the domain required for each of many production tenants.
In a current project I have to perform the following tasks (among others):
capture video frames from five IP cameras and stitch a panorama
run machine learning based object detection on the panorama
stream the panorama so it can be displayed in a UI
Currently, the stitching and the streaming runs in one docker container, and the object detection runs in another, reading the panorama stream as input.
Since I need to increase the input resolution for the the object detector while maintaining the stream resolution for the UI, I have to look for alternative ways of getting the stitched (full resolution) panorama (~10 MB per frame) from the stitcher container to the detector container.
My thoughts regarding potential solutions:
shared volume. Potential downside: One extra write and read per frame might be too slow?
Using a message queue or e.g. redis. Potential downside: yet another component in the architecture.
merging the two containers. Potential downside(s): Not only does it not feel right, but the two containers have completely different base images and dependencies. Plus I'd have to worry about parallelization.
Since I'm not the sharpest knife in the docker drawer, what I'm asking for are tips, experiences and best practices regarding fast data exchange between docker containers.
Usually most communication between Docker containers is over network sockets. This is fine when you're talking to something like a relational database or an HTTP server. It sounds like your application is a little more about sharing files, though, and that's something Docker is a little less good at.
If you only want one copy of each component, or are still actively developing the pipeline: I'd probably not use Docker for this. Since each container has an isolated filesystem and its own user ID space, sharing files can be unexpectedly tricky (every container must agree on numeric user IDs). But if you just run everything on the host, as the same user, pointing at the same directory, this isn't a problem.
If you're trying to scale this in production: I'd add some sort of shared filesystem and a message queueing system like RabbitMQ. For local work this could be a Docker named volume or bind-mounted host directory; cloud storage like Amazon S3 will work fine too. The setup is like this:
Each component knows about the shared storage and connects to RabbitMQ, but is unaware of the other components.
Each component reads a message from a RabbitMQ queue that names a file to process.
The component reads the file and does its work.
When it finishes, the component writes the result file back to the shared storage, and writes its location to a RabbitMQ exchange.
In this setup each component is totally stateless. If you discover that, for example, the machine-learning component of this is slowest, you can run duplicate copies of it. If something breaks, RabbitMQ will remember that a given message hasn't been fully processed (acknowledged); and again because of the isolation you can run that specific component locally to reproduce and fix the issue.
This model also translates well to larger-scale Docker-based cluster-computing systems like Kubernetes.
Running this locally, I would absolutely keep separate concerns in separate containers (especially if individual image-processing and ML tasks are expensive). The setup I propose needs both a message queue (to keep track of the work) and a shared filesystem (because message queues tend to not be optimized for 10+ MB individual messages). You get a choice between Docker named volumes and host bind-mounts as readily available shared storage. Bind mounts are easier to inspect and administer, but on some platforms are legendarily slow. Named volumes I think are reasonably fast, but you can only access them from Docker containers, which means needing to launch more containers to do basic things like backup and pruning.
Alright, Let's unpack this:
IMHO Shared Volume works just fine, but gets way too messy over time. Especially if you're handling Stateful services.
MQ: This seems like a best option in my opinion. Yes, it's another component in your architecture, but it makes sense to have it rather than maintaining messy shared Volumes or handling massive container images (if you manage to combine 2 container images)
Yes, You could potentially do this, but not a good idea. Considering your use case, I'm going to go ahead and make an assumption that you have a massive list of dependencies which could potentially lead to a conflict. Also, lot of dependencies = larger image = Larger attack surface - which from a security perspective is not a good thing.
If you really want to run multiple processes in one container, it's possible. There are multiple ways to achieve that, however I prefer supervisord.
https://docs.docker.com/config/containers/multi-service_container/
At different places I found the information that a docker image can only consist of up to 42 layers. This seems to be a limitation of the used AUFS file system.
Can anybody tell me why this limit exists or does anybody have some documentation explaining this?
I'm starting to suspect that there isn't any such a hard limit.
Create the following python script, and name it "makeDockerfile.py"
with open("Dockerfile", "w") as file:
file.write("from alpine:3.8\n")
for i in range(0, 201):
file.write("run echo {i}\n".format(i=i))
then run
python makeDockerfile.py && docker build --tag manylayer . && docker run -it manylayer /bin/sh
You'll see that you are running a working container with > 200 layers.
(note, this was tested with linux containers on linux)
Note that this doesn't mean that this many layers are necessarily SUPPORTED, just that they are possible in some cases.
In fact, I've seen containers fail with far fewer than 42 layers, and removing any arbitrary layer seems to fix it. (see https://github.com/docker/for-win/issues/676#issuecomment-462991673 )
EDIT:
thaJeztah, maintainer of Docker, has this to say about it:
The "42 layer limit" on aufs was on older versions of aufs, but should no longer be the case.
However, the 127 layer limit is still there. This is due to a restriction of Linux not accepting more than X arguments to a syscall that's used.
Although this limit can be raised in modern kernels, it's not the default, so if we'd go beyond that maximum, an Image built on one machine may not be usable on another machine.
( see https://github.com/docker/docker.github.io/issues/8230 )
From the Docker BP documentation:
Minimize the number of layers
You need to find the balance between readability (and thus long-term maintainability) of the Dockerfile and minimizing the number of layers it uses. Be strategic and cautious about the number of layers you use.
They also give advices on how to avoid too many layers:
That way you can delete the files you no longer need after they’ve been extracted and you won’t have to add another layer in your image
[..]
Lastly, to reduce layers and complexity, avoid switching USER back and forth frequently.
TL;DR: the benefit of minimizing the number of layers can be likened to the benefit of minimizing the number of small files but rather have fewer bigger ones. A docker pull is also faster (try downloading 2048 files of 1kB or one file of 2MB) . and having fewer layers reduces the complexity of an image, hence the maintainability.
As for the 42 limit. Well... I guess they had to come up with a number and they pick this particular one ;)
This seems to be imposed primarily by AUFS (sfjro/aufs4-linux).
See PR 50 "Prohibit more than 42 layers in the core "
Hey, out of curiosity, what's the detailed rationale behind this?
People have been asking how to bypass the 42 layers limit, and we've always told them "wait for device mapper!" so... what should we tell them now?
We can't allow more than 42 layers just yet. Devicemapper indeed supports it, but then, if you push a 50 layer image to the registry, most people won't be able to use it. We'll enable this once we support more than 42 layers for AUFS.
PR 66 was supposed to remove that limit
This thread reported:
The 42 limit comes from the aufs code itself.
My guess is that it's because of stack depth: if you stack these kernel data structures too many levels deep, you might trigger an overflow somewhere. The hardcoded limit is probably a pragmatic way to avoid that.
There are a number of really tiny Linux Docker images that weigh in around 4-5M and the "full" distros that start around 100M and climb to twice that.
Setting aside storage space and download time from a repo, are there runtime considerations to small vs large images? For example if I have a compiled Go program, one running on Busybox and the other on Ubuntu, and I run say 10 of them on a machine, in what ways (if any) does it matter than one image is tiny and the other pretty heavy? Does one consume more runtime resources than the other?
I never saw any real difference in consuming other resources than storage and RAM if the image is bigger, however, as Docker containers should be single process why having the big overhead of unused clutter in your containers?
When trimming things down to small containers, there some advantages you may consider:
Faster transfer when deploying (esp. important if you wan't to do rolling upgrades)
Costs: The most time I used big containers, I ran exactly into storage issues on small VMs
Distributed File Systems: When using some File Storage like GlusterFS or other attached storage, big containers slow down, when bootet and updated heavily
massive overhead of data: if you have 500 MB clutter, you'll have it on your dev-machine, your CI/CD-Server, your registry AND every node of your production servers. This CAN matter depending on your use case.
I would say: If you just use a handful containers internally then the size is less important, if ever, than using hundrets of containers in production.