What's the reason for the 42 layer limit in Docker? - docker

At different places I found the information that a docker image can only consist of up to 42 layers. This seems to be a limitation of the used AUFS file system.
Can anybody tell me why this limit exists or does anybody have some documentation explaining this?

I'm starting to suspect that there isn't any such a hard limit.
Create the following python script, and name it "makeDockerfile.py"
with open("Dockerfile", "w") as file:
file.write("from alpine:3.8\n")
for i in range(0, 201):
file.write("run echo {i}\n".format(i=i))
then run
python makeDockerfile.py && docker build --tag manylayer . && docker run -it manylayer /bin/sh
You'll see that you are running a working container with > 200 layers.
(note, this was tested with linux containers on linux)
Note that this doesn't mean that this many layers are necessarily SUPPORTED, just that they are possible in some cases.
In fact, I've seen containers fail with far fewer than 42 layers, and removing any arbitrary layer seems to fix it. (see https://github.com/docker/for-win/issues/676#issuecomment-462991673 )
EDIT:
thaJeztah, maintainer of Docker, has this to say about it:
The "42 layer limit" on aufs was on older versions of aufs, but should no longer be the case.
However, the 127 layer limit is still there. This is due to a restriction of Linux not accepting more than X arguments to a syscall that's used.
Although this limit can be raised in modern kernels, it's not the default, so if we'd go beyond that maximum, an Image built on one machine may not be usable on another machine.
( see https://github.com/docker/docker.github.io/issues/8230 )

From the Docker BP documentation:
Minimize the number of layers
You need to find the balance between readability (and thus long-term maintainability) of the Dockerfile and minimizing the number of layers it uses. Be strategic and cautious about the number of layers you use.
They also give advices on how to avoid too many layers:
That way you can delete the files you no longer need after they’ve been extracted and you won’t have to add another layer in your image
[..]
Lastly, to reduce layers and complexity, avoid switching USER back and forth frequently.
TL;DR: the benefit of minimizing the number of layers can be likened to the benefit of minimizing the number of small files but rather have fewer bigger ones. A docker pull is also faster (try downloading 2048 files of 1kB or one file of 2MB) . and having fewer layers reduces the complexity of an image, hence the maintainability.
As for the 42 limit. Well... I guess they had to come up with a number and they pick this particular one ;)

This seems to be imposed primarily by AUFS (sfjro/aufs4-linux).
See PR 50 "Prohibit more than 42 layers in the core "
Hey, out of curiosity, what's the detailed rationale behind this?
People have been asking how to bypass the 42 layers limit, and we've always told them "wait for device mapper!" so... what should we tell them now?
We can't allow more than 42 layers just yet. Devicemapper indeed supports it, but then, if you push a 50 layer image to the registry, most people won't be able to use it. We'll enable this once we support more than 42 layers for AUFS.
PR 66 was supposed to remove that limit
This thread reported:
The 42 limit comes from the aufs code itself.
My guess is that it's because of stack depth: if you stack these kernel data structures too many levels deep, you might trigger an overflow somewhere. The hardcoded limit is probably a pragmatic way to avoid that.

Related

Why is docker pull not extracting layers in parallel?

Does extracting (untarring) of docker image layers by docker pull have to be conducted sequentially or could it be parallelized?
Example
docker pull mirekphd/ml-cpu-r40-base - an image which had to be split into more than 50 layers for build performance reasons - it contains around 4k R packages precompiled as DEB's (the entire CRAN Task Views contents), that would be impossible to build in docker without splitting these packages into multiple layers of roughly equal size, which cuts the build time from a whole day to minutes. Extraction stage - if parallelized - could become up to 50 times faster...
Context
When you observe docker pull for a large multi-layer image (gigabytes in size), you will notice that the download of each layer can be performed separately, in parallel. Not so for subsequent extracting (untarring) of each of these layers, which is performed sequentially. Do we know why?
From my anecdotal observations with such large images it would speed up the execution of the docker pull operation considerably.
Moreover, if splitting image into more layers would let you spin up containers faster, people would start writing Dockerfiles that are more readable and faster to both debug/test and pull/run, rather than trying to pile all instructions onto a single slow-bulding, impossibly convoluted and cache-busting string of instructions only to save a few megabytes of extra layers overhead (which will be easily recouped by parallel extraction).
From the discussion at https://github.com/moby/moby/issues/21814, there are two main reasons that layers are not extracted in parallel:
It would not work on all storage drivers.
It would potentially use lots of CPu.
See the related comments below:
Note that not all storage drivers would be able to support parallel extraction. Some are snapshotting filesystems where the first layer must be extracted and snapshotted before the next one can be applied.
#aaronlehmann
We also don't really want a pull operation consuming tons of CPU on a host with running containers.
#cpuguy83
And the user who closed the linked issue wrote the following:
This isn't going to happen for technical reasons. There's no room for debate here. AUFS would support this, but most of the other storage drivers wouldn't support this. This also requires having specific code to implement at least two different code paths: one with this parallel extraction and one without it.
An image is basically something like this graph A->B->C->D and most Docker storage drivers can't handle extracting any layers which depend on layers which haven't been extracted already.
Should you want to speed up docker pull, you most certainly want faster storage and faster network. Go itself will contribute to performance gains once Go 1.7 is out and we start using it.
I'm going to close this right now because any gains from parallel extraction for specific drivers aren't worth the complexity for the code, the effort needed to implement it and the effort needed to maintain this in the future.
#unclejack

Best practice to permute static directory structures at deployment of Docker containers (in Kubernetes)?

I have a base image and several completely orthogonal "dimensions" of completely static overlays that map into data directories, each with several options, that I want to permute to produce final container(s) in my deployments. As a degenerate example, the base image (X) will need one of each of (A,B,C), (P,D,Q), and (K,L,M) at deployment time. What I'm doing now is building separate images for each permutation that I end up actually needing: e.g. XADM, XBDK, etc. The problem is that as the number of dimensions of static data overlays expands and the number of choices inside each dimension gets larger, I run into serious combinatorial explosion issues - it might take 10 minutes for our CI/CD system to build each image (some of the overlays are large) and since it is the base image that changes most often, the layers don't cache well.
Thoughts so far:
generate each layer (ABCPDQKLM) as a separate container that populates a volume which then gets mounted by each of my X containers. This is is fine, though I NEVER need the layers to be writable and don't especially want to pay for persistent storage associated with volumes that feel like they should be superfluous.
reorder my layers to be slowest-to-fastest changing. I can get some improvement from doing this, but I still hit the combinatorics issue: I probably still have to build all the combinations I need, but at least my CI/CD build time will be improved. I think it results in poorer overall layer caching, but trading off space for time might be reasonable and the result per tenant is still good and doesn't incur any volume storage during deployment.
I'm not happy about either option (or my current solution). Any ideas would be welcome.
Edits/Questions:
"static" means read-only, but as a practical matter, the A/B/C overlays might each be a few 100MB of directory structure to be mounted/present in a specific place in the container's file system. In every case, it is data that is going to be used (even memory-mapped!) by the programs in the base image, so it needs to be at least very effectively cached near each of the CPUs that is going to be using it. I like the performance characteristics of having the data baked into the containers, but perhaps I should be trusting the storage layer more to keep the data properly cached/replicated near the real CPUs. Doing so means trading off registry space charges against PV storage charges, but that may be a minor consideration.
Basically, each "dimension" is a type of trained machine learning model. I need to compose the dimensions by choosing the right set of trained models to fit the domain required for each of many production tenants.

How big can a GKE container image get before it's a problem?

This question is admittedly somewhat vague. If you have suggestions how to better word it, please by all means, give me feedback...
I want to understand how big a GKE container image can get before there may be problems, either serious or minor. For example, I've built a docker image (not deployed yet) that is 683 MB.
(As an aside, the reason it's so big is that I'm running a computer vision library licensed from a company with certain attributes: (1) uses native libraries that are not compatible with Alpine; (2) uses Java; (3) uses Node.js to run a required licensing daemon in same container; (4) has some very large machine learning model files.)
Although the service will have auto-scaling enabled, I expect the auto-scaling to be fairly light. It might add a new pod occasionally, but not major spikes up and down.
The size of the container will determine how many resources to assign it and thus how much CPU, memory and disk space your nodes.must have. I have seen containers require over 2 GB of memory and still work fine within the cluster.
There probably is an upper limit but the containers would have to be enormous, your container size should not pose any issues aside from possibly container startup
In practice, you're going to have issues pushing an image to GCR before you have issues running it on GKE, but there isn't a hard limit outside the storage capabilities of your nodes. You can get away with O(GB) pretty easily.

How file lookup work in Docker container

According to Docker docs, every Dockerfile instruction create a layer, and all the layers are kept when you create new image based on an old one. Then when I create my own image, I might have hundreds of layers involved because of the recursive inherit of layers of base image.
In my understand, file lookup in container work this way:
process want to access file a, lookup starts from the container layer(thin w/r layer) .
UnionFS check whether this layer have a record for it (have it or marked as deleted). If yes, return it or say not found respectively, ending the lookup. If no, pass the task to the layer below.
the lookup end at the bottom layer.
If that is the way, consider a file that resides in the bottom layer and unchanged by other layers, /bin/sh maybe, would need going through all the layers to the bottom. Though the layers might be very light-weight, a lookup still need 100x time than a regular one, noticeable. But from my experience, Docker is pretty fast, almost same as a native OS. Where am I wrong?
This is all thanks to UnionFS and Union mounts!
Straight from wikipedia:
It allows files and directories of separate file systems, known as
branches, to be transparently overlaid, forming a single coherent file
system.
And from an interesting article:
In the kernel, the filesystems are stacked in order of their mount
sequence, the first mounted filesystem is at the bottom of the mount
stack, and the latest mount is at the top of the stack. Only the files
and directories of the top of the mount stack are visible. With union
mounts, directory entries from the lower filesystems are merged with
the directory entries of upper filesystem, thus making a logical
combination of all mounted filesystems. Files with the same name in a
lower filesystem are masked, as the upper one takes precedence.
So it doesn't "go through layers" in the conventional sense (e.g one at a time) but rather it knows (at any given time) which file resides on which disk.
Doing this in the filesystem layer also means none of the software has to worry about where the file resides, it knows to ask for /bin/sh and the filesystem knows where to get it.
More info can be found in this webinar.
So to answer your question:
Where am I wrong?
You are thinking that it has to look through the layers one at a time while it doesn't have to do that. (UnionFS is awesome!)
To add to the correct prior answer, copy-on-write (CoW) and union filesystem implementors want to have near-native performance, so, of course, have tuned their implementations and "API" to have best possible lookup/filesystem performance.
That said, it's good to be aware that Docker does not operate on top of only a single 'type' of union/CoW filesystem, but has a small array of available options, with defaults depending on the Linux distro on which it is installed.
AUFS and overlay(fs) are the most common, but Docker also supports devicemapper (Red Hat contributed and supported on Fedora/RHEL/CentOS), btrfs, and zfs. I have a blog post comparing and contrasting the various options that may be of interest.

How to browse the contents of a docker/btrfs container-specific layer

I have read, and I believe understood, the docker pages on using btrfs, and notably this one
My question is rather simple, I would need to be able to navigate (e.g. using cd and ls, but any other means is fine) in what the above link calls the Thin R/W layer attached to a given container.
The reason I need this is, I use an image that I have not built myself - namely jupyter/scipy-notebook:latest - and what I can see is that each container starts with a circa 100-200 Mb impact on overall disk usage, even though nothing much should be going on in the container.
So I suspect some rather verbose logs get created that I need to silent down a bit; however the whole union fs is huge - circa 5Gb large - so it would help me greatly to navigate only the data that is specific to one container so I can pinpoint the problem.
To list the files that are changed/stored since the original image use
docker diff my-container
This is quite handy if you want to get an idea of what's happening inside, doesn't give you the file sizes though.

Resources