How file lookup work in Docker container - docker

According to Docker docs, every Dockerfile instruction create a layer, and all the layers are kept when you create new image based on an old one. Then when I create my own image, I might have hundreds of layers involved because of the recursive inherit of layers of base image.
In my understand, file lookup in container work this way:
process want to access file a, lookup starts from the container layer(thin w/r layer) .
UnionFS check whether this layer have a record for it (have it or marked as deleted). If yes, return it or say not found respectively, ending the lookup. If no, pass the task to the layer below.
the lookup end at the bottom layer.
If that is the way, consider a file that resides in the bottom layer and unchanged by other layers, /bin/sh maybe, would need going through all the layers to the bottom. Though the layers might be very light-weight, a lookup still need 100x time than a regular one, noticeable. But from my experience, Docker is pretty fast, almost same as a native OS. Where am I wrong?

This is all thanks to UnionFS and Union mounts!
Straight from wikipedia:
It allows files and directories of separate file systems, known as
branches, to be transparently overlaid, forming a single coherent file
system.
And from an interesting article:
In the kernel, the filesystems are stacked in order of their mount
sequence, the first mounted filesystem is at the bottom of the mount
stack, and the latest mount is at the top of the stack. Only the files
and directories of the top of the mount stack are visible. With union
mounts, directory entries from the lower filesystems are merged with
the directory entries of upper filesystem, thus making a logical
combination of all mounted filesystems. Files with the same name in a
lower filesystem are masked, as the upper one takes precedence.
So it doesn't "go through layers" in the conventional sense (e.g one at a time) but rather it knows (at any given time) which file resides on which disk.
Doing this in the filesystem layer also means none of the software has to worry about where the file resides, it knows to ask for /bin/sh and the filesystem knows where to get it.
More info can be found in this webinar.
So to answer your question:
Where am I wrong?
You are thinking that it has to look through the layers one at a time while it doesn't have to do that. (UnionFS is awesome!)

To add to the correct prior answer, copy-on-write (CoW) and union filesystem implementors want to have near-native performance, so, of course, have tuned their implementations and "API" to have best possible lookup/filesystem performance.
That said, it's good to be aware that Docker does not operate on top of only a single 'type' of union/CoW filesystem, but has a small array of available options, with defaults depending on the Linux distro on which it is installed.
AUFS and overlay(fs) are the most common, but Docker also supports devicemapper (Red Hat contributed and supported on Fedora/RHEL/CentOS), btrfs, and zfs. I have a blog post comparing and contrasting the various options that may be of interest.

Related

Check how much of a docker image is accessed?

How can I measure the efficiency of a container image, in terms of what portion of its contents are actually used (accessed) for the processes therein?
There are various forms of wastage that could contribute to excessively large images, such as layers storing files that are superseded in later layers (which can be analysed using dive), or binaries interlaced with unstripped debug information, or the inclusion of extraneous files (or data) that are simply not needed for the process which executes in the container. Here I'm asking about the latter.
Are there docker-specific tools (analogous to dive) for estimating/measuring this kind of wastage/efficiency, or should I just apply general Linux techniques? Can the filesystem access time (atime) be relied upon inside a container (to distinguish which files have/haven't been read since the container was instantiated) or do I need to instrument the image with tools like the Linux auditing system (auditd)?

How to inspect contents of different Docker image layers?

My current understanding of a Docker image is that it is a collection of individual layers. Each layer only contains deltas that are merged via the union filesystem (which simply mounts all layers on top of each other). When instantiating an image, another (writable) layer is put on top that will then contain all container-specific changes that are persisted between restarts. Please correct me if I am wrong in any of the above.
I would like to inspect the contents of each of the various layers. I am particularly interested in inspecting the top-most layer to see whether my containerized app writes any data that would bloat the container, like a log or so. I am working on macOS, which does not store all the files in /var/lib/docker/, but seems to store them in a VM. I read about the docker-machine tools that make it easy to connect to the Docker engine via SSH, where one would be able to see and mount all layers. However, this tool seems to be discontinued.
Does anybody have an idea on 1) how to connect to the docker engine to get access to the layers and 2) how to find out what files are contained in a particular layer?
edit: It seems to be possible to use docker diff to see the file differences between the original image and the running container, which is what I mainly wanted to achieve, but the original questions remain.
You can list the layers and their sizes with the docker history command. But to inspect the contents of all layers I recommend to use the dive tool.

Best practice to permute static directory structures at deployment of Docker containers (in Kubernetes)?

I have a base image and several completely orthogonal "dimensions" of completely static overlays that map into data directories, each with several options, that I want to permute to produce final container(s) in my deployments. As a degenerate example, the base image (X) will need one of each of (A,B,C), (P,D,Q), and (K,L,M) at deployment time. What I'm doing now is building separate images for each permutation that I end up actually needing: e.g. XADM, XBDK, etc. The problem is that as the number of dimensions of static data overlays expands and the number of choices inside each dimension gets larger, I run into serious combinatorial explosion issues - it might take 10 minutes for our CI/CD system to build each image (some of the overlays are large) and since it is the base image that changes most often, the layers don't cache well.
Thoughts so far:
generate each layer (ABCPDQKLM) as a separate container that populates a volume which then gets mounted by each of my X containers. This is is fine, though I NEVER need the layers to be writable and don't especially want to pay for persistent storage associated with volumes that feel like they should be superfluous.
reorder my layers to be slowest-to-fastest changing. I can get some improvement from doing this, but I still hit the combinatorics issue: I probably still have to build all the combinations I need, but at least my CI/CD build time will be improved. I think it results in poorer overall layer caching, but trading off space for time might be reasonable and the result per tenant is still good and doesn't incur any volume storage during deployment.
I'm not happy about either option (or my current solution). Any ideas would be welcome.
Edits/Questions:
"static" means read-only, but as a practical matter, the A/B/C overlays might each be a few 100MB of directory structure to be mounted/present in a specific place in the container's file system. In every case, it is data that is going to be used (even memory-mapped!) by the programs in the base image, so it needs to be at least very effectively cached near each of the CPUs that is going to be using it. I like the performance characteristics of having the data baked into the containers, but perhaps I should be trusting the storage layer more to keep the data properly cached/replicated near the real CPUs. Doing so means trading off registry space charges against PV storage charges, but that may be a minor consideration.
Basically, each "dimension" is a type of trained machine learning model. I need to compose the dimensions by choosing the right set of trained models to fit the domain required for each of many production tenants.

Docker Container Library Duplication

New to docker...
Need some help to clarify basic container concept...
AFAIK, each container would include app. code, library, runtime, cfg files, etc.
If I would run N numbers of containers for N numbers of app. and each of the app. happens to use a set of same lib. would it mean my host systems literally end up having N-1 numbers of duplicate libraries?
while container reduces OS overhead in VM approach of virtualization, I am just wondering if the container approach still has room to improve in terms of resource optimization.
Thanks
Mira
Containers are the runtime instance, defined by an image. Docker uses a unionfs to merge multiple layers together to create the root filesystem you see inside your container. Each step in the build of an image is a layer. And the container itself has a copy-on-write layer attached just to the container so that it sees it's own changes. Because of this, docker is able to point multiple instances of a running image back to the same image files for the unionfs layers, it never copies the layer when you spin up another container, they all point back to the same filesystem bytes.
In short, if you have a 1 gig image, and spin up 100 containers all using that same image, on disk will only be the 1 gig image plus any changes made in those 100 containers, not 100 gigs.

Docker images across multiple disks

I'm getting going with Docker, and I've found that I can put the main image repository on a different disk (symlink /var/lib/docker to some other location).
However, now I'd like to see if there is a way to split that across multiple disks.
Specifically, I have an old SSD that is blazingly fast to read from, but doesn't have too many writes left until it kicks the can. It would be awesome if I could store the immutable images on here, then have my writeable images on some other location that can handle the writes.
Is this something that is possible? How do you split up the repository?
Maybe you could do this using the AUFS driver and some trickery such as moving layers to the SSD after initially creating them and pointing symlinks at them - I'm not sure, I never had a proper look at how that storage driver worked.
With devicemapper thinp, btrfs and OverlayFS this isnt possible AFAICT:
The Docker dm-thinp and btrfs drivers both build layers one on top of the other using block device snapshot mechanisms. Your best bet here would be to include the SSD in the storage pool and rely on some ability to migrate the r/o snapshots to a specific block device that is part of the pool. Doubt this exists though.
The OverlayFS driver stacks layers by hard-linking files in independent directory structures. Hard-links only work within a filesystem.

Resources