Docker Container Library Duplication - docker

New to docker...
Need some help to clarify basic container concept...
AFAIK, each container would include app. code, library, runtime, cfg files, etc.
If I would run N numbers of containers for N numbers of app. and each of the app. happens to use a set of same lib. would it mean my host systems literally end up having N-1 numbers of duplicate libraries?
while container reduces OS overhead in VM approach of virtualization, I am just wondering if the container approach still has room to improve in terms of resource optimization.
Thanks
Mira

Containers are the runtime instance, defined by an image. Docker uses a unionfs to merge multiple layers together to create the root filesystem you see inside your container. Each step in the build of an image is a layer. And the container itself has a copy-on-write layer attached just to the container so that it sees it's own changes. Because of this, docker is able to point multiple instances of a running image back to the same image files for the unionfs layers, it never copies the layer when you spin up another container, they all point back to the same filesystem bytes.
In short, if you have a 1 gig image, and spin up 100 containers all using that same image, on disk will only be the 1 gig image plus any changes made in those 100 containers, not 100 gigs.

Related

Increasing the storage space of a docker container on Windows to 2-3TB

Working on a windows computer with 5TB available space. Working on building an application to process large amounts of data that uses docker containers to create replicable environments. Most of the data processing is done in parallel using many smaller docker containers, but the final tool/container requires all the data to come together in one place. The output area is mounted to a volume, but most of the data is just copied into the container. This will be multiple TBs of storage space. RAM luckily isn't an issue in this case.
Willing to try any suggestions and make what changes I can.
Is this possible?
I've tried increasing disk space for docker using .wslconfig but this doesn't help.

How to inspect contents of different Docker image layers?

My current understanding of a Docker image is that it is a collection of individual layers. Each layer only contains deltas that are merged via the union filesystem (which simply mounts all layers on top of each other). When instantiating an image, another (writable) layer is put on top that will then contain all container-specific changes that are persisted between restarts. Please correct me if I am wrong in any of the above.
I would like to inspect the contents of each of the various layers. I am particularly interested in inspecting the top-most layer to see whether my containerized app writes any data that would bloat the container, like a log or so. I am working on macOS, which does not store all the files in /var/lib/docker/, but seems to store them in a VM. I read about the docker-machine tools that make it easy to connect to the Docker engine via SSH, where one would be able to see and mount all layers. However, this tool seems to be discontinued.
Does anybody have an idea on 1) how to connect to the docker engine to get access to the layers and 2) how to find out what files are contained in a particular layer?
edit: It seems to be possible to use docker diff to see the file differences between the original image and the running container, which is what I mainly wanted to achieve, but the original questions remain.
You can list the layers and their sizes with the docker history command. But to inspect the contents of all layers I recommend to use the dive tool.

Docker using separate lib/bin

I'm new to Docker and as I understand, Docker uses the same libs/bins for multiple containers where possible.
How can I tell Docker to don't do that - so using a new lib or bin even if the same lib/bin already exists?
To be concrete:
I use this image and I want to start multiple instances of geth-testnet but all of them shall use their own blockchain.
I don't believe you need to worry about this. Docker uses hashing of the layers under the image to maximize reuse. These layers are all read only, and mounted with the union fs under a container specific read-write layer. The result is very efficient on the filesystem and transparent to the user who sees them as writable in their isolated container. However, if you modify them in one container, the change will not be visible in any other container and will be lost when the container is removed and replaced with a new instance.

One docker container per node or many containers per big node

We have a little farm of docker containers, spread over several Amazon instances.
Would it make sense to have fewer big host images (in terms of ram and size) to host multiple smaller containers at once, or to have one host instance per container, sized according to container needs?
EDIT #1
The issue here is that we need to decide up-front. I understand that we can decide later using various monitoring stats, but we need to make some architecture and infrastructure decisions before it is going to be used. More over, we do not have control over what content is going to be deployed.
You should read
An Updated Performance Comparison of Virtual Machines
and Linux Containers
http://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf
and
Resource management in Docker
https://goldmann.pl/blog/2014/09/11/resource-management-in-docker/
You need to check how much memory, CPU, I/O,... your containers consume, and you will draw your conclusions
You can easily, at least, check a few things with docker stats and docker top my_container
the associated docs
https://docs.docker.com/engine/reference/commandline/stats/
https://docs.docker.com/engine/reference/commandline/top/

What is the impact of using multiple Base Images in Docker?

I understand that docker containers are portable between docker hosts, but I am confused about the relationship with the Base Image and the host.
From the documentation on Images, it appears that you would have a much heavier footprint (akin to multiple VMs) on the host machine if you had a variety of base images running. Is this assumption correct?
GOOD: Many containers sharing a single base image.
BAD: Many containers running separate/unique base images.
I'm sure a lot of this confusion comes from my lack of knowledge of LXC.
I am confused about the relationship with the Base Image and the host.
The only relation between the container and the host is that they use the same kernel. Programs running in Docker can't see the host filesystem at all, only their own filesystem.
it appears that you would have a much heavier footprint (akin to multiple VMs) on the host machine if you had a variety of base images running. Is this assumption correct?
No. The Ubuntu base image is about 150MB. But you'd be hard-pressed to actually use all of those programs and libraries. You only need a small subset for any particular purpose. In fact, if your container is running memcache, you could just copy the 3 or 4 libraries it needs, and it would be about 1MB. There's no need for a shell, etc. The unused files will just sit there patiently on disk, completely ignored. They are not loaded into memory, nor are they copied around on disk.
GOOD: Many containers sharing a single base image.
BAD: Many containers running separate/unique base images.
No. Using multiple images will only use a tiny bit of RAM. (Obviously, multiple containers will take more disk space, but disk is cheap, so we'll ignore that). So I'd argue that it's "OK" instead of "BAD".
Example: I start one Ubuntu container with Memcached and another Centos container with Tomcat. If they were BOTH running Ubuntu, they could share the RAM for things like libc. But because they don't share the files, each base image must load it's own copy of libc. But as we've seen, we're only talking 150MB of files, and you're probably only using a few percent of that. So each image only wastes a few MB of RAM.
(Hint: look at your process in ps. That's how much RAM it's using, including any files from it's image.)
For the moment, Docker is using AUFS which is a Union file system using the copy on write.
When you have multiple base images, those images take disk space, but when you run N containers from those images, there is no actual disk used. As it is copy-on-write, only modified files will take space on the host.
So really, if you have 1 or N base image, it changes nothing, no matter how many container you have.
An image is nothing more but a filesystem where you could chroot, there is absolutely no relation between an image and the host beside the fact that it needs to be linux binary form the same architecture.
I think that multiple base images has just minor impact on memory used.
Explanation:
I think that your comparison to VM is a bit misleading. Sure, in case of f.e. 3 base images running, you will have higher memory requirements than in case of just 1 base image but VMs will have even higher memory requirements:
Rough calculation - Docker, for M images, N containers:
1 x base image + N x container (file system + working memory)
M x size of base image + N x container (file system + working memory)
Calculation - VMs:
N x VM image = at least N x size of the base image for specific VM + N x size of container (size of file system + working memory)
For docker to gain advantage you have to have M << N.
For small M and large N difference between docker and multiple VMs is significant.

Resources