Docker: reducing parent images (how big is too big)?

Docker: reducing parent images (how big is too big)? - docker

big newbie to Docker here, so forgive if the question is unclear:
I am trying to build my own Jupyter notebook image, at the simplest level this is just:
FROM jupyter/scipy-notebook
RUN pip install keras
When I build this image (lets call it keras-notebook), I understand that I have 2 images locally, the parent image jupyter/scipy-notebook and my keras-notebook.
Unfortunately, this leaves me with 2 4GB+ images stored locally - how can I build my keras-notebook without having to have a local jupyter/scipy-notebook?
The bigger question I also have is, how big is too big for a Docker image? Most people suggest a recommended image size of 100s of MB and these images are in the range of GB - so is that (almost) defeating the point of containerizing this software?

If your base image is big, your custom container will be bigger.
There is no right or wrong on building images, just the optimal, to illustrate that: think about the download time of a 4gb image and a 100mb one that does the same thing, of course you will use the 100mb one.
Making images smaller is a challenge to most image builders.

Related

Why is docker pull not extracting layers in parallel?

Does extracting (untarring) of docker image layers by docker pull have to be conducted sequentially or could it be parallelized?
Example
docker pull mirekphd/ml-cpu-r40-base - an image which had to be split into more than 50 layers for build performance reasons - it contains around 4k R packages precompiled as DEB's (the entire CRAN Task Views contents), that would be impossible to build in docker without splitting these packages into multiple layers of roughly equal size, which cuts the build time from a whole day to minutes. Extraction stage - if parallelized - could become up to 50 times faster...
Context
When you observe docker pull for a large multi-layer image (gigabytes in size), you will notice that the download of each layer can be performed separately, in parallel. Not so for subsequent extracting (untarring) of each of these layers, which is performed sequentially. Do we know why?
From my anecdotal observations with such large images it would speed up the execution of the docker pull operation considerably.
Moreover, if splitting image into more layers would let you spin up containers faster, people would start writing Dockerfiles that are more readable and faster to both debug/test and pull/run, rather than trying to pile all instructions onto a single slow-bulding, impossibly convoluted and cache-busting string of instructions only to save a few megabytes of extra layers overhead (which will be easily recouped by parallel extraction).

From the discussion at https://github.com/moby/moby/issues/21814, there are two main reasons that layers are not extracted in parallel:
It would not work on all storage drivers.
It would potentially use lots of CPu.
See the related comments below:
Note that not all storage drivers would be able to support parallel extraction. Some are snapshotting filesystems where the first layer must be extracted and snapshotted before the next one can be applied.
#aaronlehmann
We also don't really want a pull operation consuming tons of CPU on a host with running containers.
#cpuguy83
And the user who closed the linked issue wrote the following:
This isn't going to happen for technical reasons. There's no room for debate here. AUFS would support this, but most of the other storage drivers wouldn't support this. This also requires having specific code to implement at least two different code paths: one with this parallel extraction and one without it.
An image is basically something like this graph A->B->C->D and most Docker storage drivers can't handle extracting any layers which depend on layers which haven't been extracted already.
Should you want to speed up docker pull, you most certainly want faster storage and faster network. Go itself will contribute to performance gains once Go 1.7 is out and we start using it.
I'm going to close this right now because any gains from parallel extraction for specific drivers aren't worth the complexity for the code, the effort needed to implement it and the effort needed to maintain this in the future.
#unclejack

How to get large amounts of data into a docker image?

I want to upload a significant amount of images for processing to use in a Docker instance.
As I have observed this is normally done by a download script (where the images are downloaded to the instance).
I have several terabytes of images so I do not want to download them each time. Is there a way to get my images to a specific location in the Docker instance?
What is the standard way of doing this?

How to browse the contents of a docker/btrfs container-specific layer

I have read, and I believe understood, the docker pages on using btrfs, and notably this one
My question is rather simple, I would need to be able to navigate (e.g. using cd and ls, but any other means is fine) in what the above link calls the Thin R/W layer attached to a given container.
The reason I need this is, I use an image that I have not built myself - namely jupyter/scipy-notebook:latest - and what I can see is that each container starts with a circa 100-200 Mb impact on overall disk usage, even though nothing much should be going on in the container.
So I suspect some rather verbose logs get created that I need to silent down a bit; however the whole union fs is huge - circa 5Gb large - so it would help me greatly to navigate only the data that is specific to one container so I can pinpoint the problem.

To list the files that are changed/stored since the original image use
docker diff my-container
This is quite handy if you want to get an idea of what's happening inside, doesn't give you the file sizes though.

How important is a small Docker image when running?

There are a number of really tiny Linux Docker images that weigh in around 4-5M and the "full" distros that start around 100M and climb to twice that.
Setting aside storage space and download time from a repo, are there runtime considerations to small vs large images? For example if I have a compiled Go program, one running on Busybox and the other on Ubuntu, and I run say 10 of them on a machine, in what ways (if any) does it matter than one image is tiny and the other pretty heavy? Does one consume more runtime resources than the other?

I never saw any real difference in consuming other resources than storage and RAM if the image is bigger, however, as Docker containers should be single process why having the big overhead of unused clutter in your containers?
When trimming things down to small containers, there some advantages you may consider:
Faster transfer when deploying (esp. important if you wan't to do rolling upgrades)
Costs: The most time I used big containers, I ran exactly into storage issues on small VMs
Distributed File Systems: When using some File Storage like GlusterFS or other attached storage, big containers slow down, when bootet and updated heavily
massive overhead of data: if you have 500 MB clutter, you'll have it on your dev-machine, your CI/CD-Server, your registry AND every node of your production servers. This CAN matter depending on your use case.
I would say: If you just use a handful containers internally then the size is less important, if ever, than using hundrets of containers in production.

What is the impact of using multiple Base Images in Docker?

I understand that docker containers are portable between docker hosts, but I am confused about the relationship with the Base Image and the host.
From the documentation on Images, it appears that you would have a much heavier footprint (akin to multiple VMs) on the host machine if you had a variety of base images running. Is this assumption correct?
GOOD: Many containers sharing a single base image.
BAD: Many containers running separate/unique base images.
I'm sure a lot of this confusion comes from my lack of knowledge of LXC.

I am confused about the relationship with the Base Image and the host.
The only relation between the container and the host is that they use the same kernel. Programs running in Docker can't see the host filesystem at all, only their own filesystem.
it appears that you would have a much heavier footprint (akin to multiple VMs) on the host machine if you had a variety of base images running. Is this assumption correct?
No. The Ubuntu base image is about 150MB. But you'd be hard-pressed to actually use all of those programs and libraries. You only need a small subset for any particular purpose. In fact, if your container is running memcache, you could just copy the 3 or 4 libraries it needs, and it would be about 1MB. There's no need for a shell, etc. The unused files will just sit there patiently on disk, completely ignored. They are not loaded into memory, nor are they copied around on disk.
GOOD: Many containers sharing a single base image.
BAD: Many containers running separate/unique base images.
No. Using multiple images will only use a tiny bit of RAM. (Obviously, multiple containers will take more disk space, but disk is cheap, so we'll ignore that). So I'd argue that it's "OK" instead of "BAD".
Example: I start one Ubuntu container with Memcached and another Centos container with Tomcat. If they were BOTH running Ubuntu, they could share the RAM for things like libc. But because they don't share the files, each base image must load it's own copy of libc. But as we've seen, we're only talking 150MB of files, and you're probably only using a few percent of that. So each image only wastes a few MB of RAM.
(Hint: look at your process in ps. That's how much RAM it's using, including any files from it's image.)

For the moment, Docker is using AUFS which is a Union file system using the copy on write.
When you have multiple base images, those images take disk space, but when you run N containers from those images, there is no actual disk used. As it is copy-on-write, only modified files will take space on the host.
So really, if you have 1 or N base image, it changes nothing, no matter how many container you have.
An image is nothing more but a filesystem where you could chroot, there is absolutely no relation between an image and the host beside the fact that it needs to be linux binary form the same architecture.

I think that multiple base images has just minor impact on memory used.
Explanation:
I think that your comparison to VM is a bit misleading. Sure, in case of f.e. 3 base images running, you will have higher memory requirements than in case of just 1 base image but VMs will have even higher memory requirements:
Rough calculation - Docker, for M images, N containers:
1 x base image + N x container (file system + working memory)
M x size of base image + N x container (file system + working memory)
Calculation - VMs:
N x VM image = at least N x size of the base image for specific VM + N x size of container (size of file system + working memory)
For docker to gain advantage you have to have M << N.
For small M and large N difference between docker and multiple VMs is significant.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart