Why isn't Docker more transparent about what it's downloading?

Why isn't Docker more transparent about what it's downloading? - docker

When I download a Docker image, it downloads dependencies, but only displays their hashes. Why does it not display what it is downloading?
For example:
➜ ~ docker run ubuntu:16.04
Unable to find image 'ubuntu:16.04' locally
16.04: Pulling from library/ubuntu
b3e1c725a85f: Downloading 40.63 MB/50.22 MB
4daad8bdde31: Download complete
63fe8c0068a8: Download complete
4a70713c436f: Download complete
bd842a2105a8: Download complete
What's the point in only telling me that it's downloading b3e1c725a85f, etc.?

An image is created on layers of filesystems represented by hashes. After it's creation, the base image tag may point to a completely different set of hashes without affecting any images built off of it. And these layers are based on things like run commands, the tag to call it something like ubuntu:16.04 is only added after the image is made.
So the best that could be done is to say 4a70713c436f is based on adding some directory based on a hash of an input folder itself, or a multi-line run command, neither of which makes for a decent UI. The result may have no tagged name, or it could have multiple tagged names. So the simplest solution is to output what's universal and unchanging for all scenarios, an unchanging hash.
To rephrase that pictorially:
b3e1c725a85f: could be ubuntu:16.04, ubuntu:16, ubuntu:latest, some.other.registry:5000/ubuntu-mirror:16.04
4daad8bdde31: could be completely untagged, just a run command
63fe8c0068a8: could be completely untagged, just a copy file
4a70713c436f: could point to a tagged base image where that tag has since changed
bd842a2105a8: could be created with a docker commit command (eek)

Related

How to merge Docker Compose in one image? [duplicate]

I'm hoping to use docker to set up some bioinformatic analysis.
I have found two docker images that I would like to use:
jupyter/datascience-notebook
bioconductor/devel_base
I have been successful in running each of these images independently, however I don't know how to merge them together.
Is merging two docker containers possible? Or do you start with one, and then manually install the features of the other?

You can't just merge the images. You have to recreate your own based on what was in each of the images you want. You can download both images and re-create the Docker files for each like this:
docker history --no-trunc=true image1 > image1-dockerfile
docker history --no-trunc=true image2 > image2-dockerfile
Substitute the image1 and image2 with the images you want to see the history for. After this you can use those dockerfiles to build your own image that is the combination of the two.
The fly in the ointment here is that any ADD or COPY commands will not reveal what was copied because you don't have access to the local file system from which the original images were created. With any luck that won't be necessary or you can get any missing bits from the images themselves.

If there are specific files or directories that you want to cherry-pick from the one of the two images, you can create a new Dockerfile that builds FROM one of them and copy over specific paths from the other using COPY's --from option. For example:
FROM bioconductor/devel_base
COPY --from=jupyter/datascience-notebook /path/to/something-you-want /path
However, a quick investigation of those images shows that in this specific case there isn't a lot that can easily be cherry picked.
Alternatively, you can just look at the original Dockerfiles and combine them yourself:
https://github.com/jupyter/docker-stacks/blob/master/base-notebook/Dockerfile
https://github.com/Bioconductor/bioc_docker/blob/master/out/devel_base/Dockerfile
Fortunately they are both based one APT-based distros: Ubuntu and Debian. So most of the apt-get install commands should work fine if you pick either base image.

You start with one then manually install the features of the other one. Merging would be far to complex, and too many unknowns.

Docker image layer: What does `ADD file:<some_hash> in /` mean?

In Docker Hub images there are lists of commands that being run for each image layer. Here is a golang example.
Some applications also provide their Dockerfile in GitHub. Here is a golang example.
According to the Docker Hub image layer, ADD file:4b03b5f551e3fbdf47ec609712007327828f7530cc3455c43bbcdcaf449a75a9 in / is the first command. The image layer doesn't have any "FROM" command included, and it doesn't seem to be suffice the ADD definition too.
So here are the questions:
What does ADD file:<HASH> in / means? What is this format?
Is there any way I could trace upwards using the hash? I suppose that hash represents the FROM image, but it seems there are no API for that.
Why it is not possible to build a dockerfile using the ADD file:<HASH> in / syntax? Is there any way I could build an image using such syntax, OR do a conversion between two format?

That Docker Hub history view doesn't show the actual Dockerfile; instead, it shows content essentially extracted from the docker history of the image. That doesn't preserve the specific details you're looking for: it doesn't remember the names of base images, or the build-context file names of things that get ADDed or COPYed in.
Chasing through GitHub and Docker Hub links, the golang:*-buster Dockerfile is built FROM buildpack-deps:...-scm; buildpack-deps:buster-scm is FROM buildpack-deps:buster-curl; that is FROM debian:buster; and that has a very simple Dockerfile (quoted here in its entirety):
FROM scratch
ADD rootfs.tar.xz /
CMD ["bash"]
FROM scratch starts from a completely totally empty image; that is the base of the Docker image tree (and what tells docker history and similar tools to stop). The ADD line unpacks a tar file of a Debian system image.
If you look at docker history or the Docker Hub history view you cite, you should be able to see these same steps happening. The ADD file:4b0... in / corresponds to the ADD rootfs.tar.gz /, and the second line is the CMD ["bash"]. It is not split up by Dockerfile or image, and the original filenames from ADD aren't saved. (You couldn't reproduce the image anyways without the contents of the rootfs.tar.gz, so it's merely slightly helpful to know its filename but not essential.)
The ADD file:hash in /path syntax is not standard Dockerfile syntax (the word in in particular is not part of it). I'm not sure there's a reliable way to translate from the host file or URL to the hash, but building the image and looking at its docker history would tell you (assuming you've got a perfect match for the file metadata). There's no way to get back to the original filename or syntax, and definitely no way to get back to the file contents.

ADD or COPY means that files are append to the images.
That are files, you cannot "trace" them.
You cannot just copy the commands, because the hashes are not the original files. See https://forums.docker.com/t/how-to-extract-file-from-image/96987 to get the file.

How docker load a diff image

I know docker save can save a image to tar and use docker load to reload a image.
For example:
I have a Machine A and B. B can't connect hub. A is image:latest and B is image:base.
I have to save multi image in A as some tar file , but the tar files are too big to transfer.
Can I save the diff between tags or image ids in A and load the diff in B?
Not save the whole image which help update patch much more smaller.

This isn't possible using standard Docker tooling. The only option docker save takes is an option to write to a file rather than to stdout, and it always contains all parent layers (and base images).
If your only problem is transferring the images, consider either techniques to reduce the image size (for example, use a multi-stage image to not include build-time dependencies in the final image) or using tools like split(1) to break the tar file into smaller parts.
I believe the docker save tar file output is the same as the "Export an image" API call. It might be possible to manually edit that tar file to delete layers, and there might be tools out there that do this. (This is not a particularly mainstream path, though; I've looked into it several years ago but not done it myself, and occasionally see tools mentioned in infrequent SO answers.)
In between the standard behavior of docker pull and docker save only creating complete image chains, in principle there's no way to set up Docker so that you never only have the "top half" of an image but not the base layers below this. Editing the docker save tar files by hand would violate this invariant.

How do I update docker images?

I read that docker works with layers, so when creating a container with a Dockerfile, you start with the base image, then subsequent commands run add a layer to the container, so if you save the state of that new container, you have a new image. There are a couple of things I'm wondering about this.
If I start from a Ubuntu image, which is pretty big and bulky since its a complete OS, then I add a few tools to it and save this as a new image which I upload to the hub. If someone downloads my image, and they already have a Ubuntu image saved in their images folder, does this mean they can skip downloading Ubuntu since they already have the image? If so, how does this work when I modify parts of the original image, does Docker use its cached data to selectively apply those changes to the Ubuntu image after it loads it?
2.) How do I update an image that I built by modifying the Dockerfile? I setup a simple django project with this Dockerfile:
FROM python:3.5
ENV PYTHONBUFFERED 1
ENV APPLICATION_ROOT /app
ENV APP_ENVIRONMENT L
RUN mkdir -p $APPLICATION_ROOT
WORKDIR $APPLICATION_ROOT
ADD requirements.txt $APPLICATION_ROOT
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
ADD . $APPLICATION_ROOT
and used this to create the image in the beginning. So everytime I create a box, it loads all these environment variables, if I rebuild the box completely it reinstalls the packages and all the extras. I need to add a new environment variable, so I added it to the bottom of the Dockerfile, along with a test variable:
ENV COMPOSE_CONVERT_WINDOWS_PATHS 1
ENV TEST_ENV_VAR TEST
When I delete the container and the image, and build a new container, it all seems to go accordingly, it tells me that it creates the new Step 4 : ENV
COMPOSE_CONVERT_WINDOWS_PATHS 1
---> Running in 75551ea311b2
---> b25b60e29f18
Removing intermediate container 75551ea311b2
So its like something gets lost in some of these intermediate container transitions. Is this how the caching system works, every new layer is an intermediate container? So with that in mind, how do you add a new layer, do you always have to add the new data at the bottom of the Dockerfile? Or would it be better to leave the Dockerfile alone once the image is built, and just modify the container and built a new image?
EDIT I just tried installing an image, a package called bwawrik/bioinformatics, which is a CentOS based container which has a wide range of tools installed.
It froze half way through, so I exited it and then ran it again to see if everything was installed:
$ docker pull bwawrik/bioinformatics
Using default tag: latest
latest: Pulling from bwawrik/bioinformatics
a3ed95caeb02: Already exists
a3ed95caeb02: Already exists
7e78dbe53fdd: Already exists
ebcc98113eaa: Already exists
598d3c8fd678: Already exists
12520d1e1960: Already exists
9b4912d2bc7b: Already exists
c64f941884ae: Already exists
24371a4298bf: Already exists
993de48846f3: Already exists
2231b3c00b9e: Already exists
2d67c793630d: Already exists
d43673e70e8e: Already exists
fe4f50dda611: Already exists
33300f752b24: Already exists
b4eec31201d8: Already exists
f34092f697e8: Already exists
e49521d8fb4f: Already exists
8349c93680fe: Already exists
929d44a7a5a1: Already exists
09a30957f0fb: Already exists
4611e742e0b5: Already exists
25aacf0148db: Already exists
74da82504b6c: Already exists
3e0aac083b86: Already exists
f52c7e0ac000: Already exists
35eee92aaf2f: Already exists
5f6d8eb70885: Already exists
536920bfe266: Already exists
98638e678c51: Already exists
9123956b991d: Already exists
1c4c8a29cd65: Already exists
1804bf352a97: Already exists
aa6fe9359956: Already exists
e7e38d1250a9: Already exists
05e935c831dc: Already exists
b7dfc22c26f3: Already exists
1514d4797ffd: Already exists
Digest: sha256:0391808e21b7b5cc0eb44fc2dad0d7f5415115bdaafb4534c0b6a12efd47a88b
Status: Image is up to date for bwawrik/bioinformatics:latest
So it definitely installed the package in pieces, not all in one go. Are these pieces, different images?

image vs. container
First, let me clarify some terminology.
image: A static, immutable object. This is the thing you build when you run docker build using a Dockerfile. An image is not a thing that runs.
Images are composed of layers. an image might have only one layer, or it might have many layers.
container: A running thing. It uses an image as its starting template.
This is similar to a binary program and a process. You have a binary program on disk (such as /bin/sh), and when you run it, it is a process on your system. This is similar to the relationship between images and containers.
Adding layers to a base image
You can build your own image from a base image (such as ubuntu in your example). Some commands in your Dockerfile will create a new layer in the ultimate image. Some of those are RUN, COPY, and ADD.
The very first layer has no parent layer. But every other layer will have a parent layer. In this way they link to one another, stacking up like pancakes.
Each layer has a unique ID (the long hexadecimal hashes you have already seen). They can also have human-friendly names, known as tags (e.g. ubuntu:16.04).
What is a layer vs. an image?
Technically, each layer is also an image. If you build a new image and it has 5 layers, you can use that image and it will contain all 5 layers. If you run a container using the third layer in the stack as your image ID, you can do that too - but it would only contain 3 layers. The one you specify and the two that are its ancestors.
But as a matter of convention, the term "image" generally means the layer that has a tag associated. When you run docker images, it will show you all of the top-level images, and hide the layers beneath (but you can show them all with -a).
What is an intermediate container?
When docker build runs, it does all of its work inside of containers (naturally!) So if it encounters a RUN step, it will create a container from the current top layer, run the specified commands in there, and then save the result as a new layer. Then it will create a container from this new layer, run the next thing... etc.
The intermediate containers are only used for the build process, and are discarded after the build.
How layer filesystems work
You asked whether someone downloading your ubuntu-based image are only doing a partial download, if they already had the ubuntu image locally.
Yes! That's exactly right.
Every layer uses the layer beneath it as a base. The new layer is basically a diff between that layer and a new state. It's not a diff in the same way as a git commit might work, though. It works at the file level, not at a the line level.
Say you started from ubuntu, and you ran this Dockerfile.
FROM: ubuntu:16.04
RUN groupadd dan && useradd -g dan dan
This would result in a two layer image. The first layer would be the ubuntu image. The second would probably have only a handful of changes.
A newer copy of /etc/passwd with user "dan"
A newer copy of /etc/group with group "dan"
A new directory /home/dan
A couple of default files like /home/dan/.bashrc
And that's it. If you start a container from this image, those few files would be in the topmost layer, and everything else would come from the filesystem in the ubuntu image.
The top-most read-write layer in a container
One other point. When you run a container, you can write files in the filesystem. But if you stop the container and run another container from the same image, everything is reset. So where are the files written?
Images are immutable, so once they are created, they can't be changed. You can build a new version, but that's a new image. It would have a different ID and would not be the same image.
A container has a top-level read-write layer which is put on top of the image layers. Any writes happen in that layer. It works just like the other layers. If you need to modify a file (or add one, or delete one), that is done in the top layer, and doesn't affect the lower layers. If the file exists already, it is copied into the read-write layer, and then modified. This is known as copy-on-write (CoW).
Where to add changes
Do you have to add new things to the bottom of Dockerfile? No, you can add anything anywhere (or change anything).
However, how you do things does affect your build times because of how the build caching works.
Docker will try to cache results during builds. If it finds as it reads through Dockerfile that the FROM is the same, the first RUN is the same, the second RUN is the same... it will assume it has already done those steps, and will use cached results. If it encounters something that is different from the last build, it will invalidate the cache. Everything from that point on will be re-run fresh.
Some things will always invalidate the cache. For instance if you use ADD or COPY, those always invalidate the cache. That's because Docker only keeps track of what the build commands are. It doesn't try to figure out "is this version of the file I'm copying the same one as last time?"
So it is a common practice to start with FROM, then put very static things like RUN commands that install packages with e.g. apt-get, etc. Those things tend to not change a lot after your Dockerfile has been initially written. Later in the file is a more convenient place to put things that change more often.
It's hard to concisely give good advice on this, because it really depends on the project in question. But it pays to learn how the build caching works and try to take advantage of it.

What is the difference between save and export in Docker?

I am playing around with Docker for a couple of days and I already made some images (which was really fun!). Now I want to persist my work and came to the save and export commands, but I don't fully understand them.
What is the difference between save and export in Docker?

The short answer is:
save will fetch an image : for a VM or a physical server, that would be the installation .ISO image or disk. The base operating system.
It will pack the layers and metadata of all the chain required to build the image. You can then load this "saved" images chain into another docker instance and create containers from these images.
export will fetch the whole container : like a snapshot of a regular VM. Saves the OS of course, but also any change you made, any data file written during the container life. This one is more like a traditional backup.
It will give you a flat .tar archive containing the filesystem of your container.
Edit: as my explanation may still lead to confusion, I think that it is important to understand that one of these commands works with containers, while the other works with images.
An image has to be considered as 'dead' or immutable, starting 0 or 1000 containers from it won't alter a single byte. That's why I made a comparison with a system install ISO earlier. It's maybe even closer to a live-CD.
A container "boots" the image and adds an additional layer on top of it. This layer stores any change on the container (created/changed/removed files...).

There are two main differences between save and export commands.
save command saves whole image with history and metadata but export command exports only files structure (without history and metadata). So the exported tar file will be smaller then the saved one.
When you use exported file system for creating a new image then this new image will not contain any USER, EXPOSE, RUN etc. commands from your Dockerfile. Only file structure will be transferred.
So when you are using mentioned keywords in your Dockerfile then you cannot use export command for transferring image to another machine - you need always use save command.

export: container (filesystem)->image tar.
import: exported image tar-> image. Only one layer.
save: image-> image tar.
load: saved image tar->image. All layers will be recovered.
From Docker in Action, Second Edition p190.
Layered images maintain the history of the image, container-creation metadata, and old files that might have been deleted or overridden.
Flattened images contain only the current set of files on the filesystem.

The exported image will not have any layer or history information saved, so it will be smaller and you will not be able to rollback.
The saved image will have layer and history information, so larger.
If giving this to a customer, the Q is do you want to keep those layers or not?

Technically, save/load works with repositories which can be one or more of images, also referred to as layers. An image is a single layer within a repo. Finally, a container is an instantiated image (running or not).

Docker save Produces a tar file repo which contains all parent layers, and all tags + versions, or specified repo:tag, for each argument provided from image.
Docker export Produces specified file(can be tar or tgz) with flat contents without contents of specified volumes from Container.
docker save need to use on docker image while docker export need to use on container(just like running image)
Save Usage
docker save [OPTIONS] IMAGE [IMAGE...]
Save an image(s) to a tar archive (streamed to STDOUT by default)
--help=false Print usage -o, --output="" Write to a file,
instead of STDOUT
export Usage
docker export [OPTIONS] CONTAINER
Export the contents of a container's filesystem as a tar archive
--help=false Print usage -o, --output="" Write to a file,
instead of STDOUT

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart