In a deployment bash script, I have two hosts:
localhost, which is a machine, that typically builds docker images.
$REMOTE_HOST, which is believed to be some production web server.
And I need to transfer locally built docker image to $REMOTE_HOST, in most efficient way (fast, reliable, private, storage-friendly). Up to day, I have following command in my streaming script:
docker save $IMAGE_NAME :latest | ssh -i $KEY_FILE -C $REMOTE_HOST docker load
This has following PROS:
Utilizes "compression-on-the-fly"
Does not stores intermediate files on both source and destination
Does direct transfer (images may be private), that also reduces upload time and stays "green", in another broader terms.
However, the CONS are also on checkerboard: When you are involved in transferring larger images, you dont know operation progress. So you have to wait unknown, but sensible time, that you cant estimate. I heard that progress can be tracked with kinda rsync --progress command
But rsync seems to transfer files, and is not working good with my ol'UNIX-style commands. Of couse you can docker load from some file, but how to avoid it?
How can I utilize piping, to preserve above advantages? (Or is there another special tool do copy build image to remote docker host, which shows progress?)
You could invoke pv as part of your pipeline:
docker save $1:latest | pv [options...] | ssh -i $3 -C $2 docker load
pv works like cat, in that it reads from its standard input and writes to its standard output. Except that, like the documentation says,
pv allows a user to see the progress of data through a pipeline, by giving information such as time elapsed, percentage completed (with progress bar), current throughput rate, total data transferred, and ETA.
pv has a number of options to control what kind of progress information it prints. You should read the documentation and choose the output that you want. In order to display a percentage complete or an ETA, you will probably need to supply an expected size for the data transfer using the -s option.
Related
Having needed several times in the last few days to upload a 1Gb image after some micro change, I can't help but wonder why there isnt a deploy path built into docker and related tech (e.g. k8s) to push just the application files (Dockerfile, docker-compose.yml and app related code) and have it build out the infrastructure from within the (live) docker host?
In other words, why do I have to upload an entire linux machine whenever I change my app code?
Isn't the whole point of Docker that the configs describe a purely deterministic infrastructure output? I can't even see why one would need to upload the whole container image unless they make changes to it manually, outside of Dockerfile, and then wish to upload that modified image. But that seems like bad practice at the very least...
Am I missing something or this just a peculiarity of the system?
Good question.
Short answer:
Because storage is cheaper than processing power, building images "Live" might be complex, time-consuming and it might be unpredictable.
On your Kubernetes cluster, for example, you just want to pull "cached" layers of your image that you know that it works, and you just run it... In seconds instead of compiling binaries and downloading things (as you would specify in your Dockerfile).
About building images:
You don't have to build these images locally, you can use your CI/CD runners and run the docker build and docker push from the pipelines that run when you push your code to a git repository.
And also, if the image is too big you should look into ways of reducing its size by using multi-stage building, using lighter/minimal base images, using few layers (for example multiple RUN apt install can be grouped to one apt install command listing multiple packages), and also by using .dockerignore to not ship unnecessary files to your image. And last read more about caching in docker builds as it may reduce the size of the layers you might be pushing when making changes.
Long answer:
Think of the Dockerfile as the source code, and the Image as the final binary. I know it's a classic example.
But just consider how long it would take to build/compile the binary every time you want to use it (either by running it, or importing it as a library in a different piece of software). Then consider how indeterministic it would download the dependencies of that software, or compile them on different machines every time you run them.
You can take for example Node.js's Dockerfile:
https://github.com/nodejs/docker-node/blob/main/16/alpine3.16/Dockerfile
Which is based on Alpine: https://github.com/alpinelinux/docker-alpine
You don't want your application to perform all operations specified in these files (and their scripts) on runtime before actually starting your applications as it might be unpredictable, time-consuming, and more complex than it should be (for example you'd require firewall exceptions for an Egress traffic to the internet from the cluster to download some dependencies which you don't know if they would be available).
You would instead just ship an image based on the base image you tested and built your code to run on. That image would be built and sent to the registry then k8s will run it as a black box, which might be predictable and deterministic.
Then about your point of how annoying it is to push huge docker images every time:
You might cut that size down by following some best practices and well designing your Dockerfile, for example:
Reduce your layers, for example, pass multiple arguments whenever it's possible to commands, instead of re-running them multiple times.
Use multi-stage building, so you will only push the final image, not the stages you needed to build to compile and configure your application.
Avoid injecting data into your images, you can pass it later on-runtime to the containers.
Order your layers, so you would not have to re-build untouched layers when making changes.
Don't include unnecessary files, and use .dockerignore.
And last but not least:
You don't have to push images from your machine, you can do it with CI/CD runners (for example build-push Github action), or you can use your cloud provider's "Cloud Build" products (like Cloud Build for GCP and AWS CodeBuild)
Is there a simple way of converting my docker image to a cloud foundry droplet ?
What did not work:
docker save registry/myapp1 |gzip > myapp1.tgz
cf push myapp1 --droplet myapp1.tgz
LOG: 2021-02-13T12:36:28.80+0000 [APP/PROC/WEB/0] OUT Exit status 1
LOG: 2021-02-13T12:36:28.80+0000 [APP/PROC/WEB/0] ERR /tmp/lifecycle/launcher: no start command specified or detected in droplet
If you want to run your docker image on Cloud Foundry, simply run cf push -o <your/image>. Cloud Foundry can natively run docker images so long as your operations team has enabled that functionality (not a lot of reason to disable it) and you meet the requirements.
You can check to see if Docker support is enabled by running cf feature-flag and looking for the line diego_docker enabled. If it says disabled, talk to your operations team about enabling it.
By doing this, you don't need to do any complicated conversion. The image is just run directly on Cloud Foundry.
This doesn't 100% answer your question, but it's what I would recommend if at all possible.
To try and answer your question, I don't think there's an easy way to make this conversion. The output of docker save is a bunch of layers. This is not the same as a droplet which is an archive containing some specific folders (app bits + what's installed by your buildpacks). I suppose you could convert them, but there's not a clear path to doing this.
The way Cloud Foundry uses a droplet is different and more constrained than a Docker image. The droplet gets extracted into /home/vcap overtop of an Ubuntu Bionic (cflinuxfs3 root filesystem) and the app is then run out of there. This your droplet can only contain files that will go into this one place in the file system.
For a Docker image, you can literally have a completely custom file system.
So given that difference, I don't think there's a generic way you can take a random docker image and convert that to a droplet. The best you could probably do is take some constrained set of docker images, like those build from Ubuntu Bionic, using certain patterns, extract the files necessary to run your app, stuff them directories that will unpack overtop of /home/vcap (i.e. that resembles a droplet), tar gzip it and try to use that.
Starting with the output of docker save is probably a good idea. You'd then just need to extract the files you want from the tar archive of the layers (i.e. dig through each layer, which is another tar archive and extract files), then move them into a directory structure that resembles this:
./
./deps/
./profile.d/
./staging_info.yml
./tmp/
./logs/
./app/
where ./deps is typically where buildpacks will install required dependencies, ./.profile.d/ is where you can put scripts that will run before your app starts and ./app is where your app (most of your files) will end up.
The staging_info.yml, I'm not 100% sure is required, but basically breaks down to {"detected_buildpack":"java","start_command":""}. You could fake the detected_buildpack setting it to anything and then start_command is obviously the command to run (you can override this later though).
I haven't tried doing this because cf push -o is much easier, but you could give it a shot if cf push -o isn't an option.
I'm trying to build Docker images and I would like my Docker images to be deterministic. Much to my surprise I found that even a trivial Dockerfile such as
FROM scratch
ENV a b
Produces different IDs when built repeatedly using docker build --no-cache .
How could I make my builds deterministic and whats causing the changes in image IDs? When caching is enabled the same ID is produced.
The reason I'm trying to get this reproducibility is to enable producing the same layers in a distributed build environment. I can not control where a build is run therefore I can not know what is in the cache.
Also the Docker build downloads files using wget from an ftp which may or may not have changed, currently I can not easily tell Docker from within a Dockerfile if the results of a RUN should invalidate the cache. Therefore if I could just produce the same ID for identical layers (when no cache is used) these layers would not have to be "push"ed and "pull"ed again.
Also all the reasons listed here: https://reproducible-builds.org/
AFAIK, currently docker images do not hash to byte-exact hashes, since the metadata currently contains stateful information such as created date. You can check out the design doc from 1.10. Unfortunately, it looks like the history metadata is an important part of image validity and identification.
Don't get me wrong, I'm all about reproducible builds. However I don't believe hash-exactness is the best criteria for measuring reproducibility of a docker image. A docker image isn't a compiled binary. There is no way to guarantee the results of a stage will ever be able to be reproduced, so even if the datetime metadata was absent, it would not guarantee reproducible builds. Take this pathological example:
RUN curl "https://www.random.org/strings/?num=1&len=20&digits=on&unique=on&format=plain&rnd=new" -o nonce.txt
The image ID is a SHA256 of the image's configuration object (what you get when you do a docker image inspect). Run this with the images you are creating and you will see differences between them.
I use docker-compose file to get Elasticsearch Logstash Kibana stack. Everything works fine,
docker-compose build
command creates three images, about 600 MB each, downloads from docker repository needed layers.
Now, I need to do the same, but at the machine with no Internet access. Downloading from respositories there is impossible. I need to create "offline installer". The best way I found is
docker save image1 image2 image3 -o archivebackup.tar
but created file is almost 2GB. During
docker-compose build
command some data are downloaded from the Internet but it is definitely less than 2GB.
What is a better way to create my "offline installer", to avoid making it so big?
The save command is the way to go for running docker images online.
The size difference that you are noticing is because when you are pulling images from a registry, some layers might exist locally and are thus not pulled. So you are not pulling all the image layers, only the ones
that you don't have locally.
On the other hand, when you are saving the image to a tar, all the layers need to be stored.
The best way to create the Docker offline Installer is to
List item
Get the CI/CD pipeline to generate the TAR file as build process.
Later create a local folder with the required TAR files
Write a script to load these TAR files on the machine
The same script can fire the docker-compose up -d command to bring up the whole service ecosystem
Note : It is important to load the images before bringing up the services
Regarding the size issue the answer by Yamenk specifically points to the reason why the size increases. The reason is docker does not pull the shared layers.
Is there a limit to the number of parallel Docker push/pulls you can do?
E.g. if you thread Docker pull / push commands such that they are
pulling/pushing different images at the same time what would be the
upper limit to the number of parallel push/pulls
Or alternatively
On one terminal you do docker pull ubuntu on another you do docker
pull httpd etc - what would be the limit Docker would support?
The options are set in the configuration file (Linux-based OS it is located in the path: /etc/docker/daemon.json and C:\ProgramData\docker\config\daemon.json on Windows)
Open /etc/docker/daemon.json (If doesn't exist, create it)
Add the values(for push/pulls) and set parallel operations limit
{
"max-concurrent-uploads": 1,
"max-concurrent-downloads": 1
}
Restart daemon: sudo service docker restart
The docker daemon (dockerd) has two flags:
--max-concurrent-downloads int Set the max concurrent downloads for each pull
(default 3)
--max-concurrent-uploads int Set the max concurrent uploads for each push
(default 5)
The upper limit will likely depend on the number of open files you permit for the process (ulimit -n). There will be some overhead of other docker file handles, and I expect that each push and pull opens multiple handles, one for the remote connection, and another for the local file storage.
To compound the complication of this, each push and pull of an image will open multiple connections, one per layer, up to the concurrent limit. So if you run a dozen concurrent pulls, you may have 50-100 potential layers to pull.
While docker does allow these limits to be increased, there's a practical limit where you'll see diminishing returns if not a negative return to opening more concurrent connections. Assuming the bandwidth to the remote registry is limited, more connections will simply split that bandwidth, and docker itself will wait until the very first layer finishes before it starts unpacking that transmission. Also any aborted docker pull or push will lose any partial transmissions of a layer, so you increase the potential data you'd need to retransmit with more concurrent connections.
The default limits are well suited for a development environment, and if you find the need to adjust them, I'd recommend measuring the performance improvement before trying to find the max number of concurrent sessions.
For anyone using Docker for Windows and WSL2:
You can (and should) set the options on the Settings tab:
Docker for Windows Docker Engine settings