How to limit Docker filesystem space available to container(s) - docker

The general scenario is that we have a cluster of servers and we want to set up virtual clusters on top of that using Docker.
For that we have created Dockerfiles for different services (Hadoop, Spark etc.).
Regarding the Hadoop HDFS service however, we have the situation that the disk space available to the docker containers equals to the disk space available to the server. We want to limit the available disk space on a per-container basis so that we can dynamically spawn an additional datanode with some storage size to contribute to the HDFS filesystem.
We had the idea to use loopback files formatted with ext4 and mount these on directories which we use as volumes in docker containers. However, this implies a large performance loss.
I found another question on SO (Limit disk size and bandwidth of a Docker container) but the answers are almost 1,5 years old which - regarding the speed of development of docker - is ancient.
Which way or storage backend would allow us to
Limit storage on a per-container basis
Has near bare-metal performance
Doesn't require repartitioning of the server drives

You can specify runtime constraints on memory and CPU, but not disk space.
The ability to set constraints on disk space has been requested (issue 12462, issue 3804), but isn't yet implemented, as it depends on the underlying filesystem driver.
This feature is going to be added at some point, but not right away. It's a bit more difficult to add this functionality right now because a lot of chunks of code are moving from one place to another. After this work is done, it should be much easier to implement this functionality.
Please keep in mind that quota support can't be added as a hack to devicemapper, it has to be implemented for as many storage backends as possible, so it has to be implemented in a way which makes it easy to add quota support for other storage backends.
Update August 2016: as shown below, and in issue 3804 comment, PR 24771 and PR 24807 have been merged since then. docker run now allow to set storage driver options per container
$ docker run -it --storage-opt size=120G fedora /bin/bash
This (size) will allow to set the container rootfs size to 120G at creation time.
This option is only available for the devicemapper, btrfs, overlay2, windowsfilter and zfs graph drivers
Documentation: docker run/#Set storage driver options per container.

Related

Share large file with all nodes in docker swarm

Currently, I am migrating to Docker Swarm and have begun to use docker configs to offload most of the configuration files but I have one file remaining that is several GBs that is used by my tileserver. Right now, I have a 1 master / 4 workers and I am looking for a way to share that file with all nodes in the swarm to prepare for a time when the tileserver goes down.
Any ideas ?
If you want highly available data then a solution that distributes data amongst nodes (or servers).
One approach would be deploying an object storage solution onto the swarm - something like minio gives you an s3 compatible REST api and when deployed with a minimum of 4 disks in erasure coding mode tolerates 1 disk down for writing and 2 disks down for reading (assuming you have a node per disk).
If re-jigging your app to work with object storage isnt in scope then investigate something like glusterfs which you will want to install on the metal, rather than on docker. glusterfs will give you a unified filesystem with decent HA on 3 nodes, you can add disks on the fly.
Obviously with minio its expected your app would use the s3 api to access its files. With glusterfs you would need to mount gfs volumes on host locations where containers than then mount volumes to gain access to that network storage.
unless you are willing to go wandering through the world of rex-ray and other community supported docker volume drivers that either havn't seen an update in years or are literally maintained by one guy for fun which can bring some first class support for glusterfs based docker volumes to your hopefully non production docker swarm.

How to bypass memory caching while using FIO inside of a docker container?

I am trying to benchmark I/O performance on my host and docker container using flexible IO tool with O_direct enabled in order to bypass memory caching. The result is very suspicious. docker performs almost 50 times better than my host machine which is impossible. It seems like docker is not bypassing the caching at all. even if I ran it with --privileged mode. This is the command I ran inside of a container, Any suggestions?
fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=4k --numjobs=1 --size=10G --runtime=600 --group_reporting --output-format=json >/home/docker/docker_seqread_4k.json
(Note this isn't really a programming question so Stackoverflow is the wrong place to ask this... Maybe Super User or Serverfault would be a better choice and get faster answers?)
The result is very suspicious. docker performs almost 50 times better than my host machine which is impossible. It seems like docker is not bypassing the caching at all.
If your best case latencies are suspiciously small compared to your worst case latencies it is highly likely your suspicions are well founded and that kernel caching is still happening. Asking for O_DIRECT is a hint not an order and the filesystem can choose to ignore it and use the cache anyway (see the part about "You're asking for direct I/O to a file in a filesystem but...").
If you have the option and you're interested in disk speed, it is better to do any such test outside of a container (with all the caveats that implies). Another option when you can't/don't want to disable caching is ensure that you do I/O that is at least two to three times the size (both in terms of amount and the region being used) of RAM so the majority of I/O can't be satisfied by buffers/cache (and if you're doing write I/O then do something like end_fsync=1 too).
In summary, the filesystem being used by docker may make it impossible to accurately do what you're requesting (measure the disk speed by bypassing cache while using whatever your default docker filesystem is).
Why a Docker benchmark may give the results you expect
The Docker engine uses, by default, the OverlayFS [1][2] driver for data storage in a containers. It assembles all of the different layers from the images and makes them readable. Writing is always done to the "top" layer, which is the container storage.
When performing reads and writes to the container's filesystem, you're passing through Docker's overlay2 driver, through the OverlayFS kernel driver, through your filesystem driver (e.g. ext4) and onto your block device. Additionally, as Anon mentioned, DIRECT/O_DIRECT is just a hint, and may not be respected by any of the layers you're passing through.
Getting more accurate results
To get an accurate benchmarks within a Docker container, you should write to a volume mount or change your storage driver to one that is not overlaid, such as the Device Mapper driver or the ZFS driver.
Both the Device Mapper driver and the ZFS driver require a dedicated block device (you'll likely need a separate hard drive), so using a volume mount might be the easiest way to do this.
Use a volume mount
Use the -v options with a directory that sits on a block device on your host.
docker run -v /absolute/host/directory:/container_mount_point alpine
Use a different Docker storage driver
Note that the storage driver must be changed on the Docker daemon (dockerd) and cannot be set per container. From the documentation:
Important: When you change the storage driver, any existing images and containers become inaccessible. This is because their layers cannot be used by the new storage driver. If you revert your changes, you can access the old images and containers again, but any that you pulled or created using the new driver are then inaccessible.
With that disclaimer out of the way, you can change your storage driver by editing daemon.json and restarting dockerd.
{
"storage-driver": "devicemapper",
"storage-opts": [
"dm.directlvm_device=/dev/sd_",
"dm.thinp_percent=95",
"dm.thinp_metapercent=1",
"dm.thinp_autoextend_threshold=80",
"dm.thinp_autoextend_percent=20",
"dm.directlvm_device_force=false"
]
}
Additional container benchmark notes - kernel
If you are trying to compare different flavors of Linux, keep in mind that Docker is still running on your host machine's kernel.

GlusterFS storage for docker images

I have a docker swarm with 10 docker worker nodes and i'm experiencing issues with docker images storage (in thin pool). It keeps getting full as i got rather small disks (30GB-60GB).
The error:
Thin Pool has 7330 free data blocks which is less than minimum required 8455 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior
Because of that, cleaning strategy has to be aggressive, meaning deleting all images three times a day. This aggressive cleaning strategy results in broken pulls ( when cleaning happens at the same time when someone is pulling an image) and that developers cannot use cached images - instead they need to download the images that just got deleted by cleaning mechanism.
However, there is an option to use GlusterFs storage and i want to mount glusterFS volumes to each docker node and use them to create thin pool for docker images and /var/lib/docker.
I'm looking for guide how to do that exactly. Have anyone tried that?
P.S. I made my research about shared storage for docker images between multiple docker nodes and it seems its not possible, as stated here. However mounting separate volumes to each docker node should be possible.

Do Docker containers on the same host machine share the same page cache?

If I have two Docker containers running on the same host machine do they each have their own page cache or do they use the page cache of the host machine?
Page cache is managed by the kernel, which is used by all the containers.
See more at moby/moby issue 21759
Docker makes it easy to spawn a lot of containers and get better density, but it also makes it easy to run too many services on one machine or to run services which require way too much RAM.
The official documentation lists devicemapper (direct-lvm) as a production ready storage driver, but it doesn't have very efficient memory usage. The official documentation doesn't state otherwise either. Multiple identical containers will increase memory usage for the page cache.
In order to make this better and get better performance, the following should help, in a similar way to how it helps outside of Docker and containers in general:
make containers smaller for long running services & applications (e.g. smaller binaries, smaller images, optimize memory usage, etc)
VERY IMPORTANT: use volumes and bind mounts, instead of storing data inside the container
VERY IMPORTANT: make sure to run a system with a maintained kernel, up to date Docker and devicemapper libraries (e.g. fully updated CentOS 7 / RHEL 7 / Ubuntu 14.04 / Ubuntu 16.04)
Current behaviour (January 2020) is that by default containers on the same host share the same page cache.
Current docker documentation explains:
OverlayFS is a modern union filesystem that is similar to AUFS, but faster and with a simpler implementation. Docker provides two storage drivers for OverlayFS: the original overlay, and the newer and more stable overlay2.
The overlay2 driver is supported on Docker Engine - Community, and Docker EE 17.06.02-ee5 and up, and is the recommended storage driver.
Page Caching. OverlayFS supports page cache sharing. Multiple containers accessing the same file share a single page cache entry for that file. This makes the overlay and overlay2 drivers efficient with memory and a good option for high-density use cases such as PaaS
https://docs.docker.com/storage/storagedriver/overlayfs-driver/

Docker AUFS... underlying fs... optimization by mount option?

So I understand that docker is using /var/lib/docker/ to store every container and images... right?
That means the only optimization that I can do to my container is to optimize the underlying fs that /var/lib/docker/ is sitting on?
In that sense, can I assume that I should be optimizing the mount options of my underlying system fs? e.g. ext4 noatime, noadirtime etc etc
Also, can i use a different mount per /var/lib/docker/folder?? Any limitations and optimization settings considerations for the underlying disk docker is sitting on?
I would rather squash my images
https://github.com/jwilder/docker-squash
and prefer Debian, for example as said here
https://docker.cn/p/6-dockerfile-tips-official-images-en
extract
The main advantage of the Debian image is the smaller size – it clocks in at around 85.1 MB compared to around 200 MB for Ubuntu.

Resources