I want to set up a Swarm with persistent and replicated volumes through Ceph. I see these options to combine both services, once both are set up:
Configure the host OS to mount a CephFS in /var/lib/docker/volumes.
Use rexray/rbd as a volume driver.
Use rexray/s3fs to access Ceph object store, which is S3-compatible.
I wonder now: which option would deliver fastest performance? Is there another better option that I'm missing?
Thanks.
In general for best performance you should go for rbd, since it provides you with direct block access to the ceph volume, whereas s3fs is quite much more machinery to be spun, which eventually result in longer response times. Having quick responses for random read/writes is especially important when you have a scenario like running a postgreSQL (or MariaDB) database with mixed read/write load.
This is only a general advice looking at Ceph rbd. But my guess is this will apply as well to docker storage drivers.
Related
Currently, I am migrating to Docker Swarm and have begun to use docker configs to offload most of the configuration files but I have one file remaining that is several GBs that is used by my tileserver. Right now, I have a 1 master / 4 workers and I am looking for a way to share that file with all nodes in the swarm to prepare for a time when the tileserver goes down.
Any ideas ?
If you want highly available data then a solution that distributes data amongst nodes (or servers).
One approach would be deploying an object storage solution onto the swarm - something like minio gives you an s3 compatible REST api and when deployed with a minimum of 4 disks in erasure coding mode tolerates 1 disk down for writing and 2 disks down for reading (assuming you have a node per disk).
If re-jigging your app to work with object storage isnt in scope then investigate something like glusterfs which you will want to install on the metal, rather than on docker. glusterfs will give you a unified filesystem with decent HA on 3 nodes, you can add disks on the fly.
Obviously with minio its expected your app would use the s3 api to access its files. With glusterfs you would need to mount gfs volumes on host locations where containers than then mount volumes to gain access to that network storage.
unless you are willing to go wandering through the world of rex-ray and other community supported docker volume drivers that either havn't seen an update in years or are literally maintained by one guy for fun which can bring some first class support for glusterfs based docker volumes to your hopefully non production docker swarm.
I am trying to benchmark I/O performance on my host and docker container using flexible IO tool with O_direct enabled in order to bypass memory caching. The result is very suspicious. docker performs almost 50 times better than my host machine which is impossible. It seems like docker is not bypassing the caching at all. even if I ran it with --privileged mode. This is the command I ran inside of a container, Any suggestions?
fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=4k --numjobs=1 --size=10G --runtime=600 --group_reporting --output-format=json >/home/docker/docker_seqread_4k.json
(Note this isn't really a programming question so Stackoverflow is the wrong place to ask this... Maybe Super User or Serverfault would be a better choice and get faster answers?)
The result is very suspicious. docker performs almost 50 times better than my host machine which is impossible. It seems like docker is not bypassing the caching at all.
If your best case latencies are suspiciously small compared to your worst case latencies it is highly likely your suspicions are well founded and that kernel caching is still happening. Asking for O_DIRECT is a hint not an order and the filesystem can choose to ignore it and use the cache anyway (see the part about "You're asking for direct I/O to a file in a filesystem but...").
If you have the option and you're interested in disk speed, it is better to do any such test outside of a container (with all the caveats that implies). Another option when you can't/don't want to disable caching is ensure that you do I/O that is at least two to three times the size (both in terms of amount and the region being used) of RAM so the majority of I/O can't be satisfied by buffers/cache (and if you're doing write I/O then do something like end_fsync=1 too).
In summary, the filesystem being used by docker may make it impossible to accurately do what you're requesting (measure the disk speed by bypassing cache while using whatever your default docker filesystem is).
Why a Docker benchmark may give the results you expect
The Docker engine uses, by default, the OverlayFS [1][2] driver for data storage in a containers. It assembles all of the different layers from the images and makes them readable. Writing is always done to the "top" layer, which is the container storage.
When performing reads and writes to the container's filesystem, you're passing through Docker's overlay2 driver, through the OverlayFS kernel driver, through your filesystem driver (e.g. ext4) and onto your block device. Additionally, as Anon mentioned, DIRECT/O_DIRECT is just a hint, and may not be respected by any of the layers you're passing through.
Getting more accurate results
To get an accurate benchmarks within a Docker container, you should write to a volume mount or change your storage driver to one that is not overlaid, such as the Device Mapper driver or the ZFS driver.
Both the Device Mapper driver and the ZFS driver require a dedicated block device (you'll likely need a separate hard drive), so using a volume mount might be the easiest way to do this.
Use a volume mount
Use the -v options with a directory that sits on a block device on your host.
docker run -v /absolute/host/directory:/container_mount_point alpine
Use a different Docker storage driver
Note that the storage driver must be changed on the Docker daemon (dockerd) and cannot be set per container. From the documentation:
Important: When you change the storage driver, any existing images and containers become inaccessible. This is because their layers cannot be used by the new storage driver. If you revert your changes, you can access the old images and containers again, but any that you pulled or created using the new driver are then inaccessible.
With that disclaimer out of the way, you can change your storage driver by editing daemon.json and restarting dockerd.
{
"storage-driver": "devicemapper",
"storage-opts": [
"dm.directlvm_device=/dev/sd_",
"dm.thinp_percent=95",
"dm.thinp_metapercent=1",
"dm.thinp_autoextend_threshold=80",
"dm.thinp_autoextend_percent=20",
"dm.directlvm_device_force=false"
]
}
Additional container benchmark notes - kernel
If you are trying to compare different flavors of Linux, keep in mind that Docker is still running on your host machine's kernel.
I have a question regarding MariaDB and Docker. Is it wise to use the volume that is already provided with the official MariaDB-Docker-image? Or is it better to create a folder that is shared with the host for better performance? One of my colleagues was afraid that read / write operations could be too slow in the virtual volume.
In my opinion, read / write should be fast enough on that virtual volume as Docker only utilizes the Linux core system, right?
Thank you in advance!
I think you are asking if there is a performance difference between volumes and bind mounts.
The answer is there shouldn't be. Both types bypass the slow copy-on-write storage drivers and are stored directly on the host:
From Performance best practices:
Use volumes for write-heavy workloads: Volumes provide the best and
most predictable performance for write-heavy workloads. This is
because they bypass the storage driver and do not incur any of the
potential overheads introduced by thin provisioning and copy-on-write...
I have three containers that need to run on the same Swarm node/host in order to have access to the same data volume. I don't care which host they are delegated to - since it is running on Elastic AWS instances, they will come up and down without my knowing it.
This last fact makes it tricky even though it seems like it should be fairly common. Is there a placement constraint that would allow this? Obviously node.id or node.hostname are out as those are not constant. I thought about labels - that would work, but then I have no idea how to have a "replacement" AWS instance automatically get the label.
Swarm doesn't have the feature to put containers on the same host together yet (with your requirements of not using ID or hostname). That's known as "Pods" in Kubernetes. Docker Swarm takes a more distributed approach. You could try to hack together a label assignment on new instance startup but that isn't ideal.
In Swarm the way to solve this problem today is with using a different volume driver plugin then the built-in "local" driver. Here's a list of certified ones. The key in Swarm is to not use local storage on a node for volumes. Those volumes will get lost when the node dies anyway, so it's best in Swarm to move your volumes to shared storage.
In AWS I'd suggest you try EFS as shared storage if you need multiple containers to access it at once, and use either Docker's CloudStor driver (comes with Docker for AWS template) or the REX-Ray storage orchestrator solution which ensures shared data paths (NFS, EFS, S3, etc.) are connected to the correct node for the correct Service task.
If I have two Docker containers running on the same host machine do they each have their own page cache or do they use the page cache of the host machine?
Page cache is managed by the kernel, which is used by all the containers.
See more at moby/moby issue 21759
Docker makes it easy to spawn a lot of containers and get better density, but it also makes it easy to run too many services on one machine or to run services which require way too much RAM.
The official documentation lists devicemapper (direct-lvm) as a production ready storage driver, but it doesn't have very efficient memory usage. The official documentation doesn't state otherwise either. Multiple identical containers will increase memory usage for the page cache.
In order to make this better and get better performance, the following should help, in a similar way to how it helps outside of Docker and containers in general:
make containers smaller for long running services & applications (e.g. smaller binaries, smaller images, optimize memory usage, etc)
VERY IMPORTANT: use volumes and bind mounts, instead of storing data inside the container
VERY IMPORTANT: make sure to run a system with a maintained kernel, up to date Docker and devicemapper libraries (e.g. fully updated CentOS 7 / RHEL 7 / Ubuntu 14.04 / Ubuntu 16.04)
Current behaviour (January 2020) is that by default containers on the same host share the same page cache.
Current docker documentation explains:
OverlayFS is a modern union filesystem that is similar to AUFS, but faster and with a simpler implementation. Docker provides two storage drivers for OverlayFS: the original overlay, and the newer and more stable overlay2.
The overlay2 driver is supported on Docker Engine - Community, and Docker EE 17.06.02-ee5 and up, and is the recommended storage driver.
Page Caching. OverlayFS supports page cache sharing. Multiple containers accessing the same file share a single page cache entry for that file. This makes the overlay and overlay2 drivers efficient with memory and a good option for high-density use cases such as PaaS
https://docs.docker.com/storage/storagedriver/overlayfs-driver/