Persisting data in a docker swarm with glusterfs - docker

I have a docker swarm with a lot of containers, but in particolar:
mysql
mongodb
fluentd
elasticsearch
My problem is that when a node fails, the manager discards the current container and creates a new one in another node. So everytime i lost the persisting data stored in that particular container even using docker volumes.
So i would create four distributed glusterfs volumes over my cluster, and mount them as docker volumes into my containers.
Is this a correct way to resolve my problem?
If it is, what type of filesystem should i use for my glusterfs volumes?
Are there perfomance problems with this approch?

GlusterFS would not be the correct way to resolve this for all of your containers since Gluster does not support "structured data", as stated in the GlusterFS Install Guide:
Gluster does not support so called “structured data”, meaning live, SQL databases. Of course, using Gluster to backup and restore the database would be fine - Gluster is traditionally better when using file sizes at of least 16KB (with a sweet spot around 128KB or so).
One solution to this would be master slave replication for the data in your databases.
MySQL and mongoDB both support this (as described here and here), as do most common DBMSs.
Master slave replication is basically where for 2 or more copies of your database, one will be the master and the rest will be slaves. All write operations happen on the master, and all read operations happen on the slaves. Any data written to the master will be replicated across the slaves, by the master.
Some DBMSs also provide a way to check if the master goes down and elect a new master if this happens, but I don't think all DBMSs do this.
You could alternatively set up a Galera Cluster, but as far as I'm aware this only supports MySQL.
I would have thought you could use GlusterFS for Fluentd and Elasticsearch, but I'm not familiar with either of those so I couldn't say for certain. I imagine it would depend on how they store any data they collect (if they collect any at all).

You might want to take a look at flocker (a volume data manager) which has integration for several container cluster managers, including Docker Swarm.
You will have to create a volume using flocker driver for each application as pointed by the tutorial:
...
volumes:
mysql:
driver: "flocker"
driver_opts:
size: "10GiB"
profile: "bronze"
...

Related

Share volume in docker swarm for many nodes

I'm facing a big challenge. Trying run my app on 2 VPS in docker swarm. Containers that use volumes should use shared volume between nodes.
My solution is:
Use plugin glusterFS and mount volume on every node using nfs. NFS generate single point of failure so when something go wrong my data are gone. (it's not look good maybe im wrong)
Use Azure Storage - store data as blob ( Azure Data Lake Storage Gen2 ). But my main is problem how can i connect to azure storage using docker-compose.yaml? I should declarate volume in every service that use volume and declare volume in volume section. I don't have idea how to do that.
Docker documentation about it is gone. Should be here https://docs.docker.com/docker-for-azure/persistent-data-volumes/.
Another option is use https://hub.docker.com/r/docker4x/cloudstor/tags?page=1&ordering=last_updated but last update was 2 years ago so its probably not supported anymore.
Do i have any other options and which share volume between nodes is best solution?
There are a number of ways of dealing with creating persistent volumes in docker swarm, none of them particularly satisfactory:
First up, a simple way is to use nfs, glusterfs, iscsi, or vmware to multimount the same SAN storage volume onto each docker swarm node. Services just mount volumes as /mnt/volumes/my-sql-workload
On the one hand its really simple, on the other hand there is literally no access control and you can easilly accidentally load services pointing at each others data.
Next, commercial docker volume plugins for SANs. If you are lucky and possess a Pure Storage, NetApp or other such SAN array, some of them still offer docker volume plugins. Trident for example if you have a NetApp.
Third. if you are in the cloud, the legacy swarm offerings on Azure and Aws included a built in "cloudstor" volume driver but you need to dig really deep to find it in their legacy offering.
Four, there are a number of opensource or free volume plugins that will mount volumes from nfs, glusterfs or other sources. But most are abandoned or very quiet. The most active I know of is marcelo-ochoa/docker-volume-plugins
I wasn't particularly happy with how those plugins mounted pre-existing volumes, but made operations like docker volume create hard, so I made my own, but really
Swarm Cluster Volume Support with CSI Plugins is hopefully going to drop in 2021¹. Which hopefully is a solid rebuttal to all the problems above.
¹Its now 2022 and the next version of Docker has not yet gone live with CSI support. Still we wait.
In my opinion, a good solution could be to create a GlusterFS cluster, configure a single volume and mount it in every Docker Swarm node (i.e. in /mnt/swarm-storage).
Then, for every Container that needs persistent storage, bind-mount a subdirectory of the GlusterFS volume inside the container.
Example:
services:
my-container:
...
volumes:
- type: bind
source: /mnt/swarm-storage/my-container
target: /a/path/inside/the/container
This way, every node shares the same storage, so that a given container could be instantiated indifferently on every cluster node.
You don't need any Docker plugin for a particular storage driver, because the distributed storage is transparent to the Swarm cluster.
Lastly, GlusterFS is a distributed filesystem, designed to not have a single point of failure and you can cluster it on as many node you like (contrary to NFS).

Do replicated docker contianers in swarm mode contain multiple copies of data?

I have recently started learning docker. However when studying swarm mode I see that containers can be scaled up. What I would like to know is once you scale conatiner in replicated mode will the data within the container be replicated too ? or just fresh containers will be spawned ?
For example lets say I created mysql service initially only with 1 copy. I create and update tables in that mysql container. Later I scale it to 3, will newly spawned containers contain same table data ? Also will the data be continuously be replicated across 3 docker instances ?
A replicated service will use fresh container instances per container. Swarm does not take care about replication of persistent data to be stored in volumes.
Dependening on the volume plugin (e.g. local driver /w remote nfs shares) you are limited to read-write-once or read-write-many. Even if your volume allows read-write-many, the service replicas might not support that, for instance mysql will not work if you point n replicas to the same volume. You can leverage swarm service template variables for instance to point your volumes to different target folders of the same nfs share.
Also with swarm, you will want to have storage that needs to be reachable from all nodes, as a container can die and be re-spawned on a different node. So either you will need to use a remote share based on NFS or CIFS (see example usages nfs cifs), a storage cluster like Ceph or GlusterFS or a cloud native storage like Portworx. While you have to take care of HA for remote share solutions, data replication is build in for storage clusters and cloud native storage.
In case a containerized service itself is cluster/replica aware it is usualy better to not use the swarm replica mechanism - unless all instances can be started with the same set of parameters.

Docker volumes vs nfs

I would like to know if it is logical to use a redundant NFS/GFS share for webcontent instead of using docker volumes?
I'm trying to build a HA docker environment with the least amount of additional tooling. I would like to stick to 3 servers, each a docker swarm node.
Currently I'm looking into storage: an NFS/GFS filesystem cluster would require additional tooling for a small environment (100gb max storage). I would like to only use native docker supported configurations. So I would prefer to use volumes and share those across containers. However, those volumes are, for as far as I know, not synchronized to other swarm nodes by default.. so if the swarm node that hosts the data volume goes down it will be unavailable for each container across the swarm..
A few things, that together, should answer your question:
Volumes use a driver, and the default driver in Docker run and Swarm services is the built-in "local" driver which only supports file paths that are mounted on that host. For using shared storage with Swarm services, you'll want a 3rd party plugin driver, like REX-Ray. An official list is here: store.docker.com
What you want to look for in a volume driver is one that's "docker swarm aware" that will re-attach volumes to a new task created if old Swarm service task is killed/updated. Tools like REX-Ray are almost like a "persistent data orchestrator" that ensures volumes are attached to the proper node where they are needed.
I'm not sure what web content you're talking about, but if it's code or templates, it should be built into the image. If you're talking about user uploaded content that needs to be backed up, then yep a volume sounds like the right way.

Are Docker Volumes machine-specific

I'm new to Docker Swarm. As I understand, Docker Swarm allows you to abstract from clustering. Means you don't care on which hardriwe container is deployed.
On the other hand, the standard way to handle database in Docker - is to write data outside Docker container (to avoid copy-on-write behaviour). That's achieved by mounting a Volume and write db-related data to it. The important thing here - are Volumes machine-specific? Are Docker & Docker Swarm clever enough to mount a Volume on the machine it's needed?
Example:
I have 3 machines and 3 microservices/containers. All of them are deployed through Docker Swarm. Only one microservice/container must connect to a database. So I need to mount Volume only on one machine. But on which?
Databases and similar stateful applications are still a hard thing to deal with when it comes to Docker swarm and other orchestration frameworks. Ideally, containers should be able to run on any node in the swarm, but the problem comes when you need to persist data beyond the container's lifecycle.
Mounting a volume is the Docker way to persist data, however this ties the container with a specific node as volumes are created on the specific nodes. There are many projects that try to solve this problem and provide some sort of distributed storage.
There was a project called Flocker that deals with the above problem (it’s no longer maintained). There is also a newer project called REXRAY.
Are Docker & Docker Swarm clever enough to mount a Volume on the machine it's needed?
By default, no. Docker swarm will choose one of the nodes and deploy the container on it. However, you can work around this problem:
First, you need to define a named volume in you Stackfile/Composefile under the service definition.
Second, you need to use node Placement Constraints to restrict where the database container should run.
If you do not you a distributed storage tool, then when it comes to databases and similar stateful containers that need volumes, you need to restrict the container to a specific nodes.

Docker swarm NFS volumes,

I am playing with docker's 1.12 swarm with Orchestration! But there is one issue I am not able to find an answer to:
In this case if you're running a service like nginx or redis you don't worry about the data persistence,
But if you're running a service like a database we need data persistance so if something happens to your docker instance the master will shuttle the docker instance to one of the available nodes, by default docker doesn't move data volumes to other nodes to address this problem. We can use third party plugins like Flocker (https://github.com/ClusterHQ/flocker), Rexray ("https://github.com/emccode/rexray") to solve the issue.
But the problem with this is: when one node fails you lose the data. Flocker or Rexray does not deal with this.
We can solve this if we use something like NFS. I mount the same volume to across my nodes in this case we don't have to move the data between two nodes. If one of the nodes fail its need to remember the docker mount location, can we do this? If so can we achieve this with docker Swarm Built-In Orchestration!
Using Rexray, then the data is stored outside the docker swarm nodes (in Amazon S3, Openstack Cinder, ...). So If you loose a node, you won't loose your persistent data. If your scheduler mounts a new container which needs the data on another host, it will retrieve the external volume using rexray plugin and you're ok to go.
Note: your external provider needs to allow you to perform forced detach of the volume from the now unavailable old nodes.

Resources