Understanding docker swarm in terms of high availability - docker

I am currently trying to understand what would be necessary to create a docker swarm to make some service highly available. I read through a lot of the docker swarm documentation, but if my understanding is correct, docker swarm will just execute a service on any host. What would happen if a host fails? Would the swarm manager restart the service(s) running on that host/node on another one? Is there any better explanation of this than in the original documentation found here?

Nothing more complex than that really. Like it says, Swarm (and kubernetes, and most other tooling in this space) is declarative, which means that you tell it the state that you want (i.e. 'I want 4 instances of redis') and Swarm will converge the system to that state. If you have 3 nodes, then it will schedule 1 redis on Node 1, 1 on Node 2, and 2 on Node 3. If Node 2 dies, then the system is now not 'compliant' with your declared state, and Swarm will schedule another redis on Node 1 or 3 (depending on strategy, etc...).
Now this dynamism of container / task / instance scheduling brings another problem, discovery. Swarm deals with this by maintaining an internal DNS registry and by creating VIP (virtual IPs) for each service. Instead of having to address / keep track of each redis instance, I can instead point to a service alias and Swarm will automatically route traffic to where it needs to go.
Of course there are also other considerations:
Can your service support multiple backend instances? Is it stateless? Sessions? Cache? Etc...
What is 'HA'? Multi-node? Multi-AZ? Multi-region? Etc...

Related

docker swarm nodes prefer to use own container first before going to use other node's container in global mode

i have 3 swarm managers node without any worker node in global mode
this makes each node have exactly one container.
when i send request to any node, that request may be processed by any of the container but i want to create priority that it should preferably be executed by the container of that node which i send request.
and if the container connot respond, it should use the container of the other nodes.
i use chatgpt to get my answer and it says you can use --placement-pref but its not working.
Docker Swarms mesh networking is very simple and does a simple round robin load balancing.
If this is not wanted you can either use host mode networking, or for service to service calls you could use tasks.<service> to get the DNSRR record, or a different mesh implementation like Hashicorp Consul. (Not that Hashicorp Consul necessarily implements your use case).

Are services with their own clustering mechanisms suitable for swarm

I just started learning swarm recently. And I have some questions about the swarm usage scenario.
If I have a simple webserver which response to some restful HTTP requests,swarm seems to be a good choice because if I need to expand my webserver horizontally, I just need to use docker service scale and the swarm will do load balancing for me.
But what about services that have their own clustering mechanism(Redis, elastic search?)? I cannot simply expand the capacity through the docker service scale`.
For instance, I have a Redis service, if I docker service scale redis=2, two separate Redis services are generated. This is obviously not what I need.
Are these services fit for swarm mode?If so, how to config these services in swarm mode? And how to expand it?
Stateful services (e.g. Redis, RabbitMQ, etc...) fit swarm mode.
It's your responsibility though to configure the cluster manually, by some predeploy/postdeploy script or in images entrypoint.
Such reconfiguration should run also after each replica restart regardless the reason: single replica failures and subsequent restarts, restart of all service replicas, scaling of new replicas.
Content of such script/steps may vary between clustering solutions and one should refer to the relevant documentation of each solution. It maybe as simple as putting replicas virtual ips to configuration file or complex ordered steps.
General use cases that fit all solutions are: configure cluster inside service replicas for the first time, connect single replica back to cluster after failure, restart all replicas after failure/valid restart.
Some github projects try to automate the process. For example mariadb-cluster

Docker Swarm failover behavior seems a bit underwhelming

I am currently trying to use Docker Swarm to set up our application (consisting of both stateless and stateful services) in a highly available fashion on a three node cluster. With "highly available" I mean "can survice the failure of one out of the three nodes".
We have been doing such installations (using other means, not Docker, let alone Docker Swarm) for quite a while now with good success, including acceptable failover behavior, so our application itself (resp. the services that constitute it) has/have proven that in such a three node setup it/they can be made highly available.
With Swarm, I get the application up and running successfully (with all three nodes up) and have taken care that I have each service configured redundantly, i.e., more than one instance exists for each of them, they are properly configured for HA, and not all instances of a service are located on the same Swarm node. Of course, I also took care that all my Swarm nodes joined the Swarm as manager nodes, so that anyone of them can be leader of the swarm if the original leader node fails.
In this "good" state, I can reach the services on their exposed ports on any of the nodes, thanks to Swarm's Ingress networking.
Very cool.
In a production environment, we could now put a highly-available loadbalancer in front of our swarm worker nodes, so that clients have a single IP address to connect to and would not even notice if one of the nodes goes down.
So now it is time to test failover behavior...
I would expect that killing one Swarm node (i.e., hard shutdown of the VM) would leave my application running, albeit in "degraded" mode, of course.
Alas, after doing the shutdown, I cannot reach ANY of my services via their exposed (via Ingress) ports anymore for a considerable time. Some do become reachable again and indeed have recovered successfully (e.g., a three node Elasticsearch cluster can be accessed again, of course now lacking one node, but back in "green" state). But others (alas, this includes our internal LB...) remain unreachable via their published ports.
"docker node ls" shows one node as unreachable
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS
kma44tewzpya80a58boxn9k4s * manager1 Ready Active Reachable
uhz0y2xkd7fkztfuofq3uqufp manager2 Ready Active Leader
x4bggf8cu371qhi0fva5ucpxo manager3 Down Active Unreachable
as expected.
What could I be doing wrong regarding my Swarm setup that causes these effects? Am I just expecting too much here?

Difference between Docker container and service

I'm wondering whether there are any differences between the following docker setups.
Administrating two separate docker engines via the remote api.
Administrating two docker swarm nodes via one single docker engine.
I'm wondering if you can administrate a swarm with the ability run a container on a specific node are there any use cases to have separate docker engines?
The difference between the two is swarm mode. When a docker engine is running services in swarm mode you get:
Orchestration from the manager to continuously try to correct any differences between the current state and the target state. This can also include HA using the quorum model (as long as a majority of the managers are reachable to make decisions).
Overlay networking which allows containers on different hosts to talk to each other on their own container network. That can also involve IPSEC for security.
Mesh networking for published ports and a VIP for the service that doesn't change like container IP's do. The latter prevents problems from DNS caching. And the former has all nodes in the swarm publish the port and routes traffic to a container providing this service.
Rolling upgrades to avoid any downtime with replicated services.
Load balancing across multiple nodes when scaling up a service.
More details on swarm mode are available from docker's documentation.
The downside of swarm mode is that you are one layer removed from the containers when they run on a remote node. You can't run an exec command on a task to investigate a container, you need to do that on a container and be on the node it's currently using. Docker also removed some options from services like --volumes-from which don't apply when containers may be running on different machines.
If you think you may grow beyond running containers on a single node, need to communicate between the containers on different nodes, or simply want the orchestration features like rolling upgrades, then I would recommend swarm mode. I'd only manage containers directly on the hosts if you have a specific requirement that prevents swarm mode from being an option. And you can always do both, manage some containers directly and others as a service or stack inside of swarm, on the same nodes.

Is it possible to create a docker swarm cluster using nodes on different cloud providers?

Is it possible to create a docker swarm cluster using nodes on different cloud providers?
Let's say some of them on AWS, some on GCE and some on Azure?
In my understanding, as long as your nodes could access each other, then you will be able to create a swarm cluster. It doesn't matter who are your cloud providers or where your node located.
If you read the swarm deployment document carefully, you will find the critical thing of deploying a cluster is "let the compute nodes to communicate with controller node". Assume you already have a controller node with swarm and a discovery service (such as consul or etcd) both installed, the you can add a compute node like this:
$ docker run -d swarm join --advertise=<node_ip>:2375 consul://<consul_ip>:8500
Where node_ip and consul_ip should be your controller node's IP.
So the tricky part is, can you make your nodes communicate to each other? Actually this question is not easy to answer. You need to care about IP allocation, network design, routers etc.
Yes. Look at Docker Machine for a quick way of setting up the basic infrastructure.
Docker Machine

Resources