How made ban more than one container on a node with use: Docker, Swarm, Compose?
For example I have 5 nodes and I want deploy 3 replicas some service and I want that this replicas will be on different nodes.
Docker's swarm mode currently defaults to an HA scheduling strategy, so there's nothing needed to get your service spread across multiple nodes as long as there are multiple available nodes to schedule the task on. Constraints, memory restrictions, and outages may impact the available nodes that is has to schedule a task on. The HA scheduler first searches for candidate nodes that have the fewest instances of a task (typically 0 unless you have more replicas than nodes), and then it tie breaks the resulting list by favoring nodes with fewer tasks.
For further control to spread out a task on multiple nodes, I'd recommend looking into the placement preferences added in recent versions. I don't believe this has made it into stacks and the compose file yet, but I expect that's only a matter of time until it does (new swarm mode features are first introduced to the docker service command line and later added to higher level abstractions). The placement preferences allow you to label your nodes to indicate a higher level topology, e.g. common racks, switches, or datacenters. And the placement preferences enable topology aware scheduling that can spread the tasks out among nodes with unique values of the desired label. So if you have two separate racks with 5 nodes each, without the topology aware scheduling everything could be scheduled on a single rack, and with the topology aware scheduling, it could put half of your tasks on each rack.
Related
I have a Jenkins master and two agents. However the connectivity to one agent(agentA) is bit shaky and I want to use the other agent(agentB) when the connectivity to the first one is not available.
I am only using the Jenkins web interface and have not used scripts. I am trying to figure out how it can be done using the "Restrict where this project can be run" option in job's configuration. I tried using agentA|| agentB but when agentA is not available it hangs saying "pending - agentA is offline"
Is it possible to have a configuration to achieve what I need?
I can;t leave it blank because I have other agent (agentC, agentD) which do not want this job to run in.
I am not an admin of the Jenkins server, hence adding new plugins is not my preferred option but it can be done.
As noted in Least Load plugin,
By default Jenkins tries to allocate a jobs to the last node is was executed on. This can result in nodes being left idle while other nodes are overloaded.
As you generalized the example, I'm not 100% sure if your situation can be solved by simply better labelling of your nodes or you want to look at least load plugin (it is designed for balancing the load across nodes). Your example appears to show Node names (ie; agentA/agentB). The Queue allocation logic may be "Only A or Only B", then Jenkins sticks to it. Load balancing may not address that as while a Node (a Computer) name is also a label, it may have additional logic tied to it.
If you label the pair of nodes in a pool with a common label, say "CapabilityA", and constrain your jobs to run where "CapabilityA" rather than the node names, you may find jobs float across the pool (to B if A is not available. That's how we have our nodes labelled - by Capability, and we see jobs floating across nodes, but only once the first node is full (4 executors each), so not balanced.
Nodes can have many labels and you can use label conditions to have complex constraints.
I have the following situation:
Two times a day for about 1h we receive a huge inflow in messages which are currently running through RabbitMQ. The current Rabbit cluster with 3 nodes can't handle the spikes, otherwise runs smoothly. It's currently setup on pure EC2 instances. The instance type is currenty t3.medium, which is very low, unless for the other 22h per day, where we receive ~5 msg/s. It's also setup currently has ha-mode=all.
After a rather lengthy and revealing read in the rabbitmq docs, I decided to just try and setup an ECS EC2 Cluster and scale out when cpu load rises. So, create a service on it and add that service to the service discovery. For example discovery.rabbitmq. If there are three instances then all of them would run on the same name, but it would resolve to all three IPs. Joining the cluster would work based on this:
That would be the rabbitmq.conf part:
cluster_formation.peer_discovery_backend = dns
# the backend can also be specified using its module name
# cluster_formation.peer_discovery_backend = rabbit_peer_discovery_dns
cluster_formation.dns.hostname = discovery.rabbitmq
I use a policy ha_mode=exact with 2 replicas.
Our exchanges and queues are created manually upfront for reasons I cannot discuss any further, but that's a given. They can't be removed and they won't be re-created on the fly. We have 3 exchanges with each 4 queues.
So, the idea: during times of high load - add more instances, during times of no load, run with three instances (or even less).
The setup with scale-out/in works fine, until I started to use the benchmarking tool and discovered that queues are always created on one single node which becomes the queue master. Which is fine considering the benchmarking tool is connected to one single node. Problem is, after scale-in/out, also our manually created queues are not moved to other nodes. This is also in line with what I read on the rabbit 3.8 release page:
One of the pain points of performing a rolling upgrade to the servers of a RabbitMQ cluster was that queue masters would end up concentrated on one or two servers. The new rebalance command will automatically rebalance masters across the cluster.
Here's the problems I ran into, I'm seeking some advise:
If I interpret the docs correctly, scaling out wouldn't help at all, because those nodes would sit there idling until someone would manually call rabbitmq-queues rebalance all.
What would be the preferred way of scaling out?
How people detect and automate replacement of dead Swarm Manager?
That seems important considering: "If the swarm loses the quorum of managers, the swarm cannot perform management tasks."
You need to implement this with an external monitoring solution. It's not a built in capability of docker swarm mode.
Implementing this solution will be non-trivial. First, keep in mind when you promote a node, you are now giving it full administrative access over the swarm where a normal worker has none of that access, so make sure your security model is ok with this change. You also need to avoid cascade failures, where an overload of one manager causes it to fail, and automatically promoting other nodes causes them to immediately fail until there are no more workers as the existing workload is redistributed to fewer and fewer nodes. Lastly, when you add a new manager, you'll need to consider what to do with the reference to the currently failed manager. If it recovers, do you want it to continue where it left off, or do you want to have it completely removed from the swarm to reduce the number of nodes needed for quorum.
One last thing to note is when you lose quorum, nodes will continue to run the containers they have started. The only thing you lose is the ability to manage and make changes to that infrastructure. Therefore most places I've seen have 3 or 5 managers, depending on the level of fault tolerance needed, and often make the managers virtual so that if a failure occurs, the VM image can be easily restarted elsewhere in their environment.
Is it possible for a Docker Task to know which task number and how many total tasks there are running of a particular Service?
E.g. I'd like to have a Service that works on different ranges in a job queue. The range of jobs that any one Service instance (i.e. Task) works on is dependent on the total number of Tasks and which Task the current one is. So if it's the 5th task out of 20, then it will work on one range of jobs, but if it's the 12th task out of 20, it will work on a different range.
I'm hoping there is an DOCKER_SERVICE_TASK_NUMBER environment variable or something like that.
Thanks!
I've seen this requested a few times, so you're not alone, but it's not a current feature of docker's swarm mode. Implementing this would be non-trivial because of the need to support scaling a service along with other updates to the swarm service. If you scale a service down and docker cleans up task 2 of 5 because it's on the busiest node, you're left in an odd situation where the starting count is less than the current count and there's a hole in the current numbering scheme.
If you have this requirement, then an external service discovery tool like consul or etcd may be useful. You can also try implementing your own solution taking advantage of the tasks.$service_name DNS entry that's available inside the container. That gives you the IP's of all the other containers providing this service just like you had with the round robin load balancer before the swarm mode VIP abstracted that away.
I'm struggling to understand the idea of replica instances in Docker Swarm Mode. I've read that it's a feature that helps with high availability.
However, Docker automatically starts up a new task on a different node if one node goes down even with 1 replica defined for the service, which also provides high availability.
So what's the advantage of having 3 replica instances rather than 1 for an arbitrary service? My assumption was that with more replicas, Docker spends less time creating a new instance on another node in the event of failure, which aids performance. Is this correct?
What Makes a System Highly Available?
One of the goals of high availability is to eliminate single points of
failure in your infrastructure. A single point of failure is a
component of your technology stack that would cause a service
interruption if it became unavailable.
Let's take your example of a replica that consists of a single instance. Now let's suppose there is a failure. Docker Swarm will notice that the service failed and restart it. The service restarts, but a restart isn't instant. Let's say the restart takes 5 seconds. For those 5 seconds your service is unavailable. Single point of failure.
What if you had a replica that consists of 3 instances. Now when one of them fails (no service is perfect), Docker Swarm will notice that one of the instances is unavailable and create a new one. During that time you still have 2 healthy instances serving requests. To a user of your service, it appears as if there was no down time. This component is no longer a single point of failure.
ROMANARMY answer is very good and i just wanted to mention that the replicas can be on different nodes, so if one of your servers goes down(become unavailable) the container(replica) on the other server can be run without problem.