Is it possible to configure OpsCenter Backup Service differently per datacenter? - datastax-enterprise

DSE 4.5.8, OpsCenter 5.1.3.
We are running a multi-region cluster, with 6-nodes running in one DC, and 1 node running as a backup in a remote DC. RF is 3 in DC1, 1 in DC2.
After enabling the OpsCenter backup service, the single node in the remote DC is reaching high CPU every time backup is running (running /bin/find of all strange reasons).
The question is why at all would I want to backup the backup DC (DC2)? Is it possible to configure the Backup Service to confine itself to a single datacenter?
A secondary question is - are 3 copies of my data being backed up in DC1?
Thanks!

We are running a multi-region cluster, with 6-nodes running in one DC,
and 1 node running as a backup in a remote DC. RF is 3 in DC1, 1 in
DC2.
Because there are 6 nodes in DC1 (RF3) and 1 node in DC2 (RF1), the node in RF1 has 2x the data as each of the nodes in DC1.
After enabling the OpsCenter backup service, the single node in the
remote DC is reaching high CPU every time backup is running (running
/bin/find of all strange reasons).
It would make sense that this node (since it has 2x the data) has to work 2x as hard.
A secondary question is - are 3 copies of my data being backed up in
DC1?
Yes, the backup service takes the sstables from each of your nodes (all the data in that node including replicas) and backs them up (either in another local directory, or in s3).
The question is why at all would I want to backup the backup DC (DC2)?
Is it possible to configure the Backup Service to confine itself to a
single datacenter?
No, currently you can configure OpsCenter backups at the keyspace level but not at the data center level. Having all the sstables for a particular node will enable you to quickly bring back a node without having to bootstrap if this node is lost.
Furthermore, because the backup service restore functionality uses sstable loader, you can also restore a cluster to a new cluster with a different topology.
Check out Mani's blog post for details on the performance of the backup service.

Related

Couchdb 3.1.0 cluster - database failed to load after restarting one node

Here is the situation : on a couchdb cluster made of two nodes, each node is a couchdb docker instance on a server (ip1 and ip2). I had to reboot one server and restart docker, after that both my couchdb instances displays for each database: "This database failed to load."
I can connect with Futon and see the full list of databases, but that's all. On "Verify Couchdb Installation" with Futon I have several errors (only 'Create database' is a green check)
The docker logs for the container gives me this error :
"internal_server_error : No DB shards could be opened"
I tried to recover the database locally by copying the .couch and shards/ files to a local instance of couchdb but the same problem occurs.
How can I retrieve the data ?
PS: I checked the connectivity between my two nodes with erl, no problem there. Looks like docker messed up with some couchdb config file on restart.
metadata and cloning a node
The individual databases have metadata indicating on which nodes their shards are stored which is built on creation based on cluster options, so copying database files alone does not actually move or mirror the database on to the new node. (If one sets the metadata correctly the shards are copied by couch itself, so copying the files is only done to speed up the process.)
replica count
A 2 node cluster usually does not make sense. As with file system RAID, you can stripe for maximal performance and half the reliability or you can create a mirror, but unless individual node state has perfect consistency detection one can not automatically decide which of two nodes is incorrect, while deciding which of 3 nodes is incorrect is easy enough to perform automatically. Consequently, most clusters are 3 or more nodes and each shard has 3 replicas on any 3 nodes.
Alright, just in case someone do the same mistake :
When you have a 2 node cluster, couchdb#ip1 and couchdb#ip2, and created the cluster from couchdb#ip1 :
1) If the node couchdb#ip2 stops, then the cluster setup is messed up (couchdb#ip1 will no longer work), on restart it appears that the node will not connect correctly and the databases will appear but will not be available.
2) On the other hand, stoping and starting couchdb#ip1 do not cause any problem
The solution in case 1 is to recreate the cluster with 2 fresh couchdb instances (couchdb#ip1 and couchdb#ip2), then copy the databases on one couchdb instance and all the databases will be back !
If anyone can explain in detail why this happend ? It also means that this cluster configuration is absolutly not reliable (if couchdb#ip2 is down then nothing works), I guess it will not be the same with a 3 nodes cluster ?

Backup Docker Swarm - How many Manager Nodes Required

From the official docker doc, there is a statement (as below) looks confusing to me. From my understanding, don't we only need to pick anyone of healthy manager nodes to backup for future restoration purpose?
"You must perform a manual backup on each manager node, because logs contain node IP address information and are not transferable to other nodes. If you do not backup the raft logs, you cannot verify workloads or Swarm resource provisioning after restoring the cluster."
Link: https://docs.docker.com/ee/admin/backup/back-up-swarm/
It depends on how you want to recover. If you want to restore a specific node, you need a backup from that node.
If you are rebuilding your swarm cluster from an old backup, then you only need one healthy node's backup. See the following guide for performing a backup and restore:
https://docs.docker.com/engine/swarm/admin_guide/#back-up-the-swarm
If you restore the cluster from a single node, you will need to reset and join the swarm again on the other managers since you are running a single node cluster. What is restored in that scenario are the services, stacks, and other definitions, but not the nodes.

How services are distributed in docker swarm

Can I somehow configure how master node distributes services in docker swarm? I thought, that it should see free resources of worker nodes and distribute it to "freest" node.
Currently I have problem, that service is distributed into one node, which is full (90% RAM) and it starts be laggy, but at the same time second node has few services and it can handle another one.
docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
wdkklpy6065zxckxyuj000ei4 * docker-master Ready Drain Leader 18.09.6
sk45rol2whdr5eh2jqozy0035 docker-node01 Ready Active Reachable 18.09.6
o4zwwbwwcrbwo4tsd00pxkfuc docker-node02 Ready Active 18.09.6
Now I have 36 (very similar) services, 28 run on docker-node01, 8 on docker-node02. I thought, that ideal state is 16 services on both nodes.
Both docker-nodes are same.
How docker swarm knows where to run service? What algorithm does it use?
It is possible to change/update algorithm for selecting node?
According to the swarmkit project README the only available strategy is spread so it schedule tasks on the least loaded modes.
Note that the swarm won't move nodes around to maintain this strategy so if you added the node02 after the node01 was full then the node02 will remain mostly empty. You could drain both nodes then activate them to see if it distributes better the load.
You can find a more detailed description on the schedules algorithm on the project documentation: scheduling-algorithm
For the older swarm manager this attribute was configurable:
https://docs.docker.com/swarm/reference/manage/#--strategy--scheduler-placement-strategy
Also I found https://docs.docker.com/swarm/scheduler/strategy/, it explains a lot about Docker swarm strategies.

restore a docker swarm

Let's say we have a swarm1 (1 manager and 2 workers), I am going to back up this swarm on a daily basis, so if there is a problem some day, I could restore all the swarm to a new one (swarm2 = 1 manager and 2 workers too).
I followed what described here but it seems that while restoring, the new manager get the same token as the old manager, as a result : the 2 workers get disconnected and I end up with a new swarm2 with 1 manager and 0 worker.
Any ideas / solution?
I don't recommend restoring workers. Assuming you've only lost your single manager, just docker swarm leave on the workers, then join again. Then on the manager you can always cleanup old workers later (does not affect uptime) with docker node rm.
Note that if you loose the manager quorum, this doesn't mean the apps you're running go down, so you'll want to keep your workers up and providing your apps to your users until you fix your manager.
If your last manager fails or you lose quorum, then focus on restoring the raft DB so the swarm manager has quorum again. Then rejoin workers, or create new workers in parallel and only shutdown old workers when new ones are running your app. Here's a great talk by Laura Frank that goes into it at DockerCon.

What happens to docker service configs or secrets if a swarm is completely stopped?

I'm aware that service configs and secrets are stored in the RAFT log and that this log is replicated to other swarm managers.. but what if the entire swarm is stopped? Is the RAFT log persistent or should you always keep local copies?
I eventually found out that if you back up the swarm, you should be able to recover as detailed in the documentation:
Back up the swarm
Docker manager nodes store the swarm state and manager logs in the /var/lib/docker/swarm/ directory. In 1.13 and higher, this data includes the keys used to encrypt the Raft logs. Without these keys, you will not be able to restore the swarm.
You can back up the swarm using any manager. Use the following procedure.
If the swarm has auto-lock enabled, you will need the unlock key in order to restore the swarm from backup. Retrieve the unlock key if necessary and store it in a safe location. If you are unsure, read Lock your swarm to protect its encryption key.
Stop Docker on the manager before backing up the data, so that no data is being changed during the backup. It is possible to take a backup while the manager is running (a “hot” backup), but this is not recommended and your results will be less predictable when restoring. While the manager is down, other nodes will continue generating swarm data that will not be part of this backup.
Note: Be sure to maintain the quorum of swarm managers. During the time that a manager is shut down, your swarm is more vulnerable to losing the quorum if further nodes are lost. The number of managers you run is a trade-off. If you regularly take down managers to do backups, consider running a 5-manager swarm, so that you can lose an additional manager while the backup is running, without disrupting your services.
Back up the entire /var/lib/docker/swarm directory.
Restart the manager.

Resources