pgpool node detached for no reason? - network-programming

I have a two-node PostgreSQL cluster running on VMs where each VM runs both the pgpool service and a Postgres server.
due to insufficient memory configuration the Postgres server crashed, so I've bumped the VM memory and the changed Postgres memory config in the postgresql.conf file. since that memory changes the slave pgpool node detaches every night at a specific time, though when looking at node_exporter metrics regarding CPU, load, processes disk usage or memory didn't show any spikes or sudden changes.
the slave node detaching happened before but not day after day. I've stumbled upon this thread and read this part of the documentation about the failover but Since the Postgres server didn't crash and existing connections to the slave node were working (it kept serving existing connections but didn't take new ones) so network issues seemed irrelevant, especially after consulting with our OPS team on whether they noticed any abnormal network or DNS activity that could explain that. Unfortunately, they didn't notice any interesting findings.
I have pg_exporter, postgres_exporter and node_exporter on each node to monitor the server and VM behavior, what should I be looking for to debug this? what should I ask of our OPS team to check specifically? our pgpool log file only states the failure to access the other node but no exact reason, as the aforementioned docs say:
Pgpool-II does not distinguish each case and just decides that the
particular PostgreSQL node is not available if health check fails.
could it still be a network\DNS issue? and if so. how would I confirm this?
thnx for reading and taking your time to assist me in this conundrum

that was interesting
If summing the gist of it,
it was part of the OPS team infrastructure backups
Now the entire process went like that:
setting the ambiance:
we run on-prem on top of VMWare vCenter cluster backing up on the infra side with VMWare VM snapshot and Veeamm VM backup where the vmdk files\ESXi datastores reside on a NetApp storage based on NFS.
when checking node exporter metrics in Node Exporter Full Dashboard I saw network traffic drop in the order of up to 2 packets per second for about 5 to 15 minutes consistently through the last few months, increasing dramatically in phenomenon length in the last month (around the same time late at night).
Rough illustration:
After checking again with our OPS team they indicated it could be the host configurations\Veeam Backups.
It turns out that because the storage for the VMs (including the one that runs the Veeam backup) is attached via network and not directly on the ESXi hosts, the final snapshot saved\consolidated at that late-night time -
node detaches every night at a specific time
With the way NFS works with disk locking (limiting IOPs to existing data) along with the high IOPs requirements from the Veeam backup cause the server to hang\freeze and sometimes on rare occasions even a VM restart. here's the quote from the Veeam issue doc:
The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates
the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
This issue occurs when the target virtual machine and the backup appliance [proxy] reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up [during snapshot removal].
Obviously, that would interfere at the very least with Postgres functionality especially configured as a cluster with replication that requires a near-constant connection between the Postgres servers. A similar thread on SO using a different DB server
a solution is suggested including solving the issue presented in the last quote in this link, though for the time being, we removed the usage of Veeam backup for sensitive VMs until the solution can be verified locally (will update in the future if tried and fixed issue)
additional incidents documentation: similar issue case, suggested solution info from Veeam, third party site solution (around the same problem as a temp fix as I see it), Reddit thread acknowledging the issue and suggesting options

Related

Trying to understand and debug ETCDv2 Memory usage

I’m trying to understand ETCD’s memory and disk usage within a deployed system using the ETCDv2 API. The system has a file being saved on a regular basis, each time under a new key, and we’re concerned that long-term there’s no clean-up of state leading to both memory and disk usage growing unbounded on each VM in the etcd cluster. We’ve also emulated this, using a large file (several MB) being saved every few minutes.
From the etcd docs, I expected the following:
Each insertion would save the file to disk, causing disk usage to grow unbounded.
This matches what I am seeing.
In memory, etcd would save a key-value pair where the value is a lookup address for the file on disk (taking up a very small amount of memory) and a cached version of the file (taking a large amount of memory).
I would then expect that rebooting an etcd pod after several file writes would cause the cache to be (mostly) cleared, meaning a consistently up pod would have memory growing unbounded but if the pod rebooted, the cache would be cleared of all but the active entry (and any specifically requested by e.g. attempted rollbacks) and the memory usage would (mostly) reset with each reboot.
However, in practice we see a very small memory drop with a reboot which is almost immediately returned after the pod recovers (as though all the cache is restored from the peers).
Is my understanding correct? And if so:
Why does the memory usage reset fully after an etcd pod reboot? Does the etcd cache get synced with its cluster, as well as the main key-value table and file storage?
Is there a recommended way to keep etcd’s memory and disk usage within bounded limits?
Additional notes:
I’ve tried reducing the snapshot_count configuration setting - this doesn’t seem to have had any impact (unless I’ve reduced it too far - I cut it right down to 5 from the default of 100,000).
I’ve attempted changing our file saving to overwrite a single file with a new version each time, instead of storing a new file. This doesn’t appear to have had any impact (although this may be due to issues in my prototype; I’m still investigating).
We can’t migrate existing deployments to etcd v3 file-systems, so are specifically looking at etcd v2 solutions. I think this rules out compact and defrag steps, which seem to be a core part of the answer to this problem in v3.
Any help or insight very gratefully appreciated.
Thanks!

Why should I run multiple elasticsearch nodes on a single docker host?

There are a lot of articles online about running an Elasticsearch multi-node cluster using docker-compose, including the official documentation for Elasticsearch 8.0. However, I cannot find a reason why you would set up multiple nodes on the same docker host. Is this the recommended setup for a production environment? Or is it an example of theory in practice?
You shouldn't consider this a production environment. The guides are examples, often for lab environments, and testing scenarios with the application. I would not consider them production ready, and compose is often not considered a production grade tool since everything it does is to a single docker node, where in production you typically want multiple nodes spread across multiple availability zones.
Since one ES node heap memory should never get more than half the available memory (and less than ~30.5GB), one reason it makes sense to have several nodes on a given host is when you have hosts with ample memory (say 128GB+). In that case you could run 2 ES nodes (with 64GB of memory each, 30.5GB heap and the rest for Lucene) on the same host by correctly constraining each Docker container.
Note that the above is not related to Docker, you can always configure several nodes per host, whether Docker or not.
Regarding production and given the fact that 2+ nodes would run on the same host, if you lose that host, you lose two nodes, which is not good. However, depending on how many hosts you have, it might be a lesser problem, if and only if, each host is in a different availability zone and you have the appropriate cluster/shard allocation awareness settings configured, which would ensure that your data is redundantly copied in 2+ availability zones. In this case, losing a host (2 nodes) would still keep your cluster running, although in degraded mode.
It's worth noting that Elastic Cloud Enterprise (which powers Elastic Cloud) is designed to run several nodes per hosts depending on the sizing of the nodes and the available hardware. You can find more info on hardware pre-requisites as well as how medium and large scale deployments make use of one or more large 256GB hosts per availability zones.

Neo4J load distribution

We have neo4j configured in casual cluster on Kubernetnes based setup. All components are deployed on individual machine with size: t2.xlarge on aws. And we use pod affinity to schedule a deployment. While working with application under stress, we observed that there is considerable system load only one machine. For example see this:
First neo4j machine for core:
and second machine for core in same cluster:
We have bolt+router protocol configured in backend. I am not sure what causing this much resource utilization on one machine whereas others work in minimum.
I checked memory consumption on pod level as well. Neo4j-1 takes 9gb of memory whereas others are taking around 4gb. So my questions is, is this expected behavior?
I am posting my findings as answer. Not sure if its correct though. I checked status of each instance in neo4j using monitoring api's:
http://localhost:7474/db/manage/server/core/writable
This will give leader as True and False for followers. This could be the reason leader takes high resources while others are not taking much. Ref link:https://neo4j.com/docs/operations-manual/current/monitoring/causal-cluster/

Use Docker to keep track of software versions/installations?

I have an data processing application which is updated on a regular basis. This application has a bunch of dependencies which are also updated every now and then. However, different versions of the software (+dependencies) might produce different results (this is expected). The application is run on a remote computer and it can be accessed through a Web page. Every time the user uses the Web page to do some processing she/he also chooses which version of the software he/she wants to use.
Now I am trying to decide which is the best way of keeping track different software (+dependencies) versions. The simplest way of course is to just compile and install each version of my software and its dependencies in a different folder, and then based on the request the user sends, the appropriate folder is selected. However, this sounds very clunky to me. So I thought I could use Docker to keep track of the different software versions. Do you think that it is a good idea? If yes, what is most appropriate to do every time I have a new version of the software (and/or dependencies): 1) Create a new container from scratch with the new version (and end up having multiple containers), or 2) Update the existing container and commit the changes? (I suppose I can access the older commits of the container, right?)
PS: Keep in mind that the reason I looked into Docker and not a simple virtual machine solution is that the application I am running is a high-performance GPU-based software.
Docker is a reasonable choice. Your repository would contain all of the app versions you wish to publish. Note, you will only realize savings if you organize the resulting app filesystem into layers, of which the lower layers are the least likely to change between versions. This will keep the storage requirements at a minimum.
Then you have to decide how you will process each job. A robust (but complex) solution would be to have one or more API containers which take in processing jobs from your user and "dole" them out to worker containers (one or more from each release version). This would provide the lowest response latency and be non-blocking. You can look at different service discovery models to see how your "worker" containers can register with your "manager" containers. This is probably more than you'd like to bite off, but consider using a good key-value database (another container!) like etcd or a 3rd party service discovery tool like zookeeper/eureka/consul.
A much simpler model would have a single API container with one each of the release containers created, but not started. The API container would start, direct, and then stop the appropriate release container. You would incur the startup latency, but this is the least resource intensive... and easiest to manage. But this is a blocking operation.
Somewhere in the middle, but less user friendly is to have each release container running but listening on different host ports (the app always sees the same port). The user would would connect to the port which is servicing the desired release of the app. You'd have to provide some sort of index to make this useful.

Why is membase server so slow in response time?

I have a problem that membase is being very slow on my environment.
I am running several production servers (Passenger) on rails 2.3.10 ruby 1.8.7.
Those servers communicate with 2 membase machines in a cluster.
the membase machines each have 64G of memory and a100g EBS attached to them, 1G swap.
My problem is that membase is being VERY slow in response time and is actually the slowest part right now in all of the application lifecycle.
my question is: Why?
the rails gem I am using is memcache-northscale.
the membase server is 1.7.1 (latest).
The server is doing between 2K-7K ops per second (for the cluster)
The response time from membase (based on NewRelic) is 250ms in average which is HUGE and unreasonable.
Does anybody know why is this happening?
What can I do inorder to improve this time?
It's hard to immediately say with the data at hand, but I think I have a few things you may wish to dig into to narrow down where the issue may be.
First of all, do your stats with membase show a significant number of background fetches? This is in the Web UI statistics for "disk reads per second". If so, that's the likely culprit for the higher latencies.
You can read more about the statistics and sizing in the manual, particularly the sections on statistics and cluster design considerations.
Second, you're reporting 250ms on average. Is this a sliding average, or overall? Do you have something like max 90th or max 99th latencies? Some outlying disk fetches can give you a large average, when most requests (for example, those from RAM that don't need disk fetches) are actually quite speedy.
Are your systems spread throughout availability zones? What kind of instances are you using? Are the clients and servers in the same Amazon AWS region? I suspect the answer may be "yes" to the first, which means about 1.5ms overhead when using xlarge instances from recent measurements. This can matter if you're doing a lot of fetches synchronously and in serial in a given method.
I expect it's all in one region, but it's worth double checking since those latencies sound like WAN latencies.
Finally, there is an updated Ruby gem, backwards compatible with Fauna. Couchbase, Inc. has been working to add back to Fauna upstream. If possible, you may want to try the gem referenced here:
http://www.couchbase.org/code/couchbase/ruby/2.0.0
You will also want to look at running Moxi on the client-side. By accessing Membase, you need to go through a proxy (called Moxi). By default, it's installed on the server which means you might make a request to one of the servers that doesn't actually have the key. Moxi will go get it...but then you're doubling the network traffic.
Installing Moxi on the client-side will eliminate this extra network traffic: http://www.couchbase.org/wiki/display/membase/Moxi
Perry

Resources