I’m trying to understand ETCD’s memory and disk usage within a deployed system using the ETCDv2 API. The system has a file being saved on a regular basis, each time under a new key, and we’re concerned that long-term there’s no clean-up of state leading to both memory and disk usage growing unbounded on each VM in the etcd cluster. We’ve also emulated this, using a large file (several MB) being saved every few minutes.
From the etcd docs, I expected the following:
Each insertion would save the file to disk, causing disk usage to grow unbounded.
This matches what I am seeing.
In memory, etcd would save a key-value pair where the value is a lookup address for the file on disk (taking up a very small amount of memory) and a cached version of the file (taking a large amount of memory).
I would then expect that rebooting an etcd pod after several file writes would cause the cache to be (mostly) cleared, meaning a consistently up pod would have memory growing unbounded but if the pod rebooted, the cache would be cleared of all but the active entry (and any specifically requested by e.g. attempted rollbacks) and the memory usage would (mostly) reset with each reboot.
However, in practice we see a very small memory drop with a reboot which is almost immediately returned after the pod recovers (as though all the cache is restored from the peers).
Is my understanding correct? And if so:
Why does the memory usage reset fully after an etcd pod reboot? Does the etcd cache get synced with its cluster, as well as the main key-value table and file storage?
Is there a recommended way to keep etcd’s memory and disk usage within bounded limits?
Additional notes:
I’ve tried reducing the snapshot_count configuration setting - this doesn’t seem to have had any impact (unless I’ve reduced it too far - I cut it right down to 5 from the default of 100,000).
I’ve attempted changing our file saving to overwrite a single file with a new version each time, instead of storing a new file. This doesn’t appear to have had any impact (although this may be due to issues in my prototype; I’m still investigating).
We can’t migrate existing deployments to etcd v3 file-systems, so are specifically looking at etcd v2 solutions. I think this rules out compact and defrag steps, which seem to be a core part of the answer to this problem in v3.
Any help or insight very gratefully appreciated.
Thanks!
Related
I have a two-node PostgreSQL cluster running on VMs where each VM runs both the pgpool service and a Postgres server.
due to insufficient memory configuration the Postgres server crashed, so I've bumped the VM memory and the changed Postgres memory config in the postgresql.conf file. since that memory changes the slave pgpool node detaches every night at a specific time, though when looking at node_exporter metrics regarding CPU, load, processes disk usage or memory didn't show any spikes or sudden changes.
the slave node detaching happened before but not day after day. I've stumbled upon this thread and read this part of the documentation about the failover but Since the Postgres server didn't crash and existing connections to the slave node were working (it kept serving existing connections but didn't take new ones) so network issues seemed irrelevant, especially after consulting with our OPS team on whether they noticed any abnormal network or DNS activity that could explain that. Unfortunately, they didn't notice any interesting findings.
I have pg_exporter, postgres_exporter and node_exporter on each node to monitor the server and VM behavior, what should I be looking for to debug this? what should I ask of our OPS team to check specifically? our pgpool log file only states the failure to access the other node but no exact reason, as the aforementioned docs say:
Pgpool-II does not distinguish each case and just decides that the
particular PostgreSQL node is not available if health check fails.
could it still be a network\DNS issue? and if so. how would I confirm this?
thnx for reading and taking your time to assist me in this conundrum
that was interesting
If summing the gist of it,
it was part of the OPS team infrastructure backups
Now the entire process went like that:
setting the ambiance:
we run on-prem on top of VMWare vCenter cluster backing up on the infra side with VMWare VM snapshot and Veeamm VM backup where the vmdk files\ESXi datastores reside on a NetApp storage based on NFS.
when checking node exporter metrics in Node Exporter Full Dashboard I saw network traffic drop in the order of up to 2 packets per second for about 5 to 15 minutes consistently through the last few months, increasing dramatically in phenomenon length in the last month (around the same time late at night).
Rough illustration:
After checking again with our OPS team they indicated it could be the host configurations\Veeam Backups.
It turns out that because the storage for the VMs (including the one that runs the Veeam backup) is attached via network and not directly on the ESXi hosts, the final snapshot saved\consolidated at that late-night time -
node detaches every night at a specific time
With the way NFS works with disk locking (limiting IOPs to existing data) along with the high IOPs requirements from the Veeam backup cause the server to hang\freeze and sometimes on rare occasions even a VM restart. here's the quote from the Veeam issue doc:
The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates
the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
This issue occurs when the target virtual machine and the backup appliance [proxy] reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up [during snapshot removal].
Obviously, that would interfere at the very least with Postgres functionality especially configured as a cluster with replication that requires a near-constant connection between the Postgres servers. A similar thread on SO using a different DB server
a solution is suggested including solving the issue presented in the last quote in this link, though for the time being, we removed the usage of Veeam backup for sensitive VMs until the solution can be verified locally (will update in the future if tried and fixed issue)
additional incidents documentation: similar issue case, suggested solution info from Veeam, third party site solution (around the same problem as a temp fix as I see it), Reddit thread acknowledging the issue and suggesting options
I have a single page Angular app that makes request to a Rails API service. Both are running on a t2xlarge Ubuntu instance. I am using a Postgres database.
We had increase in traffic, and my Rails API became slow. Sometimes, I get an error saying Passenger queue full for rails application.
Auto scaling on the server is working; three more instances are created. But I cannot trace this issue. I need root access to upgrade, which I do not have. Please help me with this.
As you mentioned that you are using T2.2xlarge instance type. Firstly I want to tell you should not use T2 instance type for production environment. Cause of T2 instance uses CPU Credit. Lets take a look on this
What happens if I use all of my credits?
If your instance uses all of its CPU credit balance, performance
remains at the baseline performance level. If your instance is running
low on credits, your instance’s CPU credit consumption (and therefore
CPU performance) is gradually lowered to the base performance level
over a 15-minute interval, so you will not experience a sharp
performance drop-off when your CPU credits are depleted. If your
instance consistently uses all of its CPU credit balance, we recommend
a larger T2 size or a fixed performance instance type such as M3 or
C3.
Im not sure you won't face to the out of CPU Credit problem because you are using Xlarge type but I think you should use other fixed performance instance types. So instance's performace maybe one part of your problem. You should use cloudwatch to monitor on 2 metrics: CPUCreditUsage and CPUCreditBalance to make sure the problem.
Secondly, how about your ASG? After scale-out, did your service become stable? If so, I think you do not care about this problem any more because ASG did what it's reponsibility.
Please check the following
If you are opening a connection to Database, make sure you close it.
If you are using jquery, bootstrap, datatables, or other css libraries, use the CDN links like
<link rel="stylesheet" ref="https://cdnjs.cloudflare.com/ajax/libs/bootstrap-select/1.12.4/css/bootstrap-select.min.css">
it will reduce a great amount of load on your server. do not copy the jquery or other external libraries on your own server when you can directly fetch it from other servers.
There are a number of factors that can cause an EC2 instance (or any system) to appear to run slowly.
CPU Usage. The higher the CPU usage the longer to process new threads and processes.
Free Memory. Your system needs free memory to process threads, create new processes, etc. How much free memory do you have?
Free Disk Space. Operating systems tend to thrash when the file systems on system drives run low on free disk space. How much free disk space do you have?
Network Bandwidth. What is the average bytes in / out for your
instance?
Database. Monitor connections, free memory, disk bandwidth, etc.
Amazon has CloudWatch which can provide you with monitoring for everything except for free disk space (you can add an agent to your instance for this metric). This will also help you quickly see what is happening with your instances.
Monitor your EC2 instances and your database.
You mention T2 instances. These are burstable CPUs which means that if you have consistenly higher CPU usage, then you will want to switch to fixed performance EC2 instances. CloudWatch should help you figure out what you need (CPU or Memory or Disk or Network performance).
This is totally independent of AWS Server. Looks like your software needs more juice (RAM, StorageIO, Network) and it is not sufficient with one machine. You need to evaluate the metric using cloudwatch and adjust software needs based on what is required for the software.
It could be memory leaks or processing leaks that may lead to this as well. You need to create clusters or server farm to handle the load.
Hope it helps.
Environment:
Asp Net MVC app(.net framework 4.5.1) hosted on Azure app service with two instances.
App uses Azure SQL server database.
Also, app uses MemoryCache (System.Runtime.Caching) for caching purposes.
Recently, I noticed availability loss of the app. It happens almost every day.
Observations:
The memory counter Page Reads/sec was at a dangerous level (242) on instance RD0003FF1F6B1B. Any value over 200 can cause delays or failures for any app on that instance.
What 'The memory counter Page Reads/sec' means?
How to fix this issue?
What 'The memory counter Page Reads/sec' means?
We could get the answer from this blog. The recommended Page reads/sec value should be under 90. Higher values indicate insufficient memory and indexing issues.
“Page reads/sec indicates the number of physical database page reads that are issued per second. This statistic displays the total number of physical page reads across all databases. Because physical I/O is expensive, you may be able to minimize the cost, either by using a larger data cache, intelligent indexes, and more efficient queries, or by changing the database design.”
How to fix this issue?
Based on my experience, you could have a try to enable Local Cache in App
Service.
You enable Local Cache on a per-web-app basis by using this app setting: WEBSITE_LOCAL_CACHE_OPTION = Always
By default, the local cache size is 300 MB. This includes the /site and /siteextensions folders that are copied from the content store, as well as any locally created logs and data folders. To increase this limit, use the app setting WEBSITE_LOCAL_CACHE_SIZEINMB. You can increase the size up to 2 GB (2000 MB) per web app.
There is some memory performance problems can be listed
excessive paging,
memory shortages,
memory leaks
Memory counter values can be used to detect the presence of various performance problems. Tracking counter values both on a system-wide and a per-process basis helps you to pinpoint the cause in Azure such as in other systems.
Even if there is no change in the process, a change in the system can cause memory problems. the system-wide
researching in the azure:
Shared resources plans (Free and Basic) have memory limits as seen here: https://learn.microsoft.com/en-us/azure/azure-subscription-service-limits#app-service-limits.
Quotas: https://learn.microsoft.com/en-us/azure/app-service-web/web-sites-monitor
Also, you can check in the portal under your web app settings, search for “quotas”, and also check out “Diagnose and solve problems” and hit “metrics per instance (app service plan)” which will show you memory used for the plan.
A MemoryCache bug in .net 4 can also cause this type of behavior
https://stackoverflow.com/a/15715990/914284
I am new to mosquitto and have a few questions I hope you all can help me with:
what determines the limit size of the persistence file in mosquitto? Is it the system momory or disk space?
What happens when the persistence file gets larger than the limit size? Can I transfer it to another server for temporary storage?
How would mosquitto use the transferred file to publish messages when it restarts?
I appreciate any feedback.
Thanks,
Probably a combination of both Filesystem maximum filesize and system/process memory, which ever is smallest. But I would expect the performance problems that would be apparent before you reached these limits to be a bigger problem.
Mosquitto probably crashes. If mossquitto exceeds the system/process memory limits then it's going to get killed by the OS or crash instantly. I doubt there would be any benefit to moving it to a different machine as if mosquitto crashes due to hitting either of those limits the file is likely to be corrupted so unable to be read in even if restarted on the same machine.
See answer 2
In reality you should never come close to these limits, having that many inflight messages means there are some very SERIOUS issues with the design of your whole system.
I am trying to find out what a safe setting for 'maxmemory' would be in the following situation:
write-heavy application
8GB RAM
let's assume other processes take up about 1GB
this means that the redis process' memory usage may never exceed 7GB
memory usage doubles on every BGSAVE event, because:
In the redis docs the following is said about the memory usage increasing on BGSAVE events:
If you are using Redis in a very write-heavy application, while saving an RDB file on disk or rewriting the AOF log Redis may use up to 2 times the memory normally used.
the maxmemory limit is roughly compared to 'used_memory' from redis-cli INFO (as is explained here) and does not take other memory used by redis into account
Am I correct that this means that the maxmemory setting should, in this situation, be set no higher than (8GB - 1GB) / 2 = 3.5GB?
If so, I will create a pull request for the redis docs to reflect this more clearly.
I would recommend in this case a limit of 3GB. Yes, the docs are pretty much correct and running a bgsave will double for a short term the memory requirements. However, I prefer to reserve 2GB of memory for the system, or at a maximum for a persisting master 40% of maximum memory.
You indicate you have a very write heavy application. In this case I would highly recommend a second server do the save operations. I've found during high writes and a bgsave the response time to the client(s) can get high. It isn't Redis per se causing it, but the response of the server itself. This is especially true for virtual machines. Under this setup you would use the second server to slave from the primary and save to disk while the first remains responsive.