EC2 CloudWatch memory metrics don't match what Top shows - memory

I have a t2.micro EC2 instance, running at about 2% CPU. I know from other posts that the CPU usage shown in TOP is different to CPU reported in CloudWatch, and the CloudWatch value should be trusted.
However, I'm seeing very different values for Memory usage between TOP, CloudWatch, and NewRelic.
There's 1Gb of RAM on the instance, and TOP shows ~300Mb of Apache processes, plus ~100Mb of other processes. The overall memory usage reported by TOP is 800Mb. I guess there's 400Mb of OS/system overhead?
However, CloudWatch reports 700Mb of usage, and NewRelic reports 200Mb of usage (even though NewRelic reports 300Mb of Apache processes elsewhere, so I'm ignoring them).
The CloudWatch memory metric often goes over 80%, and I'd like to know what the actual value is, so I know when to scale if necessary, or how to reduce memory usage.
Here's the recent memory profile, seems something is using more memory over time (big dips are either Apache restart, or perhaps GC?)
Screenshot of memory usage over last 12 days

AWS doesn't supports Memory metrics of any EC2 instance. As Amazon does all his monitoring from outside the EC2 instance(servers), it is unable to capture the memory metrics inside the instance. But, for complete monitoring of an instance, you must need Memory Utilisation statistics for any instance, along with his CPU Utilisation and Network IO operations.
But, we can use custom metrics feature of cloudwatch to send any app-level data to Cloudwatch and monitor it using amazon tools.
You can follow this blog for more details: http://upaang-saxena.strikingly.com/blog/adding-ec2-memory-metrics-to-aws-cloudwatch
You can set a cron for 5 min interval in that instance, and all the data points can be seen in Cloudwatch.

CloudWatch doesn't actually provide metrics regarding memory usage for EC2 instance, you can confirm this here.
As a result, the MemoryUtilization metric that you are referring to is obviously a custom metric that is being pushed by something you have configured or some application running on your instance.
As a result, you need to determine what is actually pushing the data for this metric. The data source is obviously pushing the wrong thing, or is unreliable.
The behavior you are seeing is not a CloudWatch problem.

Related

AWS CloudWatch CPUUtilization vs NETSNMP discrepancies

For a few weeks now, we've noticed our ScienceLogic monitoring platform that uses SNMP is unable to detect CPU spikes that CloudWatch alarms are picking up. Both platforms are configured to poll every 5min, but SL1 is not seeing any CPU spikes more than ~20%. Our CloudWatch CPU alarm is set to fire off at 90%, which has gone off twice in the past 12 hours for this EC2 instance. I'm struggling to understand why?
I know that CloudWatch pulls the CPUUtilization metric direct from the hypervisor, but I can't imagine it would differ so much from the CPU percentage captured by SNMP. Anyone have any thoughts? I wonder if I'm just seeing a scaling issue in SNMP?
SL1:
CloudWatch:
I tried contacting Sciencelogic, and they asked me for the "formula" that AWS uses to capture this metric, which I'm not really sure I understand the question lol.
Monitors running inside/outside of a compute unit (here a virtual machine) can observe different results, which I think is rather normal.
The SNMP agent is inside the VM, so its code execution is affected heavily by high CPU events (threads blocked). You can recall similar high CPU events on your personal computer, where if an application consumed all CPU resources, other applications naturally became slow and not responsive.
While CloudWatch sensors are outside, which are almost never affected by the VM internal events.

Monitor Google Cloud Run memory usage

Is there any built-in way to monitor memory usage of an application running in managed Google Cloud Run instances?
In the "Metrics" page of a managed Cloud Run service, there is an item called "Container Memory Allocation". However, as far as I understand it, this graph refers to the instance's maximum allocated memory (chosen in the settings), and not to the memory actually used inside the container. (Please correct me if I'm wrong.)
In the Stackdriver Monitoring list of available metrics for managed Cloud Run ( https://cloud.google.com/monitoring/api/metrics_gcp#gcp-run ), there also doesn't seem to be any metric related to the memory usage, only to allocated memory.
Thank you in advance.
Cloud Run now exposes a new metrics named "Memory Utilization" in Cloud Monitoring, see more details here.
This metrics captures the container memory utilization distribution across all container instances of the revision. It is recommended to look at the percentiles of this metric: 50th percentile, 95th percentiles ad 99th percentiles to understand how utilized are your instances
Currently, there seems to be no way to monitor the memory usage of a Google Cloud Run instance through Stackdriver or on "Cloud Run" page in Google Cloud Console.
I have filed a feature request on your behalf, in order to add memory usage metrics to Cloud Run. You can see and track this feature request in the following link.
There is not currently a metric on memory utilization. However, if your service reaches a memory limit, the following log will appear in Stackdriver Logging with ERROR-level severity:
"Memory limit of 256M exceeded with 325M used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits"
(Replace specific numbers accordingly.)
Based on this log message, you could create a Log-based Metric for memory exceeded.

AWS server became slow after traffic increase

I have a single page Angular app that makes request to a Rails API service. Both are running on a t2xlarge Ubuntu instance. I am using a Postgres database.
We had increase in traffic, and my Rails API became slow. Sometimes, I get an error saying Passenger queue full for rails application.
Auto scaling on the server is working; three more instances are created. But I cannot trace this issue. I need root access to upgrade, which I do not have. Please help me with this.
As you mentioned that you are using T2.2xlarge instance type. Firstly I want to tell you should not use T2 instance type for production environment. Cause of T2 instance uses CPU Credit. Lets take a look on this
What happens if I use all of my credits?
If your instance uses all of its CPU credit balance, performance
remains at the baseline performance level. If your instance is running
low on credits, your instance’s CPU credit consumption (and therefore
CPU performance) is gradually lowered to the base performance level
over a 15-minute interval, so you will not experience a sharp
performance drop-off when your CPU credits are depleted. If your
instance consistently uses all of its CPU credit balance, we recommend
a larger T2 size or a fixed performance instance type such as M3 or
C3.
Im not sure you won't face to the out of CPU Credit problem because you are using Xlarge type but I think you should use other fixed performance instance types. So instance's performace maybe one part of your problem. You should use cloudwatch to monitor on 2 metrics: CPUCreditUsage and CPUCreditBalance to make sure the problem.
Secondly, how about your ASG? After scale-out, did your service become stable? If so, I think you do not care about this problem any more because ASG did what it's reponsibility.
Please check the following
If you are opening a connection to Database, make sure you close it.
If you are using jquery, bootstrap, datatables, or other css libraries, use the CDN links like
<link rel="stylesheet" ref="https://cdnjs.cloudflare.com/ajax/libs/bootstrap-select/1.12.4/css/bootstrap-select.min.css">
it will reduce a great amount of load on your server. do not copy the jquery or other external libraries on your own server when you can directly fetch it from other servers.
There are a number of factors that can cause an EC2 instance (or any system) to appear to run slowly.
CPU Usage. The higher the CPU usage the longer to process new threads and processes.
Free Memory. Your system needs free memory to process threads, create new processes, etc. How much free memory do you have?
Free Disk Space. Operating systems tend to thrash when the file systems on system drives run low on free disk space. How much free disk space do you have?
Network Bandwidth. What is the average bytes in / out for your
instance?
Database. Monitor connections, free memory, disk bandwidth, etc.
Amazon has CloudWatch which can provide you with monitoring for everything except for free disk space (you can add an agent to your instance for this metric). This will also help you quickly see what is happening with your instances.
Monitor your EC2 instances and your database.
You mention T2 instances. These are burstable CPUs which means that if you have consistenly higher CPU usage, then you will want to switch to fixed performance EC2 instances. CloudWatch should help you figure out what you need (CPU or Memory or Disk or Network performance).
This is totally independent of AWS Server. Looks like your software needs more juice (RAM, StorageIO, Network) and it is not sufficient with one machine. You need to evaluate the metric using cloudwatch and adjust software needs based on what is required for the software.
It could be memory leaks or processing leaks that may lead to this as well. You need to create clusters or server farm to handle the load.
Hope it helps.

Choosing redis maxmemory size and BGSAVE memory usage

I am trying to find out what a safe setting for 'maxmemory' would be in the following situation:
write-heavy application
8GB RAM
let's assume other processes take up about 1GB
this means that the redis process' memory usage may never exceed 7GB
memory usage doubles on every BGSAVE event, because:
In the redis docs the following is said about the memory usage increasing on BGSAVE events:
If you are using Redis in a very write-heavy application, while saving an RDB file on disk or rewriting the AOF log Redis may use up to 2 times the memory normally used.
the maxmemory limit is roughly compared to 'used_memory' from redis-cli INFO (as is explained here) and does not take other memory used by redis into account
Am I correct that this means that the maxmemory setting should, in this situation, be set no higher than (8GB - 1GB) / 2 = 3.5GB?
If so, I will create a pull request for the redis docs to reflect this more clearly.
I would recommend in this case a limit of 3GB. Yes, the docs are pretty much correct and running a bgsave will double for a short term the memory requirements. However, I prefer to reserve 2GB of memory for the system, or at a maximum for a persisting master 40% of maximum memory.
You indicate you have a very write heavy application. In this case I would highly recommend a second server do the save operations. I've found during high writes and a bgsave the response time to the client(s) can get high. It isn't Redis per se causing it, but the response of the server itself. This is especially true for virtual machines. Under this setup you would use the second server to slave from the primary and save to disk while the first remains responsive.

Using the Web Console, Is it possible to Change the Membase Memory Quota?

Let's say I have a Membase server with a memory quota of 1GB. Is it possible to change it and make it larger, say 8GB, assuming the server gets a hardware upgrade?
Currently, I have the impression that Membase memory quota is unchangeable once set, which is very frustrating.
For 1.6.5 at least, you can change it from Web console. From Manage | Data Buckets, you can reset your RAM quota.
But for a cluster, it won't be that easy though because it is required to have same memory quota across all cluster nodes.

Resources