AWS CloudWatch CPUUtilization vs NETSNMP discrepancies - monitoring

For a few weeks now, we've noticed our ScienceLogic monitoring platform that uses SNMP is unable to detect CPU spikes that CloudWatch alarms are picking up. Both platforms are configured to poll every 5min, but SL1 is not seeing any CPU spikes more than ~20%. Our CloudWatch CPU alarm is set to fire off at 90%, which has gone off twice in the past 12 hours for this EC2 instance. I'm struggling to understand why?
I know that CloudWatch pulls the CPUUtilization metric direct from the hypervisor, but I can't imagine it would differ so much from the CPU percentage captured by SNMP. Anyone have any thoughts? I wonder if I'm just seeing a scaling issue in SNMP?
SL1:
CloudWatch:
I tried contacting Sciencelogic, and they asked me for the "formula" that AWS uses to capture this metric, which I'm not really sure I understand the question lol.

Monitors running inside/outside of a compute unit (here a virtual machine) can observe different results, which I think is rather normal.
The SNMP agent is inside the VM, so its code execution is affected heavily by high CPU events (threads blocked). You can recall similar high CPU events on your personal computer, where if an application consumed all CPU resources, other applications naturally became slow and not responsive.
While CloudWatch sensors are outside, which are almost never affected by the VM internal events.

Related

AWS server became slow after traffic increase

I have a single page Angular app that makes request to a Rails API service. Both are running on a t2xlarge Ubuntu instance. I am using a Postgres database.
We had increase in traffic, and my Rails API became slow. Sometimes, I get an error saying Passenger queue full for rails application.
Auto scaling on the server is working; three more instances are created. But I cannot trace this issue. I need root access to upgrade, which I do not have. Please help me with this.
As you mentioned that you are using T2.2xlarge instance type. Firstly I want to tell you should not use T2 instance type for production environment. Cause of T2 instance uses CPU Credit. Lets take a look on this
What happens if I use all of my credits?
If your instance uses all of its CPU credit balance, performance
remains at the baseline performance level. If your instance is running
low on credits, your instance’s CPU credit consumption (and therefore
CPU performance) is gradually lowered to the base performance level
over a 15-minute interval, so you will not experience a sharp
performance drop-off when your CPU credits are depleted. If your
instance consistently uses all of its CPU credit balance, we recommend
a larger T2 size or a fixed performance instance type such as M3 or
C3.
Im not sure you won't face to the out of CPU Credit problem because you are using Xlarge type but I think you should use other fixed performance instance types. So instance's performace maybe one part of your problem. You should use cloudwatch to monitor on 2 metrics: CPUCreditUsage and CPUCreditBalance to make sure the problem.
Secondly, how about your ASG? After scale-out, did your service become stable? If so, I think you do not care about this problem any more because ASG did what it's reponsibility.
Please check the following
If you are opening a connection to Database, make sure you close it.
If you are using jquery, bootstrap, datatables, or other css libraries, use the CDN links like
<link rel="stylesheet" ref="https://cdnjs.cloudflare.com/ajax/libs/bootstrap-select/1.12.4/css/bootstrap-select.min.css">
it will reduce a great amount of load on your server. do not copy the jquery or other external libraries on your own server when you can directly fetch it from other servers.
There are a number of factors that can cause an EC2 instance (or any system) to appear to run slowly.
CPU Usage. The higher the CPU usage the longer to process new threads and processes.
Free Memory. Your system needs free memory to process threads, create new processes, etc. How much free memory do you have?
Free Disk Space. Operating systems tend to thrash when the file systems on system drives run low on free disk space. How much free disk space do you have?
Network Bandwidth. What is the average bytes in / out for your
instance?
Database. Monitor connections, free memory, disk bandwidth, etc.
Amazon has CloudWatch which can provide you with monitoring for everything except for free disk space (you can add an agent to your instance for this metric). This will also help you quickly see what is happening with your instances.
Monitor your EC2 instances and your database.
You mention T2 instances. These are burstable CPUs which means that if you have consistenly higher CPU usage, then you will want to switch to fixed performance EC2 instances. CloudWatch should help you figure out what you need (CPU or Memory or Disk or Network performance).
This is totally independent of AWS Server. Looks like your software needs more juice (RAM, StorageIO, Network) and it is not sufficient with one machine. You need to evaluate the metric using cloudwatch and adjust software needs based on what is required for the software.
It could be memory leaks or processing leaks that may lead to this as well. You need to create clusters or server farm to handle the load.
Hope it helps.

EC2 CloudWatch memory metrics don't match what Top shows

I have a t2.micro EC2 instance, running at about 2% CPU. I know from other posts that the CPU usage shown in TOP is different to CPU reported in CloudWatch, and the CloudWatch value should be trusted.
However, I'm seeing very different values for Memory usage between TOP, CloudWatch, and NewRelic.
There's 1Gb of RAM on the instance, and TOP shows ~300Mb of Apache processes, plus ~100Mb of other processes. The overall memory usage reported by TOP is 800Mb. I guess there's 400Mb of OS/system overhead?
However, CloudWatch reports 700Mb of usage, and NewRelic reports 200Mb of usage (even though NewRelic reports 300Mb of Apache processes elsewhere, so I'm ignoring them).
The CloudWatch memory metric often goes over 80%, and I'd like to know what the actual value is, so I know when to scale if necessary, or how to reduce memory usage.
Here's the recent memory profile, seems something is using more memory over time (big dips are either Apache restart, or perhaps GC?)
Screenshot of memory usage over last 12 days
AWS doesn't supports Memory metrics of any EC2 instance. As Amazon does all his monitoring from outside the EC2 instance(servers), it is unable to capture the memory metrics inside the instance. But, for complete monitoring of an instance, you must need Memory Utilisation statistics for any instance, along with his CPU Utilisation and Network IO operations.
But, we can use custom metrics feature of cloudwatch to send any app-level data to Cloudwatch and monitor it using amazon tools.
You can follow this blog for more details: http://upaang-saxena.strikingly.com/blog/adding-ec2-memory-metrics-to-aws-cloudwatch
You can set a cron for 5 min interval in that instance, and all the data points can be seen in Cloudwatch.
CloudWatch doesn't actually provide metrics regarding memory usage for EC2 instance, you can confirm this here.
As a result, the MemoryUtilization metric that you are referring to is obviously a custom metric that is being pushed by something you have configured or some application running on your instance.
As a result, you need to determine what is actually pushing the data for this metric. The data source is obviously pushing the wrong thing, or is unreliable.
The behavior you are seeing is not a CloudWatch problem.

Perfino System Requirements

We're planning to evaluate and eventually potentially purchase perfino. I went quickly through the docs and cannot find the system requirements for the installation. Also I cannot find it's compatibility with JBoss 7.1. Can you provide details please?
There are no hard system requirements for disk space, it depends on the amount of business transactions that you're recording. All data will be consolidated, so the database reaches a maximum size after a while, but it's not possible to say what that size will be. Consolidation times can be configured in the general settings.
There are also no hard system requirements for CPU and physical memory. A low-end machine will have no problems monitoring 100 JVMs, but the exact details again depend on the amount of monitored business transactions.
JBoss 7.1 is supported. "Supported" means that web service and EJB calls can be tracked between JVMs, otherwise all application servers work with perfino.
I haven't found any official system requirements, but this is what we figured out experimentally.
We collect about 10,000 transactions a minute from 8 JVMs. We have a lot of distinct and long SQL queries. We use AWS machine with 2 VCPUs and 8GB RAM.
When the Perfino GUI is not being used, the CPU load is low. However, for the GUI to work properly, we had to modify perfino_service.vmoptions:
-Xmx6000m. Before that we had experienced multiple OutOfMemoryError in Perfino when filtering in the transactions view. After changing the memory settings, the GUI is running fine.
This means that you need a machine with about 8GB RAM. I guess this depends on the number of distinct transactions you collect. Our limit is high, at 30,000.
After 6 weeks of usage, there's 7GB of files in the perfino directory. Perfino can clear old recordings after a configurable time.

Mirrored queue performance factors

We operate two dual-node brokers, each broker having quite different queues and workloads. Each box has 24 cores (H/T) worth of Xeon E5645 # 2.4GHz with 48GB RAM, connected by Gigabit LAN with ~150μs latency, running RHEL 5.6, RabbitMQ 3.1, Erlang R16B with HiPE off. We've tried with HiPE on but it made no noticeable performance impact, and was very crashy.
We appear to have hit a ceiling for our message rates of between 1,000/s and 1,400/s both in and out. This is broker-wide, not per-queue. Adding more consumers doesn't improve throughput overall, just gives that particular queue a bigger slice of this apparent "pool" of resource.
Every queue is mirrored across the two nodes that make up the broker. Our publishers and consumers connect equally to both nodes in a persistant way. We notice an ADSL-like asymmetry in the rates too; if we manage to publish a high rate of messages the deliver rate drops to high double digits. Testing with an un-mirrored queue has much higher throughput, as expected. Queues and Exchanges are durable, messages are not persistent.
We'd like to know what we can do to improve the situation. The CPU on the box is fine, beam takes a core and a half for 1 process, then another 80% each of two cores for another couple of processes. The rest of the box is essentially idle. We are using ~20GB of RAM in userland with system cache filling the rest. IO rates are fine. Network is fine.
Is there any Erlang/OTP tuning we can do? delegate_count is the default 16, could someone explain what this does in a bit more detail please?
This is difficult to answer without knowing more about how your producers and consumers are configured, which client library you're using and so on. As discussed on irc (http://dev.rabbitmq.com/irclog/index.php?date=2013-05-22) a minute ago, I'd suggest you attempt to reproduce the topology using the MulticastMain java load test tool that ships with the RabbitMQ java client. You can configure multiple producers/consumers, message sizes and so on. I can certainly get 5Khz out of a two-node cluster with HA on my desktop, so this may be a client (or application code) related issue.

Monitoring CPU Core Usage on Terminal Servers

I have windows 2003 terminal servers, multi-core. I'm looking for a way to monitor individual CPU core usage on these servers. It is possible for an end-user to have a run-away process (e.g. Internet Explorer or Outlook). The core for that process may spike to near 100% leaving the other cores 'normal'. Thus, the overall CPU usage on the server is just the total of all the cores or if 7 of the cores on a 8 core server are idle and the 8th is running at 100% then 1/8 = 12.5% usage.
What utility can I use to monitor multiple servers ? If the CPU usage for a core is "high" what would I use to determine the offending process and then how could I automatically kill that process if it was on the 'approved kill process' list?
A product from http://www.packettrap.com/ called PT360 would be perfect except they use SMNP to get data and SMNP appears to only give total CPU usage, it's not broken out by an individual core. Take a look at their Dashboard option with the CPU gauge 'gadget'. That's exactly what I need if only it worked at the core level.
Any ideas?
Individual CPU usage is available through the standard windows performance counters. You can monitor this in perfmon.
However, it won't give you the result you are looking for. Unless a thread/process has been explicitly bound to a single CPU then a run-away process will not spike one core to 100% while all the others idle. The run-away process will bounce around between all the processors. I don't know why windows schedules threads this way, presumably because there is no gain from forcing affinity and some loss due to having to handle interrupts on particular cores.
You can see this easily enough just in task manager. Watch the individual CPU graphs when you have a single compute bound process running.
You can give Spotlight on Windows a try. You can graphically drill into all sorts of performance and load indicators. Its freeware.
perfmon from Microsoft can monitor each individual CPU. perfmon also works remote and you can monitor farious aspects of Windows.
I'm not sure if it helps to find run-away processes because the Windows scheduler dos not execute a process always on the same CPU -> on your 8 CPU machine you will see 12.5 % usage on all CPU's if one process runs away.

Resources