Interesting metrics from JMX - jmx

May I know what are the typical metrics that application developers usually find interesting with the use of JMX other than:
CPU Utilization
Memory consumption
Nicholas

I would add:
Class loaders behaviour
Threads

memory usage diagram (you can see gc runs and detect memory leaks)
stack trace of specified thread
jvm uptime, OS information
all jmx exposed data with your application

Garbage Collector Activity (duration and frequency)
Deadlock Detection
Connector Traffic (in / out)
Request processing time
Number of sessions (in relation to max configured)
Average Session duration
Number of sessions rejected
Is Webmodule running ?
Uptime (if less than 5 minutes, then someone restarted the JVM)
Connector threads relative to the max. available connector threads
Datasource Pools: Usage (relative), Lease time
JMS: Queue size, DLQ size
See also Jmx4Perl's predefined Nagios Check for further metrics ....

JMX can be used to support any of the MXbeans metrics.
Refer Java Documentation - http://docs.oracle.com/javase/7/docs/api/java/lang/management/ManagementFactory.html
section Method Summary.

Related

Dask Distributed - Plugin for Monitoring Memory Usage

I have a distributed Dask cluster that I send a bunch of work to via Dask Distributed Client.
At the end of sending a bunch of work, I'd love to get a report or something that tells me what was the peak memory usage of each worker.
Is this possible via existing diagnostics tools? https://docs.dask.org/en/latest/diagnostics-distributed.html
Thanks!
Best,
Specifically for memory, it's possible to extract information from the scheduler (while it's running) using client.scheduler_info() (this can be dumped as a json). For peak memory there would have to be an extra function that will compare the current usage with the previous usage and pick max.
For a lot of other useful information, but not the peak memory consumption, there's the built-in report:
from dask.distributed import performance_report
with performance_report(filename="dask-report.html"):
## some dask computation
(code from the documentation: https://docs.dask.org/en/latest/diagnostics-distributed.html)
Update: there is also a dedicated plugin for dask to record min/max memory usage per task: https://github.com/itamarst/dask-memusage
Update 2: there is a nice blog post with code to track memory usage by dask: https://blog.dask.org/2021/03/11/dask_memory_usage

EC2 CloudWatch memory metrics don't match what Top shows

I have a t2.micro EC2 instance, running at about 2% CPU. I know from other posts that the CPU usage shown in TOP is different to CPU reported in CloudWatch, and the CloudWatch value should be trusted.
However, I'm seeing very different values for Memory usage between TOP, CloudWatch, and NewRelic.
There's 1Gb of RAM on the instance, and TOP shows ~300Mb of Apache processes, plus ~100Mb of other processes. The overall memory usage reported by TOP is 800Mb. I guess there's 400Mb of OS/system overhead?
However, CloudWatch reports 700Mb of usage, and NewRelic reports 200Mb of usage (even though NewRelic reports 300Mb of Apache processes elsewhere, so I'm ignoring them).
The CloudWatch memory metric often goes over 80%, and I'd like to know what the actual value is, so I know when to scale if necessary, or how to reduce memory usage.
Here's the recent memory profile, seems something is using more memory over time (big dips are either Apache restart, or perhaps GC?)
Screenshot of memory usage over last 12 days
AWS doesn't supports Memory metrics of any EC2 instance. As Amazon does all his monitoring from outside the EC2 instance(servers), it is unable to capture the memory metrics inside the instance. But, for complete monitoring of an instance, you must need Memory Utilisation statistics for any instance, along with his CPU Utilisation and Network IO operations.
But, we can use custom metrics feature of cloudwatch to send any app-level data to Cloudwatch and monitor it using amazon tools.
You can follow this blog for more details: http://upaang-saxena.strikingly.com/blog/adding-ec2-memory-metrics-to-aws-cloudwatch
You can set a cron for 5 min interval in that instance, and all the data points can be seen in Cloudwatch.
CloudWatch doesn't actually provide metrics regarding memory usage for EC2 instance, you can confirm this here.
As a result, the MemoryUtilization metric that you are referring to is obviously a custom metric that is being pushed by something you have configured or some application running on your instance.
As a result, you need to determine what is actually pushing the data for this metric. The data source is obviously pushing the wrong thing, or is unreliable.
The behavior you are seeing is not a CloudWatch problem.

Passenger server upgrade: Processor (CPU) Cores VS Ram?

I went through documentation of Passenger to find out how many application instances it can run with respect to hardware configuration. Documentation only talks about RAM
The optimal value depends on your system’s hardware and the server’s average load. You should experiment with different values. But generally speaking, the value should be at least equal to the number of CPUs (or CPU cores) that you have. If your system has 2 GB of RAM, then we recommend a value of 30. If your system is a Virtual Private Server (VPS) and has about 256 MB RAM, and is also running other services such as MySQL, then we recommend a value of 2.
It says minimum value can be number of CPU/CPU Cores we have. I have a VPS with one VCPU & 1GB RAM & my service provider has an option to just upgrade the RAM. I'm wondering how far I can just keep upgrading only RAM? How important it is to upgrade number of CPUs?
Quick Answer
Depends on what resources are the bottleneck for your app.
Long answer
You'll need to factor in a few things:
How much CPU time does your app need?
How much RAM does any given instance of your app use at peak load?
Does your app spend a lot of time doing IO intensive tasks? (ie: db and file reads/writes, network communication)
There can be other things to factor in, but your bottlenecks will probably be one of the above. If RAM is your main bottleneck, by all means use your newly available RAM. However, if it turns out that your app is being slowed down by CPU availability or flooded IO, no amount of RAM is going to speed things up.
On the topic of CPU cores; my understanding is that the main Apache process that runs Passenger is a single threaded process. Apache spawns new threads to handle concurrency on an as-needed basis. Each additional CPU core theoretically allows you to spawn x*n threads, where x is the number of threads you can optimally run under a single CPU core and n is the number of CPU cores available to Apache.
Disclaimer: I'm not very well read on Passenger internals; though this logic usually holds true for other kinds of Apache configurations.

Web application Load Tests: What metrics to look at?

During a stress/load test of a ASP.NET app hosted in IIS, what should I be monitoring on the app server?
For example, the built in utility performance monitor in windows has a huge list of counters that I can monitor. But, I don't even know what half of these counters actually mean? I know I want to look at things like memory, processor, network....but it is pretty general.
How can I successfully find a problem area?
What counters some of you guys have used in the past?
These metrics we watch to determine if requests are being serviced promptly and the volume is scaling linearly with the applied load:
Queued Requests
Current Requests
Requests Executing
Requests Succeeded
Requests/sec
We will also watch these to look for application problems
Errors/sec
Unhandled Execution Errors/sec
To monitor the VM memory, we look at:
CLR Heap Size
CLR Generation 0, 1 & 2 Garbage collections
CLR Percent Time in GC
For locking conditions, we watch:
CLR Lock Contentions
CLR Lock Contention/sec
CLR Lock Contention Queue Length
Depending on the application we might look at others, like thread counts, but the above are the ones we look at most frequently.

Monitoring CPU Core Usage on Terminal Servers

I have windows 2003 terminal servers, multi-core. I'm looking for a way to monitor individual CPU core usage on these servers. It is possible for an end-user to have a run-away process (e.g. Internet Explorer or Outlook). The core for that process may spike to near 100% leaving the other cores 'normal'. Thus, the overall CPU usage on the server is just the total of all the cores or if 7 of the cores on a 8 core server are idle and the 8th is running at 100% then 1/8 = 12.5% usage.
What utility can I use to monitor multiple servers ? If the CPU usage for a core is "high" what would I use to determine the offending process and then how could I automatically kill that process if it was on the 'approved kill process' list?
A product from http://www.packettrap.com/ called PT360 would be perfect except they use SMNP to get data and SMNP appears to only give total CPU usage, it's not broken out by an individual core. Take a look at their Dashboard option with the CPU gauge 'gadget'. That's exactly what I need if only it worked at the core level.
Any ideas?
Individual CPU usage is available through the standard windows performance counters. You can monitor this in perfmon.
However, it won't give you the result you are looking for. Unless a thread/process has been explicitly bound to a single CPU then a run-away process will not spike one core to 100% while all the others idle. The run-away process will bounce around between all the processors. I don't know why windows schedules threads this way, presumably because there is no gain from forcing affinity and some loss due to having to handle interrupts on particular cores.
You can see this easily enough just in task manager. Watch the individual CPU graphs when you have a single compute bound process running.
You can give Spotlight on Windows a try. You can graphically drill into all sorts of performance and load indicators. Its freeware.
perfmon from Microsoft can monitor each individual CPU. perfmon also works remote and you can monitor farious aspects of Windows.
I'm not sure if it helps to find run-away processes because the Windows scheduler dos not execute a process always on the same CPU -> on your 8 CPU machine you will see 12.5 % usage on all CPU's if one process runs away.

Resources