Using JMX classes to notify on events over time - monitor

I've been looking at JMX for monitoring application and system metrics (partially because MBeans can accessed by various tools such as JConsole). It would seem like the classes included with JMX would be useful for things like notification when metrics have exceeded thresholds. But I'm not sure they fit with the way I want to measure these over a specified time period.
For example, let's say I want to notify an admin when the average CPU load is over 95% for more than 5 minutes. Is that something can be done with a GaugeMonitor? From the docs, it doesn't seem quite suited for this, and I'm wondering if instead I should write my own MBean with the necessary logic.
A more relevant example is when the login times for users exceed 10s over a period of 5 mins. Slightly different would be the last 20 logins took more than 10s on average. Another case would be when a process crashes 4+ times in an hour. Or the request queue exceeds 15 for 5 mins. Are the JMX Monitor classes useful for this kind of thing?

In my opinion, the monitor mbean classes are not particularly useful and while you might be able to tweak them enough to serve your needs, it sounds like you have some diverse requirements. I recommend you take a look at something like Esper, a streaming event engine. Basically, you will inject regular readings into the engine and if a condition that you define occurs, you will get a callback which can easily be converted to a JMX Notification.
The Esper engine is quite efficient, runs entirely in caller threads (no extra threads) and only retains the injected data it needs to satisfy the conditions you register to be notified of.

Related

How to architecture an alarm engine?

I'm looking to develop and/or implement an existing solutions an Alarm Engine preferably around AWS and/or PHP eco-systems.
Precisely I have a stream of messages containing values and I need to define rules like:
Temperature has been greater than 100 F° for a period of 5 minutes
And dispatch a message if the rule matches.
I'm looking to do something very similar to Grafana Alarms or CloudWatch Alarms on AWS.
I've not been able to find a standalone service or library to do that.
The rules engines doesn't seems that hard, however given the fact that there is a period to take care of, it seems somewhat more complicated because we need to take care of a transient state.
Any clues?

Locust - how to delay collection of RPS data until all threads have started

Scenario
locust test with gradual spawn-rate, chart looks like a 45-degree angle.
I would like to know the RPS of the system while all threads are running.
The out-of-the-box RPS value from locust will include RPS values from the beginning of the run when there were fewer threads.
How can I customize my locust script to start calculating RPS from when all threads are running?
Is this a reasonable load-test practice?
An alternative option would be to "simulate reality" as much as possible (and in the real word there is ramp-up when the system starts up). To get a more representative RPS value, run the test longer.
There are many reasons why you want to pay attention to what your system can handle while new load is being added. There can be performance problems accepting the connection, for example, if you have improper or older SSL/TLS settings or libraries. In some instances having new load come up can affect users already connected to and using your system. You might even have additional server logic that happens when a new connection is accepted. In short, you should go with 3) above.
However, enough people like to ignore or gloss over what things look like during ramp up that Locust does have a configuration option --reset-stats that will automatically reset all collected stats once all spawning has completed so it appears as if the load test started with all users connected instantaneously. That should give you what you were asking for.

Should I store a global counter or an aggregated value in a TSDB

This question is really about the data schema. I have a program which has a bunch of discrete events, and I want to get beautiful graphs out.
From my knowledge, I understand that I should really keep a counter of the number of events that have occurred, and on a regular interval, transfer that cumulative counter to the TSDB (as part of a cron job or similar).
What I currently have though is a system where the monitor, on a regular interval, tells the TSB how many events occurred during that interval (a fixed hard coded value!).
Which of these two design patterns is better? What are the factors that affect that decision? Do I have a counter value here or is it just a measurement?
I have various concerns, including but not limited to the efficiency of the monitoring tool.
You tagged the question with InfluxDB but it seems like what you are really asking about is the collection agent. For that I would look at Telegraf.
StatsD is also a really great lightweight API that is available for most major languages now, from which you can efficiently emit different types of stats (counters, timings, etc); either for every event or at a sample rate you define.
I implemented a solution that gather metrics emitted from my app using StatsD, metrics that were pulled (JMX queries), and basic host level stats you get for free with Telegraf. Every host (30+) runs a single telegraf instance which delivers its stats to a centralized InfluxDB server on some interval (i.e. 30 seconds).
So with an approach like that you get a good balance of performance and data precision.

Tool for monitoring QOS

In my project
We crawls x number of server.
Number of user for each server varies from 1 to n.
We crawls 1 to z item for each user.
Currently we are monitoring QOS using graphite. We are storing time taken to crawl the item.
x.time_taken
Problem with this approach is that if only single user is affected we get false alert about QOS.
What will be the correct tool/technique to answer/monitor following points:
Alert only if minimum k user are affected. [Not number of events]
List of user which were affected.
I think graphite and statsd is not correct tool for this. What will be better tool for answering those two question ?
What you are asking for is often called Service Monitoring. For very good reasons you want to know the service impact of an event, rather than just that an event has happened.
The advantage of this approach is exactly as you state in your requirements - you can focus on events which impact a large part of your user base and you have a list of the users affected right away.
The main drawback, IMHO, is that Service Monitoring is usually much more complex than simple performance or event/alert monitoring. It also often relies on a service model, which in my experience is something that is hard to build and even harder to keep up to date.
For example if a server in your system shows a significant slow down or failure, depending on your architecture this may impact all users who use a service that relies on that server, or it may impact a very small subset, or even none at all initially, if there is a load balancing mechanism or redundancy mechanism in place.
You would need to reflect this architecture in your service monitoring model, and also change it every time you update your system architecture or deployment.
If your system is static enough or critical enough to warrant the investment then this may be worth your while. If not then a simple compromise may be just to update the graphing and alerting you are doing to alert when the average response time over a set number of users, or over all users on a server increases by a significant amount.
This may give you most of the benefits you are after without having to invest in the extra complexity of a service monitoring solution.
If you definitely are looking to expand your monitoring approach and want to stick with open source tools then I would start by looking at NAGIOS if your focus is on infrastructure, or there are quite a few web service monitoring solutions with Free Tiers such as pingdom:
http://www.nagios.org
https://www.pingdom.com

NServiceBus appropriate for load distribution of periodic tasks

Would NServiceBus or an equivalent ESB be appropriate for an application that has a bunch of different kinds of background maintenance-type tasks? For example:
Scanning databases for the occurence of certain words in user-generated content
Updating database tables that store the results of relatively expensive queries
Creating/maintaining external indexes for content
Sending event notification emails for a scheduled event.
My idea is to employ some kind of task scheduler (either the Windows builtin one, Quartz.NET, or my own database-based solution) to publish different kinds messages onto the bus periodically. The period may be as short as one minute or as long as a days. The reason I want to use the bus is so that I can scale out the number of subscribers as the system becomes larger and busier and the tasks become either more frequent or more resource-intensive. It would also provide redundancy as long as I always have at least two subscribers running.
The obvious alternative to this would be to write my own Windows Service that is triggered by the scheduler and performs the work, but I feel like making that scale beyond a single machine and provide fault tolerance might be more difficult than using the ESB as that plumbing.
Does this sound like a reasonable approach? Alternative suggestions?
TIA
As the author of NServiceBus, I'm quite probably biased, but there is a tradeoff between learning a new technology and writing (possibly a simpler version of) your own. I would recommend considering the longer term maintainance (and documentation) costs of your own solution as compared to one written in house.
In terms of the feature-set you described, NServiceBus does provide facilities for all of that.

Resources