How to architecture an alarm engine? - iot

I'm looking to develop and/or implement an existing solutions an Alarm Engine preferably around AWS and/or PHP eco-systems.
Precisely I have a stream of messages containing values and I need to define rules like:
Temperature has been greater than 100 F° for a period of 5 minutes
And dispatch a message if the rule matches.
I'm looking to do something very similar to Grafana Alarms or CloudWatch Alarms on AWS.
I've not been able to find a standalone service or library to do that.
The rules engines doesn't seems that hard, however given the fact that there is a period to take care of, it seems somewhat more complicated because we need to take care of a transient state.
Any clues?

Related

Should I store a global counter or an aggregated value in a TSDB

This question is really about the data schema. I have a program which has a bunch of discrete events, and I want to get beautiful graphs out.
From my knowledge, I understand that I should really keep a counter of the number of events that have occurred, and on a regular interval, transfer that cumulative counter to the TSDB (as part of a cron job or similar).
What I currently have though is a system where the monitor, on a regular interval, tells the TSB how many events occurred during that interval (a fixed hard coded value!).
Which of these two design patterns is better? What are the factors that affect that decision? Do I have a counter value here or is it just a measurement?
I have various concerns, including but not limited to the efficiency of the monitoring tool.
You tagged the question with InfluxDB but it seems like what you are really asking about is the collection agent. For that I would look at Telegraf.
StatsD is also a really great lightweight API that is available for most major languages now, from which you can efficiently emit different types of stats (counters, timings, etc); either for every event or at a sample rate you define.
I implemented a solution that gather metrics emitted from my app using StatsD, metrics that were pulled (JMX queries), and basic host level stats you get for free with Telegraf. Every host (30+) runs a single telegraf instance which delivers its stats to a centralized InfluxDB server on some interval (i.e. 30 seconds).
So with an approach like that you get a good balance of performance and data precision.

Tool for monitoring QOS

In my project
We crawls x number of server.
Number of user for each server varies from 1 to n.
We crawls 1 to z item for each user.
Currently we are monitoring QOS using graphite. We are storing time taken to crawl the item.
x.time_taken
Problem with this approach is that if only single user is affected we get false alert about QOS.
What will be the correct tool/technique to answer/monitor following points:
Alert only if minimum k user are affected. [Not number of events]
List of user which were affected.
I think graphite and statsd is not correct tool for this. What will be better tool for answering those two question ?
What you are asking for is often called Service Monitoring. For very good reasons you want to know the service impact of an event, rather than just that an event has happened.
The advantage of this approach is exactly as you state in your requirements - you can focus on events which impact a large part of your user base and you have a list of the users affected right away.
The main drawback, IMHO, is that Service Monitoring is usually much more complex than simple performance or event/alert monitoring. It also often relies on a service model, which in my experience is something that is hard to build and even harder to keep up to date.
For example if a server in your system shows a significant slow down or failure, depending on your architecture this may impact all users who use a service that relies on that server, or it may impact a very small subset, or even none at all initially, if there is a load balancing mechanism or redundancy mechanism in place.
You would need to reflect this architecture in your service monitoring model, and also change it every time you update your system architecture or deployment.
If your system is static enough or critical enough to warrant the investment then this may be worth your while. If not then a simple compromise may be just to update the graphing and alerting you are doing to alert when the average response time over a set number of users, or over all users on a server increases by a significant amount.
This may give you most of the benefits you are after without having to invest in the extra complexity of a service monitoring solution.
If you definitely are looking to expand your monitoring approach and want to stick with open source tools then I would start by looking at NAGIOS if your focus is on infrastructure, or there are quite a few web service monitoring solutions with Free Tiers such as pingdom:
http://www.nagios.org
https://www.pingdom.com

NServiceBus appropriate for load distribution of periodic tasks

Would NServiceBus or an equivalent ESB be appropriate for an application that has a bunch of different kinds of background maintenance-type tasks? For example:
Scanning databases for the occurence of certain words in user-generated content
Updating database tables that store the results of relatively expensive queries
Creating/maintaining external indexes for content
Sending event notification emails for a scheduled event.
My idea is to employ some kind of task scheduler (either the Windows builtin one, Quartz.NET, or my own database-based solution) to publish different kinds messages onto the bus periodically. The period may be as short as one minute or as long as a days. The reason I want to use the bus is so that I can scale out the number of subscribers as the system becomes larger and busier and the tasks become either more frequent or more resource-intensive. It would also provide redundancy as long as I always have at least two subscribers running.
The obvious alternative to this would be to write my own Windows Service that is triggered by the scheduler and performs the work, but I feel like making that scale beyond a single machine and provide fault tolerance might be more difficult than using the ESB as that plumbing.
Does this sound like a reasonable approach? Alternative suggestions?
TIA
As the author of NServiceBus, I'm quite probably biased, but there is a tradeoff between learning a new technology and writing (possibly a simpler version of) your own. I would recommend considering the longer term maintainance (and documentation) costs of your own solution as compared to one written in house.
In terms of the feature-set you described, NServiceBus does provide facilities for all of that.

Using JMX classes to notify on events over time

I've been looking at JMX for monitoring application and system metrics (partially because MBeans can accessed by various tools such as JConsole). It would seem like the classes included with JMX would be useful for things like notification when metrics have exceeded thresholds. But I'm not sure they fit with the way I want to measure these over a specified time period.
For example, let's say I want to notify an admin when the average CPU load is over 95% for more than 5 minutes. Is that something can be done with a GaugeMonitor? From the docs, it doesn't seem quite suited for this, and I'm wondering if instead I should write my own MBean with the necessary logic.
A more relevant example is when the login times for users exceed 10s over a period of 5 mins. Slightly different would be the last 20 logins took more than 10s on average. Another case would be when a process crashes 4+ times in an hour. Or the request queue exceeds 15 for 5 mins. Are the JMX Monitor classes useful for this kind of thing?
In my opinion, the monitor mbean classes are not particularly useful and while you might be able to tweak them enough to serve your needs, it sounds like you have some diverse requirements. I recommend you take a look at something like Esper, a streaming event engine. Basically, you will inject regular readings into the engine and if a condition that you define occurs, you will get a callback which can easily be converted to a JMX Notification.
The Esper engine is quite efficient, runs entirely in caller threads (no extra threads) and only retains the injected data it needs to satisfy the conditions you register to be notified of.

Is it possible to have a stateless timed function

I'm trying to set a reminder in a system to fire at a certain time.
This is a web based app, so it's not like it will be in memory all the time.
Ideally I'd like to avoid using a service or job on the server(mainly out of curiosity, to see if there is a more efficient way to do it)
For example, imagine how many Ebay bids are constantly ending all the times, and emails being sent out seemingly perfectly in time.
Do people recon there is just a big loop going over and over, moving items into a queue etc... Or is there something lower level helping out (stored procedures, triggers etc)
Thanks everyone.
What you have to realize about eBay - and most large database-backed websites - is that the interactions between humans and the database that come through the web server are only a part (sometimes a very small part) of the functionality of the system.
To use eBay as an example, the email that goes out when auctions expire is not handled by a web server. They are far more likely to have that scripted. In other words, there is another program running on a number of their systems that look at the database for ended auctions, do some processing on them, send emails, etc.
If I were doing something similar (albeit on a much smaller scale,) I'd have my web services built in the usual way, but have a job that is run automatically every few minutes to do the maintenance work. It would start up, look at the database for work, process anything that was required, then exit.

Resources