How to write bosun alerts which handle low traffic volumes - monitoring

If you are writing a bosun alert which is based of a percentage error rate for requests handled by your system, how do you write it in such a way that it handles periods of low traffic.
For example:
If I have an alert which looks back over the last 5 minutes and works out the error rate for requests
$errorRate = $numberErr/$numberReq and then triggers an alarm if the errorRate exceeds a predefined threshold crit = $errorRate > 0.05 this can work quite well so long as every 5 minute period had a sufficiently large number of requests ($numberReq).
If the number of requests in a 5 minute period was 10,000 then 501 errors would be required to trigger an alarm. However if the number of requests in a 5 minute period was 100 then only 5 errors would be required to trigger an alarm.
How can I write an alert which handles periods where the number of requests are so low that a small number of errors will equate to a large error rate. I had considered a sliding window of time, rather than a fixed 5 minute period, where the window would increase in size until the number of requests was high enough to give some confidence in the alarm. e.g. increase the time period until the number of requests is 10,000.
I can't find a way to achieve this in bosun, and I don't want to commit to a larger period of time for my alerts because the traffic rate varies so much. A longer period during peak traffic could result in an actual error causing a much larger impact.

I generally pair any percentage and/or historical based alerts with a static threshold.
For example: crit = numberErr > 100 && $errorRate > 0.05. That way the percent part doesn't matter unless the number of errors have also crossed some threshold because the entire statement won't be true.

Related

How much computing time does the kernel need

I wrote a program for a LED display. The program allows to set the refresh rate via webconfiguration. To meet the refresh rate I measure the processing time of a loop. At the end I calculate the delay and wait until the next loop.
e.g. Refresh Rate 5 Hz -> 200 milli seconds for one loop. 50 milli seconds computing time results in 150 milli seconds delay.
The ratio of process time (50 milli seconds) to total time (200 milli seconds) indicates the processor load of my program. But to find the optimal setting, I need the actual total processor load. And not only that of my program. But since I don't know the real processor load of the delay() (in which WIFI etc. is done), I don't really know the processor load. In other words, I don't know how much time the system spends doing system tasks in the delay(150).
Is there a way to find out how much of a delay is actually used for system tasks before the processor truly waits?
In other words, I'm looking for a way to get the kernel time within a certain time frame.
Cheers Gabriel

How to count the number of metrics sent to Datadog over a 24 hour period?

I have a situation where I'm trying to count the number of files loaded into the system I am monitoring. I'm sending a "load time" metric to Datadog each time a file is loaded, and I need to send an alert whenever an expected file does not appear. To do this, I was going to count up the number of "load time" metrics sent to Datadog in a 24 hour period, then use anomaly detection to see whether it was less than the normal number expected. However, I'm having some trouble finding a way to consistently pull out this count for use in the alert.
I can't use the count_nonzero function, as some of my files are empty and have a load time of 0. I do know about .as_count() and count:metric{tags}, but I haven't found a way to include an evaluation interval with either of these. I've tried using .rollup(count, time) to count up the metrics sent, but this call seems to return variable results based on the rollup interval. For instance, if I compare intervals of 2000 and 4000 seconds, I would expect each 4000 second interval to count up about the sum of two 2000 second intervals over the same time period. This does not seem to be what happens at all - the counts for the smaller intervals seem to add up to much more than the count for the larger one. Additionally some rollup intervals display decimal numbers as counts, which does not make any sense to me if this function is doing what I thought it was.
Does anyone have any thoughts on how to accomplish this? I'd really appreciate any new ideas.

How can I calculate the appropriate amount of channel capacity?

I am looking for a solution because the sth-channel is full.
I am troubled with calculating the appropriate capacity of channel capacity.
This document has the following description.
In order to calculate the appropriate capacity, just have in consideration the following parameters:
・The amount of events to be put into the channel by the sources per unit time (let's say 1 minute).
・The amount of events to be gotten from the channel by the sinks per unit time.
・An estimation of the amount of events that could not be processed per unit time, and thus to be reinjected into the channel (see next section).
How can I check the values of these parameters?
How can I check the values of these parameters?
You can't just check these parameters. They depend on your application.
What they are saying is that you should have a size which is large enough so the generator doesn't get stuck. This may not be possible in your application.
Say your generator receives one event per second and it takes 2 seconds for a receiver to manage that event. Now lets assume you have 3 receivers. In 1 second, you can manage to process 0.5 events per receiver. You have 3 receivers, so your receivers, together, are capable of processing 0.5 × 3 = 1.5 events, which is more than what you get as input. Your capacity can be 1 or 2, using 2 will greatly increase your chances that you do not get blocked.
Let's review another example:
Your generator wants to pushes 1,000 events per second
Your receivers take 3 seconds to process one event
You would need 1,000 x 3 = 3,000 receivers (3,000 goroutines that can run at full speed in parallel...)
In this example, the total number of receivers is so large that you have to either break up your code to work on multiple computers or optimize your receiver code so it can process the data in an amount of time that makes sense. Say you have 50 processors, your receivers will get 1,000 events per second, all 50 can run at full speed, you need one receiver to do its work in:
50 / 1000 = 0.05 seconds
Now let's assume that in most cases your goroutines take 0.02 but once in a while one will take 1 second. That means your goroutines can get a little behind. In that case your capacity (so the generator doesn't get blocked) should be a little over 1,000. Again, it will depend on how many of the routines get slowed down, etc. In this last example, a run is 0.02 seconds so to process 1,000 events it usually takes 0.02 seconds. If you can send those 1,000 event over the 1 second period, you may not even need the 50 goroutines and could have a smaller capacity. On the other hand, if you have big bursts where you may end up sending many (say 500) events all at ones, then more goroutines and a larger capacity is important to not get blocked.

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

How do I query Prometheus for number of times a service is down

I am trying to work with the UP metric to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour). I am sampling at 5 seconds intervals
The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
Any help with this type of query would be greatly appreciated
Thanks.
You might try the following: calculate the average of the up metric. If the service goes down, the average (sliding windows of 1 minute) will decrease over time.
If the job comes up again, and the average is greater than 0, then the service wasn't down for more than one minute.
The following query (works via the Prometheus web console) delivers one data point for each time the service comes up before it was down for more than one minute.
avg_over_time(up{job="jobname"} [1m]) > 0
AND
irate(up{job="jobname"} [1m]) > 0

Resources