How do I query Prometheus for number of times a service is down - monitoring

I am trying to work with the UP metric to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour). I am sampling at 5 seconds intervals
The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
Any help with this type of query would be greatly appreciated
Thanks.

You might try the following: calculate the average of the up metric. If the service goes down, the average (sliding windows of 1 minute) will decrease over time.
If the job comes up again, and the average is greater than 0, then the service wasn't down for more than one minute.
The following query (works via the Prometheus web console) delivers one data point for each time the service comes up before it was down for more than one minute.
avg_over_time(up{job="jobname"} [1m]) > 0
AND
irate(up{job="jobname"} [1m]) > 0

Related

How to count the number of metrics sent to Datadog over a 24 hour period?

I have a situation where I'm trying to count the number of files loaded into the system I am monitoring. I'm sending a "load time" metric to Datadog each time a file is loaded, and I need to send an alert whenever an expected file does not appear. To do this, I was going to count up the number of "load time" metrics sent to Datadog in a 24 hour period, then use anomaly detection to see whether it was less than the normal number expected. However, I'm having some trouble finding a way to consistently pull out this count for use in the alert.
I can't use the count_nonzero function, as some of my files are empty and have a load time of 0. I do know about .as_count() and count:metric{tags}, but I haven't found a way to include an evaluation interval with either of these. I've tried using .rollup(count, time) to count up the metrics sent, but this call seems to return variable results based on the rollup interval. For instance, if I compare intervals of 2000 and 4000 seconds, I would expect each 4000 second interval to count up about the sum of two 2000 second intervals over the same time period. This does not seem to be what happens at all - the counts for the smaller intervals seem to add up to much more than the count for the larger one. Additionally some rollup intervals display decimal numbers as counts, which does not make any sense to me if this function is doing what I thought it was.
Does anyone have any thoughts on how to accomplish this? I'd really appreciate any new ideas.

Log data into grafana for every minute

The below code registers five metrics count, oneminuteRate, fiftenMinuteRate, fiveMinuteRate, meanRate into graphite for every 30 seconds from my application.
public void collectMetric(string metricName, long metricValue){
mr.meter(metricName).mark(value)
}
I would like to show in the Grafana dashboard the no of requests that are received every minute.(i,e if in the first minute 60 is received, in the second minute 120 is received) Since the count in the meter metric above just keeps increasing and all the *Rate values are events per second. I am not sure how to log metric into Grafana dashboard that displays the no of requests received per minute. Any advice is highly appreciated?
Suppose if I use
mr.counter(metricName).inc(value) IS there a way to reset the counters every 1 minute?
I had the same problem. The way that I found is that I resolved this in Grafana.
When you're on the panel's metrics, you can add a function to your query like this:
You can try the derivative() function or perSecond() function but these functions are not completly reliable, it depends what you're doing with these in your panel.
But with these you'll see the number of input in time and not the total.

How to graph individual Summary metric instances in Prometheus?

I'm using Prometheus' Summary metric to collect the latency of an API call. Instead of making an actual API call, I'm simply calling Thread.sleep(1000) to simulate a 1 second api-call latency value -- this makes the Summary hold a value of .01 (for 1 second of latency). But if, for example, I invoke Thread.sleep(1000) twice in the same minute, the Summary metric ends up with a value of .02 (for 2 seconds of latency), instead of two individual instances of .01 latency that just happened to occur within the same minute. My problem is the Prometheus query. The Prometheus query I am currently using is: rate(my_custom_summary_sum[1m]).
What should my Prometheus query be, such that I can see the latency of each individual Thread.sleep(1000) invocation. As of right now, the Summary metric collects and displays the total latency sum per minute. How can I display the latency of each individual call to Thread.sleep(1000) (i.e. the API request)?
private static final Summary mySummary = Summary.build()
.name("my_custom_summary")
.help("This is a custom summary that keeps track of latency")
.register();
Summary.Timer requestTimer = mySummary.startTimer(); //starting timer for mySummary 'Summary' metric
Thread.sleep(1000); //sleep for one second
requestTimer.observeDuration(); //record the time elapsed
This is the graph that results from this query:
Prometheus graph
Prometheus is a metrics-based monitoring system, it cares about overall performance and behaviour - not individual requests.
What you are looking for is a logs-based system, such as Graylog or the ELK stack.

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

How to alert in Seyren with Graphite if transactions in last 60 minutes are less than x?

I'm using Graphite+Statsd (with Python client) to collect custom metrics from a webapp: a counter for successful transactions. Let's say the counter is stats.transactions.count, that also has a rate/per/second metric available at stats.transactions.rate.
I've also setup Seyren as a monitor+alert system and successfully pulled metrics from Graphite. Now I want to setup an alert in Seyren if the number of successful transactions in the last 60 minutes is less than a certain minimum.
Which metric and Graphite function should I use? I tried with summarize(metric, '1h') but this gives me an alert each hour when Graphite starts aggregating the metric for the starting hour.
Note that Seyren also allows to specify Graphite's from and until parameters, if this helps.
I contributed the Seyren code to support from/until in order to handle this exact situation.
The following configuration should raise a warning if the count for the last hour drops below 50, and an error if it drops below 25.
Target: summarize(nonNegativeDerivative(stats.transactions.count),"1h","sum",true)
From: -1h
To: [blank]
Warn: 50 (soft minimum)
Error: 25 (hard minimum)
Note this will run every minute, so the "last hour" is a sliding scale. Also note that the third boolean parameter true for the summarize function is telling it to align its 1h bucket to the From, meaning you get a full 1-hour bucket starting 1 hour ago, rather than accidentally getting a half bucket. (Newer versions of Graphite may do this automatically.)
Your mileage may vary. I have had problems with this approach when the counter gets set back to 0 on a server restart. But in my case I'm using dropwizard metrics + graphite, not statsd + graphite, so you may not have that problem.
Please let me know if this approach works for you!

Resources