Latency SLO calculation of requests - histogram

I need to calculate and plot latency SLO graph on prometheus by the histogram time-series, but I've been unsuccessful to display a histogram in grafana.
A sample metric would be the request time of an nginx.
suppose if i have a histogram bucket like this,
nginx_request_time_bucket(le=1) 1,
nginx_request_time_bucket(le=10) 2,
nginx_request_time_bucket(le=60) 2,
nginx_request_time_bucket(le=+inf) 5
I use this below expression to validate latency SLO . This expression returns the percentage of requests within 10s :
sum(rate(nginx_request_time_bucket{le="10"}[$__range])) / sum(rate(nginx_request_time_count[$__range]))
Now how can i find the percentage of requests within 10s to 60s ? How can I calculate it?
Is the below expression correct??
(
sum(rate(nginx_request_time_bucket{le="10"}[$__range]))
+
sum(rate(nginx_request_time_bucket{le="60"}[$__range]))
) / 2 / sum(rate(nginx_request_time_count[$__range]))
Any help here is highly appreciated!

All the {le="10"} requests are also included in {le="60"} (and in all the bigger buckets), so in order to know the amount of requests between them you just have to subtract the rates, so something like:
(
sum(rate(nginx_request_time_bucket{le="60"}[$__range]))
-
sum(rate(nginx_request_time_bucket{le="10"}[$__range]))
)
/ sum(rate(nginx_request_time_count[$__range]))
should work.

Related

SLO calculation for 90% of requests under 1000ms

I'm trying to figure out the PromQL for an SLO for latency, where we want 90% of all requests to be served in 1000ms or less.
I can get the 90th percentile of requests with this:
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]) ) )
And I can find what percentage of ALL requests were served in 1000ms or less with this.
((sum(rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]))) / (sum (rate(MyMetric_Request_Duration_count{instance="foo"}[1h])))) *100
Is it possible to combine these into one query that tells me what percentage of requests in the 90th percentile were served in 1000ms or less?
I tried the most obvious (to me anyway) solution, but got no data back.
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]) ) )
The goal is to get a measure that shows For the 90th percentile of requests, how many of those requests were under 1000ms? Seems like this should be simple but I can't find a PromQL query that allows me to do it.
Welcome to SO.
Out of all the requests how many are getting served under 1000ms, to find that I would divide the total number of requests under 1000ms by the total number of requests.. In my gcp world, it translates to a query like this:
You are basically measuring your
(sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="abcxyz",le="1000"}[1m]))/sum(rate(istio_request_duration_milliseconds_count{reporter="destination",namespace="abcxyz"}[1m])))*100
Once you have a graph setup with the above query in grafana, you can setup an alert on anything below 93 that way you are alerted even before your reach your SLO of 90%.
Prometheus doesn't provide a function, which could be used for calculating the share (aka the percentage) of requests served in under one second from histogram buckets. But such a function exists in VictoriaMetrics - this is Prometheus-like monitoring system I work on. The function is histogram_share(). For example, the following query returns the share of requests with durations smaller than one second served during the last hour:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
Then the following query can be used for alerting when the share or requests, which are served in less than one second, drops below 90%:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) < 0.9
Please note that all the functions, which work over histogram buckets, return the estimated results. Their accuracy highly depends on the used histogram buckets' boundaries. See this article for details.

How to understand Prometheus query for Grafana - histogram_quantile, sum, and rate functions & WRONG grafana graph data

The query that I am using to grab 99th percentile of API request latency is:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime"}[1m])) by (handler, method, le))
My buckets for latency histogram buckets are defined as [0.05, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0] in my code or hitting the metrics endpoint for a sample API endpoint (i.e. TestController.java class, and testLatencyTime() method):
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="0.05",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="0.25",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="0.5",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="1.0",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="2.0",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="4.0",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="8.0",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="+Inf",}
So its a http_request_duration_seconds_bucket function, passed to a rate function. Per this stackoverflow post, they stated that the rate function should be a Cumulative Density Distribution function, that is "rate applied on buckets calculates a set of rate of increments that happened on all the buckets in the span of the last 1 minute. So, to answer your question, it is a cumulative density distribution on the rate of changes calculated in a given time frame". Per this YouTube video, is it correct to assume the Cumulative Density Distribution function is the area under the curve to the LEFT of a point of interest? https://www.youtube.com/watch?v=3xAIWiTJCvE
reference:
what's the math behind histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])) in PromQL
Furthermore, this is passed to the sum function, where the values returned by the rate function are summed or aggregated. I'm trying to understand this sentence from the prometheus documentation - "The quantile is calculated for each label combination in http_request_duration_seconds"
reference:
https://prometheus.io/docs/prometheus/latest/querying/functions/
My problem is that when I use this dummy REST Controller (Spring Boot, TestController.java class, and testLatencyTime() method to visualize the data in a locally running Prometheus/Grafana instance using Docker, if I make a "dummy" request with Thread.sleep(4000) and see how Grafana plots it, its not making sense
#RestController
#RequestMapping("/test")
public class TestController {
#PostMapping("/latency/{wait}")
public ResponseEntity<String> testLatencyTime(#PathVariable Long wait) throws InterruptedException {
Thread.sleep(wait);
return new ResponseEntity("Request completed!", HttpStatus.OK);
}
}
For example, the above 4000ms sleep will get marked as an "8 second" spike in Grafana, for the 99th percentile query I posted above. Also, if I make an mock API call take 3seconds, it gets marked as 4seconds. It's almost as if Grafana is marking the graph at the UPPER BOUND bucket value the request falls into! (i.e. upper bound for 2sec or 3sec api call would be 4; upper bound for 4sec api call is 8) Is my understanding of the statistical mathematics behind how Grafana graphs this incorrect, or is this being plotted incorrectly or is my query is wrong/not accurate??? Please help! I can add any more information that is requested but I think I included a good amount!
example:

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).
Your query is correct. The result will be NaN if there have been no queries in the past minute.

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

How do I query Prometheus for number of times a service is down

I am trying to work with the UP metric to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour). I am sampling at 5 seconds intervals
The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
Any help with this type of query would be greatly appreciated
Thanks.
You might try the following: calculate the average of the up metric. If the service goes down, the average (sliding windows of 1 minute) will decrease over time.
If the job comes up again, and the average is greater than 0, then the service wasn't down for more than one minute.
The following query (works via the Prometheus web console) delivers one data point for each time the service comes up before it was down for more than one minute.
avg_over_time(up{job="jobname"} [1m]) > 0
AND
irate(up{job="jobname"} [1m]) > 0

Resources