Show response time in Grafana using Prometheus - histogram

I have a Histogram of values, very similar to how ASP.NET Core does it for their duration of HTTP requests.
My histogram is the following: reservations_api_processing_time. I have several key/value pairs in there, and I want to build a dashboard to show the processing time for each key. For example, in the above screenshot, I have two keys search_statistics="load-statistics-InHouse" and basic_search="first-page".
I tried several things, such as:
sum by (le) (rate(reservations_api_processing_time_bucket{search_statistics="load-statistics-InHouse"}[30s]))
Or
reservations_api_processing_time_bucket{search_statistics="load-statistics-InHouse"}
Or
histogram_quantile(0.95, sum(rate(reservations_api_processing_time_bucket{search_statistics="load-statistics-InHouse"}[$__rate_interval])) by (le))
But neither really shows me the values as I want them to be seen.. I don't mind using Histogram or Heatmap as panel, either work for me, as long as I can see the actual processing time history.
My goal is to visualize the amount of time it took to process given a time period. For example, if I had 5 executions, and the processing time took [100, 200, 300, 400, 500] in ms, I want to be able to see all 5 executions as they are in my graph.

The following query should return 95th percentile of processing time for requests performed during the last hour (see 1h lookbehind window in square brackets), grouped by any additional labels other than le, which exist at reservations_api_processing_time histogram:
histogram_quantile(0.95, rate(reservations_api_processing_time_bucket[1h]))
See these docs for more details about Prometheus histograms

Related

SLO calculation for 90% of requests under 1000ms

I'm trying to figure out the PromQL for an SLO for latency, where we want 90% of all requests to be served in 1000ms or less.
I can get the 90th percentile of requests with this:
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]) ) )
And I can find what percentage of ALL requests were served in 1000ms or less with this.
((sum(rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]))) / (sum (rate(MyMetric_Request_Duration_count{instance="foo"}[1h])))) *100
Is it possible to combine these into one query that tells me what percentage of requests in the 90th percentile were served in 1000ms or less?
I tried the most obvious (to me anyway) solution, but got no data back.
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]) ) )
The goal is to get a measure that shows For the 90th percentile of requests, how many of those requests were under 1000ms? Seems like this should be simple but I can't find a PromQL query that allows me to do it.
Welcome to SO.
Out of all the requests how many are getting served under 1000ms, to find that I would divide the total number of requests under 1000ms by the total number of requests.. In my gcp world, it translates to a query like this:
You are basically measuring your
(sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="abcxyz",le="1000"}[1m]))/sum(rate(istio_request_duration_milliseconds_count{reporter="destination",namespace="abcxyz"}[1m])))*100
Once you have a graph setup with the above query in grafana, you can setup an alert on anything below 93 that way you are alerted even before your reach your SLO of 90%.
Prometheus doesn't provide a function, which could be used for calculating the share (aka the percentage) of requests served in under one second from histogram buckets. But such a function exists in VictoriaMetrics - this is Prometheus-like monitoring system I work on. The function is histogram_share(). For example, the following query returns the share of requests with durations smaller than one second served during the last hour:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
Then the following query can be used for alerting when the share or requests, which are served in less than one second, drops below 90%:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) < 0.9
Please note that all the functions, which work over histogram buckets, return the estimated results. Their accuracy highly depends on the used histogram buckets' boundaries. See this article for details.

How long Prometheus timeseries last without and update

If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).
Your query is correct. The result will be NaN if there have been no queries in the past minute.

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

Resources