How to graph individual Summary metric instances in Prometheus? - monitoring

I'm using Prometheus' Summary metric to collect the latency of an API call. Instead of making an actual API call, I'm simply calling Thread.sleep(1000) to simulate a 1 second api-call latency value -- this makes the Summary hold a value of .01 (for 1 second of latency). But if, for example, I invoke Thread.sleep(1000) twice in the same minute, the Summary metric ends up with a value of .02 (for 2 seconds of latency), instead of two individual instances of .01 latency that just happened to occur within the same minute. My problem is the Prometheus query. The Prometheus query I am currently using is: rate(my_custom_summary_sum[1m]).
What should my Prometheus query be, such that I can see the latency of each individual Thread.sleep(1000) invocation. As of right now, the Summary metric collects and displays the total latency sum per minute. How can I display the latency of each individual call to Thread.sleep(1000) (i.e. the API request)?
private static final Summary mySummary = Summary.build()
.name("my_custom_summary")
.help("This is a custom summary that keeps track of latency")
.register();
Summary.Timer requestTimer = mySummary.startTimer(); //starting timer for mySummary 'Summary' metric
Thread.sleep(1000); //sleep for one second
requestTimer.observeDuration(); //record the time elapsed
This is the graph that results from this query:
Prometheus graph

Prometheus is a metrics-based monitoring system, it cares about overall performance and behaviour - not individual requests.
What you are looking for is a logs-based system, such as Graylog or the ELK stack.

Related

How long Prometheus timeseries last without and update

If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).

Log data into grafana for every minute

The below code registers five metrics count, oneminuteRate, fiftenMinuteRate, fiveMinuteRate, meanRate into graphite for every 30 seconds from my application.
public void collectMetric(string metricName, long metricValue){
mr.meter(metricName).mark(value)
}
I would like to show in the Grafana dashboard the no of requests that are received every minute.(i,e if in the first minute 60 is received, in the second minute 120 is received) Since the count in the meter metric above just keeps increasing and all the *Rate values are events per second. I am not sure how to log metric into Grafana dashboard that displays the no of requests received per minute. Any advice is highly appreciated?
Suppose if I use
mr.counter(metricName).inc(value) IS there a way to reset the counters every 1 minute?
I had the same problem. The way that I found is that I resolved this in Grafana.
When you're on the panel's metrics, you can add a function to your query like this:
You can try the derivative() function or perSecond() function but these functions are not completly reliable, it depends what you're doing with these in your panel.
But with these you'll see the number of input in time and not the total.

Prometheus duration collection

Is it possible calculate operation duration in prometheus when the data is collected as start time (epoch) and end time (epoch) that would be grouped by some form of a transaction trace?
Once duration is known, the data can then be used to provide a prometheus Apdex.
Use case:
API initiates fires (and essentially forgets) a start operation request
Multiple distributed (message based) processes take place
Final message fires and indicates the end of the operation
Prometheus is a metrics-based system, not a tracing system. The most you can do is track a summary of how long all your requests are taking using the Summary or Histogram metric types.

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).
Your query is correct. The result will be NaN if there have been no queries in the past minute.

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

Resources