How to count the number of metrics sent to Datadog over a 24 hour period? - monitoring

I have a situation where I'm trying to count the number of files loaded into the system I am monitoring. I'm sending a "load time" metric to Datadog each time a file is loaded, and I need to send an alert whenever an expected file does not appear. To do this, I was going to count up the number of "load time" metrics sent to Datadog in a 24 hour period, then use anomaly detection to see whether it was less than the normal number expected. However, I'm having some trouble finding a way to consistently pull out this count for use in the alert.
I can't use the count_nonzero function, as some of my files are empty and have a load time of 0. I do know about .as_count() and count:metric{tags}, but I haven't found a way to include an evaluation interval with either of these. I've tried using .rollup(count, time) to count up the metrics sent, but this call seems to return variable results based on the rollup interval. For instance, if I compare intervals of 2000 and 4000 seconds, I would expect each 4000 second interval to count up about the sum of two 2000 second intervals over the same time period. This does not seem to be what happens at all - the counts for the smaller intervals seem to add up to much more than the count for the larger one. Additionally some rollup intervals display decimal numbers as counts, which does not make any sense to me if this function is doing what I thought it was.
Does anyone have any thoughts on how to accomplish this? I'd really appreciate any new ideas.

Related

Log data into grafana for every minute

The below code registers five metrics count, oneminuteRate, fiftenMinuteRate, fiveMinuteRate, meanRate into graphite for every 30 seconds from my application.
public void collectMetric(string metricName, long metricValue){
mr.meter(metricName).mark(value)
}
I would like to show in the Grafana dashboard the no of requests that are received every minute.(i,e if in the first minute 60 is received, in the second minute 120 is received) Since the count in the meter metric above just keeps increasing and all the *Rate values are events per second. I am not sure how to log metric into Grafana dashboard that displays the no of requests received per minute. Any advice is highly appreciated?
Suppose if I use
mr.counter(metricName).inc(value) IS there a way to reset the counters every 1 minute?
I had the same problem. The way that I found is that I resolved this in Grafana.
When you're on the panel's metrics, you can add a function to your query like this:
You can try the derivative() function or perSecond() function but these functions are not completly reliable, it depends what you're doing with these in your panel.
But with these you'll see the number of input in time and not the total.

Grafana + Prometheus: Display single stat of how often an event occurred

How do I use Prometheus + Grafana to tell how many time an event occurs
during a given time period?
I have a Prometheus counter that I increment every time this event happens. I would like to display it in a Singlestat number. It seems like this should be as simple as:
sum(increase(some_event_happened{application="example-app"}[$__range]))
And the display set to "Current" value.
However, this gives numbers that are much higher than the actual number of events in the given range. Also, it seems to vary based on how much I offset the range, and how large the range is.
More importantly, it crashes our Prometheus server with an out of memory error when I have three or four of these on a single dashboard.
I've tried setting a recorded rule to address the crashes, but I haven't figured out the right way to slice up the record rule to still be able to display the Grafana range.
So in summary, I want a Singlestat displaying the number of times an event happened in the current time range set in the Grafana dashboard. It seems like this is a very basic thing for a monitoring system. Am I just using the wrong approach?
I've encountered similar issues and they appear to be due to discrepancies between the query interval (in Prometheus) and the min step (in Grafana). Try using this global, built-in variable for your interval, which will make sure Prometheus is always in sync with the Grafana step: $__interval.
sum(increase(some_event_happened{application="example-app"}[$__interval]))
http://docs.grafana.org/reference/templating/
https://www.stroppykitten.com/technical/prometheus-grafana-statistics

Prometheus increase not handling process restarts

I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.
When there is a process restart within a 2m interval and I query:
sum(increase(my_metric_total[2m]))
I get a value less than expected.
For example, in a simple experiment I mock:
3 lcm_restarts
1 process restart
2 lcm_restarts
All within a 2 minute interval.
Upon querying:
sum(increase(lcm_restarts[2m]))
I receive a value of ~4.5 when I am expecting 5.
lcm_restarts graph
sum(increase(lcm_restarts[2m])) result
Could someone please explain?
Pretty concise and well-prepared first question here. Please keep this spirit!
When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.
For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.
Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.
So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.
Prometheus may return unexpected results from increase() function due to the following reasons:
Prometheus may return fractional results from increase() over integer counter because of extrapolation. See this issue for details.
Prometheus may return lower than expected results from increase(m[d]) because it doesn't take into account possible counter increase between the last raw sample just before the specified lookbehind window [d] and the first raw sample inside the lookbehind window [d]. See this article and this comment for details.
Prometheus skips the increase for the first sample in a time series. For example, increase() over the following series of samples would return 1 instead of 11: 10 11 11. See these docs for details.
These issues are going to be fixed according to this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics, which are free from these issues.

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).
Your query is correct. The result will be NaN if there have been no queries in the past minute.

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

Resources