Grafana: Panel with time of last result - monitoring

I have an elasticsearch instance that receives logs from multiple backup routines. I'd like to query ES for these logs from Grafana and set up a panel that shows the last time for the different backups. Ideally I would also like to be able to show this in color if the time is longer than a certain threshold.
Basically the idea is to have a display that shows, for instance, green if a certain backup has been completed in the last 24 hours, and red if it hasn't.
How would I do this in Grafana with ES as the datasource?

Exact implementation depends on the used panel.
Example for singlestat: write ES query and then select Stat: Time of last point, you may need to select suitable unit/format:
Unfortunately, Grafana doesn't understand thresholds in your requested time format (older than 24 hours). You will need to return it as metric (for example as age of last backup in seconds) = you will need to write query for that. That means, that you will have 2 stats to show (last time + age), so you won't be able to use singlestat. Probably table panel will be better - you can use thresholding based on the age metric there.

In addition to the great answer by Jan Garaj, it looks like there is work being done to make this type of thing much easier in the future.
Check out this issue to check progress.

Related

How long Prometheus timeseries last without and update

If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).

Grafana + Prometheus: Display single stat of how often an event occurred

How do I use Prometheus + Grafana to tell how many time an event occurs
during a given time period?
I have a Prometheus counter that I increment every time this event happens. I would like to display it in a Singlestat number. It seems like this should be as simple as:
sum(increase(some_event_happened{application="example-app"}[$__range]))
And the display set to "Current" value.
However, this gives numbers that are much higher than the actual number of events in the given range. Also, it seems to vary based on how much I offset the range, and how large the range is.
More importantly, it crashes our Prometheus server with an out of memory error when I have three or four of these on a single dashboard.
I've tried setting a recorded rule to address the crashes, but I haven't figured out the right way to slice up the record rule to still be able to display the Grafana range.
So in summary, I want a Singlestat displaying the number of times an event happened in the current time range set in the Grafana dashboard. It seems like this is a very basic thing for a monitoring system. Am I just using the wrong approach?
I've encountered similar issues and they appear to be due to discrepancies between the query interval (in Prometheus) and the min step (in Grafana). Try using this global, built-in variable for your interval, which will make sure Prometheus is always in sync with the Grafana step: $__interval.
sum(increase(some_event_happened{application="example-app"}[$__interval]))
http://docs.grafana.org/reference/templating/
https://www.stroppykitten.com/technical/prometheus-grafana-statistics

Grafana Alerting when there is no change in data for x minutes

Been rolling around the web and forums, cannot find a resource on this.
What I am to achieve is create an alert for when there is no change in data for a period of time.
We are monitoring openfiles for our webserver/s so this number fluctuates rather often. Noticed that when the number is stagnant it points to an issue on the server. So what we want is if openfile remains X for 2minutes alert us.
I made such an alert through a small succession of things:
I have an exclusive 'alerting dummy board', for all the alerts, since I can only have one alert per graph (grafana version 6.6.0)
I use the following query: avg_over_time(delta(Sensor_Data[1m])[20s:]) - this calculates the 20s average of 'first_value-last_value of 1min interval'
My data gathering program feeds into prometheus and this in turn into grafana -- if this program freezes, it might continue sending the last value to prometheus, and the above query will drop to strictly zero.
so I have an alert which goes off if the above query is within a range (-0.01, 0.01) for a minute (a typical value of the above query with system running is abs(query) > 0.18)
Thus, Grafana sends an alert if the Sensor_Data value does not change within about 2-3 minutes.
If you do use Prometheus and Alert manager, There is a nice function that worked for me.
changes
So using something like this in Alert manager will trigger if no changes for the time interval
changes(metric_name[5m]) = 0
This has worked for me. Make sure you're using a rate or increase function (no change means it will drop to zero) and filter the query like the following:
increase(metric_name) > 0
Then, in Alert Config, set "If no data or all values are null" to "Alerting". That way, when there's no data, the alert will be triggered.

Prometheus increase not handling process restarts

I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.
When there is a process restart within a 2m interval and I query:
sum(increase(my_metric_total[2m]))
I get a value less than expected.
For example, in a simple experiment I mock:
3 lcm_restarts
1 process restart
2 lcm_restarts
All within a 2 minute interval.
Upon querying:
sum(increase(lcm_restarts[2m]))
I receive a value of ~4.5 when I am expecting 5.
lcm_restarts graph
sum(increase(lcm_restarts[2m])) result
Could someone please explain?
Pretty concise and well-prepared first question here. Please keep this spirit!
When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.
For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.
Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.
So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.
Prometheus may return unexpected results from increase() function due to the following reasons:
Prometheus may return fractional results from increase() over integer counter because of extrapolation. See this issue for details.
Prometheus may return lower than expected results from increase(m[d]) because it doesn't take into account possible counter increase between the last raw sample just before the specified lookbehind window [d] and the first raw sample inside the lookbehind window [d]. See this article and this comment for details.
Prometheus skips the increase for the first sample in a time series. For example, increase() over the following series of samples would return 1 instead of 11: 10 11 11. See these docs for details.
These issues are going to be fixed according to this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics, which are free from these issues.

how to cluster percentile of events by time delta?

After a mailing at t0, I will have several "delivered" (and open and click) events (schema and example)
mailing_name, timestamp, email_id, event_type
niceattack, 2016-07-14 12:11:00, 42, open
niceattack, 2016-07-14 12:11:08, 842, open
niceattack, 2016-07-14 12:11:34, 847, open
I would like to see for a mailing how long it takes to be delivered to half of the recipients. So say that I'm sending an email to 1000 addresses now, the first open event is in 2 min, the last one is going to be in a week (and min/max first last seems to be easy to find) but what I'd like to see is that half of the recipients opened it in the first 2 hours after it was sent.
The goal is to send being able to compare is sending now vs on sat morning makes a difference on how fast it's open on average, or if one specific mailing get quicker exposure, and correlate that with other events (how many click on a link, take a specific action on our site...)
I tried to use a cumulate function (how many open event for mailing for each point), but it seems that the cumulative function isn't yet implemented https://github.com/influxdata/influxdb/issues/813
How do you solve that problem with influxdb?
Solving this problem with InfluxDB alone is not currently possible, however if you're willing to add Kapacitor into the mix, then it should be possible. In particular you'll need to write a User Defined Function (UDF) for that cumulative function in Kapacitor.
The general process will look like the following:
Install and Configure Kapacitor
Create a UDF for the cumulative function you're looking for
Enable that UDF inside of Kapacitor
Write a TICKscript that uses the UDF and writes the results back to InfluxDB
Enable a task defined by the TICKscript you've written
Query the InfluxDB instance to get the results of the cumulative function.
My appoligies for being so high level on this. This is a fairly involved process, but should give you the result you're looking for.

Resources