Is counter right datatype to monitor revenue over time in Prometheus? - histogram

I am trying to understand which datatype is right choice to monitor revenue metric in Prometheus.

A Gauge is appropriate because your revenue value (per period of time) may increase|decrease and you can set a Gauge to any float value.
You can then use sum_over_time (e.g. a month) to aggregate the (e.g. daily) revenue represented by the Gauge measurements.
The only unit measure alternative is a Counter but Counter values may only increase in value.
Prometheus Metric Types

Related

Calculate the InfluxDB average

I want to process the value from InfluxDB on Grafana.
The final demand is to show how many miles the current vehicle has traveled in a certain time frame.
You can use the formula: average velocity * time.
Do the seniors have any good methods?
So what I'm thinking is: I've got the mean function for the average speed over a fixed period of time and the corresponding mileage, and then I want to add all the mileage together. How do I do that?
What if you only use SQL?
1.) InfluxDB uses InfluxQL, not a SQL
2.) Your approach average velocity * time is innacurate
3.) Use suitable InfluxDB functions, I would say INTEGRAL() is the best function for this case + some basic arithmetic. Don't expect the 100% accuracy. Accuracy depends heavily on the metric sampling, e.g. 1 minute sampling - but what if vehicle is driving 59 seconds and it is not moving for that second when sampling is happening. So don't be supprised, when even 10 sec sampling will be inacurrate.

Prometheus query for last local peak value

What Prometheus query (PromQl) can be used to identify the last local peak value in the last X minutes in a graph?
A local peak is a point that is larger than its previous and next datapoint. (So ​​the current time is definitely not a local peak)
(p: peak point, i: cornjob interval, m: missed execuation)
I want this value to find an anomaly in the execution of a cron job. As you can see in the picture, I have written a query to calculate the elapsed time since the last execution of a job. Now to set an alert rule to calculate the elapsed time from the last successful execution and find missed execution, I need the amount of time that the last execution of the job occurred in that interval. This interval is unknown for the query (In other words, the interval of the job is specified by another program), so I can not compare elapsed time with a fixed time.
Use z-score to detecting anomalies
If you know the average value and standard deviation (σ) of a series, you can use any sample in the series to calculate the z-score. The z-score is measured in the number of standard deviations from the mean. So a z-score of 0 would mean the z-score is identical to the mean in a data set with a normal distribution, while a z-score of 1 is 1.0 σ from the mean, etc.
Calculate the average and standard deviation for the metric using data with large sample size.
# Long-term average value for the series
- record: job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
expr: avg_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
# Long-term standard deviation for the series
- record: job:cronjob_duration_time_seconds_count:rate5m:stddev_over_time_1w
expr: stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
calculate the z-score for the Prometheus query once you have the average and standard deviation for the aggregation.
# Z-Score for aggregation
(
job:cronjob_duration_time_seconds_count:rate10m -
job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
) / stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
Based on the statistical principles of normal distributions, you can assume that any value that falls outside of the range of roughly +1 to -1 is an anomaly. For example, you can get an alert when our aggregation is out of this range for more than five minutes.
If what you want is an alert to be fired when the elapsed time has been longer than a fixed duration, you can set an alert similar to the up alert, based on the changes > 0 expression, which is only true (i.e. > 0) when the job is running.
An example would be:
rules:
- alert: CronJobNotRunning
expr: |
changes(
sum(
rate(
cronjob_duration_time_seconds_count{
status="ok", namespace="<namespace>", exported_job="<job>"
}[1m]
)
)[1m:]
) == 0
for: <alert_duration>
Note that subqueries ([1m:]) are expensive, and introducing a recording rule there can help performance, especially in a dashboard.
Also, in your case, the time since the last time the second derivative was non-zero can be used too, as that happens when a job starts/finishes (the drops in the graph, or when it starts to rise).

Why do some prometheus metric values return +Inf

Occasionally when I query prometheus using the api endpoint one or more of the metric values will be +Inf. What does +Inf mean and what causes a metric value to be +Inf ?
Additional info:
This data is coming from a Gauge metric type
The query is a simple sum query. ie. sum(my_metric)
+Inf stands for a positive infinite number, it would be easier if you would pass what are you measuring, but most probably you're getting this value because if dividing by 0.
Anyway, it's a normal value of float64.

Understanding histogram_quantile based on rate in Prometheus

According to Prometheus documentation in order to have a 95th percentile using histogram metric I can use following query:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Source: https://prometheus.io/docs/practices/histograms/#quantiles
Since each bucket of histogram is a counter we can calculate rate each of the buckets as:
per-second average rate of increase of the time series in the range vector.
See: https://prometheus.io/docs/prometheus/latest/querying/functions/#rate
So, for instance, if bucket value[t-5m] = 100 and bucket value[t] = 200 then bucket rate[t] = (200-100)/(10*60) = 0.167
And finally, the most confusing part is how can histogram_quantile function find 95th percentile for given metric knowing all the bucket rates?
Is there any code or algorithm I can take a look to better understand it?
A solid example will explain histogram_quantile well.
Assumptions:
ONLY ONE series for simplicity
10 buckets for metric http_request_duration_seconds.
10ms, 50ms, 100ms, 200ms, 300ms, 500ms, 1s, 2s, 3s, 5s
http_request_duration_seconds is a metric type of COUNTER
time
value
delta
rate (quantity of items)
t-10m
50
N/A
N/A
t-5m
100
50
50 / (5*60)
t
200
100
100 / (5*60)
...
...
...
...
We have at least two scrapes of the series covering 5 minutes for rate() to calculate the quantity for each bucket
rate_xxx(t) = (value_xxx[t]-value_xxx[t-5m]) / (5m*60) is the quantity of items for [t-5m, t]
We are looking at 2 samples(value(t) and value(t-5m)) here.
10000 http request durations (items) were recorded, that is,
10000 = rate_10ms(t) + rate_50ms(t) + rate_100ms(t) + ... + rate_5s(t).
bucket(le)
10ms
50ms
100ms
200ms
300ms
500ms
1s
2s
3s
5s
+Inf
range
~10ms
10~50ms
50~100ms
100~200ms
200~300ms
300~500ms
500ms~1s
1~2s
2s~3s
3~5s
5s~
rate_xxx(t)
3000
3000
1500
1000
800
400
200
40
30
5
5
Bucket is the essence of histogram. We just need 10 numbers in rate_xxx(t) to do the quantile calculation
Let's take a close look at this expression (aggregation like sum() is omitted for simplicity)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
We are actually looking for the 95%th item in rate_xxx(t) from bucket=10ms to bucket=+Inf. And 95%th means 9500th here since we got 10000 items in total (10000 * 0.95).
From the table above, there are 9300 = 3000+3000+1500+1000+800 items together before bucket=500ms.
So the 9500th item is the 200th item (9500-9300) in bucket=500ms(range=300~500ms) which got 400 items within
And Prometheus assumes that items in a bucket spread evenly in a linear pattern.
The metric value for the 200th item in bucket=500ms is 400ms = 300+(500-300)*(200/400)
That is, 95% is 400ms.
There are a few to bear in mind
Metric should be COUNTER in nature for histogram metric type
Series for quantile calculation should always get label le defined
Items (Data) in a specific bucket spread evenly a linear pattern (e.g.: 300~500ms)
Prometheus makes this assumption at least
Quantile calculation requires buckets being sorted(defined) in some ascending/descending order (e.g.: 1ms < 5ms < 10ms < ...)
Result of histogram_quantile is an approximation
P.S.:
The metric value is not always accurate due to the assumption of Items (Data) in a specific bucket spread evenly a linear pattern
Say, the max duration in reality (e.g.: from nginx access log) in bucket=500ms(range=300~500ms) is 310ms, however, we will get 400ms from histogram_quantile via above setup which is quite confusing sometimes.
The smaller bucket distance is, the more accurate approximation is.
So setup the bucket distances that fit your needs.
You can refer to my reply here
Actually the rate() function is just used to specify the time window, the denominator has no effect in the computation of the pecentile value.
I believe this is the code for it in prometheus
The general idea is that you use the data in the buckets to extrapolate / approximate the quantiles
Elasticsearch also does something similar (yet different/much simpler) in their rollup capabilities
You have to use reset because counters can be reset, rate automatically considers resets and give you the right count for each second. Just remember that always use rate before using counters.

Is high label cardinality but low metric/label count and infrequent sampling an acceptable use-case for Prometheus?

I have a use-case of monitoring that I'm not entirely sure if it's a good
match for Prometheus or not, and I wanted to ask for opinions before I delve
deeper.
The numbers of what I'm going to store:
Only 1 metric.
That metric has 1 label with 1,000,000 to 2,000,000 distinct values.
The values are gauges (but does it make a difference if they are counters?)
Sample rate is once every 5 minutes. Retaining data for 180 days.
Estimated storage size if I have 1 million distinct label values:
(According to formula in Prometheus' documentation: retention_time_seconds *
ingested_samples_per_second * bytes_per_sample)
(24*60)/5=288 5-minute intervals in a day.
(180*288) * (1,000,000) * 2 = 103,680,000,000 ~= 100GB
samples/label-value label-value-count bytes/sample
So I assume 100-200GB will be required.
Is this estimation correct?
I read in multiple places about avoiding high-cardinality labels, and I would
like to ask about this. Considering I will be looking at one time-series at a time Is the problem with high-cardinality labels? Or
having a high number of time-series? As each label value produces another
time-series? I also read in multiple places that Prometheus can handle
millions of time-series at once, so even if I have 1 label with one million
distinct values, I should be fine in terms of time-series count, do I have to
worry about the labels having high cardinality in this case? I'm aware that
it depends on the strength of the server, but assuming average capacity, I
would like to know if Prometheus' implementation has a problem handling this
case efficiently.
And also, if it's a matter of time-series count, am I correct in assuming
that it will not make a significant difference between the following
options?
1 metric with 1 label of 1,000,000 distinct label values.
10 metrics each with 1 label of 100,000 distinct label values.
X metrics each with 1 label of Y distinct label values.
where X * Y = 1,000,000
Thanks for the help!
That might work, but it's not what Prometheus is designed for and you'll likely run into issues. You probably want a database rather than a monitoring system, maybe Cassandra here.
How the cardinality is split out across metrics won't affect ingestion performance, however it'll be relatively slow to have to read 1M series in a query.
Note that Victoria Metrics is an easy to configure backend for Prometheus which will reduce storage requirements significantly.

Resources