What Prometheus query (PromQl) can be used to identify the last local peak value in the last X minutes in a graph?
A local peak is a point that is larger than its previous and next datapoint. (So the current time is definitely not a local peak)
(p: peak point, i: cornjob interval, m: missed execuation)
I want this value to find an anomaly in the execution of a cron job. As you can see in the picture, I have written a query to calculate the elapsed time since the last execution of a job. Now to set an alert rule to calculate the elapsed time from the last successful execution and find missed execution, I need the amount of time that the last execution of the job occurred in that interval. This interval is unknown for the query (In other words, the interval of the job is specified by another program), so I can not compare elapsed time with a fixed time.
Use z-score to detecting anomalies
If you know the average value and standard deviation (σ) of a series, you can use any sample in the series to calculate the z-score. The z-score is measured in the number of standard deviations from the mean. So a z-score of 0 would mean the z-score is identical to the mean in a data set with a normal distribution, while a z-score of 1 is 1.0 σ from the mean, etc.
Calculate the average and standard deviation for the metric using data with large sample size.
# Long-term average value for the series
- record: job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
expr: avg_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
# Long-term standard deviation for the series
- record: job:cronjob_duration_time_seconds_count:rate5m:stddev_over_time_1w
expr: stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
calculate the z-score for the Prometheus query once you have the average and standard deviation for the aggregation.
# Z-Score for aggregation
(
job:cronjob_duration_time_seconds_count:rate10m -
job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
) / stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
Based on the statistical principles of normal distributions, you can assume that any value that falls outside of the range of roughly +1 to -1 is an anomaly. For example, you can get an alert when our aggregation is out of this range for more than five minutes.
If what you want is an alert to be fired when the elapsed time has been longer than a fixed duration, you can set an alert similar to the up alert, based on the changes > 0 expression, which is only true (i.e. > 0) when the job is running.
An example would be:
rules:
- alert: CronJobNotRunning
expr: |
changes(
sum(
rate(
cronjob_duration_time_seconds_count{
status="ok", namespace="<namespace>", exported_job="<job>"
}[1m]
)
)[1m:]
) == 0
for: <alert_duration>
Note that subqueries ([1m:]) are expensive, and introducing a recording rule there can help performance, especially in a dashboard.
Also, in your case, the time since the last time the second derivative was non-zero can be used too, as that happens when a job starts/finishes (the drops in the graph, or when it starts to rise).
Related
I want to process the value from InfluxDB on Grafana.
The final demand is to show how many miles the current vehicle has traveled in a certain time frame.
You can use the formula: average velocity * time.
Do the seniors have any good methods?
So what I'm thinking is: I've got the mean function for the average speed over a fixed period of time and the corresponding mileage, and then I want to add all the mileage together. How do I do that?
What if you only use SQL?
1.) InfluxDB uses InfluxQL, not a SQL
2.) Your approach average velocity * time is innacurate
3.) Use suitable InfluxDB functions, I would say INTEGRAL() is the best function for this case + some basic arithmetic. Don't expect the 100% accuracy. Accuracy depends heavily on the metric sampling, e.g. 1 minute sampling - but what if vehicle is driving 59 seconds and it is not moving for that second when sampling is happening. So don't be supprised, when even 10 sec sampling will be inacurrate.
According to Prometheus documentation in order to have a 95th percentile using histogram metric I can use following query:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Source: https://prometheus.io/docs/practices/histograms/#quantiles
Since each bucket of histogram is a counter we can calculate rate each of the buckets as:
per-second average rate of increase of the time series in the range vector.
See: https://prometheus.io/docs/prometheus/latest/querying/functions/#rate
So, for instance, if bucket value[t-5m] = 100 and bucket value[t] = 200 then bucket rate[t] = (200-100)/(10*60) = 0.167
And finally, the most confusing part is how can histogram_quantile function find 95th percentile for given metric knowing all the bucket rates?
Is there any code or algorithm I can take a look to better understand it?
A solid example will explain histogram_quantile well.
Assumptions:
ONLY ONE series for simplicity
10 buckets for metric http_request_duration_seconds.
10ms, 50ms, 100ms, 200ms, 300ms, 500ms, 1s, 2s, 3s, 5s
http_request_duration_seconds is a metric type of COUNTER
time
value
delta
rate (quantity of items)
t-10m
50
N/A
N/A
t-5m
100
50
50 / (5*60)
t
200
100
100 / (5*60)
...
...
...
...
We have at least two scrapes of the series covering 5 minutes for rate() to calculate the quantity for each bucket
rate_xxx(t) = (value_xxx[t]-value_xxx[t-5m]) / (5m*60) is the quantity of items for [t-5m, t]
We are looking at 2 samples(value(t) and value(t-5m)) here.
10000 http request durations (items) were recorded, that is,
10000 = rate_10ms(t) + rate_50ms(t) + rate_100ms(t) + ... + rate_5s(t).
bucket(le)
10ms
50ms
100ms
200ms
300ms
500ms
1s
2s
3s
5s
+Inf
range
~10ms
10~50ms
50~100ms
100~200ms
200~300ms
300~500ms
500ms~1s
1~2s
2s~3s
3~5s
5s~
rate_xxx(t)
3000
3000
1500
1000
800
400
200
40
30
5
5
Bucket is the essence of histogram. We just need 10 numbers in rate_xxx(t) to do the quantile calculation
Let's take a close look at this expression (aggregation like sum() is omitted for simplicity)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
We are actually looking for the 95%th item in rate_xxx(t) from bucket=10ms to bucket=+Inf. And 95%th means 9500th here since we got 10000 items in total (10000 * 0.95).
From the table above, there are 9300 = 3000+3000+1500+1000+800 items together before bucket=500ms.
So the 9500th item is the 200th item (9500-9300) in bucket=500ms(range=300~500ms) which got 400 items within
And Prometheus assumes that items in a bucket spread evenly in a linear pattern.
The metric value for the 200th item in bucket=500ms is 400ms = 300+(500-300)*(200/400)
That is, 95% is 400ms.
There are a few to bear in mind
Metric should be COUNTER in nature for histogram metric type
Series for quantile calculation should always get label le defined
Items (Data) in a specific bucket spread evenly a linear pattern (e.g.: 300~500ms)
Prometheus makes this assumption at least
Quantile calculation requires buckets being sorted(defined) in some ascending/descending order (e.g.: 1ms < 5ms < 10ms < ...)
Result of histogram_quantile is an approximation
P.S.:
The metric value is not always accurate due to the assumption of Items (Data) in a specific bucket spread evenly a linear pattern
Say, the max duration in reality (e.g.: from nginx access log) in bucket=500ms(range=300~500ms) is 310ms, however, we will get 400ms from histogram_quantile via above setup which is quite confusing sometimes.
The smaller bucket distance is, the more accurate approximation is.
So setup the bucket distances that fit your needs.
You can refer to my reply here
Actually the rate() function is just used to specify the time window, the denominator has no effect in the computation of the pecentile value.
I believe this is the code for it in prometheus
The general idea is that you use the data in the buckets to extrapolate / approximate the quantiles
Elasticsearch also does something similar (yet different/much simpler) in their rollup capabilities
You have to use reset because counters can be reset, rate automatically considers resets and give you the right count for each second. Just remember that always use rate before using counters.
I am tracing multiple signals for a certain period of time and associating them with a timestamp like following:
t0 1 10 2 0 1 0 ...
t1 1 10 2 0 1 0 ...
t2 3 0 9 7 1 1 ... // pressed a button to change the mode
t3 3 0 9 7 1 1 ...
t4 3 0 8 7 1 1 ... // pressed button to adjust a certain characterstic like temperature (signal 3)
where t0 is the tamp stamp, 1 is the value for signal 1, 10 the value for signal 2 and so on.
That captured data during that certain period of time should be considered as the normal case. Now significant derivations should be detected from the normal case. With significant derivation I do NOT mean that one signal value just changes to a value that has not been seen during the tracing phase but rather that a lot of values change that have not yet been related to each other. I do not want to hardcode rules since in the future more signals might be added or removed and other "modi" that have other signal values might be implemented.
Can this be achieved via a certain Machine Learning algorithm? If a small derivation occurs I want the algorithm to first see it as a minor change to the training set and if it occurs multiple times in the future it should be "learned". The major goal is to detect the bigger changes / anomalies.
I hope I could explain my problem detailed enough. Thanks in advance.
you could just calculate the nearest neighbor in your feature space and set a threshold how far its allowed to be away from your test point to not be an anomaly.
Lets say you have 100 values in your "certain period of time"
so you use a 100 dimensional feature space with your training data (which doesn't contain anomalies)
If you get a new dataset you want to test, you calculate the (k) nearest neighbor(s) and calculate the (e.g. euclidean) distance in your featurespace.
If that distance is larger than a certain threshold it's an anomaly.
What you have to do in order to optimize is finding a good k and a good threshold. E.g. by Grid-search.
(1) Note that something like this probably only works well if your data has a fixed starting and ending point. Otherwise you would need a huge amount of data and even than it will not perform as good.
(2) Note It should be worth trying to create an own detector for every "mode" you have mentioned in your question.
According To This Article about Throughput and Latency H
"When You Go To Buy a Water Pipe, There Are Two Completely Independent Parameters That You Look At: The Diameter of the Pipe and Its Length"
But I Think These Two Parameters Are Related. Throughput Is Measured As Per Unit Time, So A Long Latency Will Affect Throughput, Say, If The Droplet Is Fast, More Of Them Will Pass The Pipe In One Second,
Can Any One Help Me Understand This?
EDIT:
the confusion is originated from counting queuing time as part of latency which we should not. Once a request is handled, the latency is independent of throughput.
Let me give you another anology...Think of a car travelling on a single lane road from location A to location B..time taken by that car to travel from A to B is your latency...and the number of cars travelling at an interval, maintaining the latency is your throughput.
The factors that affect here is your medium of travel ie by road and no of lanes on the road.
You're thinking about frequency. Say you have a window into the water pipe at some given point, and you send water droplets at some constant interval (say 1 droplet ever second). You count how often you see a single droplet pass by, and take the inverse (1/seconds). So if you count 1 second of elapsed time between droplets being observed, then you have a frequency of 1Hz.
Now say that you keep this frequency constant (1Hz), but you elongate the pipe. You send one droplet down and count how much time elapses before it reaches the end of the pipe. So say it takes 2 seconds for a single drop to travel from the beginning to the end of the pipe, then you have a latency of 2 seconds.
Now say that you widen the diameter of the pipe, and now you are able to send 2 droplets with a frequency of 1Hz. At the end of the pipe you will count 2 droplets coming out every second. So your throughput will be 2 droplets per second.
Here is my bit in a language which I can understand
When you go to buy a water pipe, there are two completely independent parameters that you look at: the diameter of the pipe and its length. The diameter determines the throughput of the pipe and the length determines the latency, i.e., the time it will take for a water droplet to travel across the pipe. Key point to note is that the length and diameter are independent, thus, so are are latency and throughput of a communication channel.
More formally, Throughput is defined as the amount of water entering or leaving the pipe every second and latency is the average time required to for a droplet to travel from one end of the pipe to the other.
Let’s do some math:
For simplicity, assume that our pipe is a 4inch x 4inch square and its length is 12inches. Now assume that each water droplet is a 0.1in x 0.1in x 0.1in cube. Thus, in one cross section of the pipe, I will be able to fit 1600 water droplets. Now assume that water droplets travel at a rate of 1 inch/second.
Throughput: Each set of droplets will move into the pipe in 0.1 seconds. Thus, 10 sets will move in 1 second, i.e., 16000 droplets will enter the pipe per second. Note that this is independent of the length of the pipe. Latency: At one inch/second, it will take 12 seconds for droplet A to get from one end of the pipe to the other regardless of pipe’s diameter. Hence the latency will be 12 seconds.
I'm working on a player churn prediction model for a game. I have number of rounds played time series for 60 days. Before I feed the time series to classification algorithms, I need to normalize the time series.
I was thinking about using min-max normalization by transform x to x/Max(x). Max(x) in the 60 days time series doesn't necessarily captures the peak of how many times a player usually play a day.
But the z-normalization by transform x to (x-mean(x))/std(x) will not work since I need to preserve the information of the days with no play is zero. Doing z-normalization maps 0 to different values which makes them uncomparable.
Is there a normalization scheme which requires no information about the maximum of the time series and can map 0 still to 0?
You can transform your values into probabilities by dividing each value in the array by the sum of the values in the array (normalization factor "sum to unity").i.e.transform x to x./sum(x)
That would map 0 values to 0 and requires no information about the maximum value.