combined aggregation functions for more efficient plotting - influxdb

My computing cluster monitoring data is stored in an influx DB with the following shape (minus a few columns):
time number parti user
---- ------ ----- ----
2017-06-02T06:58:52.854866584Z 59 gr01 user01
2017-06-02T06:58:52.854866584Z 6 gr01 user02
2017-06-02T06:58:52.854866584Z 295 gr02 user03
2017-06-02T06:58:52.854866584Z 904 gr03 user04
data points are every 10 minutes. Right now I am plotting the sum for each "parti" with:
select sum(number) from status_logs where time > now() - 1h group by time(10m), parti
However, this becomes very slow when I show more than a few days due to the time(10m). I cannot use a varying time window because the sum() would not make sense anymore.
My question : would there be a way to take the average of the sum over a (variable) time window ?
Thanks !

Related

how to automatically detect and correct an abnormal period of data in a time serie using python

I am trying to find a method that allows me to automatically detect and correct abnormal periods of data points in a time series (a sequence of outliers). I've already tried ThymeBoost but it only corrects point outliers and not outlier periods.
Here is an example of a time series containing a period of outliers
the time series :
01/02/2018 288.000000
01/03/2018 332.000000
01/04/2018 277.000000
01/05/2018 233.000000
01/06/2018 204.000000
01/07/2018 216.000000
01/08/2018 175.000000
01/09/2018 218.000000
01/10/2018 413.000000
01/11/2018 416.000000
01/12/2018 151.000000
01/01/2019 224.000000
01/02/2019 563.000000
01/03/2019 413.000000
01/04/2019 238.000000
01/05/2019 343.000000
01/06/2019 176.000000
01/07/2019 533.103060
01/08/2019 230.000000
01/09/2019 364.000000
01/10/2019 324.000000
01/11/2019 437.000000
01/12/2019 738.000000
01/01/2020 619.000000
01/02/2020 728.000000
01/03/2020 506.000000
01/04/2020 500.000000
01/05/2020 886.000000
01/06/2020 892.000000
01/07/2020 268.000000
01/08/2020 32.000000
01/09/2020 45.000000
01/10/2020 51.000000
01/11/2020 373.000000
01/12/2020 61.000000
01/01/2021 73.000000
01/02/2021 779.000000
01/03/2021 584.718872
01/04/2021 614.000000
01/05/2021 489.000000
01/06/2021 534.000000
01/07/2021 455.000000
The plot:
I have also tried to use the seasonal decompose but it doesn't work since the series doesn't seem to have a seasonality

Prometheus blackbox probe helpful metrics

I have around 1000 targets that are probed using HTTP.
job="http_2xx", env="prod", instance="x.x.x.x"
job="http_2xx", env="test", instance="y.y.y.y"
job="http_2xx", env="dev", instance="z.z.z.z"
I want to know for the targets:
Rate of failure by env in last 10 minutes.
Increase in rate of failure by env in last 10 minutes.
Curious what the following does:
sum(increase(probe_success{job="http_2xx"}[10m]))
rate(probe_success{job="http_2xx", env="prod"}[5m]) * 100
The closest I have reached is with following to find operational by env in 10 minutes:
avg(avg_over_time(probe_success{job="http_2xx", env="prod"}[10m]) * 100)
Rate of failure by env in last 10 minutes. The easiest way you can do it is:
sum(rate(probe_success{job="http_2xx"}[10m]) * 100) by (env)
This will return you the percentage off successful probes, which you can reverse adding *(-1) +100
Calculating rate over 10m and increase of rate over 10m seems redundant adding an increase function to the above query didn't work for me. you can replace the rate function with increase if want to.
The first query was pretty close it will calculate the increase of successful probes over 10m period. You can make it show increase of failed probes by adding == 0 and sum it by the "env" variable
sum(increase(probe_success{job="http_2xx"} == 0 [10m])) by (env)
Your second query will return percentage of successful request over 5m for prod environment

How do I properly transform missing datapoints as 0 in Prometheus?

We have an alert we want to fire based on the previous 5m of metrics (say, if it's above 0). However, if the metric is 0 it's not written to prometheus, and as such it's not returned for that time bucket.
The result is that we may have an example data-set of:
-60m | -57m | -21m | -9m | -3m <<< Relative Time
1 , 0 , 1 , 0 , 1 <<< Data Returned
which ultimately results in the alert firing every time the metric is above 0, not only when it's above 0 for 5m. I've tried writing our query with OR on() vector() appended to the end, but it does funny stuff with the returned dataset:
values:Array[12]
0:Array[1539021420,0.16666666666666666]
1:Array[1539021480,0]
2:Array[1539021540,0]
3:Array[1539021600,0]
4:Array[1539021660,0]
5:Array[1539021720,0]
6:Array[1539021780,0]
7:Array[1539021840,0]
8:Array[1539021900,0]
9:Array[1539021960,0]
10:Array[1539022020,0]
11:Array[1539022080,0]
For some reason it's putting the "real" data at the front of the array (even though my starting time is well before 1539021420) and continuing from that timestamp forward.
What is the proper way to have Prometheus return 0 for data-points which may not exist?
To be clear, this isn't an alertmanager question -- I'm using a different tool for alerting on this data.

influxdb: calculating duration of boolean events?

I have data in an influxdb database from a door sensor. This is a boolean sensor (either the door is open (value is false) or it is closed (value is true)), and the table looks like:
name: door
--------------
time value
1506026143659488953 true
1506026183699139512 false
1506026751433484237 true
1506026761473122666 false
1506043848850764808 true
1506043887602743375 false
I would like to calculate how long the door was open in a given period of time. The ELAPSED function gets me close, but I'm not sure how to either (a) restrict it to only those intervals for which the intitial value is false, or (b) identify "open" intervals from the output of something like select elapsed(value, 1s) from door.
I was hoping I could do something like:
select elapsed(value, 1s), first(value) from door
But that doesn't get me anything useful:
name: door
--------------
time elapsed first
0 true
1506026183699139512 40
1506026751433484237 567
1506026761473122666 10
1506043848850764808 17087
1506043887602743375 38
I was hoping for something more along the lines of:
name: door
--------------
time elapsed first
1506026183699139512 40 true
1506026751433484237 567 false
1506026761473122666 10 true
1506043848850764808 17087 false
1506043887602743375 38 true
Short of extracting the data myself and processing it in e.g. python, is there any way to do this via an influxdb query?
I came across this problem as well, I wanted to sum the durations of times for which a flag is on, which is pretty common in signal processing in time series libraries, but influxdb just doesn't seem to support that very well. I tried INTEGRATE with a flag of value 1 but it just didn't seem to give me correct values. In the end, I resorted to just calculating intervals in my data source, publishing those as a separate field in influxdb and summing them up. It works much better that way.
This is the closest I have found so far:
https://community.influxdata.com/t/storing-duration-in-influxdb/4669
The idea is to store the boolean event as 0or 1 and to store each state changes with two entries with one unit of time difference. It would look something like this:
name: door
--------------
time value
1506026143659488953 1
1506026183699139511 1
1506026183699139512 0
1506026751433484236 0
1506026751433484237 1
1506026761473122665 1
1506026761473122666 0
1506043848850764807 0
1506043848850764808 1
1506043887602743374 1
1506043887602743375 0
It should then be possible to use a query like this:
SELECT integral(value) FROM "door" WHERE time > x and time < y
I'm new to influx so let me know if this is a bad way of doing things today. I also haven't tested the example I've written here.
I had this same problem. After running into this wall with InfluxDB and finding no clean solutions here or elsewhere, I ended up switching to TimescaleDB (PostgreSQL-based) and solving it with a SQL window function, using lag() to calculate the delta to the previous time value.
For the OP's dataset, a possible solution looks like this:
SELECT
"time",
("time" - lag("time") OVER (ORDER BY "time"))/1000000000 AS elapsed,
value AS first
FROM door
ORDER BY 1
OFFSET 1; -- omit the initial zero value
Input:
CREATE TEMPORARY TABLE "door" (time bigint, value boolean);
INSERT INTO "door" VALUES
(1506026143659488953, true),
(1506026183699139512, false),
(1506026751433484237, true),
(1506026761473122666, false),
(1506043848850764808, true),
(1506043887602743375, false);
Output:
time | elapsed | first
---------------------+---------+-------
1506026183699139512 | 40 | f
1506026751433484237 | 567 | t
1506026761473122666 | 10 | f
1506043848850764808 | 17087 | t
1506043887602743375 | 38 | f
(5 rows)

Generating means of a variable using dummy variables & foreach in Stata

My dataset includes TWO main variables X and Y.
Variable X represents distinct codes (e.g. 001X01, 001X02, etc) for multiple computer items with different brands.
Variable Y represents the tax charged for each code of variable X (e.g. 15 = 15% for 001X01) at a store.
I've created categories for these computer items using dummy variables (e.g. HD dummy variable for Hard-Drives, takes value of 1 when variable X represents a HD, etc). I have a list of over 40 variables (two of them representing X and Y, and the rest is a bunch of dummy variables for the different categories I've created for computer items).
I would like to display the averages of all these categories using a loop in Stata, but I'm not sure how to do this.
For example the code:
mean Y if HD == 1
Mean estimation Number of obs = 5
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
Tax | 7.1 2.537716 1.154172 15.24583
gives me the mean Tax for the category representing Hard Drives. How can I use a loop in Stata to automatically display all the mean Taxes charged for each category? I would do it by hand without a problem, but I want to repeat this process for multiple years, so I would like to use a loop for each year in order to come up with this output.
My goal is to create a separate Excel file with each of the computer categories I've created (38 total) and the average tax for each category by year.
Why bother with the loop and creating the indicator variables? If I understand correctly, your initial dataset allows the use of a simple collapse:
clear all
set more off
input ///
code tax str10 categ
1 0.15 "hd"
2 0.25 "pend"
3 0.23 "mouse"
4 0.29 "pend"
5 0.16 "pend"
6 0.50 "hd"
7 0.54 "monitor"
8 0.22 "monitor"
9 0.21 "mouse"
10 0.76 "mouse"
end
list
collapse (mean) tax, by(categ)
list
To take to Excel you can try export excel or put excel.
Run help collapse and help export for details.
Edit
Because you insist, below is an example that gives the same result using loops.
I assume the same data input as before. Some testing using this example database
with expand 1000000, shows that speed is virtually the same. But almost surely,
you (including your future you) and your readers will prefer collapse.
It is much clearer, cleaner and concise. It is even prettier.
levelsof categ, local(parts)
gen mtax = .
quietly {
foreach part of local parts {
summarize tax if categ == "`part'", meanonly
replace mtax = r(mean) if categ == "`part'"
}
}
bysort categ: keep if _n == 1
keep categ mtax
Stata has features that make it quite different from other languages. Once you
start getting a hold of it, you will find that many things done with loops elsewhere,
can be made loop-less in Stata. In many cases, the latter style will be preferred.
See corresponding help files using help <command> and if you are not familiarized with saved results (e.g. r(mean)), type help return.
A supplement to Roberto's excellent answer: After collapse, you will need a loop to export the results to excel.
levelsof categ, local(levels)
foreach x of local levels {
export excel `x', replace
}
I prefer to use numerical codes for variables such as your category variable. I then assign them value labels. Here's a version of Roberto's code which does this and which, for closer correspondence to your problem, adds a "year" variable
input code tax categ year
1 0.15 1 1999
2 0.25 2 2000
3 0.23 3 2013
4 0.29 1 2010
5 0.16 2 2000
6 0.50 1 2011
7 0.54 4 2000
8 0.22 4 2003
9 0.21 3 2004
10 0.76 3 2005
end
#delim ;
label define catl
1 hd
2 pend
3 mouse
4 monitor
;
#delim cr
label values categ catl
collapse (mean) tax, by(categ year)
levelsof categ, local(levels)
foreach x of local levels {
export excel `:label (categ) `x'', replace
}
The #delim ; command makes it possible to easily list each code on a separate line. The"label" function in the export statement is an extended macro function to insert a value label into the file name.

Resources