How do I properly transform missing datapoints as 0 in Prometheus? - time-series

We have an alert we want to fire based on the previous 5m of metrics (say, if it's above 0). However, if the metric is 0 it's not written to prometheus, and as such it's not returned for that time bucket.
The result is that we may have an example data-set of:
-60m | -57m | -21m | -9m | -3m <<< Relative Time
1 , 0 , 1 , 0 , 1 <<< Data Returned
which ultimately results in the alert firing every time the metric is above 0, not only when it's above 0 for 5m. I've tried writing our query with OR on() vector() appended to the end, but it does funny stuff with the returned dataset:
values:Array[12]
0:Array[1539021420,0.16666666666666666]
1:Array[1539021480,0]
2:Array[1539021540,0]
3:Array[1539021600,0]
4:Array[1539021660,0]
5:Array[1539021720,0]
6:Array[1539021780,0]
7:Array[1539021840,0]
8:Array[1539021900,0]
9:Array[1539021960,0]
10:Array[1539022020,0]
11:Array[1539022080,0]
For some reason it's putting the "real" data at the front of the array (even though my starting time is well before 1539021420) and continuing from that timestamp forward.
What is the proper way to have Prometheus return 0 for data-points which may not exist?
To be clear, this isn't an alertmanager question -- I'm using a different tool for alerting on this data.

Related

Dataflow - Approx Unique on unbounded source

I'm getting unexpected results streaming in the cloud.
My pipeline looks like:
SlidingWindow(60min).every(1min)
.triggering(Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(30)))
)
)
.withAllowedLateness(15sec)
.accumulatingFiredPanes()
.apply("Get UniqueCounts", ApproximateUnique.perKey(.05))
.apply("Window hack filter", ParDo(
if(window.maxTimestamp.isBeforeNow())
c.output(element)
)
)
.toJSON()
.toPubSub()
If that filter isn't there, I get 60 windows per output. Apparently because the pubsub sink isn't window aware.
So in the examples below, if each time period is a minute, I'd expect to see the unique count grow until 60 minutes when the sliding window closes.
Using DirectRunner, I get expected results:
t1: 5
t2: 10
t3: 15
...
tx: growing unique count
In dataflow, I get weird results:
t1: 5
t2: 10
t3: 0
t4: 0
t5: 2
t6: 0
...
tx: wrong unique count
However, if my unbounded source has older data, I'll get normal looking results until it catches up at which point I'll get the wrong results.
I was thinking it had to do with my window filter, but removing that didn't change the results.
If I do a Distinct() then Count().perKey(), it works, but that slows my pipeline considerably.
What am I overlooking?
[Update from the comments]
ApproximateUnique inadvertently resets its accumulated value when result is extracted. This is incorrect when the value is read more than once as with windows firing multiple times. Fix (will be in version 2.4): https://github.com/apache/beam/pull/4688

influxdb: calculating duration of boolean events?

I have data in an influxdb database from a door sensor. This is a boolean sensor (either the door is open (value is false) or it is closed (value is true)), and the table looks like:
name: door
--------------
time value
1506026143659488953 true
1506026183699139512 false
1506026751433484237 true
1506026761473122666 false
1506043848850764808 true
1506043887602743375 false
I would like to calculate how long the door was open in a given period of time. The ELAPSED function gets me close, but I'm not sure how to either (a) restrict it to only those intervals for which the intitial value is false, or (b) identify "open" intervals from the output of something like select elapsed(value, 1s) from door.
I was hoping I could do something like:
select elapsed(value, 1s), first(value) from door
But that doesn't get me anything useful:
name: door
--------------
time elapsed first
0 true
1506026183699139512 40
1506026751433484237 567
1506026761473122666 10
1506043848850764808 17087
1506043887602743375 38
I was hoping for something more along the lines of:
name: door
--------------
time elapsed first
1506026183699139512 40 true
1506026751433484237 567 false
1506026761473122666 10 true
1506043848850764808 17087 false
1506043887602743375 38 true
Short of extracting the data myself and processing it in e.g. python, is there any way to do this via an influxdb query?
I came across this problem as well, I wanted to sum the durations of times for which a flag is on, which is pretty common in signal processing in time series libraries, but influxdb just doesn't seem to support that very well. I tried INTEGRATE with a flag of value 1 but it just didn't seem to give me correct values. In the end, I resorted to just calculating intervals in my data source, publishing those as a separate field in influxdb and summing them up. It works much better that way.
This is the closest I have found so far:
https://community.influxdata.com/t/storing-duration-in-influxdb/4669
The idea is to store the boolean event as 0or 1 and to store each state changes with two entries with one unit of time difference. It would look something like this:
name: door
--------------
time value
1506026143659488953 1
1506026183699139511 1
1506026183699139512 0
1506026751433484236 0
1506026751433484237 1
1506026761473122665 1
1506026761473122666 0
1506043848850764807 0
1506043848850764808 1
1506043887602743374 1
1506043887602743375 0
It should then be possible to use a query like this:
SELECT integral(value) FROM "door" WHERE time > x and time < y
I'm new to influx so let me know if this is a bad way of doing things today. I also haven't tested the example I've written here.
I had this same problem. After running into this wall with InfluxDB and finding no clean solutions here or elsewhere, I ended up switching to TimescaleDB (PostgreSQL-based) and solving it with a SQL window function, using lag() to calculate the delta to the previous time value.
For the OP's dataset, a possible solution looks like this:
SELECT
"time",
("time" - lag("time") OVER (ORDER BY "time"))/1000000000 AS elapsed,
value AS first
FROM door
ORDER BY 1
OFFSET 1; -- omit the initial zero value
Input:
CREATE TEMPORARY TABLE "door" (time bigint, value boolean);
INSERT INTO "door" VALUES
(1506026143659488953, true),
(1506026183699139512, false),
(1506026751433484237, true),
(1506026761473122666, false),
(1506043848850764808, true),
(1506043887602743375, false);
Output:
time | elapsed | first
---------------------+---------+-------
1506026183699139512 | 40 | f
1506026751433484237 | 567 | t
1506026761473122666 | 10 | f
1506043848850764808 | 17087 | t
1506043887602743375 | 38 | f
(5 rows)

How to reference a particular row for an existing variable in SPSS syntax?

I have 2 variables, one for raw p-values and another for adjusted p-values. I need to compute a new variable based on the values of these two variables. What I need to do isn't too complicated, but I have a hard time doing it in SPSS because I can't figure out how I can reference a particular row for an existing variable in SPSS syntax.
The first column lists raw p-values in ascending order. The next column lists adjusted p-values, but these adjusted p-values are still incomplete. I need to compare two adjacent p-values in the adjusted p-values column (e.g., row 1 and 2, row 2 and 3, row 3 and 4, and so forth), and take the p-values whichever is smaller in each of these comparisons and enter those p-values into the following column as values for a new variable.
However, that's not the end of the story. One more condition has to be met. That is, the new p-values have to be in the same order as the raw p-values. However, I cannot ensure this if I start the comparisons from the top row. You can see that (i') is greater than (h') and (g'), and (d') is greater than (c'), (b'), and (a') in the example below (picture).
In order to solve this issue, I would need to start the comparison of the adjusted p-values from the bottom. In addition, I would need to compare the adjusted p-values to the new p-values of one row below. One exception is that I can simply use the value of (a) as the value of (a') since the value of (a) should always be the greatest of all the p-values as a rule. Then, for (b') , I need to compare (b) and (a') and enter whichever is smaller as (b'). For (c'), I need to compare (c) and (b') and enter whichever is smaller as (c'), and so forth. By doing this way, (d') would be 0.911 and (i') would be 0.017.
Sorry for this long post, but I would really appreciate if I can get some help to do this task in SPSS.
Thank you in advance for your help.
Raw p-values | Adjusted p-values (Temporal)| New p-values (Final)
-------------|-----------------------------|---------------------
0.002 | 0.030 (i) | 0.025 (i')
0.003 | 0.025 (h) | 0.017 (h')
0.004 | 0.017 (g) | 0.017 (g')
0.005 | 0.028 (f) | 0.028 (f')
0.023 | 0.068 (e) | 0.068 (e')
0.450 | 1.061 (d) | 1.061 (d')
0.544 | 1.145 (c) | 0.911 (c')
0.850 | 0.911 (b) | 0.911 (b')
0.974 | 0.974 (a) | 0.974 (a')
Another tool that may be convenient is the SHIFT VALUES command. It can move one or more columns of data either forward or backward.
I wonder whether the purpose of this has to do with adjusting p values for multiple testing corrections as with Benjamin-Hochberg FDR or others similar. If that is the case, you might find the STATS PADJUST (Analyze > Descriptives > Calculate adjusted p values) extension command useful. It offers six adjustment methods. You can install it from the Utilities (pre-V24) or Extensions (V24+) menu.
To get you started, here are a few tools that can help you with this task:
The LAG function
you can compare values in this line and the previous one, for example, the following will compare the Pval in each line to the one in the previous one, and put the smaller of the two in the NewPval:
compute NewPVal=min(Pval, lag(Pval)).
If you want to do the same process only start from the bottom, you can easily sort your data in reverse order and do the same.
CREATE + LEAD
if you want to make comparisons to the next line instead of the previous line, you should first create a "lead" variable and then compare to it.
for example, the following syntax will create a new variable that for each line contains the value of Pval in the next line, and then chooses the smaller of the two for the NewPval:
create /LeadPval=LEAD(Pval 1).
compute NewPVal=min(Pval, LeadPval).
Using case numbers
You can use case numbers (line numbers) in calculations and in conditions. For example, the following syntax will let you make different calculations in the first line and the following ones:
if $casenum=1 NewPval=Pval.
if $casenum>1 NewPVal=min(Pval, lag(Pval)).

combined aggregation functions for more efficient plotting

My computing cluster monitoring data is stored in an influx DB with the following shape (minus a few columns):
time number parti user
---- ------ ----- ----
2017-06-02T06:58:52.854866584Z 59 gr01 user01
2017-06-02T06:58:52.854866584Z 6 gr01 user02
2017-06-02T06:58:52.854866584Z 295 gr02 user03
2017-06-02T06:58:52.854866584Z 904 gr03 user04
data points are every 10 minutes. Right now I am plotting the sum for each "parti" with:
select sum(number) from status_logs where time > now() - 1h group by time(10m), parti
However, this becomes very slow when I show more than a few days due to the time(10m). I cannot use a varying time window because the sum() would not make sense anymore.
My question : would there be a way to take the average of the sum over a (variable) time window ?
Thanks !

How to evaluate a search/retrieval engine using trec_eval?

Is there any body who has used TREC_EVAL? I need a "Trec_EVAL for dummies".
I'm trying to evaluate a few search engines to compare parameters like Recall-Precision, ranking quality, etc for my thesis work. I can not find how to use TREC_EVAL to send queries to the search engine and get a result file which can be used with TREC_EVAL.
Basically, for trec_eval you need a (human generated) ground truth. That has to be in a special format:
query-number 0 document-id relevance
Given a collection like 101Categories (wikipedia entry) that would be something like
Q1046 0 PNGImages/dolphin/image_0041.png 0
Q1046 0 PNGImages/airplanes/image_0671.png 128
Q1046 0 PNGImages/crab/image_0048.png 0
The query-number identifies therefore a query (e.g. a picture from a certain category to find similiar ones). The results from your search engine has then to be transformed to look like
query-number Q0 document-id rank score Exp
or in reality
Q1046 0 PNGImages/airplanes/image_0671.png 1 1 srfiletop10
Q1046 0 PNGImages/airplanes/image_0489.png 2 0.974935 srfiletop10
Q1046 0 PNGImages/airplanes/image_0686.png 3 0.974023 srfiletop10
as described here. You might have to adjust the path names for the "document-id". Then you can calculate the standard metrics trec_eval groundtrouth.qrel results.
trec_eval --help should give you some ideas to choose the right parameters for using the measurements needed for your thesis.
trec_eval does not send any queries, you have to prepare them yourself. trec_eval does only the analysis given a ground trouth and your results.
Some basic information can be found here and here.

Resources