Why is fill(0) creating entries and breaking the group? - influxdb

On a project where we use influx 1.7 we have a query that behaves strangely. Basically what we're trying to do is
average the data to align the time periods
take the max out of these time periods
sum these max values
Here is an example that reproduces the behavior.
Insert some example data
> INSERT test,source=source1 value=10 1643202000000000000
> INSERT test,source=source1 value=30 1643203800000000000
> INSERT test,source=source2 value=20 1643202000000000000
> INSERT test,source=source2 value=60 1643203800000000000
> INSERT test,source=source2 value=20 1643202900000000000
> INSERT test,source=source2 value=50 1643204700000000000
Change precision for readability and check the data
> precision rfc3339
> select * from test
name: test
time source value
---- ------ -----
2022-01-26T13:00:00Z source1 10
2022-01-26T13:00:00Z source2 20
2022-01-26T13:15:00Z source2 20
2022-01-26T13:30:00Z source1 30
2022-01-26T13:30:00Z source2 60
2022-01-26T13:45:00Z source2 50
Based on what I described, I would expect the final sum to equal max(10, mean(20,20)) + max(30, mean(60,50) = max(10,20) + max(30,55) = 20 + 55 = 75
If we select the mean value using fill(0) and fill(none) that's when things start to get weird, some values are added for no apparent reason
> select mean(value) from test group by time(30m), source fill(0)
name: test
tags: source=source1
time mean
---- ----
2022-01-26T13:00:00Z 10
2022-01-26T13:30:00Z 30
2022-01-26T14:00:00Z 0
2022-01-26T14:30:00Z 0
2022-01-26T15:00:00Z 0
name: test
tags: source=source2
time mean
---- ----
2022-01-26T13:00:00Z 20
2022-01-26T13:30:00Z 55
2022-01-26T14:00:00Z 0
2022-01-26T14:30:00Z 0
2022-01-26T15:00:00Z 0
> select mean(value) from test group by time(30m), source fill(none)
name: test
tags: source=source1
time mean
---- ----
2022-01-26T13:00:00Z 10
2022-01-26T13:30:00Z 30
name: test
tags: source=source2
time mean
---- ----
2022-01-26T13:00:00Z 20
2022-01-26T13:30:00Z 55
But things start to get crazy when we try to select the max value for each time period
> select max(value) from (select mean(value) as value from test group by time(30m), source fill(0)) group by time(30m)
name: test
time max
---- ---
2022-01-26T13:00:00Z 10
2022-01-26T13:30:00Z 0
2022-01-26T14:00:00Z 0
2022-01-26T14:30:00Z 0
2022-01-26T15:00:00Z 0
2022-01-26T13:00:00Z 20
2022-01-26T13:30:00Z 0
2022-01-26T14:00:00Z 0
2022-01-26T14:30:00Z 0
2022-01-26T15:00:00Z 0
2022-01-26T13:30:00Z 30
2022-01-26T14:00:00Z 0
2022-01-26T14:30:00Z 0
2022-01-26T15:00:00Z 0
2022-01-26T13:30:00Z 55
2022-01-26T14:00:00Z 0
2022-01-26T14:30:00Z 0
2022-01-26T15:00:00Z 0
> select max(value) from (select mean(value) as value from test group by time(30m), source fill(none)) group by time(30m)
name: test
time max
---- ---
2022-01-26T13:00:00Z 20
2022-01-26T13:30:00Z 55
2022-01-26T14:00:00Z
2022-01-26T14:30:00Z
2022-01-26T15:00:00Z
Of course the final result differs
> select sum(value) from (select max(value) as value from (select mean(value) as value from test group by time(30m), source fill(0)) group by time(30m))
name: test
time sum
---- ---
1970-01-01T00:00:00Z 115
> select sum(value) from (select max(value) as value from (select mean(value) as value from test group by time(30m), source fill(none)) group by time(30m))
name: test
time sum
---- ---
1970-01-01T00:00:00Z 75
The query with fill(0) seems to have broken the group as all values are getting selected.
Am fill(0) broken or am I missing/misusing something ? Could anyone justify this behavior ?

Related

when using the default 'randomForest' algorithm for classification, why doesn't the number of terminal nodes match the number of cases?

According to https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, classification trees are fully grown, meaning node size = 1.
However, if trees are really grown to a maximum, then shouldn't each terminal node contain a single case (data point, species, etc)?
If I run:
library(randomForest)
data(iris) #150 cases
set.seed(352)
rf <- randomForest(Species ~ ., iris)
hist(treesize(rf),main ="number of nodes")
I can see that most "fully grown" trees only have about 10 nodes, meaning node size can't be equal to 1...Right?
for example, (-1) below represents a terminal node for the 134th tree in the forest. Only 8 terminal nodes!?
> getTree(rf,134)
left daughter right daughter split var split point status prediction
1 2 3 3 2.50 1 0
2 0 0 0 0.00 -1 1
3 4 5 4 1.75 1 0
4 6 7 3 4.95 1 0
5 8 9 3 4.85 1 0
6 10 11 4 1.60 1 0
7 12 13 1 6.50 1 0
8 14 15 1 5.95 1 0
9 0 0 0 0.00 -1 3
10 0 0 0 0.00 -1 2
11 0 0 0 0.00 -1 3
12 0 0 0 0.00 -1 3
13 0 0 0 0.00 -1 2
14 0 0 0 0.00 -1 2
15 0 0 0 0.00 -1 3
I would be greatful if someone can explain
"Fully grown" -> "Nothing left to split". A (node of a-) decision tree is fully grown, if all data records assigned to it hold/make the same prediction.
In the iris dataset case, once you reach a node with 50 setosa data records in it, it doesn't make sense to split it into two child nodes with 25 and 25 setosas each.

InfluxDB: High cardinality for specific shards

I'm querying data from different shards and used EXPLAIN to check how many series are being fetched for that particular date range.
> SHOW SHARDS
.
.
658 mydb autogen 658 2019-07-22T00:00:00Z 2019-07-29T00:00:00Z 2020-07-27T00:00:00Z
676 mydb autogen 676 2019-07-29T00:00:00Z 2019-08-05T00:00:00Z 2020-08-03T00:00:00Z
.
.
Executing EXPLAIN for data from shard 658 and it's giving expected result in terms of number of series. SensorId is only tag key and as date range fall into only shard it's giving NUMBER OF SERIES: 1
> EXPLAIN select "kWh" from Reading where (SensorId =~ /^1186$/) AND time >= '2019-07-27 00:00:00' AND time <= '2019-07-28 00:00:00' limit 10;
QUERY PLAN
----------
EXPRESSION: <nil>
AUXILIARY FIELDS: "kWh"::float
NUMBER OF SHARDS: 1
NUMBER OF SERIES: 1
CACHED VALUES: 0
NUMBER OF FILES: 2
NUMBER OF BLOCKS: 4
SIZE OF BLOCKS: 32482
But when I run the same query on date range that falls into shard 676, number of series is 13140 instead of just one.
> EXPLAIN select "kWh" from Reading where (SensorId =~ /^1186$/) AND time >= '2019-07-29 00:00:00' AND time < '2019-07-30 00:00:00';
QUERY PLAN
----------
EXPRESSION: <nil>
AUXILIARY FIELDS: "kWh"::float
NUMBER OF SHARDS: 1
NUMBER OF SERIES: 13140
CACHED VALUES: 0
NUMBER OF FILES: 11426
NUMBER OF BLOCKS: 23561
SIZE OF BLOCKS: 108031642
Environment info:
System info: Linux 4.4.0-1087-aws x86_64
InfluxDB version: InfluxDB v1.7.6 (git: 1.7 01c8dd4)
Update - 1
On checking field cardinality, I observed a spike in RAM.
> SHOW FIELD KEY CARDINALITY
Update - 2
I've rebuilt the indexes, but the cardinality is still high.
Update - 3
I found out that shard has "SensorId" as tag as well as field that causing high cardinality when querying with the "SensorId" filter.
> SELECT COUNT("SensorId") from Reading GROUP BY "SensorId";
name: Reading
tags: SensorId=
time count
---- -----
1970-01-01T00:00:00Z 40
But when I'm checking tag values with key 'SensorId', it's not showing empty string that present in the above query.
> show tag values with key = "SensorId"
name: Reading
key value
--- -----
SensorId 10034
SensorId 10037
SensorId 10038
SensorId 10039
SensorId 10040
SensorId 10041
.
.
.
SensorId 9938
SensorId 9939
SensorId 9941
SensorId 9942
SensorId 9944
SensorId 9949
Update - 4
Inspected data using influx_inspect dumptsm and re-validated that null tag values are present
$ influx_inspect dumptsm -index -filter-key "" /var/lib/influxdb/data/mydb/autogen/235/000008442-000000013.tsm
Index:
Pos Min Time Max Time Ofs Size Key Field
1 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 5 103 Reading 1001
2 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 108 275 Reading 2001
3 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 383 248 Reading 2002
4 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 631 278 Reading 2003
5 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 909 278 Reading 2004
6 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1187 184 Reading 2005
7 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1371 103 Reading 2006
8 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1474 250 Reading 2007
9 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1724 103 Reading 2008
10 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1827 275 Reading 2012
11 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 2102 416 Reading 2101
12 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 2518 103 Reading 2692
13 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 2621 101 Reading SensorId
14 2019-07-29T00:00:05Z 2019-07-29T05:31:07Z 2722 1569 Reading,SensorId=10034 2005
15 2019-07-29T05:31:26Z 2019-07-29T11:03:54Z 4291 1467 Reading,SensorId=10034 2005
16 2019-07-29T11:04:14Z 2019-07-29T17:10:16Z 5758 1785 Reading,SensorId=10034 2005

Group data by consecutive points in Influxdb

Suppose I have this data:
time value
---- ----
0 28
1 27
2 26
3 25
4 26
5 27
I want to get values greater than 25 separated by consecutive points, as follow:
Group1
time value
---- ----
0 28
1 27
2 26
Group2
time value
---- ----
4 26
5 27
Is there anyway to do that with one query or should I post-process the data?

Count of sub-select in influxdb not returning anything

I've got some data structured as such
select * from rules where time > now() - 1m limit 5
name: rules
time ackrate consumers deliverrate hostname publishrate ready redeliverrate shard unacked version
---- ------- --------- ----------- -------- ----------- ----- ------------- ----- ------- -------
1513012628943000000 864 350 861.6 se-rabbit14 975.8 0 0 14 66 5
1513012628943000000 864.8 350 863 se-rabbit9 920.8 0 0 09 64 5
1513012628943000000 859.8 350 860.2 se-rabbit8 964.2 0 0 08 58 5
1513012628943000000 864.8 350 863.6 se-rabbit16 965.4 0 0 16 64 5
1513012631388000000 859.8 350 860.2 se-rabbit8 964.2 0 0 08 58 5
I want to calculate the percentage of 'up-time' defined as the amount of time when the queue has no ready messages.
I can get the maximum number of ready in each minute
select max(ready) from rules where time > now() - 1h group by time(1m) limit 5
name: rules
time max
---- ---
1513009560000000000 0
1513009620000000000 0
1513009680000000000 0
1513009740000000000 0
1513009800000000000 0
Using a sub-query, I can select only the minutes that have values ready.
select ready from (select max(ready) as ready from rules where time > now() - 1h group by time(1m)) where ready > 0
name: rules
time ready
---- -----
1513010520000000000 49
1513013280000000000 57
I wanted to get a count of these values and then doing a bit of math calculate a percentage. In this case, with 2 results in the last hour,
((60 minutes * 1 hour) - 2) / (60 minutes * 1 hour)) == 96%
When I try to count this though, I get no response.
select count(ready) from (select max(ready) as ready from rules where time > now() - 1h group by time(1m)) where ready > 0
This is v1.2.2.
How can I return a count of the number of results?
The solution was simply to upgrade from v1.2.2 to v1.3.8. Using the later version.
select count(ready) from (select max(ready) as ready from rules where time > now() - 1h group by time(1m)) where ready > 0
name: rules
time count
---- -----
0 6

ERROR while implementing Cox PH model for recurrent event survival analysis using counting process

I have been trying to run Cox PH model on a sample data set of 10k customers (randomly taken from 32 million customer base) for predicting probability of survival in time t (which is month in my case). I am using recurrent event survival analysis using counting process for e-commerce. For this...
1. Observation starting point: right after a customer makes first purchase
2. Start/Stop times: Months of two consecutive purchases (as in the data)
I have a few independent variables as in the sample data below:
id start stop status tenure orders revenue Quantity
A 0 20 0 0 1 $89.0 1
B 0 17 0 0 1 $556.0 2
C 0 17 0 0 1 $900.0 2
D 32 33 0 1679 9 $357.8 9
D 26 32 1 1497 7 $326.8 7
D 23 26 1 1405 4 $142.9 4
D 17 23 1 1219 3 $63.9 3
D 9 17 1 978 2 $50.0 2
D 0 9 1 694 1 $35.0 1
E 0 15 0 28 2 $156.0 2
F 0 15 0 0 1 $348.0 1
F 12 14 0 375 2 $216.8 3
F 0 12 1 0 1 $67.8 2
G 9 15 0 277 2 $419.0 2
G 0 9 1 0 1 $359.0 1
While running cox PH using the following code:
fit10=coxph(Surv(start,stop,status)~orders+tenure+Quantity+revenue,data=test)
I keep getting the following error:
Warning: X matrix deemed to be singular; variable 1 2 3 4
I tried searching for the same error online but the answers I found said this could be because of interacting independent variables, whereas my variables are individual and continuous.

Resources