Combining LAST and Cumulative SUM on influxdb subquery data

Combining LAST and Cumulative SUM on influxdb subquery data - influxdb

I have some telemetry data that looks like this:
> select * from connected_users LIMIT 100
name: connected_users
time event_id value
---- -------- -----
1605485019101319658 13759 2
1605485299566004832 13759 0
1605490011182967926 13760 4
1605490171409428188 13760 0
1605490207031398071 13760 7
1605490246151204709 13760 0
1605491054403726308 13761 1
1605491403050521520 13761 0
1605491433407347399 13762 2
1605491527773331816 13762 3
1605492020976088377 12884 1
1605492219827002782 13761 1
1605492613984911844 13763 1
1605492806683323942 13763 0
...
These writes only occur when something changes on the event (i.e. are not at fixed intervals). I want to write a query which will give me a cumulative sum per-minute of the current "value" on all the event_ids. However, because I can't guarantee that a new data value will have been written in the preceding 60 seconds, I use LAST to get whatever was last set per event_id
So far I got to:
SELECT SUM(*) FROM
(SELECT LAST("value") FROM
"connected_users" WHERE
time <= now() AND
time >= now() - 3h
GROUP BY time(1m), "event_id" fill(previous))
GROUP BY time(1m)
But this seems to give me a much lower outer "value" than expected, and a lot of duplicate time entries (and thus a lot of duplicate entries in the output data.)
I can see that the inner query is correct, because If I just run that in isolation, I can stack the output data in Grafana and manually see the correct total value in the graph. However, I want to have a single series rather than a grouped set of series, and I can't wrap my head around how to transform the data to do that.
EDIT: To give more context:
This is the inner query (Grafana, hence $timeFilter):
SELECT last("value") FROM "connected_users" WHERE $timeFilter GROUP BY time(1m), "event_id" fill(previous)
This produces a chart which I can stack and is correct:
If I then wrap that inner query in a SUM and GROUP BY time(1m) again, I can isolate a single series:
SELECT SUM(*) FROM (SELECT last("value") FROM "connected_users" WHERE $timeFilter AND ("event_id = '9970') GROUP BY time(1m), "event_id" fill(previous)) GROUP BY time(1m)
However, If I remove the AND and attempt to SUM all series values, I just end up with both a jumbled mess (presumably because there are duplicate/overlapping time values?) and also a lower max value than expected (expecting 18, got 8):
SELECT SUM(*) FROM (SELECT last("value") FROM "connected_users" WHERE $timeFilter GROUP BY time(1m), "event_id" fill(previous)) GROUP BY time(1m)

Related

JOIN ON second highest value (Impala)

I don't know how or even if this is possible.... I am trying to JOIN tables on the second highest value. I tried rowNumber, lag, lead & rank but haven't been able to get any of them to do what I need. To summarize, I'm just trying to shift the activitydate table down one row to join on rollDate minus 1 (but can't use -1 because they are not consistent dates, there are days missing.)
Does anyone know a good way to do this? Any suggestions are appreciated!
Select
ds.activitydate
,sum(ws.weeklyTotals / ds.daysBetween) as newRunRates -- getting an average of daily activity from weekly totals
from
(select
fsc.activitydate
,fsc.weekstart
,max(fsc.activitydate) OVER (partition by fsc.weekstart) as rollUpDate
,datediff(to_date(max(fsc.activitydate) OVER (partition by fsc.weekstart)), to_date(fsc.weekstart)) + 1 as daysBetween
from fiscalcalendar fsc
) ds -- used this to get a week-ending date bc that is what I need to join on. I only have a week start in this table
left join
(select
activitydate_iso
,count(distinct assignedmaincomponentid) as weeklyTotals
from activityTable
group by 1
) ws -- weeklySplits -- this gives me my weekly totals by a week ending date
on ds.rollUpDate = ws.activitydate_iso
-- need this join logic to actually be
-- on ds.rollUpDate = (max(ws.activitydate_iso) where activitydate_iso < rollUpDate)
where activitydate between '2020-05-22' and '2020-06-15'
group by 1,2
order by 1,2 ```

InfluxDB 1.7.2 - Top X over time

I’m new to InfluxDB. I’m using it to store ntopng timeseries data.
ntopng writes a measurement called asn:traffic that stores how many bytes were sent and received for an ASN.
> show tag keys from "asn:traffic"
name: asn:traffic
tagKey
------
asn
ifid
> show field keys from "asn:traffic"
name: asn:traffic
fieldKey fieldType
-------- ---------
bytes_rcvd float
bytes_sent float
>
I can run a query to see the data rate in bps for a specific ASN:
> SELECT non_negative_derivative(mean("bytes_rcvd"), 1s) * 8 FROM "asn:traffic" WHERE "asn" = '2906' AND time >= now() - 12h GROUP BY time(30s) fill(none)
name: asn:traffic
time non_negative_derivative
---- -----------------------
1550294640000000000 30383200
1550294700000000000 35639600
...
...
...
>
However, what I would like to do is create a query that I can use to return the top N ASNs by data rate and plot that on a Grafana graph. Sort of like this example that is using ELK.
I've tried a few variants from posts here and elsewhere, but I haven't been able to get what I'm after. For example, this query I think gets me closer to where I want to be, but there are no values in asn:
> select top(bps,asn,10) from (SELECT non_negative_derivative(mean(bytes_rcvd), 1s) * 8 as bps FROM "asn:traffic" WHERE time >= now() - 12h GROUP BY time(30s) fill(none))
name: asn:traffic
time top asn
---- --- ---
1550299860000000000 853572800
1550301660000000000 1197327200
1550301720000000000 1666883866.6666667
1550310780000000000 674889600
1550329320000000000 20979431866.666668
1550332740000000000 707015600
1550335920000000000 2066646533.3333333
1550336820000000000 618554933.3333334
1550339280000000000 669084933.3333334
1550340300000000000 704147333.3333334
>
Thinking then that perhaps the sub query needs to select asn also, however that proceeds an error about mixing queries:
> select top(bps,asn,10) from (SELECT asn, non_negative_derivative(mean(bytes_rcvd), 1s) * 8 as bps FROM "asn:traffic" WHERE time >= now() - 12h GROUP BY time(30s) fill(none))
ERR: mixing aggregate and non-aggregate queries is not supported
>
Anyone have any thoughts on a solution?
EDIT 1
Per the suggestion by George Shuklin, modifying the query to include asn in GROUP BY displays ASN in the CLI output, but that doesn't translate in Grafana. I'm expecting a stacked graph with each layer of the stacked graph being one of the top 10 asn results.

Try to make ASN as tag, than you can use group by time(30s), 'asn', and that tag will be available in the outer query.

complex db2/sql query with time-sampling, group, map, join and csv export

I have data in a table (named: TESTING) on a dashDB2 on IBM bluemix (Db2 Warehouse on Cloud) which is looking like this:
ID TIMESTAMP NAME VALUE
abc 2017-12-21 19:55:38.762 test1 123
abc 2017-12-21 19:55:42.762 test2 456
abc 2017-12-21 19:57:38.762 test1 789
abc 2017-12-21 19:58:38.762 test3 345
def 2017-12-21 19:59:38.762 test1 678
I am looking for a query that:
samples the data (for each NAME) to a given timeformat (ex. to a 1 minute based timestamp)
VALUES in same timerange (in same minute) should be averaged, empty times should be NULL
for 1. and 2. something like (only for one NAME working):
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
)
select temporaer, avg(VALUE) as test1 from dummy
LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
group by temporaer
ORDER BY temporaer ASC;
join all different NAMES column-wise to a matrix, like:
TIMESTAMP test1 test2 test3
2017-12-01 00:00:00 null null null
...
2017-12-21 19:55:00 123 456 null
2017-12-21 19:56:00 null null null
2017-12-21 19:57:00 789 null null
2017-12-21 19:58:00 678 null 345
...
2018-01-31 23:59:00 null null null
the query result should be exportet as a csv. or given back as csv-string
Does anybody know how this could be done in one query or in a simple and fast way? Or is it necessary to save the data in another tabe-format - can you give me a hint?

here is a code snipped that does the job, but needs very long time:
WITH
-- get all distinct names in table:
header(names) AS (SELECT DiSTINCT name
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),
-- generate a range of times from date to date in defined steps:
dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- add each name (from header) to each time/row (in dummy):
dumpy(time, names) AS (SELECT Dummy.time, Header.names
FROM Dummy
LEFT OUTER JOIN Header
ON Dummy.time IS NOT NULL),
-- averages values by name and timeinterval and sorts result to dummy:
dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
FROM Dummy
LEFT OUTER JOIN Dummie
ON Dummie.time = Dummy.time
GROUP BY Dummie.names, Dummy.time),
-- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
FROM Dumpy
LEFT OUTER JOIN Dummj
ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),
-- converts the high amount of rows to less rows with delimited strings:
test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
FROM Testo
GROUP BY time)
SELECT* FROM test ORDER BY time ASC, names ASC;
The performance problem is in the "testo" subquery. Does anybody have an idea what is the failure here or know how to improve the query?

Well, one problem I see is that you keep using functions on columns, but that shouldn't be too big a drain if id is reasonably unique. If this query is very common, it may also be worth it to permanently build and index the range table. Hmm, you probably need several indices (starting with FieldTest.id), but you might also try this version:
-- let's name things properly, too, to keep them straight.
WITH
-- generate a range of times from date to date in defined steps:
Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM Range
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FieldTest
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
-- just make the white space check part of the regex
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
FROM Range
CROSS JOIN Header
LEFT JOIN FieldTest
ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
AND FieldTest.names = Header.names
AND FieldTest.timestamp >= Range.rangeStart
AND FieldTest.timestamp < Range.rangeEnd
GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart,
-- converts the high amount of rows to less rows with delimited strings:
LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names,
LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names
(not tested)

the CROSS JOIN was defenitly a nice hint. Also I was not able to implement the following LEFT JOIN like you suggested, I found a workaround, which - I am sure - still keeps room for improvement but at this moment is acceptable for me (timesaving about factor 30 compared to my first query solution). Here the actual code:
WITH
-- generate a range of times from date to date in defined steps:
TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM TimeRange
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
FROM TimeRange
CROSS JOIN Header
LEFT JOIN rawData
ON rawData.names = Header.names
AND rawData.time = TimeRange.rangeStart
GROUP BY TimeRange.rangeStart, Header.names),
test(time, names, avgvalues) AS (SELECT Data.rangeStart,
LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
FROM Data
GROUP BY Data.rangeStart)
-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;

How to get the number of entries in a measurement

I am a newbie to influxdb. I just started to read the influx documentation.
I cant seem to get the equivalent of 'select count(*) from table' to work in influx db.
I have a measurement called cart:
time status cartid
1456116106077429261 0 A
1456116106090573178 0 B
1456116106095765618 0 C
1456116106101532429 0 D
but when I try to do
select count(cartid) from cart
I get the error
ERR: statement must have at least one field in select clause

I suppose cartId is a tag rather than a field value? count() currently can't be used on tag and time columns. So if your status is a non-tag column (a field), do the count on that.
EDIT:
Reference

This works as long as no field or tag exists with the name count:
SELECT SUM(count) FROM (SELECT *,count::INTEGER FROM MyMeasurement GROUP BY count FILL(1))
If it does use some other name for the count field. This works by first selecting all entries including an unpopulated field (count) then groups by the unpopulated field which does nothing but allows us to use the fill operator to assign 1 to each entry for count. Then we select the sum of the count fields in a super query. The result should look like this:
name: MyMeasurement
----------------
time sum
0 47799
It's a bit hacky but it's the only way to guarantee a count of all entries when no field exists that is always present in all entries.

How to use joins and averages together in Hive queries

I have two tables in hive:
Table1: uid,txid,amt,vendor Table2: uid,txid
Now I need to join the tables on txid which basically confirms a transaction is finally recorded. There will be some transactions which will be present only in Table1 and not in Table2.
I need to find out number of avg of transaction matches found per user(uid) per vendor. Then I need to find the avg of these averages by adding all the averages and divide them by the number of unique users per vendor.
Let's say I have the data:
Table1:
u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2
Table2:
u1,100
u1,101
u2,200
u2,202
Example For vendor vend1:
u1-> Avg transaction find rate = 2(matches found in both Tables,Table1 and Table2)/4(total occurrence in Table1) =0.5
u2 -> Avg transaction find rate = 1/1 = 1
Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75
Required output:
vend1,0.75
vend2,1
I can't seem to find count of both matches and occurrence in just Table1 in one hive query per user per vendor. I have reached to this query and can't find how to change it further.
SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid
Any help would be great.

I think you are running into trouble with your JOIN. When you JOIN by txid and uid, you are losing the total number of uid's per group. If I were you I would assign a column of 1's to table2 and name the column something like success or transaction and do a LEFT OUTER JOIN. Then in your new table you will have a column with the number 1 in it if there was a completed transaction and NULL otherwise. You can then do a case statement to convert these NULLs to 0
Query:
select vendor
,(SUM(avg_uid) / COUNT(uid)) as avg_of_avgs
from (
select vendor
,uid
,AVG(complete) as avg_uid
from (
select uid
,txid
,amt
,vendor
,case when success is null then 0
else success
end as complete
from (
select A.*
,B.success
from table1 as A
LEFT OUTER JOIN table2 as B
ON B.txid = A.txid
) x
) y
group by vendor, uid
) z
group by vendor
Output:
vend1 0.75
vend2 1.0
B.success in line 17 is the column of 1's that I put int table2 before the JOIN. If you are curious about case statements in Hive you can find them here

Amazing and precise answer by GoBrewers14!! Thank you so much. I was looking at it from a wrong perspective.
I made little changes in the query to get things finally done.
I didn't need to add a "success" colummn to table2. I checked B.txid in the above query instead of B.success. B.txid will be null in case a match is not found and be some value if a match is found. That checks the success & failure conditions itself without adding a new column. And then I set NULL as 0 and !NULL as 1 in the part above it. Also I changed some variable names as hive was finding it ambiguous.
The final query looks like :
select vendr
,(SUM(avg_uid) / COUNT(usrid)) as avg_of_avgs
from (
select vendr
,usrid
,AVG(complete) as avg_uid
from (
select usrid
,txnid
,amnt
,vendr
,case when success is null then 0
else 1
end as complete
from (
select A.uid as usrid,A.vendor as vendr,A.amt as amnt,A.txid as txnid
,B.txid as success
from Table1 as A
LEFT OUTER JOIN Table2 as B
ON B.txid = A.txid
) x
) y
group by vendr, usrid
) z
group by vendr;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Combining LAST and Cumulative SUM on influxdb subquery data - influxdb

Related

JOIN ON second highest value (Impala)

InfluxDB 1.7.2 - Top X over time

complex db2/sql query with time-sampling, group, map, join and csv export

How to get the number of entries in a measurement

How to use joins and averages together in Hive queries

Categories

Resources