Esper output clause with named windows - esper

I have a question about using the output clause in combination with a named window, patterns and the insert into statement.
My goal is to detect the absence of an event, store it in a named window and when the events start coming again select and delete the row and use that as an "online" indicator (see Esper - detect event after absence)
I somehow want to be able limit the rate of events when there are multiple offline - online events in a short period of time (disable false positives). I thought the output clause could help here but when I use that on the insert into statement no events are stored in the named window. Is this the right approach or is there an other way to limit the events in this scenario?
This is my code in Esper EPL online:
create schema MonitorStats(id string, time string, host string, alert string);
create window MonitorWindow.win:keepall() as select id, alert, time, host from MonitorStats;
insert into MonitorWindow select a.id as id, 'offline' as alert, a.time as time, a.host as host from pattern
[every a=MonitorStats(id='1234') -> (timer:interval(25 sec) and not MonitorStats(id=a.id))];
on pattern[every b=MonitorStats(id='1234') -> (timer:interval(25 sec) and MonitorStats(id=b.id))]
select and delete b.id as id, 'online' as alert, b.time as time, b.host as host from MonitorWindow as win
where win.id = b.id;

You may insert output events into another stream and use "output first every".
insert into DetectedStream select ....;
select * from DetectedStream output first every 10 seconds;

Related

Set retention time for states indivisually when joining KTable-KTable using KSQL

When using KTable join in KafaStream, I can set retention time individually for changelog topic and rocksDB as the following:
clickStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(200)))
.reduce((oldValue, newValue) -> newValue, Materialized.<Integer, String,
WindowStore<Bytes, byte[]>>as("click").withRetention(Duration.ofSeconds(30000)));
Can I do the same when using ksql(KTable-KTable) for join?
for example:
select * from clicks
left join conversions on clicks->param = conversions->param
and I would like retention time set on both clicks KTable and conversions KTable indivisually, for instance, 1 week for clicks and 1 month for conversions.
This is not yet supported by KSQL.

How to Batching ESPER EPL events

I'm trying to batch Notification events like this and I'm getting one Notifications event with a single notification event. Can anyone help me?
Thnks in advance.
Relevant Statements
INSERT INTO Notification SELECT d.id as id,a.stationId as stationId,d.firebaseToken as firebaseToken, d.position as devicePos,a.location as stationPos,a.levelNumber as levelNumber,a.levelName as levelName FROM AirQualityAlert.win:time(3sec) as a, device.win:time(3sec) as d WHERE d.position.distance(a.location) < 300
INSERT INTO Notifications SELECT * FROM Notification.std:groupwin(id).win:time_batch(20sec) for grouped_delivery(id)
This solution delivers a row per 'id' that contain a column with the list of events.
create context Batch20Sec start #now end after 20 sec;
context Batch20Sec select id, window(*) as data
from Notifications#keepall
group by id
output all when terminated;
I think that is what you want.

Query the most recent timestamp (MAX/Last) for a specific key, in Influx

Using InfluxDB (v1.1), I have the requirement where I want to get the last entry timestamp for a specific key. Regardless of which measurement this is stored and regardless of which value this was.
The setup is simple, where I have three measurements: location, network and usage.
There is only one key: device_id.
In pseudo-code, this would be something like:
# notice the lack of a FROM clause on measurement here...
SELECT MAX(time) WHERE 'device_id' = 'x';
The question: What would be the most efficient way of querying this?
The reason why I want this is that there will be a decentralised sync process. Some devices may have been updated in the last hour, whilst others haven't been updated in months. Being able to get a distinct "last updated on" timestamp for a device (key) would allow me to more efficiently store new points to Influx.
I've also noticed there is a similar discussion on InfluxDB's GitHub repo (#5793), but the question there is not filtering by any field/key. And this is exactly what I want: getting the 'last' entry for a specific key.
Unfortunately there wont be single query that will get you what you're looking for. You'll have to do a bit of work client side.
The query that you'll want is
SELECT last(<field name>), time FROM <measurement> WHERE device_id = 'x'
You'll need to run this query for each measurement.
SELECT last(<field name>), time FROM location WHERE device_id = 'x'
SELECT last(<field name>), time FROM network WHERE device_id = 'x'
SELECT last(<field name>), time FROM usage WHERE device_id = 'x'
From there you'll get the one with the greatest time stamp
> select last(value), time from location where device_id = 'x'; select last(value), time from network where device_id = 'x'; select last(value), time from usage where device_id = 'x';
name: location
time last
---- ----
1483640697584904775 3
name: network
time last
---- ----
1483640714335794796 4
name: usage
time last
---- ----
1483640783941353064 4
tl;dr;
The first() and last() selectors will NOT work consistently if the measurement have multiple fields, and fields have NULL values. The most efficient solution is to use these queries
First:
SELECT * FROM <measurement> [WHERE <tag>=value] LIMIT 1
Last:
SELECT * FROM <measurement> [WHERE <tag>=value] ORDER BY time DESC LIMIT 1
Explanation:
If you have a single field in your measurement, then the suggested solutions will work, but if you have more than one field and values can be NULL then first() and last() selectors won't work consistently and may return different timestamps for each field. For example, let's say that you have the following data set:
time fieldKey_1 fieldKey_2 device
------------------------------------------------------------
2019-09-16T00:00:01Z NULL A 1
2019-09-16T00:00:02Z X B 1
2019-09-16T00:00:03Z Y C 2
2019-09-16T00:00:04Z Z NULL 2
In this case querying
SELECT first(fieldKey_1) FROM <measurement> WHERE device = "1"
will return
time fieldKey_1
---------------------------------
2019-09-16T00:00:02Z X
and the same query for first(fieldKey_2) will return a different time
time fieldKey_2
---------------------------------
2019-09-16T00:00:01Z A
A similar problem will happen when querying with last.
And in case you are wondering, it wouldn't do querying 'first(*)' since you'll get an 'epoch-0' time in the results, such as:
time first_fieldKey_1 first_fieldKey_2
-------------------------------------------------------------
1970-01-01T00:00:00Z X A
So, the solution would be querying using combinations of LIMIT and ORDER BY.
For instance, for the first time value you can use:
SELECT * FROM <measurement> [WHERE <tag>=value] LIMIT 1
and for the last one you can use
SELECT * FROM <measurement> [WHERE <tag>=value] ORDER BY time DESC LIMIT 1
It is safe and fast as it will relay on indexes.
Is curious to mention that this more simple approach was mentioned in the thread linked in the opening post, but was discarded. Maybe it was just lost overlooked.
Here there's a thread in InfluxData blogs about the subject also suggesting to use this approach.
I tried this and it worked for me in a single command :
SELECT last(<field name>), time FROM location, network, usage WHERE device_id = 'x'
The result I got :
name: location
time last
---- ----
1483640697584904775 3
name: network
time last
---- ----
1483640714335794796 4
name: usage
time last
---- ----
1483640783941353064 4

Transform SQL JOIN SELECT to Esper EPL syntax

Let's consider a simple object with the same representation in a SQL database with properties(columns¨): Id, UserId,Ip.
I would like to prepare a query that would generate event in case that one user logs in from 2 IP adresses (or more) within 1 hour period.
My SQL looks like:
SELECT id,user_id,ip FROM w_log log
LEFT JOIN
(SELECT user_id, count(distinct ip) AS ip_count FROM w_log GROUP BY user_id) ips
ON log.user_id = ips.user_id
WHERE ips.ip_count > 1
Transformation to EPL:
SELECT * FROM LogEntry.win:time(1 hour) logs LEFT INNER join
(select UserId,count(distinct Ip) as IpCount FROM LogEntry.win:time(1 hour)) ips
ON logs.UserId = ips.UserId where ips.IpCount>1
Exception:
Additional information: Incorrect syntax near '(' at line 1 column 100,
please check the outer join within the from clause near reserved keyword 'select'
UPDATE:
I was successfuly able to create a schema, named window and insert data into it (or update it). I would like to increase the counter when a new LogEvent arrives in the .win:time(10 seconds) and decrease it when the event is leaving the 10 seconds window. Unfortunately the istream() doesn't seem to provide the true/false when the event is in remove stream.
create schema IpCountRec as (ip string, hitCount int)
create window IpCountWindow.win:time(10 seconds) as IpCountRec
on LogEvent.win:time(10 seconds) log
merge IpCountWindow ipc
where ipc.ip = log.ip
when matched and istream()
then update set hitCount = hitCount + 1
when matched and not istream()
then update set hitCount = hitCount - 1
when not matched
then insert select ip, 1 as hitCount
Is there something I missed?
In EPL I don't think it is possible to put a query into the from-part. You can change using "insert into". An EPL alternative is also a named window or table.

esper how to find ID that exists in streamA but NOT EXITS in streamB

The problem is pretty simple: extract only the not exists records from 2 different streams using Esper engine.
ID exists in streamA but NOT EXITS in streamB.
In SQL it would look like this:
SELECT *
FROM tableA
WHERE NOT EXISTS (SELECT *
FROM tableB
WHERE tableA.Id = tableB.Id)
I've tried it Esper style but it doesn't work:
SELECT *
FROM streamA.win:ext_timed(timestamp, 5 seconds) as stream_A
WHERE NOT EXSITS
(SELECT stream_B.Id
FROM streamB.win:ext_timed(timestamp, 5 seconds) as stream_B
WHERE stream_A.Id = stream_B.Id)
Sadly if stream_A.Id inserted before stream_B.id than it will answer the query conditions and the query won't work.
Any suggestions on how to identify "ID exists in streamA but NOT EXITS in streamB" using Esper?
One simple way is to time-order the stream, so that A and B are timestamp ordered before sending events in.
Or you could delay A such as this query:
select * from pattern [every a=streamA -> timer:interval(1 sec)] as delayed_a
where not exists (... where delayed_a.a.id = b.id)
There is no need for an externally timed window for streamA. For externally timed behavior in general use external timer events.

Resources