Does anybody have a KSQL query that counts event in a topic on a per hour basis? - ksqldb

I am new to KSQL and I am trying to get the counts of the events in a topic group per hour. If not I would settle for counting the events in the topic. Then I could change the query to work in a windowing basis. The timestamp is the
To give more context let's assume my topic is called messenger and the events are in JSON format. And here is a sample message:
{"name":"Won","message":"This message is from Won","ets":1642703358124}
Partition:0 Offset:69 Timestamp:1642703359427

First create a stream over the topic:
CREATE STREAM my_stream (NAME VARCHAR, MESSAGE VARCHAR)
WITH (KAFKA_TOPIC='my_topic', FORMAT='JSON');
Then use a TUMBLING window aggregation and a dummy GROUP BY field:
SELECT TIMESTAMPTOSTRING(WINDOWSTART,'yyyy-MM-dd HH:mm:ss','Europe/London')
AS WINDOW_START_TS,
COUNT(*) AS RECORD_CT
FROM my_stream
WINDOW TUMBLING (SIZE 1 HOURS)
GROUP BY 1
EMIT CHANGES;
If you want to override the timestamp being picked up from the message timestamp and use a custom timestamp field (I can see ets in your sample) you would do that in the stream definition:
CREATE STREAM my_stream (NAME VARCHAR, MESSAGE VARCHAR, ETS BIGINT)
WITH (KAFKA_TOPIC='my_topic', FORMAT='JSON', TIMESTAMP='ets');
Ref: https://rmoff.net/2020/09/08/counting-the-number-of-messages-in-a-kafka-topic/

Related

How to join a KSQL table and a stream on a non row key column

I am using community edition of confluent Platform version 5.4.1. I did not find any CLI command to print the KSQL Server version but when I enter KSQL what I get to see can be found in the attached screenshot.
I have a geofence table -
CREATE TABLE GEOFENCE (GEOFENCEID INT,
FLEETID VARCHAR,
GEOFENCECOORDINATES VARCHAR)
WITH (KAFKA_TOPIC='MONGODB-GEOFENCE',
VALUE_FORMAT='JSON',
KEY= 'GEOFENCEID');
The data is coming to Geofence KSQL table from Kafka MongoDB source connector whenever an insert or update operation is performed on the geofence MongoDB collection from a web application supported by a REST API. The idea behind making geofence a table is that since tables are mutable it would hold the updated geofence information and since the insert or update operation will not be very frequent and whenever there are changes in the Geofence MongoDB collection they will get updated on the Geofence KSQL table since the key here is GeofenceId.
I have a live stream of vehicle position -
CREATE STREAM VEHICLE_POSITION (VEHICLEID INT,
FLEETID VARCHAR,
LATITUDE DOUBLE,
LONGITUDE DOUBLE)
WITH (KAFKA_TOPIC='VEHICLE_POSITION',
VALUE_FORMAT='JSON')
I want to join table and stream like this -
CREATE STREAM VEHICLE_DISTANCE_FROM_GEOFENCE AS
SELECT GF.GEOFENCEID,
GF.FLEETID,
VP.VEHICLEID,
GEOFENCE_UDF(GF.GEOFENCECOORDINATES, VP.LATITUDE, VP.LONGITUDE)
FROM GEOFENCE GF
LEFT JOIN VEHICLE_POSITION VP
ON GF.FLEETID = VP.FLEETID;
But KSQL will not allow me to do because I am performing join on FLEETID which is a non row key column.Though this would have been possible in SQL but how do I achieve this in KSQL?
Note: According to my application's business logic Fleet Id is used to combine Geofences and Vehicles belonging to a fleet.
Sample data for table -
INSERT INTO GEOFENCE
(GEOFENCEID INT, FLEETID VARCHAR, GEOFENCECOORDINATES VARCHAR)
VALUES (10, 123abc, 52.4497_13.3096);
Sample data for stream -
INSERT INTO VEHICLE_POSITION
(VEHICLEID INT, FLEETID VARCHAR, LATITUDE DOUBLE, LONGITUDE DOUBLE)
VALUES (1289, 125abc, 57.7774, 12.7811):
To solve your problem what you need is a table of FENCEID to GEOFENCECOORDINATES. You could use such a table to join to your VEHICLE_POSITION stream to get the result you need.
So, how do you get a table of FENCEID to GEOFENCECOORDINATES?
The simple answer is that you can't with your current table definition! You declare the table as having only the GEOFENCEID as the primary key. Yet a fleetId can have many fences. To be able to mode this, both GEOFENCEID and FENCEID would need to be part of the primary key of the table.
Consider the example:
INSERT INTO GEOFENCE VALUES (10, 'fleet-1', 'coords-1');
INSERT INTO GEOFENCE VALUES (10, 'fleet-2', 'coords-2');
Are running these two inserts the table would contain only a single row, with key 10 and value 'fleet-2', 'coords-2'.
Even if we could somehow capture the above information in a table, consider what happens if there is a tombstone in the topic, because the first row had been deleted from the source Mongo table. A tombstone is the key, (10), and a null value. ksqlDB would then remove the row from its table with key 10, leaving an empty table.
This is the crux of your problem!
First, you'll need to configure the source connector to get both the fence id and fleet id into the key of the messages.
Next, you'll need to access this in ksqlDB. Unfortunately, ksqlDB, as of version 0.10.0 / CP 6.0.0 doesn't support multiple key columns, though this is being worked on soon.
In the meantime, if you key is a JSON document containing the two key fields, e.g.
{
"GEOFENCEID": 10,
"FLEETID": "fleet-1"
}
Then you can import it into ksqlDB as a STRING:
-- 5.4.1 syntax:
-- ROWKEY will contain the JSON document, containing GEOFENCEID and FLEETID
CREATE TABLE GEOFENCE (
GEOFENCECOORDINATES VARCHAR
)
WITH (
KAFKA_TOPIC='MONGODB-GEOFENCE',
VALUE_FORMAT='JSON'
);
-- 6.0.0 syntax:
CREATE TABLE GEOFENCE (
JSONKEY STRING PRIMARY KEY,
GEOFENCECOORDINATES VARCHAR
)
WITH (
KAFKA_TOPIC='MONGODB-GEOFENCE',
VALUE_FORMAT='JSON'
);
With the table now correctly defined you can use EXTRACTJSONFIELD to access the data in the JSON key and collect all the fence coordinates using COLLECT_SET. I'm not 100% sure this would on 5.4.1, (see how you get on), but will on 6.0.0.
-- 6.0.0 syntax
CREATE TABLE FLEET_COORDS AS
SELECT
EXTRACTJSONFIELD(JSONKEY, '$.FLEETID') AS FLEETID,
COLLECT_SET(GEOFENCECOORDINATES)
FROM GEOFENCE
GROUP BY EXTRACTJSONFIELD(JSONKEY, '$.FLEETID');
This will give you a table of fleetId to a set of fence coordinates. You can use this to join to your vehicle position stream. Of course, your GEOFENCE_UDF udf will need to accept an ARRAY<STRING> for the fence coordinates, as there may be many.
Good luck!

Set retention time for states indivisually when joining KTable-KTable using KSQL

When using KTable join in KafaStream, I can set retention time individually for changelog topic and rocksDB as the following:
clickStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(200)))
.reduce((oldValue, newValue) -> newValue, Materialized.<Integer, String,
WindowStore<Bytes, byte[]>>as("click").withRetention(Duration.ofSeconds(30000)));
Can I do the same when using ksql(KTable-KTable) for join?
for example:
select * from clicks
left join conversions on clicks->param = conversions->param
and I would like retention time set on both clicks KTable and conversions KTable indivisually, for instance, 1 week for clicks and 1 month for conversions.
This is not yet supported by KSQL.

Stream analytics getting average for 1 year from history

I have Stream Analytics job with
INPUTS:
1) "InputStreamCSV" - linked to Event hub and recievies data . InputStreamHistory
2) "InputStreamHistory" - Input stream linked BlobStorage. InputStreamCSV
OUTPUTS:
1) "AlertOUT" - linked to table storage and inserts alarm event as row in table
I want to calculate AVERAGE amount for all transactions for year 2018(one number - 5,2) and compare it with transaction, that is comming in 2019:
If new transaction amount is bigger than average - put that transaction in "AlertOUT" output.
I am calculating average as :
SELECT AVG(Amount) AS TresholdAmount
FROM InputStreamHistory
group by TumblingWindow(minute, 1)
Recieving new transaction as:
SELECT * INTO AlertOUT FROM InputStreamCSV TIMESTAMP BY EventTime
How can I combine this 2 queries to be able to check if new transaction amount is bigger than average transactions amount for last year?
Please use JOIN operator in ASA sql,you could refer to below sql to try to combine the 2 query sql.
WITH
t2 AS
(
SELECT AVG(Amount) AS TresholdAmount
FROM jsoninput2
group by TumblingWindow(minute, 1)
)
select t2.TresholdAmount
from jsoninput t1 TIMESTAMP BY EntryTime
JOIN t2
ON DATEDIFF(minute,t1,t2) BETWEEN 0 AND 5
where t1.Amount > t2.TresholdAmount
If the history data is stable, you also could join the history data as reference data.Please refer to official sample.
If you are comparing last year's average with current stream, it would be better to use reference data. Compute the averages for 2018 using either asa itself or a different query engine to a storage blob. After that you can use the blob as reference data in asa query - it will replace the average computation in your example.
After that you can do a reference data join with inputStreamCsv to produce alerts.
Even if you would like to update the averages once in a while, above pattern would work. Based on the refresh frequency, you can either use another asa job or a batch analytics solution.

How to Batching ESPER EPL events

I'm trying to batch Notification events like this and I'm getting one Notifications event with a single notification event. Can anyone help me?
Thnks in advance.
Relevant Statements
INSERT INTO Notification SELECT d.id as id,a.stationId as stationId,d.firebaseToken as firebaseToken, d.position as devicePos,a.location as stationPos,a.levelNumber as levelNumber,a.levelName as levelName FROM AirQualityAlert.win:time(3sec) as a, device.win:time(3sec) as d WHERE d.position.distance(a.location) < 300
INSERT INTO Notifications SELECT * FROM Notification.std:groupwin(id).win:time_batch(20sec) for grouped_delivery(id)
This solution delivers a row per 'id' that contain a column with the list of events.
create context Batch20Sec start #now end after 20 sec;
context Batch20Sec select id, window(*) as data
from Notifications#keepall
group by id
output all when terminated;
I think that is what you want.

Esper output clause with named windows

I have a question about using the output clause in combination with a named window, patterns and the insert into statement.
My goal is to detect the absence of an event, store it in a named window and when the events start coming again select and delete the row and use that as an "online" indicator (see Esper - detect event after absence)
I somehow want to be able limit the rate of events when there are multiple offline - online events in a short period of time (disable false positives). I thought the output clause could help here but when I use that on the insert into statement no events are stored in the named window. Is this the right approach or is there an other way to limit the events in this scenario?
This is my code in Esper EPL online:
create schema MonitorStats(id string, time string, host string, alert string);
create window MonitorWindow.win:keepall() as select id, alert, time, host from MonitorStats;
insert into MonitorWindow select a.id as id, 'offline' as alert, a.time as time, a.host as host from pattern
[every a=MonitorStats(id='1234') -> (timer:interval(25 sec) and not MonitorStats(id=a.id))];
on pattern[every b=MonitorStats(id='1234') -> (timer:interval(25 sec) and MonitorStats(id=b.id))]
select and delete b.id as id, 'online' as alert, b.time as time, b.host as host from MonitorWindow as win
where win.id = b.id;
You may insert output events into another stream and use "output first every".
insert into DetectedStream select ....;
select * from DetectedStream output first every 10 seconds;

Resources