In Esper doc 5.6.2.1 Hints Pertaining to Group-By,
as it say:
"As the engine has no means of detecting when aggregation state (sums per symbol) can be
discarded, you may use the following hints to control aggregation state lifetime.
The #Hint("reclaim_group_aged=age_in_seconds") hint instructs the engine to discard
aggregation state that has not been updated for age_in_seconds seconds."
how understand "aggregation state"? if some epl didn't updated events, will these events be removed or droped?
Reading the doc what you are quoting is only when there are no data windows, i.e. "select sum(xyz) from ABC group by somekey". Since there are no data windows in this query, this aggregates all ABC events that arrived since start. Aggregation state in this example is the total (sum) per group. In this no-data-window aggregation with group-by when for a given value of "somekey" no data has arrived for N seconds the engine can be instructed to discard/forget that total and "somekey" value.
Related
I have the following requirement:
read events from a pub sub topic
take a window of duration 30 mins and period 1 minute
in that window if 3 events for a given id all match match some predicate then i need to raise an event in a different pub sub topic
The event should be raised as soon as the 3rd event comes in for the grouping id as this is for detecting fraudulent behaviour. In one pane there many be many ids that have 3 events that match my predicate so i may need to emit multiple events per pane
I am able to write a function which consumes a PCollection does the necessary grouping, logic and filtering and emit events according to my business logic.
Questions:
The output PCollection contains duplicates due to the overlapping sliding windows. I understand this is the expected behaviour of sliding windows but how can I avoid this whilst staying in the same dataflow pipeline. I realise I could dedupe in an external system but that is just adding complexity to my system.
I also need to write some sort of trigger that fires each and every time my condition is reached in a window
Is dataflow suitable for this type of realtime detection scenario
Many thanks
You can rewindow the output PCollection into the global window (using the regular Window.into()) and dedupe using a GroupByKey.
It sounds like you're already returning the events of interest as a PCollection. In order to "do something for each event", all you need is a ParDo.of(whatever action you want) applied to this collection. Triggers do something else: they control what happens when a new value V arrives for a particular key K in a GroupByKey<K, V>: whether to drop the value, or buffer it, or to pass the buffered KV<K, Iterable<V>> for downstream processing.
Yes :)
Full disclosure: I also published a variant of this question here.
I have an embedded device as part of a heating system that is publishing two temperature values, each to an individual MQTT topic, every 5 seconds via a mosquitto MQTT broker. "mydevice/sensor1" is the pre-heated temperature, and "mydevice/sensor2" is post-heating temperature. The values are published at almost the same time, so there's typically never more than half a second of delay between the two messages - but they aren't synchronised exactly.
Telegraf is subscribed to the same broker and is happily putting these measurements into an InfluxDB database called "telegraf.autogen". The measurements both appear under a single measurement called "mqtt_consumer" with a field called "value". In InfluxDB I can differentiate between topic-tagged values by filtering with the "topic" tag:
SELECT mean("value") AS "mean_value" FROM "telegraf"."autogen"."mqtt_consumer" WHERE time > now() - 1m AND "topic"='mydevice/sensor1' GROUP BY time(5s)
This all seems to be working correctly.
What I want to do is calculate the difference between these two topic values, for each pair of incoming values, in order to calculate the temperature differential and eventually calculate the energy being transferred by the heating system (the flow rate is constant and known). I tried to do this with InfluxDB queries in Grafana but it seemed quite difficult (I failed), so I thought I'd try and use TICKscript to break down my process into small steps.
I have been putting together a TICKscript to calculate the difference based on this example:
https://docs.influxdata.com/kapacitor/v1.3/guides/join_backfill/#stream-method
However in my case I don't have two separate measurements. Instead, I create two separate streams from the single "mqtt_consumer" measurement, using the topic tag as a filter. Then I attempt to join these with a 1s tolerance (values are always published close enough in time). I'm using httpOut to generate a view for debugging (Aside: this only updates every 10 seconds, missing every second value, even though my stream operates at 5 second intervals - why is that? I can see in the new db that the values are all present though).
Once I have them joined, I would evaluate the difference in values, and store this in a new database under a measurement called "diff".
Here's my script so far:
var sensor1 = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('mqtt_consumer')
.where(lambda: "topic" == 'mydevice/sensor1')
.groupBy(*)
|httpOut('sensor1')
var sensor2 = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('mqtt_consumer')
.where(lambda: "topic" == 'mydevice/sensor2')
.groupBy(*)
|httpOut('sensor2')
sensor1
|join(sensor2)
.as('value1', 'value2')
.tolerance(1s)
|httpOut('join')
|eval(lambda: "sensor1.value1" - "sensor1.value2")
.as('diff')
|httpOut('diff')
|influxDBOut()
.create()
.database('mydb')
.retentionPolicy('myrp')
.measurement('diff')
Unfortunately my script is failing to pass any items through the join node. In kapacitor show I can see that the httpOut nodes are both passing items to the join node, but it isn't passing any on. The kapacitor logs don't show anything obvious either. An HTTP GET for httpOut('join') returns:
{"series":null}
I have two questions:
is this approach, using Kapacitor with a TICKscript for calculating energy based on the difference between two values in a single measurement, valid? Or is there a better/simpler way to do this?
why isn't the join node producing any output? What can I do to debug this further?
Try to add |mean node, to calculate the mean of the field, in both sensors:
var sensor1 = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('mqtt_consumer')
.where(lambda: "topic" == 'mydevice/sensor1')
.groupBy(*)
|mean('field1')
|httpOut('sensor1')
After the join, you should use the newly assigned name to the streams nor the original ones:
sensor1
|join(sensor2)
.as('value1', 'value2')
.tolerance(1s)
|httpOut('join')
|eval(lambda: "value1.field1" - "value2.field2")
.as('diff')
|httpOut('diff')
|influxDBOut()
.create()
.database('mydb')
.retentionPolicy('myrp')
.measurement('diff')
Where mean fields are the field calculated on my previous comment. Try it out!
Also, to further debugging, try to add log nodes where you want to put your eye.
Hope this helps! Regards
I'm thinking about designing an event processing system.
The rules per se are not the problem.
What bogs my is how to store event data so that I can efficiently answer questions/facts like:
If number of events of type A in the last 10 minutes equals N,
and the average events of type B per minute over the last M hours is Z,
and the current running average of another metric is Y...
then
fire some event (or store a new fact/event).
How do Esper/Drools/MS StreamInsight store their time dependant data so that they can efficiently calculate event stream properties? ¿Do they just store it in SQL databases and continuosly query them?
Do the preprocess the rules so they can know beforehand what "knowledge" they need to store?
Thanks
EDIT: I found what I want is called Event Stream Processing, and the wikipedia example shows what I would like to do:
WHEN Person.Gender EQUALS "man" AND Person.Clothes EQUALS "tuxedo"
FOLLOWED-BY
Person.Clothes EQUALS "gown" AND
(Church_Bell OR Rice_Flying)
WITHIN 2 hours
ACTION Wedding
Still the question remains: how do you implement such a data store? The key is "WITHIN 2 hours" and the ability to process thousands of events per second.
Esper analyzes the rule and only stores derived state (aggregations etc., if any) and if needed by the rule also a subset of events. Esper allows defining contexts like described in the book by Opher Etzion and Peter Niblet. I recommend reading. By specifying a context Esper can minimize the amount of state it retains and can make queries easier to read.
It's not difficult to store events happening within a time window of a certain length. The problem gets more difficult if you have to consider additional constraints: here an analysis of the rules is indicated so that you can maintain sets of events matching the constraints.
Storing events in an (external) database will be too slow.
I have a use case where a system transaction happen/completed over a period of time and with multiple "building up" steps. each step in the process generates one or more events (up to 22 events per transaction). All events within a transaction have a shared and unique (uuid) correlation ID.
An example is for a transaction X: will have the building blocks of EventA, EventB, EventC... and all tagged with a unique correlation identifier.
The ultimate goal here is to switch from persisting all the separate events in an RDBMS and query a consolidated view (lots of joins) To: be able to persist only 1 encompassing transaction record that will consolidate attributes from each step in the transaction.
My research so far led me toward reading about Esper (Java stack here) and WSo2/WS02 CEP. In my case each event is submitted/enqueued into JMS, and I am wondering if a solution like WS02/WSo2 CEP can be used to consolidate JMS events/messages (streams) and based on correlation ID (and maximum time limit 30 min) produce one consolidated record and send it down JMS to ultimately persist in a DB.
Since I am still in research mode, I was wondering if I am on the right path for a solution?
Anybody achieved such thing using WS02/WSo2 CEP, or is it over kill ? other recommendations?
Thanks
-S
You can use WSO2 CEP by integrating that to JMS to send and receive events and by using Siddhi Pattern queries[1] to consolidate events arriving from the same transaction.
30 min is a reasonable time period and its recommended to test the scenario with some test data set because you must need enough memory in the servers for CEP to handle the states. This will greatly depend on the event rate.
AFAIK this is not an over kill in a enterprise deployment.
[1]https://docs.wso2.com/display/CEP200/Patterns
I would recommend you to try esper patterns. For multievent based system where some particular information is to be collected patterns works the best way.
A sample example would be:
select * from TemperatureEvent
match_recognize (
measures A as temp1, B as temp2, C as temp3, D as temp4
pattern (A B C D)
define
A as A.temperature > 100,
B as (A.temperature < B.value),
C as (B.temperature < C.value),
D as (C.temperature < D.value) and D.value >
(A.value * 1.5))
Here, we have 4 events and 5 conditions involving these events. Example is taken from demo project.
I am using IDS 10 and I have a simple transaction table with the inventory changes with product ID, transaction time, volume, quantity and price.
Is it possible to determine the FIFO valuation solely with SQL/Stored procedures or do I need to use something like Perl with DBI for the cursor handling?
Fifo valuation requires cursor-handling from my pov as I need to first build a temp table with the total volume and process then the sorted transaction to calculate the average on the relevant transactions.
It should certainly be possible to do it in a stored procedure. You can create temporary tables and use cursors via the FOREACH statement. I doubt if it is doable in straight SQL.
FIFO evaluation - as in, I bought 27 lots of a particular share are various times and prices; now I sold a bunch of those shares and need to work out the cost basis using FIFO?