Esper discard events - esper

I have a platform that leverages Esper. However, all events are inserted into the Event table and sent to Esper to process. My rules are specific to only around 10% of the data set but the 90% other data going through the engine is bottlenecking my alerts from firing.
Is there a way to tell Esper to discard events that I don't care about on ingest so I have a smaller stream going through the actual alert / rule processing engine?

The insert-into could be handy for you. For example:
insert into FilteredStream select * from UnfilteredStream where ...some filter critera...
and
// the FilteredStream has the filtered events only
select count(*) from FilteredStream
There is an overview of under which conditions Esper holds events in memory at http://espertech.com/esper/faq_esper.php#keep_in_memory

Related

Flink: How can I emits the outputs immediately in the one-to-many streams' join

I have one query stream and one item stream. I want to join these two stream on the query_id, the relation is one-to-many. How can I emit the item immediately to the output stream after its arrival and get some info from the query.
You could do this either using Table API and a simple join or you could implement it yourself using CoFlatMap on stream keyed by the query_id and buffering incoming events in the state. You should consider some retention policy though, to make sure the state won't grow infinitely.

Join of of multiple streams with the Python SDK

I would like to join multiple streams on a common key and trigger a result either as soon as all of the streams have contributed at least one element or at the end of the window. CoGroupByKey seems to be the appropriate building block, but there does not seem to be a way to express the early trigger condition (count trigger applies per input collection)?
I believe CoGroupByKey is implemented as Flatten + GroupByKey under the hood. Once multiple streams are flattened into one, data-driven trigger (or any other triggers) won't have enough control to achieve what you want.
Instead of using CoGroupByKey, you can use Flatten and StatefulDoFn that fills an object backed by State for each key. Also in this case, StatefulDoFn would have the chance to decide what to do when stream A has 2 elements arrived but stream B doesn't have any element yet.
Another potential solution that comes to mind is (a stateless) DoFn that filters the CoGBK results to remove those that don't have at least one occurrence for each joined stream. For the end of window result (which does not have the same restriction), it would then be necessary to have a parallel CoGBK and its result would not go through the filter. I don't think there is a way to tag results with the trigger that emitted it?

How to dedupe across over-lapping sliding windows in apache beam / dataflow

I have the following requirement:
read events from a pub sub topic
take a window of duration 30 mins and period 1 minute
in that window if 3 events for a given id all match match some predicate then i need to raise an event in a different pub sub topic
The event should be raised as soon as the 3rd event comes in for the grouping id as this is for detecting fraudulent behaviour. In one pane there many be many ids that have 3 events that match my predicate so i may need to emit multiple events per pane
I am able to write a function which consumes a PCollection does the necessary grouping, logic and filtering and emit events according to my business logic.
Questions:
The output PCollection contains duplicates due to the overlapping sliding windows. I understand this is the expected behaviour of sliding windows but how can I avoid this whilst staying in the same dataflow pipeline. I realise I could dedupe in an external system but that is just adding complexity to my system.
I also need to write some sort of trigger that fires each and every time my condition is reached in a window
Is dataflow suitable for this type of realtime detection scenario
Many thanks
You can rewindow the output PCollection into the global window (using the regular Window.into()) and dedupe using a GroupByKey.
It sounds like you're already returning the events of interest as a PCollection. In order to "do something for each event", all you need is a ParDo.of(whatever action you want) applied to this collection. Triggers do something else: they control what happens when a new value V arrives for a particular key K in a GroupByKey<K, V>: whether to drop the value, or buffer it, or to pass the buffered KV<K, Iterable<V>> for downstream processing.
Yes :)

How do CEP rules engines store time data?

I'm thinking about designing an event processing system.
The rules per se are not the problem.
What bogs my is how to store event data so that I can efficiently answer questions/facts like:
If number of events of type A in the last 10 minutes equals N,
and the average events of type B per minute over the last M hours is Z,
and the current running average of another metric is Y...
then
fire some event (or store a new fact/event).
How do Esper/Drools/MS StreamInsight store their time dependant data so that they can efficiently calculate event stream properties? ¿Do they just store it in SQL databases and continuosly query them?
Do the preprocess the rules so they can know beforehand what "knowledge" they need to store?
Thanks
EDIT: I found what I want is called Event Stream Processing, and the wikipedia example shows what I would like to do:
WHEN Person.Gender EQUALS "man" AND Person.Clothes EQUALS "tuxedo"
FOLLOWED-BY
Person.Clothes EQUALS "gown" AND
(Church_Bell OR Rice_Flying)
WITHIN 2 hours
ACTION Wedding
Still the question remains: how do you implement such a data store? The key is "WITHIN 2 hours" and the ability to process thousands of events per second.
Esper analyzes the rule and only stores derived state (aggregations etc., if any) and if needed by the rule also a subset of events. Esper allows defining contexts like described in the book by Opher Etzion and Peter Niblet. I recommend reading. By specifying a context Esper can minimize the amount of state it retains and can make queries easier to read.
It's not difficult to store events happening within a time window of a certain length. The problem gets more difficult if you have to consider additional constraints: here an analysis of the rules is indicated so that you can maintain sets of events matching the constraints.
Storing events in an (external) database will be too slow.

wso2/ws02 CEP, ESPER or something else?

I have a use case where a system transaction happen/completed over a period of time and with multiple "building up" steps. each step in the process generates one or more events (up to 22 events per transaction). All events within a transaction have a shared and unique (uuid) correlation ID.
An example is for a transaction X: will have the building blocks of EventA, EventB, EventC... and all tagged with a unique correlation identifier.
The ultimate goal here is to switch from persisting all the separate events in an RDBMS and query a consolidated view (lots of joins) To: be able to persist only 1 encompassing transaction record that will consolidate attributes from each step in the transaction.
My research so far led me toward reading about Esper (Java stack here) and WSo2/WS02 CEP. In my case each event is submitted/enqueued into JMS, and I am wondering if a solution like WS02/WSo2 CEP can be used to consolidate JMS events/messages (streams) and based on correlation ID (and maximum time limit 30 min) produce one consolidated record and send it down JMS to ultimately persist in a DB.
Since I am still in research mode, I was wondering if I am on the right path for a solution?
Anybody achieved such thing using WS02/WSo2 CEP, or is it over kill ? other recommendations?
Thanks
-S
You can use WSO2 CEP by integrating that to JMS to send and receive events and by using Siddhi Pattern queries[1] to consolidate events arriving from the same transaction.
30 min is a reasonable time period and its recommended to test the scenario with some test data set because you must need enough memory in the servers for CEP to handle the states. This will greatly depend on the event rate.
AFAIK this is not an over kill in a enterprise deployment.
[1]https://docs.wso2.com/display/CEP200/Patterns
I would recommend you to try esper patterns. For multievent based system where some particular information is to be collected patterns works the best way.
A sample example would be:
select * from TemperatureEvent
match_recognize (
measures A as temp1, B as temp2, C as temp3, D as temp4
pattern (A B C D)
define
A as A.temperature > 100,
B as (A.temperature < B.value),
C as (B.temperature < C.value),
D as (C.temperature < D.value) and D.value >
(A.value * 1.5))
Here, we have 4 events and 5 conditions involving these events. Example is taken from demo project.

Resources