How do CEP rules engines store time data? - time-series

I'm thinking about designing an event processing system.
The rules per se are not the problem.
What bogs my is how to store event data so that I can efficiently answer questions/facts like:
If number of events of type A in the last 10 minutes equals N,
and the average events of type B per minute over the last M hours is Z,
and the current running average of another metric is Y...
then
fire some event (or store a new fact/event).
How do Esper/Drools/MS StreamInsight store their time dependant data so that they can efficiently calculate event stream properties? ¿Do they just store it in SQL databases and continuosly query them?
Do the preprocess the rules so they can know beforehand what "knowledge" they need to store?
Thanks
EDIT: I found what I want is called Event Stream Processing, and the wikipedia example shows what I would like to do:
WHEN Person.Gender EQUALS "man" AND Person.Clothes EQUALS "tuxedo"
FOLLOWED-BY
Person.Clothes EQUALS "gown" AND
(Church_Bell OR Rice_Flying)
WITHIN 2 hours
ACTION Wedding
Still the question remains: how do you implement such a data store? The key is "WITHIN 2 hours" and the ability to process thousands of events per second.

Esper analyzes the rule and only stores derived state (aggregations etc., if any) and if needed by the rule also a subset of events. Esper allows defining contexts like described in the book by Opher Etzion and Peter Niblet. I recommend reading. By specifying a context Esper can minimize the amount of state it retains and can make queries easier to read.

It's not difficult to store events happening within a time window of a certain length. The problem gets more difficult if you have to consider additional constraints: here an analysis of the rules is indicated so that you can maintain sets of events matching the constraints.
Storing events in an (external) database will be too slow.

Related

Event Store DB : temporal queries

regarding to asked question here :
suppose that we have ProductCreated and ProductRenamed events which both contain the title of the product.now we want to query EventStoreDB for all events of type ProductCreated and ProductRenamed with the given title.i want all these events to check whether there is any product in the system which has been created or renamed to the given title, so that i could throw the exception of repetitive title in the domain
i am using MongoDB for creating UI reports from all the published events and everything is fine there.but for checking some invariants, like checking for unique values, i have to either query the event store for some events along with their criteria and by iterating over them, decide whether there is a product created with the same title which has not renamed or a product renamed with the same title.
for such queries, the only way that event store provides is creating a one-time projection with the proper java script code which filters and emits required events to a new stream.and then all i have to do is to fetch events from the new generated stream which is filled by the projection
no the odd thing is, projections are great for subscriptions and generating new streams, but they seem to be odd for doing real time queries.immediately after i create a projection with the HTTP api, i check the new resulting stream for the query result, but it seems that the workers has not got the chance to elaborate on the result and i get 404 response.but after waiting for a bunch of seconds, the new streams pops out and gets filled with the result.
there are too many things wrong with this approach:
first, it seems that if the event store is filled with millions of events across many streams, it wont be able to process and filter all of them immediately to the resulting stream.it does not create the stream immediately, let alone the population.so i have to wait for some time and check for the result hoping the the projection is done
second, i have to fetch multiple times and issue multiple GET HTTP commands which seems to be slow.the new JVM client is not ready yet.
Third, i have to delete the resulting stream after i'm done with the result and failing to do so will leave event store with millions of orphan query result streams
i wish i could pass the java script to some api and get the result page by page like querying MongoDB without worrying about the projection, new streams and timing issues.
i have seen a query section in the Admin UI, but i dont know whats that for, and unfortunetly the documentation doesn't help much
am i expecting the event store to do something that is impossible?
do i have to create a bounded context inner read model for doing such checks?
i am using my events to dehyderate the aggregates and willing to use the same events for such simple queries without acquiring other techniques
I believe it would not be a separate bounded context since the check you want to perform belongs to the same bounded context where your Product aggregate lives. So, the projection that is solely used to prevent duplicate product names would be a part of the same context.
You can indeed use a custom projection to check it but I believe the complexity of such a solution would be higher than having a simple read model in MongoDB.
It is also fine to use an existing projection if you have one to do the check. It might be not what you would otherwise prefer if the aim of the existing projection is to show things in the UI.
For the collection that you could use for duplicates check, you can have the document schema limited to the id only (string), which would be the product title. Since collections are automatically indexed by the id, you won't need any additional indexes to support the duplicate check query. When the product gets renamed, you'd need to delete the document for the old title and add a new one.
Again, you will get a small time window when the duplicate can slip in. It's then up to the business to decide if the concern is real (it's not, most of the time) and what's the consequence of the situation if it happens one day. You'd be able to find a duplicate when projecting events quite easily and decide what to do when it happens.
Practically, when you have such a projection, all it takes is to build a simple domain service bool ProductTitleAlreadyExists.

Can I perform a single fetch request which returns independent calculations for subsets of the results?

My data model has a ClickerRecord entity with 2 attributes: date (NSDate) and numberOfBiscuits (NSNumber). Every time a new record is added, a different value for numberOfBiscuits can be entered.
To calculate a daily average for the number of biscuits I'm currently doing a fetch request for each day within range and using the corresponding NSExpression to calculate the sum of all numberOfBiscuits values for that day.
The problem: I'm using asynchronous fetch requests to avoid blocking the main thread, so it ends up being quite slow when there are many days between the first and last record. The fetch requests are performed one after another.
I could also load all records into memory and perform the sorting and calculations, but I'm worried that it could become an issue when the number of records becomes very large.
Therefore, my question: Is it possible to use NSExpressions to add something like sub-predicates for each date interval, in order to do a single fetch request and retrieve a dictionary with an entry for each daily sum of numberOfBiscuits?
If not, what would be the recommended approach for this situation?
I've read about subqueries but as far as I've understood they're not intended for this kind of use.
This is the first question I'm asking on SO, so I hope to have written it in a clear way :)
I think what you are looking for is the propertiesToGroupBy (see the Apple Docs) for the NSFetchRequest, though in your case it is not straight forward to implement, for reasons I will discuss later.
Suppose you could specify the category of biscuit consumed on each occasion, and this is stored in a category attribute of your entity. Then to obtain the total number of biscuits of each category (ignoring the date), you could use an NSExpression using #sum and specify:
fetch.propertiesToGroupBy = ["category"]
CoreData will then group the results of the fetch by the category and will calculate the sum for each group separately.
The problem in your case is that (unless you already strip out the time information from your date attribute), there is no attribute that represents the date interval that you want to group by, and CoreData will not let you specify a computed value to group by. You would need to add a new day attribute to your entity, and calculate that whenever you add/update a record, and specify it in the group by. And you face the same problem again if you subsequently want to calculate your average over a different interval - weeks or months for example. One other downside to this is that the results will only include days for which there are ClickerRecords: if the user has a day where they consume no biscuits, then the fetch will not show a result for that day (ie it will not infer an average of 0). You would need to handle this appropriately when using the results.
It might be better either to tune your asynchronous fetch or, as you suggest, just to read the whole lot into memory to perform the calculations. If your entity only has those two attributes, and assuming your users don't live entirely on biscuits, the volumes should not be too problematic.

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

wso2/ws02 CEP, ESPER or something else?

I have a use case where a system transaction happen/completed over a period of time and with multiple "building up" steps. each step in the process generates one or more events (up to 22 events per transaction). All events within a transaction have a shared and unique (uuid) correlation ID.
An example is for a transaction X: will have the building blocks of EventA, EventB, EventC... and all tagged with a unique correlation identifier.
The ultimate goal here is to switch from persisting all the separate events in an RDBMS and query a consolidated view (lots of joins) To: be able to persist only 1 encompassing transaction record that will consolidate attributes from each step in the transaction.
My research so far led me toward reading about Esper (Java stack here) and WSo2/WS02 CEP. In my case each event is submitted/enqueued into JMS, and I am wondering if a solution like WS02/WSo2 CEP can be used to consolidate JMS events/messages (streams) and based on correlation ID (and maximum time limit 30 min) produce one consolidated record and send it down JMS to ultimately persist in a DB.
Since I am still in research mode, I was wondering if I am on the right path for a solution?
Anybody achieved such thing using WS02/WSo2 CEP, or is it over kill ? other recommendations?
Thanks
-S
You can use WSO2 CEP by integrating that to JMS to send and receive events and by using Siddhi Pattern queries[1] to consolidate events arriving from the same transaction.
30 min is a reasonable time period and its recommended to test the scenario with some test data set because you must need enough memory in the servers for CEP to handle the states. This will greatly depend on the event rate.
AFAIK this is not an over kill in a enterprise deployment.
[1]https://docs.wso2.com/display/CEP200/Patterns
I would recommend you to try esper patterns. For multievent based system where some particular information is to be collected patterns works the best way.
A sample example would be:
select * from TemperatureEvent
match_recognize (
measures A as temp1, B as temp2, C as temp3, D as temp4
pattern (A B C D)
define
A as A.temperature > 100,
B as (A.temperature < B.value),
C as (B.temperature < C.value),
D as (C.temperature < D.value) and D.value >
(A.value * 1.5))
Here, we have 4 events and 5 conditions involving these events. Example is taken from demo project.

Best way to store time series in Rails

I have a table in my database that stores event totals, something like:
event1_count
event2_count
event3_count
I would like to transition from these simple aggregates to more of a time series, so the user can see on which days these events actually happened (like how Stack Overflow shows daily reputation gains).
Elsewhere in my system I already did this by creating a separate table with one record for each daily value - then, in order to collect a time series you end up with a huge database table and the need to query 10s or 100s of records. It works but I'm not convinced that it's the best way.
What is the best way of storing these individual events along with their dates so I can do a daily plot for any of my users?
When building tables like this, the real key is having effective indexes. Test your queries with the EXAMINE statement or the equivalent in your database of choice.
If you want to build summary tables you can query, build a view that represents the query, or roll the daily results into a new table on a regular schedule. Often summary tables are the best way to go as they are quick to query.
The best way to implement this is to use Redis. If you haven't worked before with Redis I suggest you to start. You will be surprised how fast this can get :). The way I would do such a thing is to use the Hash data structure Redis provides. Just assign every user to his Hash (making a unique key for every user like "user:23:counters"). Inside this Hash you can store a daily timestamp as "05/06/2011" as the field and increment its counter every time an event happens or whatever you want to do with that!
A good start would be this thread. It has a simple, beginner level solution. Time Series Starter. If you are ok with rails models: This is a way it could work. For a sol called "irregular" time series. So this is a event here and there, but not in a regular interval. Like a sensor that sends data when your door is opened.
The other thing, and that is what I was looking for in this thread is regular time series db: Values come at a interval. Say 60/minute aka 1 per second for example a temperature sensor. This all boils down to datasets with "buckets" as you are suspecting right: A time series table gets long, indexes suck at a point etc. Here is one "bucket" approach using postgres arrays that would a be feasible idea.
Its not done as "plug and play" as far as I researched the web.

Resources