EPL Every Limit Subexpression Lifetime - esper

I have a particular case for "every" pattern, this is the example code:
SELECT * FROM PATTERN [
every(
e1=Event(device_ip IN ( '10.10.10.1' ) AND category IS NOT NULL ))where timer:within(1800 Sec)
->
e2=Event(device_ip IN ( '10.10.10.1' ) AND event_cat_name.toLowerCase() IN ('User') AND log_session_id = e1.log_session_id)
->
e3=Event(device_ip IN ( '10.10.10.1' ) AND result.toLowerCase() IN ('Block') AND log_session_id = e1.log_session_id )
->
e4=Event(device_ip IN ( '10.10.10.1' ) AND category IS NOT NULL AND log_session_id != e1.log_session_id)
->
e5=Event(device_ip IN ( '10.10.10.1' ) AND event_cat_name.toLowerCase() IN ('User') AND log_session_id = e4.log_session_id)
->
e6=Event(device_ip IN ( '10.10.10.1' ) AND result.toLowerCase() IN ('Block') AND log_session_id = e4.log_session_id )
->
e7=Event(device_ip IN ( '10.10.10.1' ) AND category IS NOT NULL AND log_session_id != e4.log_session_id)
->
e8=Event(device_ip IN ( '10.10.10.1' ) AND event_cat_name.toLowerCase() IN ('User') AND log_session_id = e7.log_session_id)
->
e9=Event(device_ip IN ( '10.10.10.1' ) AND result.toLowerCase() IN ('Block') AND log_session_id = e7.log_session_id )
];
Let me explain it:
We know that "every" pattern will reset after content were found and create another "Window" looking for same event, we are applying every pattern just to "e1" event in this case.
The above code is for 9 events, they are "grouped" by 3 events, as you see from "e1" to "e3" correspond to 1 unique event, from "e4" to "e6" another unique event and so on, each 3 events we know are same unique event because just those 3 events have same ID in "log_session_id", but we know that events from "e1" to "e3" are differents from "e4" to "e6" and different from "e7" to "e9" because every unique event has different ID in "log_session_id".
So when we have the following sequence of events: e1, e2, e3, e4, e5, e6, e7, e8, e9
When "e1" to "e3" are detected, the pattern every reset and search for the same event of "e1"... all is ok until now, but when "e4" event arrives, because of "e1" and "e4" are almost same conditions, the first window match and also match for the new window created when every restarted. We have in "e4" distint conditions but they are not known in "e1" positions yet, so because e1 does not know that e4 exists at first, in the new window is match and 2 Windows are opened until now:
The first one have 2 unique events (e1 to e6)
The second one have 1 unique window (e4 to e6).
And when "e7" to "e9" arrives, are generated another window, in total until now we have 3 Windows:
The first one have 3 unique events (e1 to e9),
The second one have 2 unique events (e4 to e9)
The third one have 1 unique event (e7 to e9).
So, Do you know how to limit just to the first Window using every? we have tried with AND NOT but we cannot manage to do it.

You could make that pattern much smaller and readable by using insert-into for the same-expression filters:
// use this for all the same-expression filters
insert into FilteredStream select Event(device_ip IN ( '10.10.10.1' ));
select * from pattern [every(
e1=FilteredStream(category IS NOT NULL )) where timer:within(1800 Sec)
....)];
For removing overlapping matches, there are multiple choices, such as:
Let the overlapping match happen and use a subquery on the output event to see if it overlaps
Use match-recognize instead which automatically skips-past-last match

Related

Arrayformula to preserve a running/cumulative balance when inserting new rows

The top right cell (Natwest) is a list from a range using data validation.
The Opening Balance 1,000.00 is sourced from another sheet using a lookup formula.
Using simple if statements, the cumulative balance is then produced - according to the Amount column and whether the Natwest account occurs in the Dr(+) or Cr (-) column
i.e. =if(B4=$D$1,D3+A4,if(C4=$D$1,D3-A4,D3)) and copied down.
Natwest
Amount Dr Cr Balance
1,000.00
100.00 Natwest Account 1 1,100.00
200.00 Account 2 Natwest 900.00
400.00 Natwest Account 1 1,300.00
It works fine, except that when a new row is inserted, the if statement formula is not copied into the new row.
I am looking for an arrayformula solution (or other formula inside the cell solution), so that the Cumulative Balance still works, but doesn't need to be copied into column D new row - when a new row(s) are inserted.
(I don't mind the Natwest (drop down from the list) or the Opening Balance 1,000.00 to be moved elsewhere if required for a solution.)
Thanks for your help.
Something adding up in between the same range of the arrayformula is always going to be tricky with circular dependency. I suggest to get the initial value and add it the SUMIF of second column and substract the SUMIF of second column up to each value. With BYROW you can do it like this:
=BYROW(A4:A,LAMBDA(each,SUMIF(INDIRECT("B4:B"&ROW(each)),D1,A4:each)-SUMIF(INDIRECT("C4:C"&ROW(each)),D1,A4:each)+D3))
Alternate solution:
You can use this custom function from AppScript for automatically calculating cumulative balance
Code:
function customFunction(startnum, key, range) {
var res = [];
var current = startnum;
range.forEach((x) => {
res.push(x.map((y, index) => {
return y == key && index == 1 ? current = (current + x[0]) : (y == key && index == 2 ? current = (current - x[0]) : null)
}).filter(c => c))
})
return res;
}
Custom Function Parameters:
=customFunction(startnum, key, range)
startnum = opening balance
key = Account name
range = cell range
Sample output:
=customFunction(D3,D1,A4:C)

Flink - Join same stream in order to filter some events

I have a stream of data that looks like this:
impressionId | id | name | eventType | timestamp
I need to filter (ignore) event of type "click" that don't have a matching 'impressionId' of type 'impression' (so basically ignore clicks event that don't have an impression)
and then count how many impressions in total I have and how many clicks I have (for an id/name pair) for a particular time window.
This is how I approached the solution:
[...]
Table eventsTable = tEnv.fromDataStream(eventStreamWithTimeStamp, "impressionId, id, name, eventType, eventTime.rowtime");
tEnv.registerTable("Events", eventsTable);
Table clicksTable = eventsTable
.where("eventType = 'click'")
.window(Slide.over("24.hour").every("1.minute").on("eventTime").as("minuteWindow"))
.groupBy("impressionId, id, name, eventType, minuteWindow")
.select("impressionId as clickImpressionId, eventType as clickEventType, concat(concat(id,'_'), name) as concatClickId, id as clickId, name as clickName, minuteWindow.rowtime as clickMinute");
Table impressionsTable = eventsTable
.where("eventType = 'impression'")
.window(Slide.over("24.hour").every("1.minute").on("eventTime").as("minuteWindow"))
.groupBy("impressionId, id, name, eventType, minuteWindow")
.select("impressionId as impressionImpressionId, eventType as impressionEventType, concat(concat(id,'_'), name) as concatImpId, id as impId, name as impName, minuteWindow.rowtime as impMinute");
Table filteredClickCount = clicksTable
.join(impressionsTable, "clickImpressionId = impressionImpressionId && concatClickId = concatImpId && clickMinute = impMinute")
.window(Slide.over("24.hour").every("1.minute").on("clickMinute").as("minuteWindow"))
.groupBy("concatClickId, clickMinute")
.select("concatClickId, concatClickId.count as clickCount, clickMinute as eventTime");
DataStream<Test3> result = tEnv.toAppendStream(filteredClickCount, Test3.class);
result.print();
What I'm trying to do is simply create two tables, one with clicks and one with impressions, 'inner' join clicks to impressions and the one that are joined means they are the clicks that have a matching impression.
Now this doesn't work and I don't know why!?
the count produced by the last joint table are not correct. It works for the first minute but after that the counts are off by almost double.
I have then tried to modify the last table like this:
Table clickWithMatchingImpression2 = clicksTable
.join(impressionsTable, "clickImpressionId = impressionImpressionId && concatClickId = concatImpId && clickMinute = impMinute")
.groupBy("concatClickId, clickMinute")
.select("concatClickId, concatClickId.count as clickCount, clickMinute as eventTime");
DataStream<Tuple3<Boolean, Tuple3>> result2 = tEnv.toRetractStream(clickWithMatchingImpression2, Test3.class);
result2.print();
And.... this works !? However I don't know why and I don't know what to do with this DataStream<Tuple3<Boolean, Test3>> format... Flink refuse to use toAppendStream when the table don't have a window.
I would like a simply structure with only the final numbers.
1 ) Is my approach correct? Is there an easier way of filtering click that don't have impressions ?
2 ) Why does the counts are not correct in my solution ?
I am not entirely sure if I understood your use case correctly, an example with some data points would definitely help here.
Let me explain what your code is doing. First the two tables calculate how many clicks/impressions there were in the last 24 hours.
For an input
new Event("1", "1", "ABC", "...", 1),
new Event("1", "2", "ABC", "...", 2),
new Event("1", "3", "ABC", "...", 3),
new Event("1", "4", "ABC", "...", 4)
You will get windows (array<eventId>, window_start, window_end, rowtime):
[1], 1969-12-31-01T00:01:00.000, 1970-01-01T00:01:00.000, 1970-01-01T00:00:59.999
[1, 2], 1969-12-31-01T00:02:00.000, 1970-01-01T00:02:00.000, 1970-01-01T00:01:59.999
[1, 2, 3], 1969-12-31-01T00:03:00.000, 1970-01-01T00:03:00.000, 1970-01-01T00:02:59.999
...
Therefore when you group both on id and name you get sth like:
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:00:59.999
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:01:59.999
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:02:59.999
...
which if you group again in 24 hours windows you will count each event with the same id multiple times.
If I understand your use case correctly and you are looking for how many impressions happened in a 1 minute period around an occurrence of a click, an interval join might be what you are looking for. You could implement your case with a following query:
Table clicks = eventsTable
.where($("eventType").isEqual("click"))
.select(
$("impressionId").as("clickImpressionId"),
concat($("id"), "_", $("name")).as("concatClickId"),
$("id").as("clickId"),
$("name").as("clickName"),
$("eventTime").as("clickEventTime")
);
Table impressions = eventsTable
.where($("eventType").isEqual("impression"))
.select(
$("impressionId").as("impressionImpressionId"),
concat($("id"), "_", $("name")).as("concatImpressionId"),
$("id").as("impressionId"),
$("name").as("impressionName"),
$("eventTime").as("impressionEventTime")
);
Table table = impressions.join(
clicks,
$("clickImpressionId").isEqual($("impressionImpressionId"))
.and(
$("clickEventTime").between(
$("impressionEventTime").minus(lit(1).minutes()),
$("impressionEventTime"))
))
.select($("concatClickId"), $("impressionEventTime"));
table
.window(Slide.over("24.hour").every("1.minute").on("impressionEventTime").as("minuteWindow"))
.groupBy($("concatClickId"), $("minuteWindow"))
.select($("concatClickId"), $("concatClickId").count())
.execute()
.print();
As for why Flink sometimes cannot produce append stream, but only retract stream see. Very briefly, if an operation does not work based on a time attribute, there is not single point in time, when the result is "valid". Therefore it must emit stream of changes instead of a single appended value. The first field in the tuple tells you if the record is an insertion(true) or retraction/deletion(false).

Creating relationships based on nested list

I'm building a graph based and some of the relationships are based on information in nested lists. The relevant nodes are (b:Bundle) and (o:Object); the bundles require certain objects with different quantities and qualities. The nested list that contains these requirements has the format [ [object1, quantity1, quality1], [object2, quantity2, quality2], ... ]
but in the .csv file that I'm using the field has the format
o1,qn1,ql1|o2,qn2,ql2|... The relationship I want to create is
(b)-[r:REQUIRES {quantity, quality}]->(o).
I've tried using various combinations of SPLIT, UNWIND, and FOREACH. A minimal example from my data set:
id: 1
requirements: 24,1,0|188,1,0|190,1,0|192,1,0
That is to say, (b:Bundle {id:1}) -[r:REQUIRES {quantity:1, quality:0}]-> (o:Object {id:24}) and so on.
LOAD CSV WITH HEADERS FROM 'file:///bundles.csv' AS line
WITH SPLIT( UNWIND SPLIT ( line.requirements, '|' ), ',') as reqList
MATCH ( o:Object { id:TOINTEGER(reqList[0]) } )
MATCH ( b:Bundle { id:TOINTEGER(line.id) } )
MERGE (b) -[r:REQUIRES]-> (o)
ON CREATE SET r.quantity = TOINTEGER(reqList[1]),
r.quality = TOINTEGER(reqList[2]);
The error this query gives is
Neo.ClientError.Statement.SyntaxError: Invalid input 'P': expected 't/T' (line 2, column 22 (offset: 78))
" WITH SPLIT( UNWIND SPLIT ( line.requirements, '|' ), ',') as reqList"
^
Assuming your CSV file actually looks like this:
id requirements
1 24,1,0|188,1,0|190,1,0|192,1,0
then this query should work:
LOAD CSV WITH HEADERS FROM 'file:///bundles.csv' AS line FIELDTERMINATOR ' '
WITH line.id AS id, SPLIT(line.requirements, '|' ) AS reqsList
UNWIND reqsList AS reqsString
WITH id, SPLIT(reqsString, ',') AS reqs
MATCH ( o:Object { id:TOINTEGER(reqs[0]) } )
MATCH ( b:Bundle { id:TOINTEGER(id) } )
MERGE (b) -[r:REQUIRES]-> (o)
ON CREATE SET r.quantity = TOINTEGER(reqs[1]),
r.quality = TOINTEGER(reqs[2]);

Spark join hangs

I have a table with n columns that I'll call A. In this table there are three columns that i'll need:
vat -> String
tax -> String
card -> String
vat or tax can be null, but not at the same time.
For every unique couple of vat and tax there is at least one card.
I need to alter this table, adding a column count_card in which I put a text based on the number of cards every unique combination of tax and vat has.
So I've done this:
val cardCount = A.groupBy("tax", "vat").count
val sqlCard = udf((count: Int) => {
if (count > 1)
"MULTI"
else
"MONO"
})
val B = cardCount.withColumn(
"card_count",
sqlCard(cardCount.col("count"))
).drop("count")
In the table B I have three columns now:
vat -> String
tax -> String
card_count -> Int
and every operation on this DataFrame is smooth.
Now, because I wanted to import the new column in A table, i performed the following join:
val result = A.join(B,
B.col("tax")<=>A.col("tax") and
B.col("vat")<=>A.col("vat")
).drop(B.col("tax"))
.drop(B.col("vat"))
Expecting to have the original table A with the column card_count.
Problem is that the join hangs, getting all system resources blocking the pc.
Additional details:
Table A has ~1.5M elements and is read from parquet file;
Table B has ~1.3M elements.
System is a 8 thread and 30GB of RAM
Let me know what I'm doing wrong
At the end, I didn't found out which was the issue, so I changed approach
val cardCount = A.groupBy("tax", "vat").count
val cardCountSet = cardCount.filter(cardCount.col("count") > 1)
.rdd.map(r => r(0) + " " + r(1)).collect().toSet
val udfCardCount = udf((tax: String, vat:String) => {
if (cardCountSet.contains(tax + " " + vat))
"MULTI"
else
"MONO"
})
val result = A.withColumn("card_count",
udfCardCount(A.col("tax"), A.col("vat")))
If someone knows a better approach let me know it

Get corresponding values to a select in a group for PostgreSQL

(Background: I'm attempting to find the "peak" hour of activity in a series of cameraapis, defined as having the most entries with a start and end date between 1 hour periods (starting with the beginning of the hour) For example, 1:00 to 2:00 may have 8 entries within that timeframe, but 2:00 to 3:00 has 12 entries - so I would want to have it return the 12 entry timeframe.)
I'm having trouble getting associated data from a SELECT query of a group. Here is the code:
def reach_peak_hour_by_date_range(start_date, end_date)
placement_self_device_id = self.device_id
query = <<-SQL
SELECT max(y.num_entries) AS max_entries
FROM
(
SELECT x.starting_hour, count(*) AS num_entries
FROM
(
SELECT date_trunc('hour', visitor_start_time) starting_hour
FROM Cameraapis WHERE device_id = '#{placement_self_device_id}'::text AND visitor_start_time > '#{start_date}'::timestamp AND visitor_end_time < '#{end_date}'::timestamp
) AS x
GROUP BY x.starting_hour
) AS y
SQL
results = Placement.connection.execute(query)
binding.pry
end
Cameraapi have a device_id, visitor_start_time, and visitor_end_time, referenced in the code.
This code successfully returns the max_entries in a 1 hour period, but I can't figure out what to SELECT to get the associated starting_hour to that max_entries. Because it is a group, it requires aggregated functions, which I don't actually need. Any advice?
didnt quite understand the question ... use window functions
select starting_hour , num_entries from (
SELECT starting_hour ,y.num_entries, max(y.num_entries) over() AS max_entries
FROM
(
SELECT x.starting_hour, count(*) AS num_entries
FROM
(
SELECT date_trunc('hour', visitor_start_time) starting_hour
FROM Cameraapis WHERE device_id = '#{placement_self_device_id}'::text AND visitor_start_time > '#{start_date}'::timestamp AND visitor_end_time < '#{end_date}'::timestamp
) AS x
GROUP BY x.starting_hour
) AS y
) as u
where num_entries = max_entries
this query returns all entries associated with peak hour, you can modify it to return only entry count with associated hour selecting hour and count using distinct or grouping
select * from
(
select x.*, max(num_entries) over()as max_num_entries from
(
SELECT Cameraapis.* ,date_trunc('hour', visitor_start_time) as starting_hour, count(*) over( partition by date_trunc('hour', visitor_start_time)) as num_entries
FROM Cameraapis WHERE device_id = '#{placement_self_device_id}'::text AND visitor_start_time > '#{start_date}'::timestamp AND visitor_end_time < '#{end_date}'::timestamp
) as x
) as x where max_num_entries = num_entries

Resources