NEsper memory usage of "output" keyword - esper

I have many EPL statements that output a period of time (1~24 hours), and following is my statement
"SELECT MessageID, VName, count(VName) as count FROM DDIEvent(MajorType=4).std:groupwin(VName).win:time(3 hour).win:length(10) group by VName having count(VName) >= 10 output last every 3 hour"
If there is no limit of the length window, my case will retain around 700K records in 3 hours.
And in above, my test case only have 100 different VName. For my understanding, there will have maximum 1000 records keep in memory at the same time, (100 * 10[length]) am i right?
But the memory usage of my application will keep growing until output to listener.
The memory usage almost the same as the sample without length window.
And after output to listener the memory significantly fall down.
Then, another cycle begin, memory grow slowly until 3 hour later.
I already check the document, do not find the memory related topic of the "output" keyword.
Does anyone knows what is the really root cause? And how to avoid? Or just my EPL's problem?
Thank you~

If your query removes the "MessageId" from the select clause, it becomes a regular aggregation query. You could do a "last(MessageId)" instead. Because "MessageId" is in there the rows that the engine delivers are a row for each event rather then a row for each aggregation group.

Related

What factors determine the memory used in lambda functions?

=SUM(SEQUENCE(10000000))
The formula above is able to sum upto 10 million virtual array elements. We know that 10 million is the limit according to this question and answer. Now, if the same is implemented as Lambda using Lambda helper function REDUCE:
=REDUCE(,SEQUENCE(10000000),LAMBDA(a,c,a+c))
We get,
Calculation limit was reached when trying to calculate this formula
Official documentation says
This can happen in 2 cases:
The computation for the formula takes too long.
It uses too much memory.
To resolve it, use a simpler formula to reduce complexity.
So, it says the reason is space and time complexity. But what is the exact space used to throw this error? How is this determined?
In the REDUCE function above, the limit was at around 66k for a virtual array:
=REDUCE(,SEQUENCE(66660),LAMBDA(a,c,a+c))
However, if we remove the addition criteria and make it return only the current value c, the allowed virtual array size seems to increase to 190k:
=REDUCE(,SEQUENCE(190000),LAMBDA(a,c,c))
After which it throws a error. So, what factors determine the memory limit here? I think it's memory limit, because it throws the error almost within a few seconds.
If you're affected by this issue, you can send feedback to Google:
Open a spreadsheet, preferably one where you bumped into the issue.
Replace any sensitive information with anonymized but realistic-looking data. Remove any sensitive information that is not needed to reproduce the issue.
Choose Help > Report a Problem or Help > Help Sheets Improve. If you are on a paid Google Workspace Domain, see Contact Google Workspace support.
Explain why the calculation limit is an issue for you.
Request:
Justice: Removing arbitrary limits on lambda functions
Equality: Avoiding discrimination against lambda functions
Transparency: Documenting the said discrimination in more clarity and detail
Include a link to this Stack Overflow answer post.
Update Oct '22 (Credit to MaxMarkhov)
The limit is now 10x higher at 1.9 million 1999992. This is still less than 1/5th of 10 million virtual array limit of non-lambda formulas, but much better than before. Also non-lambda formulas's limit doesn't reduce with number of operations. But lambda helper formulas limit still does decrease with number of operations. So, even though it's 10x higher, that just means ~5 extra operations inside lambda(see table below).
A partial answer
We know for a fact, the following factors decide the calculation limit drum roll:
Number of operations
(Nested)LAMBDA() function calls
The base number for 1 operation seems to be 199992 1 2(=REDUCE(,SEQUENCE(199992),LAMBDA(a,c,c))). But for a zero-op or a no-op(=REDUCE(,SEQUENCE(10000000),LAMBDA(a,c,0))), the memory limit is much higher, but you'll still run into time limit. We also know number of operations is a factor, because
=REDUCE(,SEQUENCE(66664/1),LAMBDA(a,c,a+c)) fails
=REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c)) works.
=REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c+0)) fails
Note that the size of operands doesn't matter. If =REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+0)) works, =REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+100000)) will also work.
For each increase in number of operations inside the lambda function, the maximum allowed array size falls by 2n-1(Credit to #OlegValter for actually figuring out there's a factor multiple here):
Maximum sequence
Number of operations (inside lambda)
Reduction(from 199992)
Formula
199992
1
1
REDUCE(,SEQUENCE(199992),LAMBDA(a,c,c))
66664
2
1/3
REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c))
39998
3
1/5
REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+10000))
28570
4
1/7
REDUCE(,SEQUENCE(28570),LAMBDA(a,c,a+c+10000+0))
Operations outside the LAMBDA functions also count. For eg, =REDUCE(,SEQUENCE(199992/1),LAMBDA(a,c,c)) will fail due to extra /1 operation, but you only need to reduce the array size linearly by 1 or 2 per operation, i.e., this =REDUCE(,SEQUENCE(199990/1),LAMBDA(a,c,c)) will work3.
In addition LAMBDA function calls itself cost more. So, refactoring your code doesn't eliminate the memory limit, but reduces it furthermore. For eg, if your code uses LAMBDA(a,c,(a-1)+(a-1)), if you add another lambda like this: LAMBDA(a,c,LAMBDA(aminus,aminus+aminus)(a-1)), it errors out with much less array elements than before(~20% less). LAMBDA is much more expensive than repeating calls.
There are many other factors at play, especially with other LAMBDA functions. Google might change their mind about these arbitrary limits later. But this gives a start.
Possible workarounds:
LAMBDA itself isn't restricted. You can nest as much as you want to. Only LAMBDA Helper Functions are restricted. (Credit to player0)
Named functions which don't use LAMBDA(helper functions) themselves, aren't subjected to the same restrictions. But they're subject to maximum recursion restrictions.
Another workaround is to avoid using lambda as a arrayformula and use autofill or drag fill feature, by making the lambda function return only one value per function. Note that this might actually make your sheet slow. But apparently, Google is ok with that - multiple individual calls instead of a single array call. For example, I've written a permutations function here to get a list of all permutations. While it complains about "memory limit" for a array with more than 6 items, it can work easily by autofill/dragfill/copy+paste with relative ranges.
not even an answer
by brute-forcing a few ideas it looks like there are more hidden variables than previously thought. it is probably safe to say that the upper limit is a result of "running out of memory" especially when calculation time does not play any role. the thing is that there are factors even outside of LAMBDA that affect the computational capabilities of the formula. here is a brief summary of the issue in layman's terms:
WHY WERE/ARE LAMBDA'S MINIONS STUPID?!
UPDATE: limit boundaries were moved 10-fold higher, so none of the below testing formulae limits represent the actual up-to-date state, however, lambda minions are still not limitless!
let's imagine a memory buffer from the 1999 era with a limited size of 30 units that kicks in only when you use LAMBDA with friends (MAP, SCAN,BYCOL, BYROW, REDUCE, MAKEARRAY). keep in mind that in google sheets when we use any other formula, the limiting factor is usually the cell count limit.
example 1
output capability: 199995 cells!
reduction from 199995: 1/1 (meh, but ok)
example 2
output capability: 49998 cells!
reduction from 199995: 1/~4 (*double-checking the calendar if the year is really 2022*)
example 3
output capability: 995 cells!
reduction from 199995: 1/201 !! (*remembering this company built a quantum computer*)
further testing
establishing the baseline: all below formulae are maxed out so they work as "one step before erroring out". please keep noticing the numbers as a direct representation of row (not cell) processing abilities
starting with a simple:
=ROWS(BYROW(SEQUENCE(99994), LAMBDA(x, AVERAGE(x))))
by adding one more x the following would error out so even the length of strings matters:
=ROWS(BYROW(SEQUENCE(99994), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx))))
doubling the array brings no issues:
=ROWS(BYROW({SEQUENCE(99994), SEQUENCE(99994)}, LAMBDA(x, AVERAGE(x))))
but additional "stuff" will reduce the output by 1:
=ROWS(BYROW({SEQUENCE(99993), SEQUENCE(99993, 1, 5)}, LAMBDA(x, AVERAGE(x))))
interestingly this one runs with no problem so now even the complexity of input matters (?):
=ROWS(BYROW(SEQUENCE(99994, 6, 0, 5), LAMBDA(x, AVERAGE(x))))
and with this one, it seems that even choice of formula selection matters:
=ROWS(BYROW(RANDARRAY(99996, 2), LAMBDA(x, AVERAGE(x))))
but what if we move from virtual input to real input... A1 cell being set to =RANDARRAY(105000, 3) we can have:
=ROWS(BYROW(A1:B99997, LAMBDA(x, AVERAGE(x))))
again, it's not a matter of cells because even with 8 columns we can get the same:
=ROWS(BYROW(A1:H99997, LAMBDA(x, AVERAGE(x))))
not bad, however, indirecting the range will put us back to 99995:
=ROWS(BYROW(INDIRECT("A1:B"&99995), LAMBDA(x, AVERAGE(x))))
another fact is that LAMBDA as a standalone function runs flawlessly even with an array 105000×8 (that's solid 840K cells)
=LAMBDA(x, AVERAGE(x))(A1:H105000)
so is this really the memory issue of LAMBDA (?) or the factors that determine the memory used in LAMBDA are limits of unknown origin bestowed upon LAMBDA by individual incapabilities of:
MAP
SCAN
BYCOL
BYROW
REDUCE
MAKEARRAY
and their unoptimized memory demands shaken by wast variety of yet unknown variables within our spacetime
Edit 2022/10/26
Seems, Google Sheets Team has just increased the max. limit 10x times 😍.
1999992 from 199992
My original formula supposed it would be 199992 cells, but as you see the "behind" logic changes and may also change in the future.
LAMBDA+Friends Limit
The maximum number of rows you can use in the formula (guess):
Limit = 1999992/(1 + inside_lambdas) - outside_lambdas
inside_lambdas and outside_lambdas are functions and parameters, each count 1:
+ / * -
5, A1, "text",
MOD, AVERAGE, etc.
{"array element"}
The limit is about cells operated by the "lambda+" formula: reduce, byrow, etc.
My tests are here:
Lambda Limits \ Sample Sheet
Steps to fix:
Do Not use Lambda if possible :(
Do most of the calculations outside lambda if possible
Split formulas to multiple cells, having the limit in mind. Copy formulas, each one has its own limit.
Ask Google to Fix this. In Sheets use the menu Help > Help Sheets Improve
Write to the support if you have a paid account.
Final notes:
my formula for the limit is guess, and it works for my examples and tests. Please try it and comment to this answer if you find an error.
the formula does not answer how long variable names affect the limit (=ROWS(BYROW(SEQUENCE(99994), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx))))) Need more tests to figure out the correct effect on the limit. As this does not break: =ROWS(BYROW(SEQUENCE(199992), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)))), my suggestion is this is the max. length of the variable name, and it does not change the cells limit.
Google Sheets team may change the logic "behind" the formula, so all tests may appear invalid in a time.

How to space out influxdb continuous query execution?

I have many influxdb continuous queries(CQ) used to downsample data over a period of time on several occasions. At one point, the load became high and influxdb went to out of memory at the time of executing continuous queries.
Say I have 10 CQ and all the 10 CQ execute in influxdb at a time. That impacts the memory heavily. I am not sure whether there is any way to evenly space out or have some delay in executing each CQ one by one. My speculation is executing all the CQ at the same time makes a influxdb crash. All the CQ are specified in influxdb config. I hope there may be a way to include time delay between the CQ in the influx config. I didn't know exactly how to include the time delay in the config. One sample CQ:
CREATE CONTINUOUS QUERY "cq_volume_reads" ON "metrics"
BEGIN
SELECT sum(reads) as reads INTO rollup1.tire_volume FROM
"metrics".raw.tier_volume GROUP BY time(10m),*
END
And also I don't know whether this is the best way to resolve the problem. Any thoughts on this approach or suggesting any better approach will be much appreciated. It would be great to get suggestions in using debugging tools for influxdb as well. Thanks!
#Rajan - A few comments:
The canonical documentation for CQs is here. Much of what I'm suggesting is from there.
Are you using back-referencing? I see your example CQ uses GROUP BY time(10m),* - the * wildcard is usually used with backreferences. Otherwise, I don't believe you need to include the * to indicate grouping by all tags - it should already be grouped by all tags.
If you are using backreferences, that runs the CQ for each measurement in the metrics database. This is potentially very many CQ executions at the same time, especially if you have many CQ defined this way.
You can set offsets with GROUP BY time(10m, <offset>) but this also impacts the time interval used for your aggregation function (sum in your example) so if your offset is 1 minute then timestamps will be a sum of data between e.g. 13:11->13:21 instead of 13:10 -> 13:20. This will offset execution but may not work for your downsampling use case. From a signal processing standpoint, a 1 minute offset wouldn't change the validity of the downsampled data, but it might produce unwanted graphical display problems depending on what you are doing. I do suggest trying this option.
Otherwise, you can try to reduce the number of downsampling CQs to reduce memory pressure or downsample on a larger timescale (e.g. 20m) or lastly, increase the hardware resources available to InfluxDB.
For managing memory usage, look at this post. There are not many adjustments in 1.8 but there are some.

How to count the number of metrics sent to Datadog over a 24 hour period?

I have a situation where I'm trying to count the number of files loaded into the system I am monitoring. I'm sending a "load time" metric to Datadog each time a file is loaded, and I need to send an alert whenever an expected file does not appear. To do this, I was going to count up the number of "load time" metrics sent to Datadog in a 24 hour period, then use anomaly detection to see whether it was less than the normal number expected. However, I'm having some trouble finding a way to consistently pull out this count for use in the alert.
I can't use the count_nonzero function, as some of my files are empty and have a load time of 0. I do know about .as_count() and count:metric{tags}, but I haven't found a way to include an evaluation interval with either of these. I've tried using .rollup(count, time) to count up the metrics sent, but this call seems to return variable results based on the rollup interval. For instance, if I compare intervals of 2000 and 4000 seconds, I would expect each 4000 second interval to count up about the sum of two 2000 second intervals over the same time period. This does not seem to be what happens at all - the counts for the smaller intervals seem to add up to much more than the count for the larger one. Additionally some rollup intervals display decimal numbers as counts, which does not make any sense to me if this function is doing what I thought it was.
Does anyone have any thoughts on how to accomplish this? I'd really appreciate any new ideas.

Influx index and high cardinality

I have a high throughput system. I found out that since many events has the same timestamp, influx had overwritten many events.
Therefore I tried moving from milliseconds to nanoseconds, but since I am using JAVA, I couldn't get the real clock based nanoseconds.
I came up with this solution:
I created a new tag called "descriptor" which for each event I insert a random number between 1-1000. These values are fixed and the probability for the same timestamp with the same random descriptor value is very low. This fixes my problem and I can see all the events.
My question is wether it is OK to use these 1000 values - since this is a tag and I understand it can mess up my index and my performance?
Regards, Ido
As the random "descriptors" are completely uncorrelated to other event tags, in the worst case this could increase your series cardinality by 3 orders of magnitude. This is because each existing series (s) will potentially split into up to 1000 unique series (s,1),(s,2),...,(s,1000).
How much of a problem this is will depend on your existing series cardinality. Increasing from 10 to 10,000 is probably no big deal. Increasing from 100,000 to 100,000,000 is more likely to be an issue. You would need to experiment and profile to see.
An alternative approach might be to encode the "descriptor" in the microsecond and/or nanosecond component(s) of the timestamp (as you're not using them anyway) to make them unique.

Issue in GraphAware TimeTree - bad insertions for minute resolution

EDIT: Solved problem.
TL;DR: TimeTree wants milliseconds since epoch. I was using seconds since epoch as my time values.
Versions:
Neo4j community : 3.0.3
GraphAware / TimeTree community server plugins : 3.0.3.39
I recently started using a time tree to search my graph by time ranges. I noticed some funny behavior the other day when I made a query like this:
"
WITH ({start:1350542000,end:1350543000}) as tr
CALL ga.timetree.events.range(tr) YIELD node as n
RETURN n
LIMIT 5;
"
Note that the time range here is only 1000 seconds apart. What was strange is that my return nodes (which are all of the same type) looked like this:
Node[343421]{gtype:1,bbox:[121.01454162597656,20.602155685424805,121.01454162597656,20.602155685424805],meta:"KAOU_20110613v20001_1422",time:1308026580,lat:20.602155685424805,lon:121.01454162597656}
Specifically, note that the value time:1308026580 is NOT within the bounds I provided. Now, I made this example up (because the query takes forever to run right now), but I was getting similar results the last time I ran the query.
So I investigated a little. First off, this is how I insert my data into the TimeTree:
MATCH (r:record {meta:"KAOU_20110613v20001_1422"})
WITH r
CALL ga.timetree.events.attach({node: r, time: r.time, relationshipType: "observedOn", resolution:"Minute"})
YIELD node
RETURN node.meta;
Notice the resolution: "Minute". When I first wrote this query as a function, I forgot to specify the resolution. So when I added about 4-5 records with this method, the resolution defaulted to "Day".
I didn't think this was an issue, so I just left these records in the graph at "Day" resolution and everything following would be at resolution "Minute".
So I decided to check out the graph using the Neo4J browser to see if anything weird is going on. From here, I executed the following query:
MATCH p=(:TimeTreeRoot)-[:CHILD*5]-()-[]-(:record) RETURN p LIMIT 25;
Ah ha! I noticed that all those records attached to the Minute node are consecutive records in terms of their time value. For example:
KAOU_20110613v20001_0956 has time:1307998620
KAOU_20110613v20001_0957 has time:1307998680
These consecutive records are all 1 minute apart. (i.e. time1 - time2 == 60)
So why are they being added to the same Minute node? I used an epoch time converter to verify that these time stamps are in fact 1 minute apart and represent the dates they are intended to.
I believe this problem is contributing to my performance lag since all my records are globbing up on Minute nodes.
So, either I missed something regarding the time values and how the Time Tree handles them, or something else fishy is going on.
I figured out my problem. Going back through some documentation, I found the following in the Examples section:
"time instant represented by {time} which is a long (the number of milliseconds since 1/1/1970)."
I must have misread this somewhere, and assumed the value was represented in seconds. That explains the behavior I am experiencing perfectly. All I need to do is multiply all my time values by 1000 to get milliseconds instead.

Resources