Issue in GraphAware TimeTree - bad insertions for minute resolution - neo4j

EDIT: Solved problem.
TL;DR: TimeTree wants milliseconds since epoch. I was using seconds since epoch as my time values.
Versions:
Neo4j community : 3.0.3
GraphAware / TimeTree community server plugins : 3.0.3.39
I recently started using a time tree to search my graph by time ranges. I noticed some funny behavior the other day when I made a query like this:
"
WITH ({start:1350542000,end:1350543000}) as tr
CALL ga.timetree.events.range(tr) YIELD node as n
RETURN n
LIMIT 5;
"
Note that the time range here is only 1000 seconds apart. What was strange is that my return nodes (which are all of the same type) looked like this:
Node[343421]{gtype:1,bbox:[121.01454162597656,20.602155685424805,121.01454162597656,20.602155685424805],meta:"KAOU_20110613v20001_1422",time:1308026580,lat:20.602155685424805,lon:121.01454162597656}
Specifically, note that the value time:1308026580 is NOT within the bounds I provided. Now, I made this example up (because the query takes forever to run right now), but I was getting similar results the last time I ran the query.
So I investigated a little. First off, this is how I insert my data into the TimeTree:
MATCH (r:record {meta:"KAOU_20110613v20001_1422"})
WITH r
CALL ga.timetree.events.attach({node: r, time: r.time, relationshipType: "observedOn", resolution:"Minute"})
YIELD node
RETURN node.meta;
Notice the resolution: "Minute". When I first wrote this query as a function, I forgot to specify the resolution. So when I added about 4-5 records with this method, the resolution defaulted to "Day".
I didn't think this was an issue, so I just left these records in the graph at "Day" resolution and everything following would be at resolution "Minute".
So I decided to check out the graph using the Neo4J browser to see if anything weird is going on. From here, I executed the following query:
MATCH p=(:TimeTreeRoot)-[:CHILD*5]-()-[]-(:record) RETURN p LIMIT 25;
Ah ha! I noticed that all those records attached to the Minute node are consecutive records in terms of their time value. For example:
KAOU_20110613v20001_0956 has time:1307998620
KAOU_20110613v20001_0957 has time:1307998680
These consecutive records are all 1 minute apart. (i.e. time1 - time2 == 60)
So why are they being added to the same Minute node? I used an epoch time converter to verify that these time stamps are in fact 1 minute apart and represent the dates they are intended to.
I believe this problem is contributing to my performance lag since all my records are globbing up on Minute nodes.
So, either I missed something regarding the time values and how the Time Tree handles them, or something else fishy is going on.

I figured out my problem. Going back through some documentation, I found the following in the Examples section:
"time instant represented by {time} which is a long (the number of milliseconds since 1/1/1970)."
I must have misread this somewhere, and assumed the value was represented in seconds. That explains the behavior I am experiencing perfectly. All I need to do is multiply all my time values by 1000 to get milliseconds instead.

Related

How to count the number of metrics sent to Datadog over a 24 hour period?

I have a situation where I'm trying to count the number of files loaded into the system I am monitoring. I'm sending a "load time" metric to Datadog each time a file is loaded, and I need to send an alert whenever an expected file does not appear. To do this, I was going to count up the number of "load time" metrics sent to Datadog in a 24 hour period, then use anomaly detection to see whether it was less than the normal number expected. However, I'm having some trouble finding a way to consistently pull out this count for use in the alert.
I can't use the count_nonzero function, as some of my files are empty and have a load time of 0. I do know about .as_count() and count:metric{tags}, but I haven't found a way to include an evaluation interval with either of these. I've tried using .rollup(count, time) to count up the metrics sent, but this call seems to return variable results based on the rollup interval. For instance, if I compare intervals of 2000 and 4000 seconds, I would expect each 4000 second interval to count up about the sum of two 2000 second intervals over the same time period. This does not seem to be what happens at all - the counts for the smaller intervals seem to add up to much more than the count for the larger one. Additionally some rollup intervals display decimal numbers as counts, which does not make any sense to me if this function is doing what I thought it was.
Does anyone have any thoughts on how to accomplish this? I'd really appreciate any new ideas.

Prometheus increase not handling process restarts

I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.
When there is a process restart within a 2m interval and I query:
sum(increase(my_metric_total[2m]))
I get a value less than expected.
For example, in a simple experiment I mock:
3 lcm_restarts
1 process restart
2 lcm_restarts
All within a 2 minute interval.
Upon querying:
sum(increase(lcm_restarts[2m]))
I receive a value of ~4.5 when I am expecting 5.
lcm_restarts graph
sum(increase(lcm_restarts[2m])) result
Could someone please explain?
Pretty concise and well-prepared first question here. Please keep this spirit!
When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.
For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.
Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.
So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.
Prometheus may return unexpected results from increase() function due to the following reasons:
Prometheus may return fractional results from increase() over integer counter because of extrapolation. See this issue for details.
Prometheus may return lower than expected results from increase(m[d]) because it doesn't take into account possible counter increase between the last raw sample just before the specified lookbehind window [d] and the first raw sample inside the lookbehind window [d]. See this article and this comment for details.
Prometheus skips the increase for the first sample in a time series. For example, increase() over the following series of samples would return 1 instead of 11: 10 11 11. See these docs for details.
These issues are going to be fixed according to this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics, which are free from these issues.

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

Neo4j - query performance degradation

Here is my query:
MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)-[s:AUTHORED]-(a:Author)
MATCH (l:Language)-[t:USED]->(w:Woka)
WHERE (a.author_name =~ '.*Camus.*' and a.author_name =~ '.*Albert.*')
RETURN p.publisher_name, w.woka_title, a.author_name, l.language_name;
The first time this is executing the result is returned in 3.8 seconds. For the second execution couple minutes later the result is returned in 15.1 seconds. The more I am executing the longer the response time. For the third execution the response time is increasing and several moments later I am getting the results between 30 and 90 seconds.
I am the only user of this (development) database. No data is added or deleted or changed there. No indexes are dropped or created also there.
When closing two out of three connections to the database the response time goes back to 15 seconds.
Memory is set 4GB as init and max up to 8GB. Server has 16GB total memory.
What is happening here? Why the response time differs so much?
How big is your graph? Could be that it allocates a lot of heap for caching and then there is not enough space for running the queries without garbage collection.
I presume your relationships are all 1:n, if not add a WITH distinct p,w,a in between the two matches.
Your query is also suboptimal, and will probably create a lot of intermediate results, you can use PROFILE to look at the query plan.
Try this:
PROFILE
MATCH (a:Author)
WHERE (a.author_name =~ '.*Camus.*' and a.author_name =~ '.*Albert.*')
MATCH (p:Publisher)-[:PUBLISHED]->(w:Woka)<-[:AUTHORED]-(a)
MATCH (l:Language)-[:USED]->(w:Woka)
RETURN p.publisher_name, w.woka_title, a.author_name, l.language_name;
I borrowed from other solutions and I changed the params as following in the file conf/neo4j.properties (server restarted after that):
neostore.nodestore.db.mapped_memory=512M
neostore.relationshipstore.db.mapped_memory=512M
neostore.propertystore.db.mapped_memory=512M
neostore.propertystore.db.strings.mapped_memory=512M
neostore.propertystore.db.arrays.mapped_memory=512M
The results are now returning, in average, 300% faster narrowing the time from 90 seconds to 30 seconds and I am not getting disconnected anymore from the web interface to the database.

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

Resources