What is the meaning of OneMinuteRate in JMX? - jmx

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.

It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.

OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

Related

SLO calculation for 90% of requests under 1000ms

I'm trying to figure out the PromQL for an SLO for latency, where we want 90% of all requests to be served in 1000ms or less.
I can get the 90th percentile of requests with this:
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]) ) )
And I can find what percentage of ALL requests were served in 1000ms or less with this.
((sum(rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]))) / (sum (rate(MyMetric_Request_Duration_count{instance="foo"}[1h])))) *100
Is it possible to combine these into one query that tells me what percentage of requests in the 90th percentile were served in 1000ms or less?
I tried the most obvious (to me anyway) solution, but got no data back.
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]) ) )
The goal is to get a measure that shows For the 90th percentile of requests, how many of those requests were under 1000ms? Seems like this should be simple but I can't find a PromQL query that allows me to do it.
Welcome to SO.
Out of all the requests how many are getting served under 1000ms, to find that I would divide the total number of requests under 1000ms by the total number of requests.. In my gcp world, it translates to a query like this:
You are basically measuring your
(sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="abcxyz",le="1000"}[1m]))/sum(rate(istio_request_duration_milliseconds_count{reporter="destination",namespace="abcxyz"}[1m])))*100
Once you have a graph setup with the above query in grafana, you can setup an alert on anything below 93 that way you are alerted even before your reach your SLO of 90%.
Prometheus doesn't provide a function, which could be used for calculating the share (aka the percentage) of requests served in under one second from histogram buckets. But such a function exists in VictoriaMetrics - this is Prometheus-like monitoring system I work on. The function is histogram_share(). For example, the following query returns the share of requests with durations smaller than one second served during the last hour:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
Then the following query can be used for alerting when the share or requests, which are served in less than one second, drops below 90%:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) < 0.9
Please note that all the functions, which work over histogram buckets, return the estimated results. Their accuracy highly depends on the used histogram buckets' boundaries. See this article for details.

Store machine status on Graphite time-series to later extract KPIs

having a machine which sends (not regularly) its status values 0, 1, 2, we're storing it in Graphite. Now the status means:
0 - stopped
1 - working
2 - stopped by anomaly
The requested KPIs to extract are the classical ones: how much time on status 0 or 1 or 2 in a day or a week? Before reinventing the wheel, we're looking at the best way to compute those PKIs and if in Graphite (or possible other time-series solution) there are already function which deal with summing the time where the data point value is just a condition. Clearly the time intervals to sum are not stored, it's the time elapsed between a data point and the next one.
Or should the data pre-processed to compute the time intervals and then store three data sets like: status.working, status.stopped, status.alarm and for each store when the specific "event" started and how much it lasted?
There are other KPIs, for example the number of alarms in a day. Receiving two status data points in a row both indicating status "2" is actually a single alarm condition and must count as 1.
So, is there a best way to store such data without pre-processing it? It sounds to be a common pattern but (shame on us?) we have not found this topic well explored.
Thanks.
Graphite has a number of functions that could help you here. One that stands out is the summarize() function in which you can pass an aggregation method (in this case sum) and a duration in minutes/hours/days/weeks/etc), take a look here
isNonNull is another useful function: it can be used to determine the existence of a datapoint regardless of the value.
When you say that the machie reports a value 0 to indicate it has stopped - does it actually send that value or does it report nothing? This is an important detail and will have some bearing on the end result of your solution.

How can I calculate the appropriate amount of channel capacity?

I am looking for a solution because the sth-channel is full.
I am troubled with calculating the appropriate capacity of channel capacity.
This document has the following description.
In order to calculate the appropriate capacity, just have in consideration the following parameters:
・The amount of events to be put into the channel by the sources per unit time (let's say 1 minute).
・The amount of events to be gotten from the channel by the sinks per unit time.
・An estimation of the amount of events that could not be processed per unit time, and thus to be reinjected into the channel (see next section).
How can I check the values of these parameters?
How can I check the values of these parameters?
You can't just check these parameters. They depend on your application.
What they are saying is that you should have a size which is large enough so the generator doesn't get stuck. This may not be possible in your application.
Say your generator receives one event per second and it takes 2 seconds for a receiver to manage that event. Now lets assume you have 3 receivers. In 1 second, you can manage to process 0.5 events per receiver. You have 3 receivers, so your receivers, together, are capable of processing 0.5 × 3 = 1.5 events, which is more than what you get as input. Your capacity can be 1 or 2, using 2 will greatly increase your chances that you do not get blocked.
Let's review another example:
Your generator wants to pushes 1,000 events per second
Your receivers take 3 seconds to process one event
You would need 1,000 x 3 = 3,000 receivers (3,000 goroutines that can run at full speed in parallel...)
In this example, the total number of receivers is so large that you have to either break up your code to work on multiple computers or optimize your receiver code so it can process the data in an amount of time that makes sense. Say you have 50 processors, your receivers will get 1,000 events per second, all 50 can run at full speed, you need one receiver to do its work in:
50 / 1000 = 0.05 seconds
Now let's assume that in most cases your goroutines take 0.02 but once in a while one will take 1 second. That means your goroutines can get a little behind. In that case your capacity (so the generator doesn't get blocked) should be a little over 1,000. Again, it will depend on how many of the routines get slowed down, etc. In this last example, a run is 0.02 seconds so to process 1,000 events it usually takes 0.02 seconds. If you can send those 1,000 event over the 1 second period, you may not even need the 50 goroutines and could have a smaller capacity. On the other hand, if you have big bursts where you may end up sending many (say 500) events all at ones, then more goroutines and a larger capacity is important to not get blocked.

How do I query Prometheus for number of times a service is down

I am trying to work with the UP metric to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour). I am sampling at 5 seconds intervals
The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
Any help with this type of query would be greatly appreciated
Thanks.
You might try the following: calculate the average of the up metric. If the service goes down, the average (sliding windows of 1 minute) will decrease over time.
If the job comes up again, and the average is greater than 0, then the service wasn't down for more than one minute.
The following query (works via the Prometheus web console) delivers one data point for each time the service comes up before it was down for more than one minute.
avg_over_time(up{job="jobname"} [1m]) > 0
AND
irate(up{job="jobname"} [1m]) > 0

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

Resources