Track Events with Prometheus Counters - monitoring

Using Prometheus for things that are per second works really great and I've had great success with rate and irate. I am just at a loss how to graph something that's happening very rarely and is a big deal.
So I have a counter I am incrementing that's called job_failed. Whenever that happens it shows up in my instant-vector. If I graph it directly it always goes up and I see a bump in the graph, but this isn't giving me clear enough indication that a job has failed. So I'd like to have it be a spike in a zeroed graph.
If I do a rate(job_failed[15s]) I get my spike - but it's a per second spike so it's value is 0.1 although the change I want is 1.
I tried increase(job_failed[1m]) but that is also not adding up correctly, occasionally leaving me with values like 2.18 etc.
Is there a way to only see a single spike? This seems like a rather trivial thing but I can't figure it out.

Prometheus is suited more to high volume than low volume events, as at low volumes artifacts from how we keep things accurate on average show up.
So for example rate(job_failed[15s]) with an increase of 1 over the 15 seconds is 1/15 = 0.066/s. Rounding could make that show as 0.1.
https://www.youtube.com/watch?v=67Ulrq6DxwA goes into more detail as to how this all works.
The short version is what you're doing now is the way to do it.

For similar requirement, I was using delta function with threshold configured as per requirement.
https://prometheus.io/docs/querying/functions/#delta

Related

How to avoid data averaging when logging to metric across multiple runs?

I'm trying to log data points for the same metric across multiple runs (wandb.init is called repeatedly in between each data point) and I'm unsure how to avoid the behavior seen in the attached screenshot...
Instead of getting a line chart with multiple points, I'm getting a single data point with associated statistics. In the attached e.g., the 1st data point was generated at step 1,470 and the 2nd at step 2,940...rather than seeing two points, I'm instead getting a single point that's the average and appears at step 2,205.
My hunch is that using the resume run feature may address my problem, but even testing out this hunch is proving to be cumbersome given the constraints of the system I'm working with...
Before I invest more time in my hypothesized solution, could someone confirm that the behavior I'm seeing is, indeed, the result of logging data to the same metric across separate runs without using the resume feature?
If this is the case, can you confirm or deny my conception of how to use resume?
Initial run:
run = wandb.init()
wandb_id = run.id
cache wandb_id for successive runs
Successive run:
retrieve wandb_id from cache
wandb.init(id=wandb_id, resume="must")
Is it also acceptable / preferable to replace 1. and 2. of the initial run with:
wandb_id = wandb.util.generate_id()
wandb.init(id=wandb_id)
It looks like you’re grouping runs so that could be why it’s appearing as averaging across step - this might not be the case but it’s worth trying. Turn off grouping by clicking the button in the centre above your runs table on the left - it’s highlighted in purple in the image below.
Both of the ways you’re suggesting resuming runs seem fine.
My hunch is that using the resume run feature may address my problem,
Indeed, providing a cached id in combination with resume="must" fixed the issue.
Corresponding snippet:
import wandb
# wandb run associated with evaluation after first N epochs of training.
wandb_id = wandb.util.generate_id()
wandb.init(id=wandb_id, project="alrichards", name="test-run-3/job-1", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 20}, step=1)
wandb.finish()
# wandb run associated with evaluation after second N epochs of training.
wandb.init(id=wandb_id, resume="must", project="alrichards", name="test-run-3/job-2", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 10}, step=5)
wandb.finish()

How to space out influxdb continuous query execution?

I have many influxdb continuous queries(CQ) used to downsample data over a period of time on several occasions. At one point, the load became high and influxdb went to out of memory at the time of executing continuous queries.
Say I have 10 CQ and all the 10 CQ execute in influxdb at a time. That impacts the memory heavily. I am not sure whether there is any way to evenly space out or have some delay in executing each CQ one by one. My speculation is executing all the CQ at the same time makes a influxdb crash. All the CQ are specified in influxdb config. I hope there may be a way to include time delay between the CQ in the influx config. I didn't know exactly how to include the time delay in the config. One sample CQ:
CREATE CONTINUOUS QUERY "cq_volume_reads" ON "metrics"
BEGIN
SELECT sum(reads) as reads INTO rollup1.tire_volume FROM
"metrics".raw.tier_volume GROUP BY time(10m),*
END
And also I don't know whether this is the best way to resolve the problem. Any thoughts on this approach or suggesting any better approach will be much appreciated. It would be great to get suggestions in using debugging tools for influxdb as well. Thanks!
#Rajan - A few comments:
The canonical documentation for CQs is here. Much of what I'm suggesting is from there.
Are you using back-referencing? I see your example CQ uses GROUP BY time(10m),* - the * wildcard is usually used with backreferences. Otherwise, I don't believe you need to include the * to indicate grouping by all tags - it should already be grouped by all tags.
If you are using backreferences, that runs the CQ for each measurement in the metrics database. This is potentially very many CQ executions at the same time, especially if you have many CQ defined this way.
You can set offsets with GROUP BY time(10m, <offset>) but this also impacts the time interval used for your aggregation function (sum in your example) so if your offset is 1 minute then timestamps will be a sum of data between e.g. 13:11->13:21 instead of 13:10 -> 13:20. This will offset execution but may not work for your downsampling use case. From a signal processing standpoint, a 1 minute offset wouldn't change the validity of the downsampled data, but it might produce unwanted graphical display problems depending on what you are doing. I do suggest trying this option.
Otherwise, you can try to reduce the number of downsampling CQs to reduce memory pressure or downsample on a larger timescale (e.g. 20m) or lastly, increase the hardware resources available to InfluxDB.
For managing memory usage, look at this post. There are not many adjustments in 1.8 but there are some.

Prometheus increase not handling process restarts

I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.
When there is a process restart within a 2m interval and I query:
sum(increase(my_metric_total[2m]))
I get a value less than expected.
For example, in a simple experiment I mock:
3 lcm_restarts
1 process restart
2 lcm_restarts
All within a 2 minute interval.
Upon querying:
sum(increase(lcm_restarts[2m]))
I receive a value of ~4.5 when I am expecting 5.
lcm_restarts graph
sum(increase(lcm_restarts[2m])) result
Could someone please explain?
Pretty concise and well-prepared first question here. Please keep this spirit!
When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.
For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.
Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.
So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.
Prometheus may return unexpected results from increase() function due to the following reasons:
Prometheus may return fractional results from increase() over integer counter because of extrapolation. See this issue for details.
Prometheus may return lower than expected results from increase(m[d]) because it doesn't take into account possible counter increase between the last raw sample just before the specified lookbehind window [d] and the first raw sample inside the lookbehind window [d]. See this article and this comment for details.
Prometheus skips the increase for the first sample in a time series. For example, increase() over the following series of samples would return 1 instead of 11: 10 11 11. See these docs for details.
These issues are going to be fixed according to this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics, which are free from these issues.

Beaglebone Black sampling rate too slow and gives false voltage libpruio

I'm pretty much a noob when it comes to this kind of thing, so if you guys could either help me or direct me to a place to learn what I need to know, I would greatly appreciate it.
Basically my problem is that I am using the libpruio library to continuously sample analog values from the board. 2 things are going wrong here.
The first is that whenever the BB is sampling the voltages, the voltage of the wire that is hooked up to the AIN pin goes up. I've observed this through hooking up an oscilloscope to the same wire the pin is sampling. What I see is that whenever the BB starts sampling, the entire signal (just a sound wave from an amplified mic) is shifted up .8-.9 volts. This is also reflected in the values that I get from the BB, which are around 30,000 (when they should be 0). Hooking the pin up to ground gets me 0, which is correct, and hooking it up to 1.8 volts gets me something like 65520, which is also correct. So maybe it has something to do with the signal being weak?
The second issue is that even though I am receiving values at a rate of like 500khz-900khz, the actual rate seems to be around 11khz. What I mean by this is I only get a new value every 88us, and the rest of the values I get are stay the same as the new value until the next 88us passes, when I get a new value. These times correspond to the voltage shift up, which I mentioned in the previous paragraph. So actually what I see on the oscilloscope is that whenever I sample with the BB, there is a saw wave, with the frequency at the 11khz I was mentioning earlier.
In conclusion, whenever the BB samples, it first increases the voltage at the pin by .9volts, takes a sample of that voltage, and the voltage dies down for the next 88us, all the while the BB spits back the sample it took at the beginning of the period. I do not want this. I want it to not affect the voltage significantly, and take new samples every time the code runs.
As for the code I'm using, it's basically a slightly modified version of the IO_Input example in the libpruio library, with the values being stored in an array for later use instead of being printed immediately.
If you guys need any more information, I will gladly post it here, but for now I'm wondering if it is something super obvious that I'm missing.
Hooking the pin up to ground gets me 0, which is correct, and hooking
it up to 1.8 volts gets me something like 65520, which is also
correct. So maybe it has something to do with the signal being weak?
The BBB and libpruio seem to work OK. Check your wiring.
Regarding the sampling rate, the io_input example uses IO mode. If you need accurate timing for the samples use MM mode or RB mode.
Your target isn't very clear, so I cannot give detailed advices. (Some code also would help to understand what you're trying to do.)
BR

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

Resources