I'm trying to collect statsd metrics in an influxdb/telegraf/grafana server. What I'm seeing is that there is a continuous stream of entries in influxdb every 10 seconds from telegraf. How can I configure telegraf to only send an update to influxdb whenever it receives a statsd metric over UDP. I don't want a continuously updating value because I want to see the discrete event counts over time periods in grafana.
For example, if I send exactly one counter metric (value=1) at time t0 and no more events for 10 minutes (say), I expect to see exactly one data point for the 10 minute time period I'm aggregating over in Grafana. However, what I see is that every 10s there is an entry in the influxdb telegraph table for the measurement with the value of 1. Grafana would then show me a continuous value of 1 over each 10 minute period. What I really want is that in the 10 minute period where t0 existed, that the value 1 would be shown, whereas in all subsequent time periods (until the next metric, of course), the value would be 0.
How can I achieve that? I see nothing in the telegraf documentation for the statsd plugin that says it will continuously update influxdb with the aggregated value (since the beginning of time) that telegraf has cached.
In telegraf.conf, change the following to true:
[[input.statsd]]
delete_counters = true
Related
I extract statistics from Jenkins with the Prometheus Metrics plugin.
I have created a query in PromQL to check if a Jenkins job build time has increased by 50% from the average successful build time:
default_jenkins_builds_last_build_duration_milliseconds > 1.5 * (avg_over_time(default_jenkins_builds_last_build_duration_milliseconds[180d]) and default_jenkins_builds_last_build_result_ordinal == 0)
However there is a problem with this query. The result is getting diluted over time because the query keeps adding each value from the time-series to the total average result. There may be values that haven't changed but they keep adding.
I expected to create a query that calculates the 'delta' from the current successful build time against the previous one, but there doesn't seem to be a metric that represents the previous build (or I can't find it), so I ended up using the average_over_time.
I have also tried to calculate the delta with the offset modifier by one minute (because the Prometheus scrapes the Jenkins exporter every 1 minute), but the problem there is that some times the time-series returns Nan results and it cant calculate the time delta from each build. I was expecting that in a graph i would see a line with some ups and downs if the build time increased or decreased but NaN values break this graph.
How can this query be refactored in order to yield the expected result ???
I have a SpringBoot application that is under moderate load. I want to collect metric data for a few of the operations of my app. I am majorly interested in Counters and Timers.
I want to count the number of times a method was invoked (# of invocation over a window, for example, #invocation over last 1 day, 1 week, or 1 month)
If the method produces any unexpected result increase failure count and publish a few tags with that metric
I want to time a couple of expensive methods, i.e. I want to see how much time did that method took, and also I want to publish a few tags with metrics to get more context
I have tried StatsD-SignalFx and Micrometer-InfluxDB, but both these solutions have some issues I could not solve
StatsD aggregates the data over flush window and due to aggregation metric tags get messed up. For example, if I send 10 events in a flush window with different tag values, and the StatsD agent aggregates those events and publishes only one event with counter = 10, then I am not sure what tag values it's sending with aggregated data
Micrometer-InfluxDB setup has its own problems, one of them being micrometer sending 0 values for counters if no new metric is produced and in that fake ( 0 value counter) it uses same tag values from last valid (non zero counter)
I am not sure how, but Micrometer also does some sort of aggregation at the client-side in MeterRegistry I believe, because I was getting a few counters with a value of 0.5 in InfluxDB
Next, I am planning to explore Micrometer/StatsD + Telegraf + Influx + Grafana to see if it suits my use case.
Questions:
How to avoid metric aggregation till it reaches the data store (InfluxDB). I can do the required aggregation in Grafana
Is there any standard solution to the problem that I am trying to solve?
Any other suggestion or direction for my use case?
If my interpretation is correct, according to the documentation provided here:InfluxDB Downsampling when we down-sample data using a Continuous Query running every 30 minutes, it runs only for the previous 30 minutes data.
Relevant part of the document:
Use the CREATE CONTINUOUS QUERY statement to generate a CQ:
CREATE CONTINUOUS QUERY "cq_30m" ON "food_data" BEGIN
SELECT mean("website") AS "mean_website",mean("phone") AS "mean_phone"
INTO "a_year"."downsampled_orders"
FROM "orders"
GROUP BY time(30m)
END
That query creates a CQ called cq_30m in the database food_data.
cq_30m tells InfluxDB to calculate the 30-minute average of the two
fields website and phone in the measurement orders and in the DEFAULT
RP two_hours. It also tells InfluxDB to write those results to the
measurement downsampled_orders in the retention policy a_year with the
field keys mean_website and mean_phone. InfluxDB will run this query
every 30 minutes for the previous 30 minutes.
When I create a Continuous Query it actually runs on the entire dataset, and not on the previous 30 minutes. My question is, does this happen only the first time after which it runs on the previous 30 minutes of data instead of the entire dataset?
I understand that the query itself uses GROUP BY time(30m) which means it'll return all data grouped together but does this also hold true for the Continuous Query? If so, should I then include a filter to only process the last 30 minutes of data in the Continuous Query?
What you have described is expected functionality.
Schedule and coverage
Continuous queries operate on real-time data. They use the local server’s timestamp, the GROUP BY time() interval, and InfluxDB database’s preset time boundaries to determine when to execute and what time range to cover in the query.
CQs execute at the same interval as the cq_query’s GROUP BY time() interval, and they run at the start of the InfluxDB database’s preset time boundaries. If the GROUP BY time() interval is one hour, the CQ executes at the start of every hour.
When the CQ executes, it runs a single query for the time range between now() and now() minus the GROUP BY time() interval. If the GROUP BY time() interval is one hour and the current time is 17:00, the query’s time range is between 16:00 and 16:59.999999999.
So it should only process the last 30 minutes.
Its a good point about the first run.
I did manage to find a snippet from an old document
Backfilling Data
In the event that the source time series already has data in it when you create a new downsampled continuous query, InfluxDB will go back in time and calculate the values for all intervals up to the present. The continuous query will then continue running in the background for all current and future intervals.
https://influxdbcom.readthedocs.io/en/latest/content/docs/v0.8/api/continuous_queries/#backfilling-data
Which would explain the behaviour you have found
I'm sending metrics in StatsD format to Telegraf, which forwards them to InfluxDB 0.9.
I'm measuring execution times (of some event) from multiple hosts. The measurement is called "execTime", and the tag is "host". Once Telegraf gets these numbers, it calculates mean/upper/lower/count, and stores them in separate measurements.
Sample data looks like this in influxdb:
TIME...FIELD..............HOST..........VALUE
t1.....execTime.count.....VM1...........3
t1.....execTime.mean......VM1...........15
t1.....execTime.count.....VM2...........6
t1.....execTime.mean......VM2...........22
(So at time t1, there were 3 events on VM1, with mean execution time 15ms, and on VM2 there were 6 events, and the mean execution time was 22ms)
Now I want to calculate the mean of the operation execution time across both hosts at time t1. Which is (3*15 + 6*22)/(3+6) ms.
But since the count and mean values are in two different series, I can't simply use "select mean(value) from execTime.mean"
Do I need to change my schema, or can I do this with the current setup?
What I need is essentially a new series, which is a combination of the execTime.count and execTime.mean across all hosts. Instead of calculating this on-the-fly, the best approach seems to be to actually create the series along with the others.
So now I have two timer stats being generated on each host for each event:
1. one event with actual hostname for the 'host' tag
2. second event with one tag "host=all"
I can use the first set of series to check mean execution times per host. And the second series gives me the mean time for all hosts combined.
It is possible to do mathematical operations on fields from two different series, provided both series are members of the same measurement. I suspect your schema is non-optimized for your use case.
For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.