How to monitor a systemd service using telegraf? - monitoring

I created a systemd service that is running in our system and I want to monitor it using a telegraf agent that I already have installed on the instance.
The Agent is currently monitoring the basic infra stuff and I need to add monitoring to the new service.
I couldn't find any example on how to do it which is strange, I would expect telegraf to have some sort of plugin for something that basic.
My service is running a python script that doesn't expose any port so I can do a normal HTTP health check.
any help will be appreciated.

So I found that indeed there is a plugin that monitors systems service,
The name is systemd_timings.
This is the configuration I've implemented:
# Gather systemd units state
[[inputs.systemd_units]]
## Set timeout for systemctl execution
timeout = "1s"
# Filter for a specific unit type, default is "service", other possible
# values are "socket", "target", "device", "mount", "automount", "swap",
# "timer", "path", "slice" and "scope ":
unittype = "service"
# Filter for a specific pattern, default is "" (i.e. all), other possible
# values are valid pattern for systemctl, e.g. "a*" for all units with
# names starting with "a"
pattern = ""
## pattern = "telegraf* influxdb*"
## pattern = "a*"
After getting the metrics in the influxDB This is the query I used to extract the data I needed:
from(bucket: "veeva")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_field"] == "active_code")
|> filter(fn: (r) => r["_measurement"] == "systemd_units")
|> filter(fn: (r) => r["active"] == "active")
|> filter(fn: (r) => r["host"] == "10.192.21.66")
|> filter(fn: (r) => r["name"] == "myservice.service")
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> yield(name: "mean")
[1]:
And this is how it looks like in Grafana:
https://docs.influxdata.com/telegraf/v1.22/plugins/#systemd_timings

Related

Take the median of a grouped set

I am quite new to Flux and want to solve an issue:
I got a bucket containing measurements, which are generated by a worker-service.
Each measurement belongs to a site and has an identifier (uuid). Each measurement contains three measurement points containing a value.
What I want to archive now is the following: Create a graph/list/table of measurements for a specific site and aggregate the median value of each of the three measurement points per measurement.
TLDR;
Get all measurementpoints that belong to the specific site-uuid
As each measurement has an uuid and contains three measurement points, group by measurement and take the median for each measurement
Return a result that only contains the median value for each measurement
This does not work:
from(bucket: "test")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "lighthouse")
|> filter(fn: (r) => r["_field"] == "speedindex")
|> filter(fn: (r) => r["site"] == "1d1a13a3-bb07-3447-a3b7-d8ffcae74045")
|> group(columns: ["measurement"])
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> yield(name: "mean")
This does not throw an error, but it of course does not take the median of the specific groups.
This is the result (simple table):
If I understand your question correctly you want a single number to be returned.
In that case you'll want to use the |> mean() function:
from(bucket: "test")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "lighthouse")
|> filter(fn: (r) => r["_field"] == "speedindex")
|> filter(fn: (r) => r["site"] == "1d1a13a3-bb07-3447-a3b7-d8ffcae74045")
|> group(columns: ["measurement"])
|> mean()
|> yield(name: "mean")
The aggregateWindow function aggregates your values over (multiple) windows of time. The script you posted computes the mean over each v.windowPeriod (in this case 20 minutes).
I am not entirely sure what v.windowPeriod represents, but I usually use time literals for all times (including start and stop), I find it easier to understand how the query relates to the result that way.
On a side note: the yield function only renames your result and allows you to have multiple returning queries, it does not compute anything.

InfluxDB: apply function on a 5 minute slice of data to generate the output stream

I'm a beginner with influxDB. I try to solve the following problem:
I have 2 streams of data from 2 temperature sensors. I need to get a stream of correlations, for each 5minute slice, so an output stream with 36 values in this case (last 3 hours) that I can plot on a graph.
My script (that I tried in the script editor) is:
t1 = from(bucket: "sensor1")
|> range(start: -3h)
|> filter(fn: (r) => r["_measurement"] == "temp1" and r["_field"] == "avg")
t2 = from(bucket: "sensor2")
|> range(start: -3h)
|> filter(fn: (r) => r["_measurement"] == "temp2" and r["_field"] == "avg")
t3 = cov(x: t1, y: t2, on: ["_time"], pearsonr: true)
|> yield(name: "cov")
If I execute the above (in influxdb2.4 script editor) I get the Pearson correlation calculated (a single value).
I tried to figure out the syntax from using the query builder, then switching to script editor, to see the generated code, but I failed.

Difference in performance or execution between single vs multiple, chained lambdas in Flux

In Influx Flux, is there a technical difference (like in execution or performance) between setting a filter operation in a single statement vs. using multiple, chained statements?
For example, the single statement:
from(bucket: "example-bucket")
|> range(start: -1h)
|> filter(fn: (r) =>
r._measurement == "example-measurement" and
r._field == "example-field" and
r.tag == "example-tag"))
... versus using multiple, chained lambda's:
from(bucket: "example-bucket")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "example-measurement")
|> filter(fn: (r) => r._field == "example-field")
|> filter(fn: (r) => r.tag == "example-tag"))
Perhaps both operations are executed equally. But I cannot find canon in the docs on it, although the examples seem to prefer the first example.
I understand that logical operator OR isn't ideal in the second case. Let's assume for this question it's all AND.

Grouping by increasing stateDuration resets using Flux in InfluxDb

I am recording period between application heartbeats into Influxdb.
The "target" period is 2000ms.
If the period is above 2750ms, then it is defined as a "lag event".
My end objective is to run statistics on "how long" we are running without lag events.
I switched to Flux from Influxql, so that i could use the stateDuration() method.
Using the below method, i am able to collect the increasing durations. At lag events, the state_duration is then reset to -1.
from (bucket: "sampledb/autogen")
|> range(start: -1h)
|> filter(fn: (r) =>
r._measurement == "timers" and
r._field == "HeartbeatMs" and
r.character == "Tarek"
)
|> stateDuration(fn: (r) =>
r._value<=2750,
column: "state_duration",
unit: 1s
)
|> keep(columns: ["_time","state_duration"])
At this point, I would like to be able to collect 'max(state_duration)' for each duration between lag events, and this is where i get stuck. Trying to "group by every new stateDuration sequence"/"group by increasing stateDurations"...
I was thinking that it might be possible to use "reduce()" or "map()" to inject a sequence number that i can use to group by, somehow increasing that sequence number whenever i have a -1 in the state_duration.
Below is a graph of the "state_duration" when running the flux query, i am basically trying to capture the value at the top of each peak.
Any help is appreciated, including doing this e.g. in InfluxQL or with Continuous Queries.
Data looks like below when exported to csv:
"time","timers.HeartbeatMs","timers.character"
"2021-01-12T14:49:34.000+01:00","2717","Tarek"
"2021-01-12T14:49:36.000+01:00","1282","Tarek"
"2021-01-12T14:49:38.000+01:00","2015","Tarek"
"2021-01-12T14:49:40.000+01:00","1984","Tarek"
"2021-01-12T14:49:42.000+01:00","2140","Tarek"
"2021-01-12T14:49:44.000+01:00","1937","Tarek"
"2021-01-12T14:49:46.000+01:00","2405","Tarek"
"2021-01-12T14:49:48.000+01:00","2312","Tarek"
"2021-01-12T14:49:50.000+01:00","1453","Tarek"
"2021-01-12T14:49:52.000+01:00","1890","Tarek"
"2021-01-12T14:49:54.000+01:00","2077","Tarek"
"2021-01-12T14:49:56.000+01:00","2250","Tarek"
"2021-01-12T14:49:59.000+01:00","2360","Tarek"
"2021-01-12T14:50:00.000+01:00","1453","Tarek"
"2021-01-12T14:50:02.000+01:00","1952","Tarek"
"2021-01-12T14:50:04.000+01:00","2108","Tarek"
"2021-01-12T14:50:06.000+01:00","2485","Tarek"
"2021-01-12T14:50:08.000+01:00","1437","Tarek"
"2021-01-12T14:50:10.000+01:00","2421","Tarek"
"2021-01-12T14:50:12.000+01:00","1483","Tarek"
"2021-01-12T14:50:14.000+01:00","2344","Tarek"
"2021-01-12T14:50:17.000+01:00","2437","Tarek"
"2021-01-12T14:50:18.000+01:00","1092","Tarek"
"2021-01-12T14:50:20.000+01:00","1969","Tarek"
"2021-01-12T14:50:22.000+01:00","2359","Tarek"
"2021-01-12T14:50:24.000+01:00","2140","Tarek"
"2021-01-12T14:50:27.000+01:00","2421","Tarek"
There are two ways I can think of. One is to look for the inverted state. The other is to use elapsed() to find interval + timeShift() to emulate LAG().
I don't like the latter though I think the first is not intuitive neither :-(. Really hope features like LAG() or CurrentRecordIndex() would be available in Flux.
from (bucket: "sampledb/autogen")
|> range(start: -1h)
|> filter(fn: (r) =>
r._measurement == "timers" and
r._field == "HeartbeatMs" and
r.character == "Tarek"
)
|> stateDuration(fn: (r) =>
r._value>2750, // Look for the inverted state
column: "inverted_state_duration",
unit: 1s
)
|> keep(columns: ["_time","inverted_state_duration"])
// Clear out records of the periods you are after
|> filter(fn: (r) => r["inverted_state_duration"] == -1)
// Calculate the gap duration with elapsed()
|> elapsed(columnName: "state_duration")
|> filter(fn: (r) => r["state_duration"] > ${ max(stateDuration.unit, record interval) })

Why after using InfluxDB v2.0 join() funciton, it can't write to a bucket?

Here is my flux script, when I run it, there is no error, but there is no data in bucket “output-test-3” , and exist data in bucket "output-test-4" :(
I have been troubled by this problem for a long time. Can anyone solve my problem?
option task = {name: "join-test-1", every: 5m, offset: 5s}
max_connections = from(bucket: "Node-exporter")
|> range(start: -task.every)
|> filter(fn: (r) =>
(r["_measurement"] == "go_info"))
|> last()
|> to(bucket: "output-test-4")
used_connections = from(bucket: "Node-exporter")
|> range(start: -task.every)
|> filter(fn: (r) =>
(r["_measurement"] == "go_goroutines"))
|> last()
|> to(bucket: "output-test-4")
a = join(tables: {max_connections: max_connections, used_connections: used_connections}, on:
["_time", "_start", "_measurement", "_stop", "_field"])
|> to(bucket: "output-test-3")
When using the join() function to connect two queries a and b, the variables _field, _measurement, and _value will automatically become _field_a, _filed_b, _value_a, _value_b, etc. When InfluxDB writes to the bucket, there must be three fragments of _field, _measurement, and _value, but due to the above reasons, these neighbors have disappeared. So the easiest way to solve this problem is to use the map() function to recreate these three subdivisions. The content inside can be specified. When using the data, it is good not to use the data of these specified partitions.

Resources