interval vs flush_interval in telegraf - influxdb

I have a following telegraf configuration
[agent]
interval = "5s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "5s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
logfile = ""
hostname = "$HOSTNAME"
omit_hostname = false
[[outputs.influxdb]]
urls = ["http://influxdb:8086"]
database = "telegraf"
username = ""
password = ""
retention_policy = ""
write_consistency = "any"
timeout = "5s"
[[inputs.docker]]
endpoint = "unix:///var/run/docker.sock"
container_names = []
timeout = "5s"
perdevice = true
total = false
[[inputs.cpu]]
[[inputs.system]]
[[inputs.influxdb]]
urls = ["http://influxdb:8086/debug/vars"]
[[inputs.syslog]]
# ## Specify an ip or hostname with port - eg., tcp://localhost:6514, tcp://10.0.0.1:6514
# ## Protocol, address and port to host the syslog receiver.
# ## If no host is specified, then localhost is used.
# ## If no port is specified, 6514 is used (RFC5425#section-4.1).
server = "tcp://localhost:6514"
[[inputs.socket_listener]]
# ## URL to listen on
service_address = "udp4://:8094"
data_format = "influx"
I need to dump data as fast as possible to my influx DB. I understand that it has something to do with interval and flush_interval settings (here is what I've been reading), but I can't figure out what is the difference between interval and flush_interval. Would someone be able to help me out?

I've asked the maintainer of Telegraf to provide an answer to this question. Here it is:
You can think of inputs as falling into 2 categories: polling
(regular) and event driven (service inputs). The interval is the
frequency that polling inputs plugins look for data. Most of the event
driven plugins do not use the interval, although the statsd plugin is
a notable exception.
All collected data has a timestamp, this means that regardless of how
long it takes to send the data, it will have the timestamp of when it
was sampled.
The flush_interval is how frequently the outputs write data. This is
the longest you will have to wait under normal circumstances for the
data to be written. The metric_batch_size setting also comes into play
here, if this number of new metrics are collected then the output will
immediately flush. In other words, Telegraf will trigger a write when
either metric_batch_size new metrics are collected or after
flush_interval, whichever comes first.

Related

Influxdb Json (MQTT) rename field with wildcards

I need help with an issue I am having sending the data of my Zigbee thermometers to influx, via telegraf.
this is the path:
Zigbee Sonoff SNZB-02 --> Tasmota ZBBridge --> MQTT --> Telegraf --> InfluxDB
The id of the zigbee thermometer: 0x4EF9 might change since it's randomly assigned to the device, in Tasmota I am able to assign a "friendly name", in this case: ZB_Sonoff_Temp01
With simple tasmota devices I have no issues, I have the single entry of the device in the MQTT Topic and Telegraf plays nicely with those.
My issues is with the data from the Zigbee Bridge, since it has a single topic and the output in Influx is a bit difficult to work with:
Example MQTT message for Zigbee thermometer:
tele/tasmota_ABDCEF/ZB_Sonoff_Temp01/SENSOR {"ZbReceived":{"0x4EF9":{"Device":"0x4EF9","Name":"ZB_Sonoff_Temp01","Humidity":94.84,"Endpoint":1,"LinkQuality":34}}}
the data is in Json format as you can see,
in Telegraf I am using mqtt_consumer, here is the config:
/etc/telegraf/telegraf.d/mqtt.conf
[[inputs.mqtt_consumer]]
servers = ["tcp://192.168.10.10:1883"]
## Topics that will be subscribed to.
topics = [
"tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR"
]
qos = 0
connection_timeout = "30s"
username = "user"
password = "password"
data_format = "json"
[[outputs.influxdb]]
urls = ["http://localhost:8086"]
database = "test_temp"
# skip_database_creation = true
and this is my /etc/telegraf/telegraf.conf:
[global_tags]
[agent]
logfile = "/var/log/telegraf/telegraf.log"
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
this is the data in influx:
time ZbReceived_0x4EF9_Endpoint ZbReceived_0x4EF9_Humidity ZbReceived_0x4EF9_LinkQuality ZbReceived_0x4EF9_Temperature host topic
---- -------------------------- -------------------------- ----------------------------- ----------------------------- ---- -----
2021-12-24T15:43:55.26962955Z 1 99.99 26 32.24 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR
2021-12-24T15:43:55.560162845Z 1 21 25.18 influxdb-test tele/tasmota_ABDCEF/ZB_Sonoff_Temp01/SENSOR
which could be ok, since I am able to choose the device via the "topic" field, but the problems is that the field "ZbReceived_0x4EF9_Temperature" is not "sane" in case the devices change ID when re-associating with zigbee, which might happen..
the workaround I found is to add a rename for the fields:
[[processors.rename]]
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_Temperature"
dest = "Temperature"
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_Humidity"
dest = "Humidity"
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_Endpoint"
dest = "Endpoint"
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_LinkQuality"
dest = "LinkQuality"
which changes the fields as I want (there is no Humidity but it's not always pushed, so it's ok, I am dropping the db between changes):
time Endpoint LinkQuality Temperature host topic
---- -------- ----------- ----------- ---- -----
2021-12-24T15:47:09.992947108Z 1 21 23 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR
2021-12-24T15:47:25.868416967Z 1 13 27.06 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR
I don't like this solution very much since it has the device ids hardcoded in telegraf config, so when I add or change a sensor I need to edit telegraf.
the problem I now have is that I would like to find a wildcard or a method to change the fields independently from the device id, like:
[[processors.rename]]
[[processors.rename.replace]]
field = "*_Temperature"
dest = "Temperature"
but I am not able to find it, I've read all the docs of the methods (also strings) but I could not find a way to achieve that..
do you have any tip that could help me?
thank you very much and happy holidays!
I needed to use processors.regex.field_rename, but I had an older version of telegraf than the latest ((1.21.1-1) over (1.20.2-1)), and so field_rename was not available.
The regex I've used:
[[processors.regex]]
[[processors.regex.field_rename]]
pattern = '(^ZbReceived_)\w+_'
replacement = "${2}"
the result:
time Endpoint Humidity LinkQuality Temperature host topic
---- -------- -------- ----------- ----------- ---- -----
2021-12-24T23:51:48.289677669Z 1 0 25.26 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR

Can we nicely convert MQTT topic descriptions into InfluxDB tags, using the mqtt_consumer Telegraf input plugin?

I am publishing data to MQTT topics with the Mosquitto broker, and am trying to pipe the data points through to my InfluxDB database. Currently I am trying to do this using the mqtt_consumer input plugin for Telegraf.
A simplified version of my telegraf.conf file looks like this:
# Global Agent Configuration
[agent]
interval = "1m"
# Input Plugins
[[inputs.mqtt_consumer]]
name_prefix = "sensors_"
servers = ["localhost:1883"]
qos = 0
connection_timeout = "30s"
topics = [
"/inside/kitchen/temperature"
]
persistent_session = false
data_format = "value"
data_type = "float"
# Output Plugins
[[outputs.influxdb]]
database = "telegraf"
urls = [ "http://localhost:8086" ]
Now, I can manually publish a data point using MQTT via the following:
~ $ mosquitto_pub -d -t /inside/kitchen/temperature -m 23.5
where I have used the MQTT topic /inside/kitchen/temperature and a value of 23.5.
Examining my InfluxDB database, I see the following data points:
name: sensors_mqtt_consumer
time topic value
---- ----- -----
2020-06-27T20:08:50 /inside/kitchen/temperature 23.5
2020-06-27T20:08:40 /inside/kitchen/temperature 23.5
Is there any way that I can use the MQTT topic name description to properly allocate Influx tags? For example, I would like something like:
name: temperature
time location room value
---- ----- ----- -----
2020-06-27T20:08:50 inside kitchen 23.5
2020-06-27T20:08:40 inside kitchen 23.5
Is there a way to do this with the InfluxDB/Mosquitto/Telegraf configurations? Later I will be adding more topics (locations, etc) and measurements (humidity, voltage, etc), so the solution should allow for this.
(I know that this can be done by by choosing data_format = "influx" as described here, where Telegraf interprets the message as InfluxDB line protocol and passes it through directly. However, then I would have to publish to the topic in this way:
mosquitto_pub -d -t /inside/kitchen/temperature -m "temperature,location=inside,room=kitchen value=23.5"
where you can see that most of the information has been input twice, even though it already existed. The same can be said for the data_format="json" option. What I need is more of a mapping).
I found a comprehensive answer in the telegraf repo's issue threads. Copied here for convenience. Since I had control over the message source I used the last option.
You have a few options, example configs below are untested and might need tweaked.
You can have multiple plugins, one for each type and use name_override to vary the measurement name:
[[inputs.mqtt_consumer]]
name_override = "my_integer"
topics = ["/XXX"]
data_format = "value"
data_type = "integer"
[[inputs.mqtt_consumer]]
name_override = "my_string"
topics = ["/SSS"]
data_format = "value"
data_type = "string"
You can have a single plugin reading all values as strings, and rename the fields and convert the integers using processors:
[[inputs.mqtt_consumer]]
topics = ["/#"]
data_format = "value"
data_type = "string"
[[processors.rename]]
order = 1
[processors.rename.tagpass]
topic = ["/SSS"]
[[processors.rename.replace]]
field = "value"
dest = "my_string"
[[processors.rename]]
order = 2
[processors.rename.tagpass]
topic = ["/XXX"]
[[processors.rename.replace]]
field = "value"
dest = "my_integer"
[[processors.converter]]
order = 3
[processors.converter.tagpass]
topic = ["/XXX"]
[[processors.converter.fields]]
integer = ["my_integer"]
Last "option" is to write a format that can be converted automatically on read, such as InfluxDB Line Protocol which will give you the most control over how the data looks. I recommend this if it is possible to modify the incoming data:
[[inputs.mqtt_consumer]]
topics = ["/#"]
data_format = "influx"
New to telegraf but I have had some success with something along the lines of:
[[inputs.mqtt_consumer]]
name_override = "sensors"
servers = ["localhost:1883"]
qos = 0
connection_timeout = "30s"
topics = [
"/inside/kitchen/temperature"
]
persistent_session = false
data_format = "value"
data_type = "float"
[[processors.regex]]
namepass = ["sensors"]
# Base topic present
[[processors.regex.tags]]
key = "topic"
pattern = "(.*)/.*/.*/"
replacement = "${1}"
result_key = "location"
[[processors.regex.tags]]
key = "topic"
pattern = ".*/(.*)/.*"
replacement = "${1}"
result_key = "room"
[[processors.regex.tags]]
key = "topic"
pattern = ".*/.*/(.*)"
replacement = "${1}"
result_key = "value"
The regex processor uses golang syntax.

twitter source to hive sink using flume

I am trying to connect twitter source to hive sink using flume.
I have my property file given below
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = k1
# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxx
TwitterAgent.sources.Twitter.keywords = kafka, flume, hadoop, hive
# Describing/Configuring the sink
TwitterAgent.channels = MemChannel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.sinks = k1
TwitterAgent.sinks.k1.type = hive
TwitterAgent.sinks.k1.channel = MemChannel
TwitterAgent.sinks.k1.hive.metastore = thrift://xxxx:9083
TwitterAgent.sinks.k1.hive.database = sample
TwitterAgent.sinks.k1.hive.table = tweets_twitter
#TwitterAgent.sinks.k1.hive.partition = user_location
TwitterAgent.sinks.k1.useLocalTimeStamp = false
TwitterAgent.sinks.k1.round = true
TwitterAgent.sinks.k1.roundValue = 10
TwitterAgent.sinks.k1.roundUnit = minute
TwitterAgent.sinks.k1.serializer = DELIMITED
TwitterAgent.sinks.k1.serializer.delimiter = "\t"
TwitterAgent.sinks.k1.serializer.serdeSeparator = '\t'
#TwitterAgent.sinks.k1.serializer.fieldnames =user_friends_count,user_location,user_email
# Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
TwitterAgent.channels.MemChannel.byteCapacity = 6912212
# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.k1.channel = MemChannel
I am not creating any database or table in hive here. Should i need to create database name, table name, partition column, field names?? before starting the agent ??
If so where should i get the schema of twitter streaming data??
I am starting the flume agent using this below command
bin/flume-ng agent --conf ./conf/ -f conf/twitter_hive.conf Dflume.root.logger=DEBUG,console -n TwitterAgent --classpath "/usr/hdp/2.6.3.0-235/hive-hcatalog/share/hcatalog/*":"/usr/hdp/2.6.3.0-235/hive/lib/*"
Where should i get the schema of twitter data to create hive tables, to be mention in twitter.conf property file
HiveSink was introduced in version 1.6 and as per the documentation, yes the metastore, database name and the table name are mandatory. The partition part however is optional as flume can create the missing partitions.
As for the schema for twitter, it seems to be a problem some others have also faced and I found this link quite useful (You may have already come across this). It mentions some of the data structures available in hive that you may need to work with data coming in JSON format. You may however need to alter some of the bits and pieces for your scenario but this should give you a good start.
I hope this helps.

Telegraf phpfpm not storing all tag measurements to influxdb

I have Telegraf configured and running with -input-filter phpfpm
Input filter configured:
[phpfpm]
urls = ["http://127.0.0.1:8080/fpmstats"]
This url works, and returns correct php-fpm stats:
pool: www
process manager: dynamic
start time: 03/Sep/2016:13:25:25 +0000
start since: 1240
accepted conn: 129
listen queue: 0
max listen queue: 0
listen queue len: 0
idle processes: 2
active processes: 1
total processes: 3
max active processes: 1
max children reached: 0
slow requests: 0
Telegraf Output is configured for Influxdb as follows:
[[outputs.influxdb]]
urls = ["udp://172.17.0.16:8089"] # Stick to UDP
database = "telegraf"
precision = "s"
retention_policy = "autogen"
write_consistency = "any"
timeout = "5s"
username = "telegraf"
password = "password"
user_agent = "telegraf"
udp_payload = 1024
This is 'almost' working, and data is being recieved by Influx - but only a couple of the measurements..
SHOW TAG KEYS FROM "phpfpm"
Shows only the following tagkey
host
pool
I expected to see values for accepted conn, listen queue, idel processes and so on. I cannot see any 'useful' data being posted to Influx.
Am I missing something, in terms of where to look for the phpfpm values being stored in the Influxdb.
Or is this a configuration problem.
I had a problem getting the http collector to work so stuck with UDP - is this a bad idea?
Data in InfluxDB is separated into measurements, tags, and fields.
Measurements are high level bucketing of data.
Tags are index values.
Fields are the actual data.
The data that you're working with has the measurement phpfpm and two tags host and pool.
I expected to see values for accepted conn, listen queue, idel processes and so on. I cannot see any 'useful' data being posted to Influx.
The values that you're looking for are most likely fields. To verify that this is the case run the query
SHOW FIELD KEYS FROM "phpfpm"

How to increase the processing rate of flume agent

I have a flume agent that ingests data into elasticsearch. The agent is using spoolDir source. There is another agent which writes the files into the spoolDir of the elasticsearch agent.
Over the time the files are increased and the difference between the processed files files and the unprocessed files increase.
I want to increase the number of events processed by the flume agent to speedup the ingesting process.
Here is the configuration of the flume agent.
agent04.sources = s1
agent04.channels = ch1
agent04.channels = memoryChannel
agent04.channels.memoryChannel.type = memory
agent04.channels.memoryChannel.capacity=100000
agent04.channels.memoryChannel.transactionCapacity=1000
agent04.sources.s1.channels = memoryChannel
agent04.sources.s1.type = spooldir
agent04.sources.s1.spoolDir = /DataCollection/Flume_Cleaner_Output/Json_Elastic/
agent04.sources.s1.deserializer.maxLineLength = 100000
agent04.sinks = elasticsearch
agent04.sinks.elasticsearch.channel = memoryChannel
agent04.sinks.elasticsearch.type=org.css.cssElasticsearchSink
agent04.sinks.elasticsearch.batchSize=400
agent04.sinks.elasticsearch.hostNames = elastic-node01.css.org
agent04.sinks.elasticsearch.indexName = all_collections
agent04.sinks.elasticsearch.indexType = live_tweets
agent04.sinks.elasticsearch.indexNameBuilder= org.css.sa.flume.elasticsearch.sink.indexNameBuilder.HeaderValueBasedIndexNameBuilder
agent04.sinks.elasticsearch.clusterName = css_rai_social
agent04.sinks.elasticsearch.serializer = org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer
agent04.sinks.elasticsearch.cache_period_ms=90d
Why are you chaining two Flume agents using the spooldir? That'll be really slow and is a surprising configuration. You're incurring the cost of frequent fsyncs as each batch gets processed.
I recommend you chain them using the Avro Sink and Avro Source. I would also up the batch size to at least 1000. (Computers really like batches and Flume is setup to do that).

Resources