Telegraf agent configuration to avoid metric buffer overflow - influxdb

After getting some metric buffer overflow warning messages, I am trying to understand better how the fundamental agent parameters interval, metric_batch_size, metric_buffer_limit and flush_interval impact each other.
Looking the specification, these 3 parameters are defined as:
interval : Default data collection interval for all inputs
metric_batch_size : Telegraf will send metrics to output in batch of at most metric_batch_size metrics.
metric_buffer_limit : Telegraf will cache metric_buffer_limit metrics for each output, and will flush this buffer on a successful write. This should be a multiple of metric_batch_size and could not be less than 2 times metric_batch_size .
flush_interval : Default data flushing interval for all outputs. You should not set this below interval . Maximum flush_interval will be flush_interval + flush_jitter
What I understand is that
Data is only written to the output at each flush_interval .
If not all the data can be written, it uses the buffer to store.
What I am not sure to understand is when will the buffered data be written to the output again ? Will it be at next flush_interval occurence ?
Thanks in advance for your help on this !

Answered and solved thanks to Jay_Clifford
Yes. Data will be sent at the next flush interval.
(cf. Telegraf community post)

Related

Flink Checkpoint Failure - Checkpoints time out after 10 mins

We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).
The root issue is:
Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.
Suggested solution:
I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?
Aim:
How to avoid this issue and record the correct state that doesn't miss any data?
Failed checkpoint:
Completed checkpoint:
subtask didn't respond
Thanks
There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.
Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.
Sounds like you should extend the timeout, which you can do like this:
env.getCheckpointConfig().setCheckpointTimeout(n);
where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.

What is the best way to performance test an SQS consumer to find the max TPS that one host can handle?

I have a SQS consumer running in EventConsumerService that needs to handle up to 3K TPS successfully, sometimes upwards of 20K TPS (or 1.2 million messages per minute). For each message processed, I make a REST call to DataService's TCP VIP. I'm trying to perform a load test to find the max TPS that one host can handle in EventConsumerService without overstraining:
Request volume on dependencies, DynamoDB storage, etc
CPU utilization in both EventConsumerService and DataService
Network connections per host
IO stats due to overlogging
DLQ size must be minimal, currently I am seeing my DLQ growing to 500K messages due to 500 Service Unavailable exceptions thrown from DataService, so something must be wrong.
Approximate age of oldest message. I do not want a message sitting in the queue for over X minutes.
Fatals and latency of the REST call to DataService
Active threads
This is how I am performing the performance test:
I set up both my consumer and the other service on one host, the reason being I want to understand the load on both services per host.
I use a TPS generator to fill the SQS queue with a million messages
The EventConsumerService service is already running in production. Once messages started filling the SQS queue, I immediately could see requests being sent to DataService.
Here are the parameters I am tuning to find messagesPolledPerSecond:
messagesPolledPerSecond = (numberOfHosts * numberOfPollers * messageFetchSize) * (1000/(sleepTimeBetweenPollsPerMs+receiveMessageTimePerMs))
messagesInSurge / messagesPolledPerSecond = ageOfOldestMessageSLA
ageOfOldestMessage + settingsUpdatedLatency < latencySLA
The variables for SqsConsumer which I kept constant are:
numberOfHosts = 1
ReceiveMessageTimePerMs = 60 ms? It's out of my control
Max thread pool size: 300
Other factors are all game:
Number of pollers (default 1), I set to 150
Sleep time between polls (default 100 ms), I set to 0 ms
Sleep time when no messages (default 1000 ms), ???
message fetch size (default 1), I set to 10
However, with the above parameters, I am seeing a high amount of messages being sent to the DLQ due to server errors, so clearly I have set values to be too high. This testing methodology seems highly inefficient, and I am unable to find the optimal TPS that does not cause such a tremendous number of messages to be sent to the DLQ, and does not cause such a high approximate age of the oldest message.
Any guidance is appreciated in how best I should test. It'd be very helpful if we can set up a time to chat. PM me directly

How to set the time precision of the telegraf statsd (influxdb)?

I'm using telegraf with influxdb, and in the telegraf I'm using the statsd_input plugin.
The statsd_input.conf:
[[inputs.statsd]]
## Address and port to host UDP listener on
service_address = ":8126"
## The following configuration options control when telegraf clears it's cache
## of previous values. If set to false, then telegraf will only clear it's
## cache when the daemon is restarted.
## Reset gauges every interval (default=true)
delete_gauges = true
## Reset counters every interval (default=true)
delete_counters = true
## Reset sets every interval (default=true)
delete_sets = true
## Reset timings & histograms every interval (default=true)
delete_timings = true
## Percentiles to calculate for timing & histogram stats
percentiles = [90]
## separator to use between elements of a statsd metric
metric_separator = "."
## Parses tags in the datadog statsd format
## http://docs.datadoghq.com/guides/dogstatsd/
parse_data_dog_tags = true
## Statsd data translation templates, more info can be read here:
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md#graphite
# templates = [
# "cpu.* measurement*"
# ]
## Number of UDP messages allowed to queue up, once filled,
## the statsd server will start dropping packets
allowed_pending_messages = 10000
## Number of timing/histogram values to track per-measurement in the
## calculation of percentiles. Raising this limit increases the accuracy
## of percentiles but also increases the memory usage and cpu time.
percentile_limit = 1000
I'm trying to set the time precision to seconds. I tried to accomplish this in the telegram.conf file, but it's written in the notes that the precision setting does not affect the statsd plugin:
## By default, precision will be set to the same timestamp order as the
## collection interval, with the maximum being 1s.
## Precision will NOT be used for service inputs, such as logparser and statsd.
## Valid values are "ns", "us" (or "µs"), "ms", "s".
precision = ""
I haven't seen a setting of the precision in the statsd_input.conf file.
What is the correct way to accomplish this?
Unfortunately this isn't supported by influxdb. The workaround is to send the information using socket_listener with the correct timestamp.
This information is per the issues logged against influxdb's GitHub

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).
Your query is correct. The result will be NaN if there have been no queries in the past minute.

Plot event values in Graphite

We would like to use Graphite to plot values related to events such as "a packet of N messages has been published". When no packet is published, no code is run at all and so we cannot send zero to Graphite.
Essentially, we would like to compute some kind of publication rate per second.
Here are some sample data that we send to Graphite (with added timestamps):
2016-11-28 14:46:33.6338Z api.message.publication.count:100
2016-11-28 15:01:36.0780Z api.message.publication.count:12
2016-11-28 15:01:36.9911Z api.message.publication.count:1
2016-11-28 15:01:37.0679Z api.message.publication.count:100
Between 14:46:33 and 15:01:36, no messages were sent. However, between 15:01:36 and 15:01:37, 13 messages were sent (reported as two values, 12 and 1).
I've tried the summarize() function but it does not give results that make sense to me, i.e. I cannot correlate what I'm sending to Graphite and what is displayed by Graphite. Moreover, it seems that summarize() does not support 1-second intervals (I've tried "1second" and "1s" for the interval parameter).
The perSecond() function computes a rate of change (i.e. a derivative) but what we're sending is already a kind of derivative (maybe it's closer to a Dirac delta?) so it doesn't make sense in our context.
Are we completely off, or is there a way to make this work with Graphite?
Edit: I guess we need to add an aggregation stage to our data. Would Carbon aggregation fit the bill here?
It turns out that we were already sending our metrics to statsd, which supports aggregation via the c metric type, and a few other nifty things: https://github.com/etsy/statsd/blob/master/docs/metric_types.md

Resources