Receiving Buffer flush took longer time than slow_flush_log_threshold: warnings in fluentD - azure-aks

We deploymed our fluentd application on AKS.
From the past few days we are seeing multiple warnings in our aks pods.
[warn]: #6 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=34.518092500045896 slow_flush_log_threshold=20.0
Recently we are receiving more logs compared to the previous days. I guess this might be the reason for these warnings. However I see some decrease in the output feeds in Event hubs which is a concern.
<match udp.**>
#type azureeventhubs_e
connection_string "#{ENV['EH_CONNECTIONSTRING']}"
hub_name "#{ENV['EH_ENTITY']}"
batch true
print_records false
<buffer>
retry_forever false
total_limit_size 2GB
chunk_full_threshold 0.1
flush_interval 1s
flush_thread_count 36
retry_max_times 2
overflow_action block
</buffer>
</match>
Can anyone help, How I can avoid these warnings?

Related

fluentd TimeParser Error - Invalid Time Format

I'm trying to get some Cisco Meraki MX firewalls logs pointed to our Kubernetes cluster using fluentd pods. I'm using the #syslog source plugin, and able to get the logs generated, but I keep getting this error
2022-06-30 16:30:39 -0700 [error]: #0 invalid input data="<134>1 1656631840.701989724 838071_MT_DFRT urls src=10.202.11.05:39802 dst=138.128.172.11:443 mac=90:YE:F6:23:EB:T0 request: UNKNOWN https://f3wlpabvmdfgjhufgm1xfd6l2rdxr.b3-4-eu-w01.u5ftrg.com/..." error_class=Fluent::TimeParser::TimeParseError error="invalid time format: value = 1 1656631840.701989724 838071_ME_98766, error_class = ArgumentError, error = string doesn't match"
Everything seems to be fine, but it seems as though the Meraki is sending it's logs in Epoch time, and the fluentd #syslog plugin is not liking it.
I have a vanilla config:
<source>
#type syslog
port 5140
tag meraki
</source>
Is there a way to possibly transform the time strings to something fluentd will like? Or what am I missing here.

How to inject `time` attribute based on certain json key value?

I am still new on fluentd, I've tried various configuration, but I am stuck.
Suppose I have this record pushed to fluend that has _epoch to tell the epoch time the record is created.
{"data":"dummy", "_epoch": <epochtime_in_second>}
Instead of using time attribute being processed by fluentd, I want to override the time with this _epoch field. How to produce fluentd output with time overriden?
I've tried this
# TCP input to receive logs from the forwarders
<source>
#type forward
bind 0.0.0.0
port 24224
</source>
# HTTP input for the liveness and readiness probes
<source>
#type http
bind 0.0.0.0
port 9880
</source>
# rds2fluentd_test
<filter rds2fluentd_test.*>
#type parser
key_name _epoch
reserve_data true
<parse>
#type regexp
expression /^(?<time>.*)$/
time_type unixtime
utc true
</parse>
</filter>
<filter rds2fluentd_test.*>
#type stdout
</filter>
<match rds2fluentd_test.*>
#type s3
#log_level debug
aws_key_id "#{ENV['AWS_ACCESS_KEY']}"
aws_sec_key "#{ENV['AWS_SECRET_KEY']}"
s3_bucket foo-bucket
s3_region ap-southeast-1
path ingestion-test-01/${_db}/${_table}/%Y-%m-%d-%H-%M/
#s3_object_key_format %{path}%{time_slice}_%{index}.%{file_extension}
# if you want to use ${tag} or %Y/%m/%d/ like syntax in path / s3_object_key_format,
# need to specify tag for ${tag} and time for %Y/%m/%d in <buffer> argument.
<buffer time,_db,_table>
#type file
path /var/log/fluent/s3
timekey 1m # 5 minutes partition
timekey_wait 10s
timekey_use_utc true # use utc
chunk_limit_size 256m
</buffer>
time_slice_format %Y%m%d%H
store_as json
</match>
But upon receiving data like above, it shows warning error like this:
#0 dump an error event: error_class=Fluent::Plugin::Parser::ParserError error="parse failed no implicit conversion of Integer into Hash" location="/usr/local/bundle/gems/fluentd-1.10.4/lib/fluent/plugin/filter_parser.rb:110:in `rescue in filter_with_time'" tag="rds2fluentd_test." time=1590578507 record={....
was getting the same warning message, setting hash_value_field parsed under filter section solved the issue.

fluentd not capturing milliseconds for time

I am using fluentd version 0.14.6. I want to have the milliseconds (or better) captured by fluentd and then passed on to ElasticSearch, so that the entries are shown in the correct order. Here is my fluentd.conf:
<source>
#type tail
path /home/app/rails/current/log/development.log
pos_file /home/app/fluentd/rails.pos
tag rails.access
format /^\[(?<time>\S{10}T\S{8}.\d{3})\] \[(?<remoteIP>\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})\] \[(?<requestUUID>\w+)\] \[AWSELB=(?<awselb>\w+)\] (?<message>.*)/
time_format %Y-%m-%dT%H:%M:%S.%L
</source>
<match rails.access>
#type stdout
time_as_integer false
</match>
And here is a sample log entry from Rails
[2016-09-27T19:10:05.732] [xxx.xxx.xxx.46] [46171c9870ab2d06bc3a9a0bb02] [AWSELB=97B1C1B51866B68887CF7F5B8C352C45CA31592743CF389F006C541D59ED5E01852E7EF67C807B1CFC8BC145D569BCB9859AFCA73D10A87920CF2269DE5A47D16536B33873DEEF4A24967661232B38E564] Completed 200 OK in 39.8ms (Views: 0.5ms | ActiveRecord: 14.9ms)
This all parses fine, except the milliseconds are dropped. Here is a result from STDOUT
2016-09-27 19:43:56 +0000 rails.access: {"remoteIP":"xxx.xxx.xxx.46","requestUUID":"0238cb3d812534487181b2c54bd20","awselb":"97B1C1B51866B68887CF7F5B8C352C43CA21592743CF389F006C541D59ED5E01852E7EF67C807B1CFC8BC145D569BCB9859AFCA73D10A87920CF2269DE5A47D16536B33873DEEF4A24967661232B38E564","message":""}
I have searched SO, but the two posts listed are from a time before this PR, which is supposed to add in milliseconds. It is merged. The PR mentions adding a time_as_integer option, which I have done. I tried setting it to both true and false, as there is some confusion in the PR, but it made no difference. I also tried putting it into the source, but that threw an error.
I also looked at this post, which is trying to get to nano second, which I don't need. It also is not a good solution for me, as the time would then come from fluentd, not Rails.
Thanks for your help !
Your source is properly configured, the milliseconds are available to the output plugin. The stdout output plugin does not output milliseconds code.
You can test the milliseconds availability by using the file output plugin.
<match rails.access>
#type file
path example.out.log
time_format %Y-%m-%dT%H:%M:%S.%L
</match>
The ElasticSearch output plugin takes milliseconds into account source.

How do you forward an event from one match block to the next in fluentd?

I came across Fluentd last week. I liked it at first (still do), but there seem to be a few holes that are preventing me from using it.
I'm trying to forward our logs to two different locations - an S3 bucket to archive, and an Elasticsearch database for analytics with kibana. I looked at the fluent-forest-plugin, but I realize that won't work because of this. I tried using the copy plugin, but I'm getting this error:
[error]: config error file="/etc/td-agent/td-agent.conf" error="Other 's3' plugin already use same buffer_path: type = s3, buffer_path = /tmp/fluent-plugin-s3"
with this config
<source>
type tail
path /var/log/nginx/web__error.log
pos_file /var/tmp/nginx_web__error.pos
tag web__error
format /^(?<time>[^ ]+ [^ ]+) \[(?<log_level>.*)\] (?<pid>\d*).(?<tid>[^:]*): (?<message>.*)$/
</source>
<match web__error>
type copy
<store>
type s3
aws_key_id ACC_KEY
aws_sec_key SEC_KEY
s3_bucket log-bucket
path web__error/
buffer_path /tmp/fluent-plugin-s3
s3_object_key_format %{path}%{time_slice}_%{index}.%{file_extension}
time_slice_format %Y-%m-%d/%H
flush_interval 15s
utc
</store>
<store>
type elasticsearch
logstash_format true
logstash_prefix web__error
flush_interval 15s
include_tag_key true
utc_index true
</store>
</match>
From what I've read, once an event is caught in one match block, it can't be caught by any subsequent ones. As a last resort, I need to know if there is any way to do this that I haven't found yet?
This is a non-issue - I forgot I was using the same buffer_path in other config files, which caused this error.

Elasticsearch Memory Issue - ES Process Consuming ALL RAM

We are having an issue on our production Elasticsearch cluster where Elasticsearch seems to be consuming, over time, all of the RAM on each server. Each box has 128GB of RAM so we run two instances, 30GB is allocated to each for the JVM Heap. The remaing 68G is left for the OS and Lucene. We rebooted each of the servers last week and the RAM was started off just right using 24% of the RAM for each Elasticsearch process. It's now been almost a week and our memory consumption has gone up to around 40% per Elasticsearch instance. I have attached our config file in hopes that someone may be able to help figure out why Elasticsearch is growing out past the limit we have set for memory utilization.
Currently we are running ES 1.3.2 but will be upgrading to 1.4.2 next week with our next release.
Here is a view of top (extra fields removed for clarity) from right after the reboot:
PID USER %MEM TIME+
2178 elastics 24.1 1:03.49
2197 elastics 24.3 1:07.32
and one today:
PID USER %MEM TIME+
2178 elastics 40.5 2927:50
2197 elastics 40.1 3000:44
elasticserach-0.yml:
cluster.name: PROD
node.name: "PROD6-0"
node.master: true
node.data: true
node.rack: PROD6
cluster.routing.allocation.awareness.force.rack.values:
PROD4,PROD5,PROD6,PROD7,PROD8,PROD9,PROD10,PROD11,PROD12
cluster.routing.allocation.awareness.attributes: rack
node.max_local_storage_nodes: 2
path.data: /es_data1
path.logs:/var/log/elasticsearch
bootstrap.mlockall: true
transport.tcp.port:9300
http.port: 9200
http.max_content_length: 400mb
gateway.recover_after_nodes: 17
gateway.recover_after_time: 1m
gateway.expected_nodes: 18
cluster.routing.allocation.node_concurrent_recoveries: 20
indices.recovery.max_bytes_per_sec: 200mb
discovery.zen.minimum_master_nodes: 10
discovery.zen.ping.timeout: 3s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: XXX
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s
action.auto_create_index: .marvel-*
action.disable_delete_all_indices: true
indices.cache.filter.size: 10%
index.refresh_interval: -1
threadpool.search.type: fixed
threadpool.search.size: 48
threadpool.search.queue_size: 10000000
cluster.routing.allocation.cluster_concurrent_rebalance: 6
indices.store.throttle.type: none
index.reclaim_deletes_weight: 4.0
index.merge.policy.max_merge_at_once: 5
index.merge.policy.segments_per_tier: 5
marvel.agent.exporter.es.hosts: ["1.1.1.1:9200","1.1.1.1:9200"]
marvel.agent.enabled: true
marvel.agent.interval: 30s
script.disable_dynamic: false
and here is /etc/sysconfig/elasticsearch-0 :
# Directory where the Elasticsearch binary distribution resides
ES_HOME=/usr/share/elasticsearch
# Heap Size (defaults to 256m min, 1g max)
ES_HEAP_SIZE=30g
# Heap new generation
#ES_HEAP_NEWSIZE=
# max direct memory
#ES_DIRECT_SIZE=
# Additional Java OPTS
#ES_JAVA_OPTS=
# Maximum number of open files
MAX_OPEN_FILES=65535
# Maximum amount of locked memory
MAX_LOCKED_MEMORY=unlimited
# Maximum number of VMA (Virtual Memory Areas) a process can own
MAX_MAP_COUNT=262144
# Elasticsearch log directory
LOG_DIR=/var/log/elasticsearch
# Elasticsearch data directory
DATA_DIR=/es_data1
# Elasticsearch work directory
WORK_DIR=/tmp/elasticsearch
# Elasticsearch conf directory
CONF_DIR=/etc/elasticsearch
# Elasticsearch configuration file (elasticsearch.yml)
CONF_FILE=/etc/elasticsearch/elasticsearch-0.yml
# User to run as, change this to a specific elasticsearch user if possible
# Also make sure, this user can write into the log directories in case you change them
# This setting only works for the init script, but has to be configured separately for systemd startup
ES_USER=elasticsearch
# Configure restart on package upgrade (true, every other setting will lead to not restarting)
#RESTART_ON_UPGRADE=true
Please let me know if there is any other data I can provide. Thanks in advance for any help.
total used free shared buffers cached
Mem: 129022 119372 9650 0 219 46819
-/+ buffers/cache: 72333 56689
Swap: 28603 0 28603
What you are seeing isn't heap blow out, heap will always be restricted by what you set in the config. free -m and top report on OS related use, so the use there would most likely be the OS caching FS calls.
This will not cause a java OOM.
If you are experiencing java OOM, which is directly related to the java heap running out of space, then there is something else at play. Your logs may provide some info around that.

Resources