How can I drop corrupt event in Flume-NG? - flume

I am using flume file channel with avro source-sink topology to transfer logs. Along with the logs, flume is also resulting into a lot of corrupt logs. How can I drop such logs.
Corruption is basically happening due to merging of some logs. Lets say I have 10 logs coming from machine A and 10 logs coming from machine B. What is happening is that Flume is giving me 21 logs (10 each from machine A and B) and 1 log which is a combination of a log from machine A and B. However this is not that frequent. I am getting around 1 corrupt log in 10000 logs but due to scale this also is turning out to be a problem.

You can use interceptor to discard the events of your choice. May be you can introspect the body of the event to check for corruption and discard those.

Related

Fluent Bit only receiving logs on first connection

I'm trying to use Serilog (the Serilog.Sinks.Network sink) to write to a Fluentbit instance (just running locally).
I'm finding that if I start Fluentbit, then run my app, the logs get processed as expected. However, without restarting Fluentbit, if I then run my app subsequent times, the logs no longer appear. Restarting Fluentbit will go back to making it work again (but again, just that first time). I'm not seeing any errors (even with Fluentbit turned up to trace logging).
The above code is here incase it's useful:
https://github.com/dracan/FluentBitProblem
I'm seeing similar issues when trying other Serilog Sinks too - eg. Serilog.Sinks.Fluentd. Feels like there's something I'm missing with the way FluentBit connections work.
A bit more information
Digging a bit further, and getting out Wireshark - I can see that the TCP data is being sent for both the working and failing attempts...
If I use a bogus port number, Wireshark recognised this...
And if I log in a while loop, and then kill the Fluentbit process - Wireshark also picks this up as an error...
Which suggests to me that Fluentbit is getting the logs, but ignoring them for some reason.

Switch between WriteTo and AuditTo at runtime

AuditTo is a Serilog feature that ensures synchronous write to the sink, with an exception thrown if change flushing fails.As the name implies, it is ideal for ensuring the security of audit data to be stored. Right now, I found File, Seq and RabbitMQ sinks supporting AuditTo writes. I couldn't find the SqlLite sink that I'm interested in ... :(
From the other side, we have WriteTo, which batches the log entries and writes them asynchronously. There are no exceptions; it's kind of fire and forget.No one cares whether the log entries are dropped by the connection or the target system's failure or unavailability.
I would like to implement sending the audit logs via AuditTo but also be able to switch the log configuration to WriteTo at the runtime. In the meantime, the app might still write the logs. 
I saw that Serilog offers dynamic switching of the logging level via LoggingLevelSwitch.
Any suggestions, ideas, or solutions for such requirements?

Fargate logging issue using log4j2

We have a fargate service running. On CloudWatch we can see the metrics for ECS/ContainerInsights->StorageWriteBytes growing every hour, and at some point it will not increase anymore probably because out of disk space. We will start to see log errors if we do not force a new deployment of the ECS. The error looks like:
error: org.apache.logging.log4j.core.appender.AppenderLoggingException: Error
writing to RandomAccessFile /apollo/env/ReaverFeatureGating/var/output/logs/application.log.%d{yyyy-MM-dd-HH}
Questions:
Is this normal to all the fargate services? Do we setup something
wrong?
Can we remove all the AmazonRollingRandomAccessFile and just use STDOUT in log4j2-container.xml? Will that still post our events to
CloudWatch, but just not writing to the disk?
After some research this is what I got:
Because the default template includes AmazonRollingRandomAccessFile, the log will be generated locally but never be cleaned up. There are some suggestions about adding a cron job to delete the logs, but for our case we don't need the local logs.
Yes, CloudWatch just need STDOUT.
Also, StorageWriteBytes only represent how many bytes are read/write to the storage. It is not equal to the used disk space. To monitor the disk space, we can build CloudWatch Agent into the container image and then use disk_used metric.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/metrics-collected-by-CloudWatch-agent.html

neo4j is storing arbitrary files in drive C?

my C Drive size is growing and my server is not running any thing but neo4j.
even though i configured neo4j to store database information on some other drive.
node count might be irrelevant but for the record, i have almost 10 million nodes and traffic to database about 200 request / minute.
is there any thing else written by neo4j that i should be aware of?
dbms.directories.data=E:/MyNeoDB4/
dbms.directories.logs=E:/MyNeoDb4
dbms.jvm.additional=-Dunsupported.dbms.udc.source=zip
dbms.memory.heap.initial_size=15
dbms.memory.heap.max_size=15G
dbms.security.procedures.unrestricted=apoc.*
dbms.memory.pagecache.size=8G
Update 1:
things i have checked already:
my debug log is being written some where other than Drive C
metrics.enabled=false
Update 2:
- as #InverseFalcon said i also checked transaction logs in the first step. they were being written in some other directory.
(Note: Answer was written before original question was updated to say that neither metrics nor logs were the likely culprits)
Logs, and possibly metrics
I'm not sure what your logging needs have been like, but a major source of disk consumption that is not the data itself is the writing of log files. They typically do not grow extremely quickly, but it totally depends on your set up.
I suspect that your drive may be filling up with logs, although I am surprised it's filling up so quickly. I would check out your log files and see if they are full of long chains of exceptions.
It could also be metrics being exported to CSV on the local disk, although I do not believe that Neo4J will do that without being explicitly configured to do so.
More info on metrics is at the official docs:
https://neo4j.com/docs/operations-manual/current/monitoring/metrics/
A variant on Rebecca Nelson's answer, you might want to check for transaction log files.
Transaction logs are the source of truth for changes made to a database, and they are not the same kinds of logs as the readable log files (debug.log, neo4j.log) that live in the logs folder.
You can find transaction logs in your graph.db folder (or whatever name you've given to your graph database folder) using the naming pattern neostore.transaction.db.0 (with incremental numbering of the log files starting with 0).
Transaction logs are a stage of data persistence. Transactions affecting the database first write to these logs. When criteria are met, a checkpoint operation occurs which flushes the contents of the transaction logs to the datastore files (some of the other files in the graph.db folder) and the transaction logs are pruned and/or rotated.
While you should not modify or delete transaction log files yourself, you can add configuration parameters in neo4j.conf to control how these files are handled.
Here are the docs dealing with transaction logs.

Erlang Binary Leak?

We have an erlang/elixir application (on 18/erts 7.3.1) that processes large json payloads.  
Here's a typical workflow:
A listener gets a token from rabbitmq and sends to a gen_server.
The gen_server puts the token into a ETS table with a future time (current + n secs). A schedule job in the gen_server will pick up expired tokens from ETS and launch several short lived processes with these tokens.
These short lived processes downloads 30-50k json payloads from elasticsearch (using hackney) and processes it, then uploads the results back to  elasticsearch, following that the process dies immediately. We keep track of these processes and have confirmed they die. We process 5-10 of these requests per second.
The Problem: We see an ever growing binary space and within 48 hours this grows to couple of GBs (seen via observer and debug prints). A manual GC also has no impact.
We already added "recon" and ran recon:bin_leak, however this only frees up few KBs and no impact on the ever growing binary space.
stack: Erlang 18/erts 7.3.1, elixir 1.3.4, hackney 1.4.4, poison 2.2.0, timex 3.1.13, etc., none of these apps holding the memory either.
Has anyone come across a similar issue in the past? Would appreciate any solutions.
Update 9/15/2017:
We updated our app to Erlang 19/ERTS 8.3 and hackney and poison libs to latest, still there is no progress. Here's some log within a GenServer, which periodically sends a message to itself, using spawn/receive or send_after. At each handle_info, it looks up an ets table and if it finds any "eligible" entries, it spawns new processes. If not, it just returns a {:noreply, state}. We print the VMs binary space info here at entry to the function (in KB), the log is listed below. This is a "queit" time of the day. You can see the gradual increase of binary space. Once again :recon.bin_leak(N) or :erlang.garbage_collect() had no impact on this growth.
11:40:19.896 [warn] binary 1: 3544.1328125
11:40:24.897 [warn] binary 1: 3541.9609375
11:40:29.901 [warn] binary 1: 3541.9765625
11:40:34.903 [warn] binary 1: 3546.2109375
--- some processing ---
12:00:47.307 [warn] binary 1: 7517.515625
--- some processing ---
12:20:38.033 [warn] binary 1: 15002.1328125
We never had a situation like this in our old Scala/Akka app which processes 30x more volume runs for years without an issue or restarts. I wrote both apps.
We found that the memory_leak came from a private reusable library that sends messages to Graylog and that uses the function below to compress that data before sending it via gen_udp.
defp compress(data) do
zip = :zlib.open()
:zlib.deflateInit(zip)
output = :zlib.deflate(zip, data, :finish)
:zlib.deflateEnd(zip)
:zlib.close(zip) #<--- was missing, hence the slow memory leak.
output
end
Instead using term_to_binary(data, [:compressed]) I could have saved some headaches.
Thanks for all the inputs and comments. Much appreciated!

Resources