Fluentd regex filter removes other keys - fluentd

I'm getting a message into fluentd with a few keys already populated from previous stages (fluent-bit on another host). I'm trying to parse the content of the log field as follows:
# Parse app_logs
<filter filter.app.backend.app_logs>
#type parser
key_name log
<parse>
#type regexp
expression /^(?<module>[^ ]*) *(?<time>[\d ,-:]*) (?<severity>[^ ]*) *(?<file>[\w\.]*):(?<function>[\w_]*) (?<message>.*)$/
time_format %Y-%m-%d %H:%M:%S,%L
</parse>
</filter>
It works (kind of), as it extracts the fields as expected. That said, it removes all the other fields that were there before.
Example message before the filter:
filter.app.backend.app_logs: {"docker.container_name":"intranet-worker","docker.container_id":"98b7784f27f93a056c05b4c5066c06cb5e23d7eeb436a6e4a66cdf8ff045d29f","time":"2022-06-10T17:00:00.248932151Z","log":"org-worker 2022-06-10 19:00:00,248 INFO briefings.py:check_expired_registrations Checking for expired registrations\n","docker.container_image":"registry.my-org.de/org-it-infrastructure/org-fastapi-backend/backend-worker:v0-7-11","stream":"stdout","docker.container_started":"2022-06-10T14:57:27.925959889Z"}
After the filter, the message looks like this (its a slightly different one, but same stream):
filter.app.backend.app_logs: {"module":"mksp-api","severity":"DEBUG","file":"authToken.py","function":"verify_token","message":"Token is valid, checking permission"}
So only the parsed fields are kept, the rest is removed. Can I somehow use that filter to add the fields to the message, instead of replacing it?

Actually, this scenario is described in the documentation, its not part of the regexp documentation but of the corresponding parser filter documentation:
reserve_data
Keeps the original key-value pair in the parsed result.
Therefore, the following configuration works:
<filter filter.app.backend.app_logs>
#type parser
key_name log
reserve_data true
<parse>
#type regexp
expression /^(?<module>[^ ]*) *(?<time>[\d ,-:]*) (?<severity>[^ ]*) *(?<file>[\w\.]*):(?<function>[\w_]*) (?<message>.*)$/
time_format %Y-%m-%d %H:%M:%S,%L
</parse>
</filter>

Related

Can fluent-bit parse multiple types of log lines from one file?

I have a fairly simple Apache deployment in k8s using fluent-bit v1.5 as the log forwarder. My setup is nearly identical to the one in the repo below. I'm running AWS EKS and outputting the logs to AWS ElasticSearch Service.
https://github.com/fluent/fluent-bit-kubernetes-logging
The ConfigMap is here: https://github.com/fluent/fluent-bit-kubernetes-logging/blob/master/output/elasticsearch/fluent-bit-configmap.yaml
The Apache access (-> /dev/stdout) and error (-> /dev/stderr) log lines are both in the same container logfile on the node.
The problem I'm having is that fluent-bit doesn't seem to autodetect which Parser to use, I'm not sure if it's supposed to, and we can only specify one parser in the deployment's annotation section, I've specified apache.
So in the end, the error log lines, which are written to the same file but come from stderr, are not parsed.
Should I be sending the logs from fluent-bit to fluentd to handle the error files, assuming fluentd can handle this, or should I somehow pump only the error lines back into fluent-bit, for parsing?
Am I missing something?
Thanks!
I was able to apply a second (and third) parser to the logs by using the FluentBit FILTER with the 'parser' plugin (Name), like below.
Documented here: https://docs.fluentbit.io/manual/pipeline/filters/parser
[FILTER]
Name parser
Match kube.*
Parser apache_error_custom
Parser apache_error
Preserve_Key On
Reserve_Data On
Key_Name log
Fluentbit is able to run multiple parsers on input.
If you add multiple parsers to your Parser filter as newlines (for non-multiline parsing as multiline supports comma seperated) eg.
[Filter]
Name Parser
Match *
Parser parse_common_fields
Parser json
Key_Name log
The 1st parser parse_common_fields will attempt to parse the log, and only if it fails will the 2nd parser json attempt to parse these logs.
If you want to parse a log, and then parse it again for example only part of your log is JSON. Then you'll want to add 2 parsers after each other like:
[Filter]
Name Parser
Match *
Parser parse_common_fields
Key_Name log
[Filter]
Name Parser
Match *
Parser json
# This is the key from the parse_common_fields regex that we expect there to be JSON
Key_Name log
Here is an example you can run to test this out:
Example
Attempting to parse a log but some of the log can be JSON and other times not.
Example log lines
2022-07-28T22:03:44.585+0000 [http-nio-8080-exec-3] [2a166faa-dbba-4210-a328-774861e3fdef][0ed32f19-47bb-4c1f-92c2-c9b7c43aa91f] INFO SomeService:000 - Using decorator records threshold: 0
2022-07-29T11:36:59.236+0000 [http-nio-8080-exec-3] [][] INFO CompleteOperationLogger:25 - {"action":"Complete","operation":"healthcheck","result":{"outcome":"Succeeded"},"metrics":{"delayBeforeExecution":0,"duration":0},"user":{},"tracking":{}}
parser.conf
[PARSER]
Name parse_common_fields
Format regex
Regex ^(?<timestamp>[^ ]+)\..+ \[(?<log_type>[^ \[\]]+)\] \[(?<transaction_id>[^ \[\]]*)\]\[(?<transaction_id2>[^ \[\]]*)\] (?<level>[^ ]*)\s+(?<service_id>[^ ]+) - (?<log>.+)$
Time_Format %Y-%m-%dT%H:%M:%S
Time_Key timestamp
[PARSER]
Name json
Format json
fluentbit.conf
[SERVICE]
Flush 1
Log_Level info
Parsers_File parser.conf
[INPUT]
NAME dummy
Dummy {"log": "2022-07-28T22:03:44.585+0000 [http-nio-8080-exec-3] [2a166faa-dbba-4210-a328-774861e3fdef][0ed32f19-47bb-4c1f-92c2-c9b7c43aa91f] INFO AnonymityService:245 - Using decorator records threshold: 0"}
Tag testing.deanm.non-json
[INPUT]
NAME dummy
Dummy {"log": "2022-07-29T11:36:59.236+0000 [http-nio-8080-exec-3] [][] INFO CompleteOperationLogger:25 - {\"action\":\"Complete\",\"operation\":\"healthcheck\",\"result\":{\"outcome\":\"Succeeded\"},\"metrics\":{\"delayBeforeExecution\":0,\"duration\":0},\"user\":{},\"tracking\":{}}"}
Tag testing.deanm.json
[Filter]
Name Parser
Match *
Parser parse_common_fields
Key_Name log
[Filter]
Name Parser
Match *
Parser json
Key_Name log
[OUTPUT]
Name stdout
Match *
Results
After the parse_common_fields filter runs on the log lines, it successfully parses the common fields and either will have log being a string or an escaped json string
First Pass
[0] testing.deanm.non-json: [1659045824.000000000, {"log_type"=>"http-nio-8080-exec-3", "transaction_id"=>"2a166faa-dbba-4210-a328-774861e3fdef", "transaction_id2"=>"0ed32f19-47bb-4c1f-92c2-c9b7c43aa91f", "level"=>"INFO", "service_id"=>"AnonymityService:245", "log"=>"Using decorator records threshold: 0"}]
[0] testing.deanm.json: [1659094619.000000000, {"log_type"=>"http-nio-8080-exec-3", "level"=>"INFO", "service_id"=>"CompleteOperationLogger:25", "log"=>"{"action":"Complete","operation":"healthcheck","result":{"outcome":"Succeeded"},"metrics":{"delayBeforeExecution":0,"duration":0},"user":{},"tracking":{}}"}]
Once the Filter json parses the logs, we successfully have the JSON also parsed correctly
Second Pass
[0] testing.deanm.non-json: [1659045824.000000000, {"log_type"=>"http-nio-8080-exec-3", "transaction_id"=>"2a166faa-dbba-4210-a328-774861e3fdef", "transaction_id2"=>"0ed32f19-47bb-4c1f-92c2-c9b7c43aa91f", "level"=>"INFO", "service_id"=>"AnonymityService:245", "log"=>"Using decorator records threshold: 0"}]
[0] testing.deanm.json: [1659094619.000000000, {"action"=>"Complete", "operation"=>"healthcheck", "result"=>{"outcome"=>"Succeeded"}, "metrics"=>{"delayBeforeExecution"=>0, "duration"=>0}, "user"=>{}, "tracking"=>{}}]
Didn't see this for FluentBit, but for Fluentd:
https://github.com/fluent/fluentd-kubernetes-daemonset
https://github.com/repeatedly/fluent-plugin-multi-format-parser#configuration
Note format none as the last option means to keep log line as is, e.g. plaintext, if nothing else worked.
You can also use FluentBit as a pure log collector, and then have a separate Deployment with Fluentd that receives the stream from FluentBit, parses, and does all the outputs. Use type forward in FluentBit output in this case, source #type forward in Fluentd. Docs: https://docs.fluentbit.io/manual/pipeline/outputs/forward

parse the date with format yyyy.MM.dd HH.mm.ss.SSSSSS with ingest node processor in ELK

I need to parse yyyy.MM.dd HH.mm.ss.SSSSSS format as date field type with ingest node processors.
I couldn't find any applicable grok pattern suitable to parse the above format. And, also the date processor did not work.
example log:
[2020.08.31 14:30:23.823121] (INFO) Execution is DONE!
How can i parse this log like time:date severity:string message:string ?
Please find below the grok pattern that will parse your logline:
\[(?<timestamp>%{YEAR}.%{MONTHNUM}.%{MONTHDAY} %{TIME})\] \(%{LOGLEVEL:log level}\) %{GREEDYDATA:message}
I have used Grok Debugger to debug the grok pattern.
Screenshot of the output after parsing the logline:

Stackdriver custom multiline logging, time format

I've been trying to set up a custom multiline log parser to get logs into Stackdriver with some readable fields. Currently it looks like this:
<source>
type tail
read_from_head true
path /root/ansible.log
pos_file /var/lib/google-fluentd/pos/ansible.pos
time_format "%a %b %e %T %Z %Y"
format multiline
format_firstline /Started ansible run at/
format1 /Started ansible run at (?<timestart>[^\n]+)\n(?<body>.*)/
format2 /PLAY RECAP.*/
format3 /ok=(?<ok>\d+)\s+changed=(?<changed>\d+)\s+unreachable=(?<unreachable>\d+)\s+failed=(?<failed>\d+).*/
format4 /Finished ansible run at (?<timeend>[^\n]+)/
tag ansible
</source>
It's done to the specifications at http://docs.fluentd.org/v0.12/articles/parser_multiline, and it works. But it works without a proper time stamp - timestart and timeend are just simple fields in the json. So in this current state, the time_format setting is useless, because I don't have a time variable among the regexes. This does aggregate all the variables I need, logs show up in Stackdriver when I run the fluend service, and all is almost happy.
However, when I change one of those time variables' name to time, trying to actually assign a Stackdriver timestamp to the entry, it doesn't work. The fluentd log on the machine says that the worker started and parsed everything, but logs don't show up in the Stackdriver console at all.
timestart and timeend look like Fri Jun 2 20:39:58 UTC 2017 or something along those lines. The time format specifications are at http://ruby-doc.org/stdlib-2.4.1/libdoc/time/rdoc/Time.html#method-c-strptime and I've checked and double checked them too many times and I can't figure out what I'm doing wrong.
EDIT: another detail: when I try to parse out the time variable, while the logs don't show up in the Stackdriver console, the appropriate tag (in this case ansible) shows up in the list of tags. It's just that the results are empty.
You're correct that the Stackdriver logging agent looks for the timestamp in the 'time' field, but it uses Ruby's Time.iso8601 to parse that value (falling back on Time.at on error). The string you quoted (Fri Jun 2 20:39:58 UTC 2017) is not in either of those formats, so it fails to parse it (you could probably see the error in /var/log/google-fluentd/google-fluentd.log).
You could add a record_transformer plugin to your config to change your parsed date to the right format (hint: enable_ruby is your friend). Something like:
<filter foo.bar>
#type record_transformer
enable_ruby
<record>
time ${Time.strptime(record['time'], '%a %b %d %T %Z %Y').iso8601}
</record>
</filter>
should work...

Flex prints newline to stdout on default rule match - want to alter that behavior

I have the following flex rules in place.
"#"{name} {printf(" HASH | %s\n", yytext);}
. {}
It works great for my purposes and outputs upon a match to the first rule;
HASH | some matched string
What's bothering me is that flex is also printing a newline on each match of the second rule. So I get a stdout filled with newlines. Is there a do nothing OP in C? Am I implicitly telling flex to print a newline with a empty rule action? Omitting the "{}" results in the same behavior. I can use sed or whatever to filter out the newlines, but I'd rather just tell flex to stop printing newlines.
I'm happy to provide follow-up examples and data.
You need to add \n to your default rule:
.|\n {}

Fuzzy search with Solr and sunspot

I have installed Solr and the Sunspot gem for my Rails 3.0 app.
My goal is to do fuzzy search.
For example, I want the search term "Chatuea Marguxa" be found as "Château Margaux".
Actually, only the same exact words are found, so fuzzy didn't work at all.
My model:
searchable do
text :winery
end
My controller:
search = Wine.search do
fulltext 'Chatuea Marguxa'
end
The solr schemas I tried, with ngrams:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
I also tried with double metaphone:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
</analyzer>
In both cases, I got 0 response. (after reindexing of course).
What I did wrong ?
try to add character '~' behind all word in query. Like this: Chatuea~ Marguxa~. This is fuzzy operator implemented in lucene: http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches
some searching around revealed fuzzily gem:
Anecdotical benchmark: against our whole Geonames-derived table of
locations (3.2M records, about 1GB of data), on my development machine
(a 2011 MacBook Pro)
searching for the top 10 matching records takes 6ms ±1 preparing the
index for all records takes about 10min the DB query overhead when
changing a record is at 3ms ±2 the memory overhead (footprint of the
trigrams table index) is about 300MB

Resources