I'm having issues connecting Fluentd to Kafka for a centralized logging PoC I'm working on.
I'm currently using the following configuration:
Minikube
Fluentd
fluent/fluentd-kubernetes-daemonset:v1.14.3-debian-kafka2-1.0 (docker)
Configuration: I have the FLUENT_KAFKA2_BROKERS=<INTERNAL KAFKA BOOTSTRAP IP>:9092 and FLUENT_KAFKA2_DEFAULT_TOPIC=logs env set in my yaml for fluentd daemonset.
Kafka
I was sort of expecting to see the logs appear in a Kafka consumer running against the same broker listening on the "logs" topic. No dice.
Could anyone recommend next steps for troubleshooting and or a good reference? I've done a good bit of searching and have only found a few people posting about setting up with the fluentd-kafka plugin. Also would it make sense for me to explore Fluent Bit Kafka setup as an alternative?
In general, to configure forwarding of log events to Kafka topic you would definitely need to use output plugins for Fluentd.
Fluentd delivers fluent-plugin-kafka plugin, as specified in Fluentd docs, for both input and output use cases. For output case, this plugin has Kafka Producer functions to publishes messages into topics. kafka-connect-fluentd plugin can also be used as an alternative.
Fluent Bit - being the sub-project of Fluentd - a good lightweight alternative for Fluentd, but which one to use depends on your particular use case.
Fluent Bit has limited amount of filtering options, it is not as pluggable and flexible as Fluentd. The later has more configuration options and filters, it can be integrated with a much larger amount of input and output sources. It is essentially designed to deal with heavy throughput — aggregating from multiple inputs, processing data and routing to different outputs. More on comparison here and here.
Related
I am new to docker and I hope to monitor some QoS metrics of service in a docker container. Does docker provide some API that I can count the HTTP requests to each container with deploy code inside of the containers?
I believe Docker's only metrics API is the /containers/{id}/stats endpoint, and it looks like that only publishes interface-level statistics (generally the total number of bytes that have gone in and out of a container). That interface doesn't provide port-level or HTTP-level metrics.
You might check to see whether the HTTP framework you're using has metric instrumentation built-in. If it does, you can use open-source tools like Prometheus and Grafana to collect and display those metrics.
At a technical level, decoding the HTTP stream is trickier than just forwarding packets along, and you need a more involved proxy setup. If you're using Kubernetes, service meshes like Istio often provide these statistics for you, by running a proxy (Envoy) that handles all traffic; but especially for a small application, instrumenting your own code will be much easier than installing Kubernetes and Istio and trying to deploy there.
We re trying to eliminate Datadog agents from our infrastructure. I am trying to find a solution to forward the containers standard output logs to be visualised on datadog but without the agents and without changing the dockerfiles because there are hundreds of them.
I was thinking about trying to centralize the logs with rsyslog but I dont know if its a good idea. Any suggestions ?
This doc will show you a comprehensive list of all integrations that involve log collection. Some of these include other common log shippers, which can also be used to forward logs to Datadog. Among these you'd find...
Fluentd
Logstash
Rsyslog (for linux)
Syslog-ng (for linux, windows)
nxlog (for windows)
That said, you can still just use the Datadog agent to collect logs only (they want you to collect everything with their agent, that's why they warn you against collecting just their logs).
If you want to collect logs from docker containers, the Datadog agent is an easy way to do that, and it has the benefit of adding lots of relevant docker-metadata as tags to your logs. (Docker log collection instructions here.)
If you don't want to do that, I'd look at Fluentd first on the list above -- it has a good reputation for containerized log collection, promotes JSON log formatting (for easier processing), and scales reasonably well.
Question:
Is it possible to use stdout/stderr as fluentd source?
If not, are there some sort of workaround to implement this?
Background:
I have to containerize a NodeJS web server that uses json-log as a logging resource.
Since containers are ephemeral, I want to extract it's logs for debugging purposes.
To do this, I've decided to use EFK stack.
However, since...
The philosophy of json-log is...
Write to stdout/err
I can only get the logs of the web server from stdout.
After going through the fluentd documentation, I didn't find a way to use stdout/stderr as a source.
Related question:
Is it possible to use stdout as a fluentd source to capture specific logs for write to elasticsearch?
The question has an answer but it is inapplicable in my case.
See https://www.npmjs.com/package/json-log#write-to-stdouterr
You can send logs from json-log to syslog.
So you can use fluent-plugin-syslog to receive logs from json-log, and send them to Fluentd.
The Kubernetes documentation states it's possible to use Elasticsearch and Kibana for cluster level logging.
Is this possible to do this on the instance of Kubernetes that's shipped with Docker for Windows as per the documentation? I'm not interested in third party Kubernetes manifests or Helm charts that mimic this behavior.
Kubernetes is an open-source system for automating deployment, scaling,
and management of containerized applications.
It is a complex environment with a huge amount of information regarding the state of cluster and events
processed during execution of pods lifecycle and health checking off all nodes and whole Kubernetes
cluster.
I do not have practice with Docker for Windows, so my point of view is based on Kubernetes with Linux containers
perspective.
To collect and analyze all of this information there are some tools like Fluentd, Logstash
and they are accompanied by tools such as Elasticsearch and Kibana.
Those cluster-level log aggregation can be realized using Kubernetes orchestration framework.
So we can expect that some running containers take care of gathering data and other containers
take care of other aspects of abstractions like analyzing and presentation layer.
Please notice that some solutions depend on cloud platform features where Kubernetes environment
is running. For example, GCP offers Stackdriver Logging.
We can mention some layers of log probes and analyses:
monitoring a pod
is the most rudimentary form of viewing Kubernetes logs.
You use the kubectl commands to fetch log data for each pod individually.
These logs are stored in the pod and when the pod dies, the logs die with them.
monitoring a node. Collected log for each node are stored in a JSON file. This file can get really large.
Node-level logs are more persistent than pod-level ones.
monitoring a cluster.
Kubernetes doesn’t provide a default logging mechanism for the entire cluster, but leaves this up
to the user and third-party tools to figure out. One approach is to build on the node-level logging.
This way, you can assign an agent to log every node and combine their output.
As you see, there is a niche on cluster level monitoring, so there is a reason to aggregate current logs and
offer a practical way to analyze and present results.
On the node level logging, popular log aggregator is Fluentd. It is implemented as a Docker container,
and it is run parallel with pod lifecycle. Fluentd does not store the logs themselves.
Instead, it sends their logs to an Elasticsearch cluster that stores the log information in a replicated set of nodes.
It looks like Elasticsearch is used as a data store of aggregated logs of working nodes.
This aggregator cluster consists of a pod with two instances of Elasticsearch.
The aggregated logs in the Elasticsearch cluster can be viewed using Kibana.
This presents a web interface, which provides a more convenient interactive method for querying the ingested logs
The Kibana pods are also monitored by the Kubernetes system to ensure they are running healthily and the expected
number of replicas are present.
The lifecycle of these pods is controlled by a replication-controller specification similar in nature to how the
Elasticsearch cluster was configured.
Back to your question. I'm pretty sure that the mentioned above also works with Kubernetes and Dockers
for Windows. From the other hand, I think the cloud platform or the Linux premise environment
is a natural space to live for them.
Answer was inspired by Cluster-level Logging of Containers with Containers and Kubernetes Logging articles.
I also like Configuring centralized logging from Kubernetes page and used An Introduction
to logging in Kubernetes at my beginning with Kubernetes.
I am currently trying to break into Data engineering and I figured the best way to do this was to get a basic understanding of the Hadoop stack(played around with Cloudera quickstart VM/went through tutorial) and then try to build my own project. I want to build a data pipeline that ingests twitter data, store it in HDFS or HBASE, and then run some sort of analytics on the stored data. I would also prefer that I use real time streaming data, not historical/batch data. My data flow would look like this:
Twitter Stream API --> Flume --> HDFS --> Spark/MapReduce --> Some DB
Does this look like a good way to bring in my data and analyze it?
Also, how would you guys recommend I host/store all this?
Would it be better to have one instance on AWS ec2 for hadoop to run on? or should I run it all in a local vm on my desktop?
I plan to have only one node cluster to start.
First of all, Spark Streaming can read from Twitter, and in CDH, I believe that is the streaming framework of choice.
Your pipeline is reasonable, though I might suggest using Apache NiFi (which is in the Hortonworks HDF distribution), or Streamsets, which is installable in CDH easily, from what I understand.
Note, these are running completely independently of Hadoop. Hint: Docker works great with them. HDFS and YARN are really the only complex components that I would rely on a pre-configured VM for.
Both Nifi and Streamsets give you a drop and drop UI for hooking Twitter to HDFS and "other DB".
Flume can work, and one pipeline is easy, but it just hasn't matured at the level of the other streaming platforms. Personally, I like a Logstash -> Kafka -> Spark Streaming pipeline better, for example because Logstash configuration files are nicer to work with (Twitter plugin builtin). And Kafka works with a bunch of tools.
You could also try out Kafka with Kafka Connect, or use Apache Flink for the whole pipeline.
Primary takeaway, you can bypass Hadoop here, or at least have something like this
Twitter > Streaming Framework > HDFS
.. > Other DB
... > Spark
Regarding running locally or not, as long as you are fine with paying for idle hours on a cloud provider, go ahead.