Monitoring Tools for Kafka

Monitoring Tools for Kafka - monitoring

I have my Kafka(0.9.0.0) Nodes and Zookeeper setup , i have 3 Kafka nodes and 3 zookeeper nodes and its is working absolutely fine.
Now i was looking for set of monitoring tools to monitor topics, load on each node, memory usage . Are there any good tools?
1) I have tried exploring Kafka-Manager -- but it only supports till 0.8.2.2 version.
2) Ganglia - It gives an overview about some stuff but it put too much load on Kafka nodes, and needs to installed on each node.
3) Nagios - It work i will OK but doesn't provide much detail and it also needs to installed on each node.
So , What i was looking for is a tool that gives me insights about health , performance , memory consumption, Notification if something goes wrong ,status of the node and all other details for each Kafka and zookeeper nodes without needed to be installed on each nodes. I was planning to have a central Monitoring Server where i have all the tools which can be used for this .
What is the set of tool people use for monitoring their Kafka clusters??

SPM for Kafka monitoring can do everything you mentioned. It works with all recent versions of Kafka and pulls some 100-200 metrics. You also mention ZooKeeper, which SPM can monitor for you, too. Note that you will want to get the monitoring agent installed on each node. Why? Because you will want to get the OS metrics, too.

Related

Does it makes sense to manage Docker containers of a/few single hosts with Kubernetes?

I'm using docker on a bare metal server. I'm pretty happy with docker-compose to configure and setup applications.
Still some features are missing, like configuration management and monitoring maybe there are other solutions to solve this issues but I'm a bit overwhelmed by the feature set of Kubernetes and can't judge if it would help me here.
I'm also open for recommendations to solve the requirements separately:
Configuration / Secret management
Monitoring of my docker hostes applications (e.g. having some kind of dashboard)
Remot container control (SSH is okay with only one Server)
Being ready to scale my environment (based on multiple different Dockerized applications) to more than one server in future - already thinking about networking/service discovery issues with a pure docker-compose setup
I'm sure Kubernetes covers some of these features, but I have the feeling that it's too much focused on Cloud platforms where Machines are created on the fly (since I only have at most few bare metal Servers)
I hope the questions scope is not too broad, else please use the comment section and help me to narrow down the question.
Thanks.

I think the Kubernetes is absolutely much your requests and it is what you need.
Let's start one by one.
I have the feeling that it's too much focused on Cloud platforms where Machines are created on the fly (since I only have at most few bare metal Servers)
No, it is not focused on Clouds. Kubernates can be installed almost on any bare-metal platform (include ARM) and have many tools and instructions which can help you to do it. Also, it is easy to deploy it on your local PC using Minikube, which will prepare local cluster for you within VMs or right in your OS (only for Linux).
Configuration / Secret management
Kubernates has a powerful configuration and management based on special objects which can be attached to your containers. You can read more about configuration management in that article.
Moreover, some tools like Helm can provide you more automation and range of preconfigured applications, which you can install using a single command. And you can prepare your own charts for it.
Monitoring of my docker hostes applications (e.g. having some kind of dashboard)
Kubernetes has its own dashboard where you can get many kinds of information: current applications status, configuration, statistics and many more. Also, Kubernetes has great integration with Heapster which can be used with Grafana for powerful visualization of almost anything.
Remot container control (SSH is okay with only one Server)
Kubernetes controlling tool kubectl can get logs and connect to containers in the cluster without any problems. As an example, to connect a container "myapp" you just need to call kubectl exec -it myapp sh, and you will get sh session in the container. Also, you can connect to any application inside your cluster using kubectl proxy command, which will forward a port you need to your PC.
Being ready to scale my environment (based on multiple different Dockerized applications) to more than one server in future - already thinking about networking/service discovery issues with a pure docker-compose setup
Kubernetes can be scaled up to thousands of nodes. Or can have only one. It is your choice. Independent of a cluster size, you will get production-grade networking, service discovery and load balancing.
So, do not afraid, just try to use it locally with Minikube. It will make many of operation tasks more simple, not more complex.

Advantages of dockerizing Java Springboot application?

We are working with a dockerized kafka environment. I would like to know the best practices for deployments of kafka-connectors and kafka-streams applications in such scenerio . Currently we are deploying each connector and stream as springboot applications and are started as systemctl microservices . I do not find a significant advantage in dockerizing each kafka connector and stream . Please provide me insights on the same

To me the Docker vs non-Docker thing comes down to "what does your operations team or organization support?"
Dockerized applications have an advantage in that they all look / act the same: you docker run a Java app the same way as you docker run a Ruby app. Where as with an approach of running programs with systemd, there's not usually a common abstraction layer around "how do I run this thing?"
Dockerized applications may also abstract some small operational details, like port management for example - ie making sure all your app's management.ports don't clash with each other. An application in a Docker container will run as one port inside the container, and you can expose that port as some other number outside. (either random, or one to your choosing).
Depending on the infrastructure support, a normal Docker scheduler may auto-scale a service when that service reaches some capacity. However, in Kafka streams applications the concurrency is limited by the number of partitions in the Kafka topics, so scaling up will just mean some consumers in your consumer groups go idle (if there's more than the number of partitions).
But it also adds complications: if you use RocksDB as your local store, you'll likely want to persist that outside the (disposable, and maybe read only!) container. So you'll need to figure out how to do volume persistence, operationally / organizationally. With plain ol' Jars with Systemd... well you always have the hard drive, and if the server crashes either it will restart (physical machine) or hopefully it will be restored by some instance block storage thing.
By this I mean to say: that kstream apps are not stateless, web apps where auto-scaling will always give you some more power, and that serves HTTP traffic. The people making these decisions at an organization or operations level may not fully know this. Then again, hey if everyone writes Docker stuff then the organization / operations team "just" have some Docker scheduler clusters (like a Kubernetes cluster, or Amazon ECS cluster) to manage, and don't have to manage VMs as directly anymore.

Dockerizing + clustering with kubernetes provide many benefits like auto healing, auto horizontal scaling.
Auto healing: in case spring application crashes, kubernetes will automatically run another instances and will ensure required number of containers are always up.
Auto horizontal scaling: if you get burst of messages, yo can tune spring applications to auto scale up or down using HPA that can use custom metrics also.

Any recommendation about run Prometheus in docker container or not?

Our team decided to switch to Prometheus monitoring. So I wonder how to setup highly available fault tolerant Prometheus installation.
We have a bunch of small projects, running on AWS ECS, almost all services are containerized. So I have some questions.
Should we containerize the Prometheus?
That means to run 2 EC2 instances with one Prometheus container per instance and one NodeExporter per instance. And run highly available Alert Manager in the container with Wave Mesh per instance in separate instances.
Or just install Prometheus binary and other stuff on EC2 and forget about containerizing them?
Any ideas? Are some best practices exist for highly available Prometheus setup?

Don't run node_exporter inside of a container as you'll greatly limit the number of metrics exposed.
There is also a HA guide in relation to Prometheus setups that may be of use to you.
Also this question would be better suited to the Prometheus user mailing list

Running Prometheus inside a container works if you configure some additional options, especially for the node_exporter. The challenges of this approach relate to the fact that node_exporter gathers metrics from what it sees as the local machine - a container in this case - and we want it to gather metrics from the host instead. Prometheus offers options to override the filesystem mount points from which this data is gathered.
See "Step 2" in https://www.digitalocean.com/community/tutorials/how-to-install-prometheus-using-docker-on-ubuntu-14-04 for detailed instructions.

Google Cloud Platform: how to monitor memory usage of VM instances

I have recently performed a migration to Google Cloud Platform, and I really like it.
However I can't find a way to monitor the memory usage of the Dataproc VM intances. As you can see on the attachment, the console provides utilization info about CPU, disk and network, but not about memory.
Without knowing how much memory is being used, how is it possible to understand if there is a need of extra memory?

By installing the Stackdriver agent in GCE VMs additional metrics like memory can be monitored. Stackdriver also offers you alerting and notification features. Nevertheless agent metrics are only available for premium tier accounts.
See this answer for Dataproc VMs.

The stackdriver agent only supports monitoring of RAM of the E2 family at the moment. Other instance types such as N1, N2,... are not supported.
See the latest documentation of what is supported; https://cloud.google.com/monitoring/api/metrics_gcp#gcp-compute

Well you can use the /proc/meminfo virtual file system to get information on current memory usage. You can create a simple bash script that reads the memory usage information from /proc/meminfo. The script can be run periodically as a cron job service. The script can send an alert email if the memory usage exceeds a given threshold.
See this link: https://pakjiddat.netlify.app/posts/monitoring-cpu-and-memory-usage-on-linux

The most up-to-date answer here.
How to see memory usage in GCP?
Install the agent on your virtual machine. Takes less than 5 minutes.
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh
sudo apt-get update
sudo apt-get install stackdriver-agent
the code snippet should install the most recent version of the agent, but for up-to-date guide you can always refer to
https://cloud.google.com/monitoring/agent/installation#joint-install.
After it's installed, in a minute or two, you should see the additional metrics in Monitoring section of GCP.
https://console.cloud.google.com/monitoring
Explanation and why it's invisible by default?
The metrics (such as CPU usage or memory usage) can be collected at different places. For instance, CPU usage is a piece of information that the host (machine with special software running your virtual machine) can collect.
The thing with memory usage and virtual machines, is, it's the underlying operating system that manages it (the operating system of your virtual machine). Host cannot really know how much is used, for all it can see in the memory given to that virtual machine, is a stream of bytes.
That's why there's an idea to install agents inside of that virtual machine that would collect the metrics from inside and ship it somewhere where they can be interpreted. There are many types of agents available out there, but Google promotes their own - Monitoring Agent - and it integrates into the entire GCP suite well.

The agent metrics page may be useful:
https://cloud.google.com/monitoring/api/metrics_agent
You'll need to install stackdriver. See: https://app.google.stackdriver.com/?project="your project name"
The stackdriver metrics page will provide some guidance. You will need to change the "project name" (e.g. sinuous-dog-133823) to suit your account:
https://app.google.stackdriver.com/metrics-explorer?project=sinuous-dog-133823&timeSelection={"timeRange":"6h"}&xyChart={"dataSets":[{"timeSeriesFilter":{"filter":"metric.type=\"agent.googleapis.com/memory/bytes_used\" resource.type=\"gce_instance\"","perSeriesAligner":"ALIGN_MEAN","crossSeriesReducer":"REDUCE_NONE","secondaryCrossSeriesReducer":"REDUCE_NONE","minAlignmentPeriod":"60s","groupByFields":[],"unitOverride":"By"},"targetAxis":"Y1","plotType":"LINE"}],"options":{"mode":"COLOR"},"constantLines":[],"timeshiftDuration":"0s","y1Axis":{"label":"y1Axis","scale":"LINEAR"}}&isAutoRefresh=true
This REST call will get you the cpu usage. You will need to modify the parameters to suite your project name (e.g. sinuous-dog-133823) and other params to suit needs.
GET /v3/projects/sinuous-cat-233823/timeSeries?filter=metric.type="agent.googleapis.com/memory/bytes_used" resource.type="gce_instance"& aggregation.crossSeriesReducer=REDUCE_NONE& aggregation.alignmentPeriod=+60s& aggregation.perSeriesAligner=ALIGN_MEAN& secondaryAggregation.crossSeriesReducer=REDUCE_NONE& interval.startTime=2019-03-06T20:40:00Z& interval.endTime=2019-03-07T02:51:00Z& $unique=gc673 HTTP/1.1
Host: content-monitoring.googleapis.com
authorization: Bearer <your token>
cache-control: no-cache
Postman-Token: 039cabab-356e-4ee4-99c4-d9f4685a7bb2

VM memory metrics is not available by default, it requires Cloud Monitoring Agent 1.
The UI you are showing is Dataproc, which already has the agent installed, but disabled by default, you don't have to reinstall it. To enable Cloud Monitoring Agent for Dataproc clusters, set --properties dataproc:dataproc.monitoring.stackdriver.enable=true 2 when creating the cluster. Then you can monitor VM memory and create alerts in the Cloud Monitoring UI (not integrated with Dataproc UI yet).
Also see this related question: Dataproc VM memory and local disk usage metrics

This article is now out of date as Stackdriver is now a legacy agent. This has been replaced by the Ops Agent. Please read the latest articles on GCP about migrating to Ops Agent

Cluster management and service discovery

I want to introduce on my deploys a service discovery / cluster management solution. As far as I see Mesos is one solution, but I'm worried about how much it consumes in terms of RAM when installing agents of marathon, cronos, mesos, etc; my boxes have at most 512mb of RAM.
It is feasible to install Mesos on boxes with low resources?
It is Consul a replacement for Mesos?

Your question is really a number of questions:
Mesos is a great solution for cluster management. It is tested in production at high scale at twitter.
Mesos doesn't provide a service discover mechanism.
Mesos requests other components in order to provide a full solution. There is no one solution for all environments / topologies. The leading supplements are provided by mesosphere which include marathon (at a minimum).
Memory requirements will vary based on number of slaves. The starting requirement is 3MB for each the master and the slave. Making it feasible to install on nodes with low resources.
Consul is a service discovery component and does not replace Mesos. They are complementary. In fact Keen Labs has modified marathon to integrate mesos with consul. See: https://github.com/keenlabs/marathon/commit/290036e34337dcd6483550b7ab7d723bc4378d5f

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart