Differentiate databricks streaming queries in datadog - monitoring

I am trying to set up a dashboard on Datadog that will show me the streaming metrics for my streaming job. The job itself contains two tasks one task has 2 streaming queries and the other has 4 (Both tasks use the same cluster). I followed the instructions here to install Datadog on the driver node. However when I go to datadog and try to create a dashboard there is no way to differentiate between the 6 different streaming queries so they are all lumped together (none of the tags for the metrics are different per query).

After some digging I found there is an option you can enable via the init script called enable_query_name_tag which is disabled by default as it can cause there to be a ton of tags created when you are not using query names.
The modification is shown here:
instances:
- spark_url: http://\$DB_DRIVER_IP:\$DB_DRIVER_PORT
spark_cluster_mode: spark_standalone_mode
cluster_name: \${hostip}
streaming_metrics: true
enable_query_name_tag: true <----

Related

Displaying slowest Jenkins pipelines in Grafana

I set up a monitoring system to track our Jenkins pipelines using Prometheus and Grafana (see Jenkins Prometheus Plugin). I am building some dashboards and while doing so I tried to create a table graph that displays the 5 slowest pipelines. This is the Prometheus query I used:
topk (5, max by (jenkins_job) (default_jenkins_builds_last_build_duration_milliseconds / 60000))
Grafana Table Visual
However, instead of displaying 5 lines, the table shows numerous timestamps for every pipeline. Does anybody have an idea how to solve this? I tried numerous attemps discribed on stackoverflow (e.g. this), without success. Thanks in advance!
5 slowest Jenkins pipelines, one record each

Can Heroku Postgres dynos talk with Datadog?

I have a Postgres dyno on Heroku and I use Datadog.
Two postgres dashboards are by default on Datadog: Metrics and Overview.
Metrics is working (CPU usage, memory, I/O,...) but Overview is not (deadlocks, indexes usages)
Are Heroku Postgres dyno and Datadog fully compatible?
There are a number of ways you can get metrics out of the box into datadog for your application or service.
Assuming you are using a backend service since you described postgres, you can use one of the many datadog dependencies. One such for node applications i use is dd-trace, it has a number of different plugins for postgres (connections using the pg library) out of the box. Although these give you a lot of metrics about querys ran and help identify application level bottlenecks, you will need to do additional work to get other things like deadlocks, indexes usages, and connected users into datadog. Two main ways to go about this,
create a custom metric and query the db from the application using dd-trace mentioned above
use a custom build pack to launch a separate datadog agent on heroku, this runs another thing along side the applicaton in the heroku instance whose soul purpose is to port metrics.

Any way of monitoring Airflow DAG's execution time?

I'd like to use Airflow with Statsd and DataDog to monitor if DAG takes e.g. twice time as its previous execution(s). So, I need some kind of a real-time timer for a DAG (or operator).
I'm aware that Airflow supports some metrics.
However, to my understanding all of the metrics are related to finished tasks/DAGs, right? So, It's not the solution, because I'd like to monitor running DAGs.
I also considered the timeout_execution/SLA features, but they are not suitable for this use-case
I'd like to be notified that some DAG hangs, but I don't want to kill it.
There are a number of different ways you could handle this:
In the past I've configured a telemetry DAG which would collect the current state of all tasks/DAGs by querying the metadata tables. I'd collect these metrics and push them up to CloudWatch. This became problematic as these internal fields change often so we would run into issues when trying to upgrade to newer versions of Airflow.
There are also some well-maintained Prometheus exporters that some companies have open sourced. By setting these up you could poll the exposed export path as frequently as you wanted to (DataDog supports Prometheus).
These are just some of your options. Since the Airflow webserver is just a Flask app you can really expose metrics in whatever way you see fit.
As I understand, you can monitor running tasks in DAGs using DataDog, refer the integration with Airflow docs
You may refer metrics via DogStatD docs. Also, look at this page would be useful to understand what to monitor.
E.g., the metrics as below:
airflow.operator_failures: monitor the failed operator.
airflow.operator_successes: monitor succeed operator.
airflow.dag_processing.processes: Number of currently running DAG parsing (processes).
airflow.scheduler.tasks.running : Number of tasks running in executor
Shown as task.

Second and Third Distributed Kafka Connector workers failing to work correctly

With a Kafka cluster of 3 and a Zookeeper cluster of the same I brought up one distributed connector node. This node ran successfully with a single task. I then brought up a second connector, this seemed to run as some of the code in the task definitely ran. However it then didn't seem to stay alive (though with no errors thrown, the not staying alive was observed by a lack of expected activity, while the first connector continued to function correctly). When I call the URL http://localhost:8083/connectors/mqtt/tasks, on each connector node, it tells me the connector has one task. I would expect this to be two tasks, one for each node/worker. (Currently the worker configuration says tasks.max = 1 but I've also tried setting it to 3.
When I try and bring up a third connector, I get the error:
"POST /connectors HTTP/1.1" 500 90 5
(org.apache.kafka.connect.runtime.rest.RestServer:60)
ERROR IO error forwarding REST request:
(org.apache.kafka.connect.runtime.rest.RestServer:241)
java.net.ConnectException: Connection refused
Trying to call the connector POST method again from the shell returns the error:
{"error_code":500,"message":"IO Error trying to forward REST request:
Connection refused"}
I also tried upgrading to Apache Kafka 0.10.1.1 that was released today. I'm still seeing the problems. The connectors are each running on isolated Docker containers defined by a single image. They should be identical.
The problem could be that I'm trying to run the POST request to http://localhost:8083/connectors on each worker, when I only need to run it once on a single worker and then the tasks for that connector will automatically distribute to the other workers. If this is the case, how do I get the tasks to distribute? I currently have the max set to three, but only one appears to be running on a single worker.
Update
I ultimately got things running using essentially the same approach that Yuri suggested. I gave each worker a unique group ID, then gave each connector task the same name. This allowed the three connectors and their single tasks to share a single offset, so that in the case of sink connectors the messages they consumed from Kafka were not duplicated. They are basically running as standalone connectors since the workers have different group ids and thus won't communicate with each other.
If the connector workers have the same group ID, you can't add more than one connector with the same name. If you give the connectors different names, they will have different offsets and consume duplicate messages. If you have three workers in the same group, one connector and three tasks, you would theoretically have an ideal situation where the tasks share an offset and the workers make sure the tasks are always running and well distributed (with each task consuming a unique set of partitions). In practice the connector framework doesn't create more than one task, even with tasks.max set to 3 and when the topic tasks are consuming has 25 partitions.
If anyone knows why I'm seeing this behaviour, please let me know.
I've encountered with similar issue in the same situation as yours.
Task.max is configured for a topic and distributed workers automatically decide what nodes handle topic. So, if you have 3 workers in a cluster and your topic configuration says task.max=2 then only 2 of 3 workers will process the topic. In theory, if one of workers fails, 3rd should pick up workload. But..
The distributed connector turned out to be very unreliable: once you add\remove some nodes, the cluster broke down and all workers did nothing but tried to choose leader and failed. The only way to fix was to restart whole cluster and preferably all workers simultaneously.
I chose another way - I used standalone worker and it works like a charm to me because distribution of load is implemented on Kafka client level and once some worker dropped, the cluster re-balances automatically and clients connected to unoccupied topics.
PS. Maybe it will be useful for you too. Confluent connector is not tolerate to invalid payload that does not match topic's schema. Once the connector get some invalid message it silently dies. The only way to find out is to analyze metrics.
I'm posting an answer to an old question, since Kafka Connect has moved on a lot in three years.
In the latest version (2.3.1) there is incremental rebalancing which massively improves the behaviour of Kafka Connect.
It's also worth noting that when configuring Kafka Connect rest.advertised.host.name must be set correctly, as if it's not you will see errors including the one quoted
{"error_code":500,"message":"IO Error trying to forward REST request: Connection refused"}
See this post for more details.

Run a large amount of tasks on a cluster

I'm looking for a solution to running a large amount of tasks and monitoring their status on a cluster.
In detail: Each task consists of 3-4 processes which are docker contained (each process is a docker run command). All of the processes have to run on the same server.
The amount of tasks we're talking about is bursts of several hundreds of tasks at a time.
I've looked into several solutions all of them based on Mesos:
Chronos - Seems like it would falter under high load and in any case is more directed towards recurring (cron) jobs. While I need one-time (heavy) job.
Custom Mesos FW - Seems to low-level for my needs would require me to write scheduling and retrying mechanisms, I'd save this for last resort.
Aurora - This seems promising as each task is run on the same node and comprised of several processes. I am missing a couple of this here though: Aurora seems to not be able to run several tasks as a part of a single job. Since my tasks are all similar with different input I could use a single job with many (say 400) instances and the first process of each task (whose role is to download the input from S3) could download a different set based on the instance ID. Which brings me to another problem: I can't find a working example of using {{ mesos.instance }} in .aurora files can anyone give me an example?
Thanks for all the fish people
You could also have a look on Kubernetes (which also can be run as a framework in Mesos). Kubernetes has the concept of Pods which are basically a set of co-located containers. So in your case a pod would consist of your 3-4 processes/containers and then these pods can be scaled up/down.
Short comments regarding the other solutions you mentioned:
Chronos: Not really targeting your use case
Custom FW: Actually not so difficult, but good call to save this as last resort.
Aurora: Very powerful but also complex framework
Marathon (which you didn't mention): targeted for long running applications which can be easily scaled up and down.
In addition to the excellent other answer, you could check out Two Sigma's Cook which they have only recently open sourced but have been using in prod at scale for a while.

Resources