Solution for Data pipeline,ETL load monitoring - monitoring

My team is looking for a solution (both in house or tools) for performance monitoring and operation management for 500 plus SSIS ETL loads ( with varied run frequencies- daily, weekly, monthly etc) and 100 plus data pipelines ( currently Hadoop is used as data lake storage layer but the plan is to migrate to data bricks hosted on AWS). Data pipes will increase as ML And AI needs evolve over time. The solution should be able to handle input from SSIS as well as from data pipelines. Primary goals for this solution are:
Generate a dashboard that shows data quality anomalies ( ETL source sent 100 but destination received only 90, For pipeline x -- avg data volume received is reduced by 90%).
Send alerts on failures. Can integrate with service now to create tickets for severe failure etc.
Allow the operation team to quickly figure out important performance metrics -- Execution run time, any slowness in a particular pipeline/bottlenecks.
Ability to customize the dashoards.
Right now we are thinking a SQL server based centralized logging table which will get metrics from ETL and pipelines and then expose this data to powerbi and create custom dashboards. Write some API to integrate with service now to create alerts. But this solution might be hard to scale.
Can someone suggest me some good Data monitoring tools which can serve our needs. My google search came up with these 3 tools :
Data Dog, Data fold, Dyna trace and ELK stack.

Related

How does apache beam access bigtable data?

If BigtableIO.Read is run in dataflow, is the data being accessed via a bigtable node or going directly to bigtable tablets?
Bigtable architecture has:
client requests go through a front-end server before they are sent to a Cloud Bigtable node
and goes on to say:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets to help balance the workload of queries... Tablets are stored on Colossus, Google's file system, in SSTable format
(The concern is if there is a dataflow job running at the same as users are making individual request that definitely go through the nodes, whether there will be a small or large amount of contention from the dataflow job. I would guess that if the dataflow job went through the nodes there would be significantly more contention as opposed to hitting the tablets directly.)
Beam BigTable connector uses the Cloud BigTable's public API hence requests will be going through the BigTable front end server nodes.
See here for bit more detail regarding BigTable client API usage of the Beam connector.

Is there a way to limit the performance data being recorded by AKS clusters?

I am using azure log analytics to store monitoring data from AKS clusters. 72% of the data stored is performance data. Is there a way to limit how often AKS reports performance data?
At this point we do not provide a mechanism to change performance metric collection frequency. It is set to 1 minute and cannot be changed.
We were actually thinking about adding an option to make more frequent collection as was requested by some customers.
Given the number of objects (pods, containers, etc) running in the cluster collecting even a few perf metrics may generate noticeable amount of data... You need that data in order to figure out what is going on in case of a problem.
Curious: you say your perf data is 72% of total - how much is it in terms om Gb/day, do you now? Do you have any active applications running on the cluster generating tracing? What we see is that once you stand up a new cluster, perf data is "the king" of volume, but once you start ading active apps that trace, logs become more and more of a factor in telemetry data volume...

What is recommended solution for monitoring heterogeneous infrastructure?

I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.

Fetch data subset from gmond

This is in the context of a small data-center setup where the number of servers to be monitored are only in double-digits and may grow only slowly to few hundreds (if at all). I am a ganglia newbie and have just completed setting up a small ganglia test bed (and have been reading and playing with it). The couple of things I realise -
gmetad supports interactive queries on port 8652 using which I can get metric data subsets - say data of particular metric family in a specific cluster
gmond seems to always return the whole dump of data for all metrics from all nodes in a cluster (on doing 'netcat host 8649')
In my setup, I dont want to use gmetad or RRD. I want to directly fetch data from the multiple gmond clusters and store it in a single data-store. There are couple of reasons to not use gmetad and RRD -
I dont want multiple data-stores in the whole setup. I can have one dedicated machine to fetch data from the multiple, few clusters and store them
I dont plan to use gweb as the data front end. The data from ganglia will be fed into a different monitoring tool altogether. With this setup, I want to eliminate the latency that another layer of gmetad could add. That is, gmetad polls say every minute and my management tool polls gmetad every minute will add 2 minutes delay which I feel is unnecessary for a relatively small/medium sized setup
There are couple of problems in the approach for which I need help -
I cannot get filtered data from gmond. Is there some plugin that can help me fetch individual metric/metric-group information from gmond (since different metrics are collected in different intervals)
gmond output is very verbose text. Is there some other (hopefully binary) format that I can configure for export?
Is my idea of eliminating gmetad/RRD completely a very bad idea? Has anyone tried this approach before? What should I be careful of, in doing so from a data collection standpoint.
Thanks in advance.

Capture / Monitor system data of application server in Graphite

I am using graphite server to capture my metrics data and bring down to graphs. I have 4 application servers which is load balancer setup. My aim is capture system data such as cpu usage, memory usage, disk load, etc., for all the 4 application servers. I setup an graphite environment in a separate server and i wanted to push the system data for all the applications servers to graphite and get it display as graphs. I don't know what needs to be done for feeding system data to graphite. My thinking was to install statsd in all application servers and feed the system data to graphite but looks like statsd does not support system data rather application data.
Can anyone help me to catch the right track. Thanks in advance.
Running collectd with a graphite agent would be an excellent start to gather the information your after.
There is an almost unlimited amount of ways to get your data into graphite.
You can find a list of tools that have known to work very well with graphite on the readthedocs.org page: http://graphite.readthedocs.org/en/0.9.10/tools.html
There is also an example script that gathers load average from the system in the carbon project: example-client.py

Resources