If BigtableIO.Read is run in dataflow, is the data being accessed via a bigtable node or going directly to bigtable tablets?
Bigtable architecture has:
client requests go through a front-end server before they are sent to a Cloud Bigtable node
and goes on to say:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets to help balance the workload of queries... Tablets are stored on Colossus, Google's file system, in SSTable format
(The concern is if there is a dataflow job running at the same as users are making individual request that definitely go through the nodes, whether there will be a small or large amount of contention from the dataflow job. I would guess that if the dataflow job went through the nodes there would be significantly more contention as opposed to hitting the tablets directly.)
Beam BigTable connector uses the Cloud BigTable's public API hence requests will be going through the BigTable front end server nodes.
See here for bit more detail regarding BigTable client API usage of the Beam connector.
Related
My team is looking for a solution (both in house or tools) for performance monitoring and operation management for 500 plus SSIS ETL loads ( with varied run frequencies- daily, weekly, monthly etc) and 100 plus data pipelines ( currently Hadoop is used as data lake storage layer but the plan is to migrate to data bricks hosted on AWS). Data pipes will increase as ML And AI needs evolve over time. The solution should be able to handle input from SSIS as well as from data pipelines. Primary goals for this solution are:
Generate a dashboard that shows data quality anomalies ( ETL source sent 100 but destination received only 90, For pipeline x -- avg data volume received is reduced by 90%).
Send alerts on failures. Can integrate with service now to create tickets for severe failure etc.
Allow the operation team to quickly figure out important performance metrics -- Execution run time, any slowness in a particular pipeline/bottlenecks.
Ability to customize the dashoards.
Right now we are thinking a SQL server based centralized logging table which will get metrics from ETL and pipelines and then expose this data to powerbi and create custom dashboards. Write some API to integrate with service now to create alerts. But this solution might be hard to scale.
Can someone suggest me some good Data monitoring tools which can serve our needs. My google search came up with these 3 tools :
Data Dog, Data fold, Dyna trace and ELK stack.
Can anyone explain what is the benefit of adopting google cloud pub/sub service in a streaming pipeline?
I saw one of the event streaming pipeline example showcased, and it was using pub/sub to ingest the events data before connecting to the google cloud data flow service to transform it. Why does it not connect to the events data directly through data flow?
Thanks.
Dataflow will need a source to get the data from. If you are using a streaming pipeline you can use different options as a source and each of them will have its own characteristics that may fit your scenario.
With Pub/Sub you can easily publish events using a client library or directly the API to a topic, and it will guarantee at least once delivery of that message.
When you connect it with Dataflow streaming pipeline, you can have a resilient architecture (Pub/Sub will keep sending the message until Dataflow acknowledge that it has processed it) and a near real-time processing. In addition, Dataflow can use Pub/Sub metrics to scale up or down depending on the number of the messages in the backlog.
Finally, Dataflow runner uses an optimized version of the PubSubIO connector which provides additional features. I suggest checking this documentation that describes some of these features.
I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.
I'm now implementing an Apache beam application running on Dataflow which consumes data from a Cloud PubSub, transforms the format of them and sends the results to another Cloud PubSub. It loads the definitions of the streaming data which describes the names and types of keys and how each data should be transformed. The definitions are stored in GCS and loaded when the applications starts.
My question is that the way to update the definitions and notify the changes of each PTransform object running on the data flow. Is it possible to do that online? or do we have to drain and recreate the dataflow app?
I am using graphite server to capture my metrics data and bring down to graphs. I have 4 application servers which is load balancer setup. My aim is capture system data such as cpu usage, memory usage, disk load, etc., for all the 4 application servers. I setup an graphite environment in a separate server and i wanted to push the system data for all the applications servers to graphite and get it display as graphs. I don't know what needs to be done for feeding system data to graphite. My thinking was to install statsd in all application servers and feed the system data to graphite but looks like statsd does not support system data rather application data.
Can anyone help me to catch the right track. Thanks in advance.
Running collectd with a graphite agent would be an excellent start to gather the information your after.
There is an almost unlimited amount of ways to get your data into graphite.
You can find a list of tools that have known to work very well with graphite on the readthedocs.org page: http://graphite.readthedocs.org/en/0.9.10/tools.html
There is also an example script that gathers load average from the system in the carbon project: example-client.py