DataFlow don't show streaming engine metrics for bigger machine types - google-cloud-dataflow

We have a DataFlow job with streaming engine enabled enableStreamingEngine set to true(beam version 2.36.0)
But we don't have streaming engine metrics(no data available). We use n2-standard-8 machine type. We tried to change to n2-standard-2 and we saw metrics but it couldn't handle our load, so we changed it back and metrics disappeared again.
gcloud dataflow jobs describe shows that our parameter is present in the job.
I can't find any limitations to machine type in docs. Does anybody know what the deal is?

Related

How do you change the default topic subscription options while using Rosbridge for Foxglove?

We are using Foxglove as a visualization tool for our ROS2 Foxy system on Ubuntu 20, but we are running into bandwidth issues with the rosbridge websocket. We have plans to switch to using the foxglove_bridge websockets since they advertise performance improvement, but are waiting until we migrate to ROS Humble.
When a client initiates a subscription to a topic, it can also pass along options to the server to throttle the message rate for each topic.
Where do I change those options? They must be set within the client, but I couldn't find anything within the GUI to set it.
I'm running foxglove-studio from binaries installed through apt. The only source code I have for the foxglove-studio is a few custom extension panels.
My temporary fix is to filter out the topics I want to throttle and hard-code the throttle_rate option within the rosbridge server before the options are passed to the subscriber handler.
This will work for the demo we have coming up, but I'm searching for a better solution.
Foxglove Studio currently uses a hard-coded set of parameters for creating the roslib Topic object and so does not support throttling. To achieve this you'd currently need to either:
patch the Studio source and build it yourself
patch the server as you've currently done, or
create a separate downsampled topic (e.g. using topic_tools/throttle).

How can i access real time driver log (not with a lag of 5 min) other than Azure databricks sparkUI

I am trying to integrate the driver logs to Control-M scheduler.
How can i access real time driver log (not with a lag of 5 min) other than Azure databricks sparkUI. Using some API or accessing the location of real time written logs.
Also I am planning to do elastic analysis on top of it.
Such things (real-time collection of metrics or logs) are usually done via installation of some agent (for example, filebeat) via init scripts (global or cluster-level init scripts).
The actual script content heavily depends on the type of the agent used, but Databricks' documentation contains some examples of that:
Blog post on setting Datadog integration
Notebook that shows how setup init script for Datadog

Problem: empty graphics in GKE cluster node detail (No data for this time interval). How can I fix it?

I have a cluster in Google Cloud. But I need to know information about resources usage.
In interface of each node there are three graphics about CPU, memory and disk usage. But all this graphics in the each node have warning "No data for this time interval" for any time interval.
I upgraded all clusters and nodes to the latest version 1.15.4-gke.22 and changed "Legacy Stackdriver Logging" to "Stackdriver Kubernetes Engine Monitoring".
But it didn't help.
In Stackdriver Workspace there is only "disk_read_bytes" with graphics, any other requests in Metric Explorer have only message "No data for this time interval"
If I do request "kubectl top nodes" in the command line, I see current data for CPU and memory. But I need to see it on Node detail page to understand the peak load. How can I configure it?
In my case, I was missing permissions on the IAM service account associated with the cluster - make sure it has the roles:
Monitoring Metrics Writer (roles/monitoring.metricWriter)
Logs Writer (roles/logging.logWriter)
Stackdriver Resource Metadata Writer (roles/stackdriver.resourceMetadata.writer)
This is documented here
Actually it sound strange because if you can get metrics in command line and the Stackdriver interface doesn't show them maybe it's a bug.
I recommend this: if you be able, create a cluster with the minimum resources, check the same Stackdriver metrics and if there are metrics, it can be a bug and you can report it on in the appropriate GCP channel.
Check the documentation about how to get support within GCP:
Best Practices for Working with Cloud Support
Getting support for Google Cloud

Rational behind appending versions as Service/Deployment name on k8s with spring cloud skipper

I am kind of new the spring cloud dataflow world and while playing around with the framework, I see that if I have a stream = 'test-steram' with 1 application called 'app'. When I deploy using skipper to kubernetes, I see that It creates pod/deployment & service on kubernetes with name as
test-stream-app-v1.
My question is why do we need to have v1 in service/deployment names on k8s? What role does it play in the overall workflow using spring cloud dataflow?
------Follow up -----------
Just wanted to confirm few points to make sure i am on right track to understand the flow
My understanding is with traditional stream (bind through kafka topics) service (object on kubernetes) do not play a significant role.
Rolling Update (Red/Black) pattern has implemented in following way in skipper and versioning in deployment/service plays a role in following way.
Let's assume that app-v1 deployment already exists and upgrade is requested. Skipper creates app-v2 deployment and
wait for it to be ready. Once ready it destroys app-v1
If my above understanding is right I have following follow up questions...
I see that skipper can deploy and package (and it do not have to be a traditional stream) to work with. Is that the longer term plan or Skipper is only intended to work spring-cloud-dataflow streams?
In case of non-tradtional stream package, where an package has multiple apps(rest microservices) in a group, how this model of versioning will work? I mean when I want to call the microservice from other microservice, I cannot possibly know or less than ideal to know the release-version of the app?
#Anand. Congrats on the 1st post!
The naming convention goes by the idea that each of the stream application is "versioned" if Skipper is used with SCDF. The version gets bumped for when, as a user, when you rolling-upgrade and rolling-downgrade the streaming-application versions or the application-specific properties either on-demand or via CI/CD automation.
It is very relevant for continuous-delivery and continuous-deployment workflows, and we provide native options in SCDF through commands such as stream update .. and stream rollback .. respectively. For any of these operations, the applications will be rolling updated in K8s, and each action will bump the number in the application name. In your example, you'd see them as test-stream-app-v1, `test-stream-app-v2, etc.
With all the historical versions in a central place (i.e., Skipper's database), you'd be able to interact with them via stream history.. and stream manifest .. commands in SCDF.
To learn more about all this, watch this demo-webinar (starts # ~41.25), and also have a look at samples in the reference guide.
I hope this helps.

What is recommended solution for monitoring heterogeneous infrastructure?

I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.

Resources