Allowing users to view GPU utilization in GCP Vertex AI training jobs - google-cloud-iam

I am running custom training jobs using Google cloud Vertex AI. But when I enter a custom training job page, the GPU utilization display is not shown, instead, there is a message saying "you don't have access to this data."
I would appreciate help finding the right IAM role which will allow me to view the GPU utilization.
Thanks!

You can use monitoring.viewer IAM role to display both CPU and GPU utilization in GCP Vertex AI training jobs on top of aiplatform.viewer IAM role.

Related

Solution for Data pipeline,ETL load monitoring

My team is looking for a solution (both in house or tools) for performance monitoring and operation management for 500 plus SSIS ETL loads ( with varied run frequencies- daily, weekly, monthly etc) and 100 plus data pipelines ( currently Hadoop is used as data lake storage layer but the plan is to migrate to data bricks hosted on AWS). Data pipes will increase as ML And AI needs evolve over time. The solution should be able to handle input from SSIS as well as from data pipelines. Primary goals for this solution are:
Generate a dashboard that shows data quality anomalies ( ETL source sent 100 but destination received only 90, For pipeline x -- avg data volume received is reduced by 90%).
Send alerts on failures. Can integrate with service now to create tickets for severe failure etc.
Allow the operation team to quickly figure out important performance metrics -- Execution run time, any slowness in a particular pipeline/bottlenecks.
Ability to customize the dashoards.
Right now we are thinking a SQL server based centralized logging table which will get metrics from ETL and pipelines and then expose this data to powerbi and create custom dashboards. Write some API to integrate with service now to create alerts. But this solution might be hard to scale.
Can someone suggest me some good Data monitoring tools which can serve our needs. My google search came up with these 3 tools :
Data Dog, Data fold, Dyna trace and ELK stack.

Google Cloud Platform Charging $10 for loading a Dataset of 2G

I started a VM instance for an ML task that needs to train a model on a 2G data set. I use connected the VM to Google's datalab and loaded the 2G dataset using from GCP's bucket. The VM has a standard "n1-highmem-16" machine type.
Datalab automatically disconnects in 1-2 hrs, but I was charged $10 for simply loading the 2G data to the memory. I was wondering if it was because I did not shut down the VM soon enough so there was an on-going charge, so I reload the same dataset again and monitored the charges. I found that I was charged $2 in 2 minutes for that task. I expect the on-going charges to accumulate fast.
These confusing charges basically makes it impossible for me to finish a project completely on GCP. Does anyone have suggestions on anything that I have done wrong in creating the VM or handling the task so that I got charged this much? If not, does anyone have a suggestions for more reasonable cloud computing sources?
You can reach out to the GCP Cloud Billing Support regarding your issue with billing of charges for GCP resources. In the meanwhile, you can look into the GCP Pricing in order to have a better understanding on the specific pricing for different resources.
Its better to open a issuetracker case or billing team of gcp for better overview of the incurred charges

Monitor Google Cloud Run memory usage

Is there any built-in way to monitor memory usage of an application running in managed Google Cloud Run instances?
In the "Metrics" page of a managed Cloud Run service, there is an item called "Container Memory Allocation". However, as far as I understand it, this graph refers to the instance's maximum allocated memory (chosen in the settings), and not to the memory actually used inside the container. (Please correct me if I'm wrong.)
In the Stackdriver Monitoring list of available metrics for managed Cloud Run ( https://cloud.google.com/monitoring/api/metrics_gcp#gcp-run ), there also doesn't seem to be any metric related to the memory usage, only to allocated memory.
Thank you in advance.
Cloud Run now exposes a new metrics named "Memory Utilization" in Cloud Monitoring, see more details here.
This metrics captures the container memory utilization distribution across all container instances of the revision. It is recommended to look at the percentiles of this metric: 50th percentile, 95th percentiles ad 99th percentiles to understand how utilized are your instances
Currently, there seems to be no way to monitor the memory usage of a Google Cloud Run instance through Stackdriver or on "Cloud Run" page in Google Cloud Console.
I have filed a feature request on your behalf, in order to add memory usage metrics to Cloud Run. You can see and track this feature request in the following link.
There is not currently a metric on memory utilization. However, if your service reaches a memory limit, the following log will appear in Stackdriver Logging with ERROR-level severity:
"Memory limit of 256M exceeded with 325M used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits"
(Replace specific numbers accordingly.)
Based on this log message, you could create a Log-based Metric for memory exceeded.

Will Google Cloud Run support GPU/TPU some day?

So far Google Cloud Run support CPU. Is there any plan to support GPU? It would be super cool if GPU available, then I can demo the DL project without really running a super expensive GPU instance.
So far Google Cloud Run support CPU. Is there any plan to support GPU?
It would be super cool if GPU available, then I can demo the DL
project without really running a super expensive GPU instance.
I seriously doubt it. GPU/TPUs are specialized hardware. Cloud Run is a managed container service that:
Enables you to run stateless containers that are invokable via HTTP requests. This means that CPU intensive applications are not supported. Inbetween HTTP request/response the CPU is idled to near zero. Your expensive GPU/TPUs would sit idle.
Autoscales based upon the number of requests per second. Launching 10,000 instances in seconds is easy to achieve. Imagine the billing support nightmare for Google if customers could launch that many GPU/TPUs and the size of the bills.
Is billed in 100 ms time intervals. Most requests fit into a few hundred milliseconds of execution. This is not a good execution or business model for CPU/GPU/TPU integration.
Provides a billing model which significantly reduces the cost of web services to near zero when not in use. You just pay for the costs to store your container images. When an HTTP request is received at the service URL, the container image is loaded into an execution unit and processing requests resume. Once requests stop, billing and resource usage also stop.
GPU/TPU types of data processing are best delivered by backend instances that protect and manage the processing power and costs that these processor devices provide.
You can use GPU with Cloud Run for Anthos
https://cloud.google.com/anthos/run/docs/configuring/compute-power-gpu

Is there a way to limit the performance data being recorded by AKS clusters?

I am using azure log analytics to store monitoring data from AKS clusters. 72% of the data stored is performance data. Is there a way to limit how often AKS reports performance data?
At this point we do not provide a mechanism to change performance metric collection frequency. It is set to 1 minute and cannot be changed.
We were actually thinking about adding an option to make more frequent collection as was requested by some customers.
Given the number of objects (pods, containers, etc) running in the cluster collecting even a few perf metrics may generate noticeable amount of data... You need that data in order to figure out what is going on in case of a problem.
Curious: you say your perf data is 72% of total - how much is it in terms om Gb/day, do you now? Do you have any active applications running on the cluster generating tracing? What we see is that once you stand up a new cluster, perf data is "the king" of volume, but once you start ading active apps that trace, logs become more and more of a factor in telemetry data volume...

Resources