Monitor throughput bandwidth GCE - monitoring

Im new to Google Cloud Platform and have set up a VM that is working. Im expecting that there will be a significant amount of bandwidth use through this server.
Id like to know what the bandwidth usage is so that I can anticipate what the impact might be to this project over time.
Please advise

You could track the bandwidth usage in term of billing by enabling the daily billing export [1] that will populate a CSV or Json in your Bucket (storage) with the a well detailed breakdown of your charges.
[1] https://cloud.google.com/billing/docs/how-to/export-data-file

Related

Do Idle Snowflake Connections Use Cloud Services Credits?

Motivation | Suppose one wanted to execute two SQL queries against a Snowflake DB, ~20 minutes apart.
Optimization Problem | Which would cost fewer cloud services credits:
Re-using one connection, and allowing that connection to idle in the interim.
Connecting once per query.
The documentation indicates that authentication incurs cloud services credit usage, but does not indicate whether idle connections incur credit usage.
Question | Does anyone know whether idle connections incur cloud services credit usage?
Snowflake connections are stateless. They do not occupy a resource, and they do not need to keep the TCP/IP connection alive like other database connections.
Therefore idle connections do not consume any the Cloud Services Layer credits unless you enable "CLIENT_SESSION_KEEP_ALIVE".
https://docs.snowflake.com/en/sql-reference/parameters.html#client-session-keep-alive
When you set CLIENT_SESSION_KEEP_ALIVE, the client will update the token for the session (default value is 1 hour).
https://docs.snowflake.com/en/sql-reference/parameters.html#client-session-keep-alive-heartbeat-frequency
As Peter mentioned, the CSL usage up to 10% of daily warehouse usage is free, so refreshing the tokens will not cost you anything in practice.
About your approaches: I do not know how many queries you are planning to run daily, but creating a new connection for each query can be a performance killer. For costs perspective, idle connection will do max 24 authorization requests on a day, so if you are planning to run more than 24 queries on a day, I suggest you to pick the first approach.
Even if idle connections do not cost anything in the Cloud Services respect, is your warehouse running with idle connections hence giving you other costs to consider? I am guessing there's more factors to consider overall which you can speak to your Snowflake Account Team to discuss. Not trying to dodge your question, but trying to give a more wholesome answer!
In general, the Cloud Services costs are typically on the lower side compared to your other costs. Here are the main drivers for cloud service costs and how to minimalize them: https://community.snowflake.com/s/article/Cloud-Services-Billing-Update-Understanding-and-Adjusting-Usage
The best advice you may get is to test your connections/workflows and compare the costs over time. The overall costs are going to depend on several factors. Even if there's a difference in costs between two workflows, you may still have to analyze the cost/output ratio and your business needs to determine if it's worth the savings.
Approach 1 will incur less cloud services usage, but more data transfer charges (to keep the connection alive). Only the Auth event incurs cloud services usage.
Approach 2 will incur more cloud services usage, but less data transfer charges.
However, the amount of cloud services usage or data transfer charges are extremely small in either case.
Note - any cloud services used (up to 10% of daily warehouse usage) are free, whereas there is no free bandwidth allocation, so using #2 may save you a few pennies.

Google Cloud Platform Charging $10 for loading a Dataset of 2G

I started a VM instance for an ML task that needs to train a model on a 2G data set. I use connected the VM to Google's datalab and loaded the 2G dataset using from GCP's bucket. The VM has a standard "n1-highmem-16" machine type.
Datalab automatically disconnects in 1-2 hrs, but I was charged $10 for simply loading the 2G data to the memory. I was wondering if it was because I did not shut down the VM soon enough so there was an on-going charge, so I reload the same dataset again and monitored the charges. I found that I was charged $2 in 2 minutes for that task. I expect the on-going charges to accumulate fast.
These confusing charges basically makes it impossible for me to finish a project completely on GCP. Does anyone have suggestions on anything that I have done wrong in creating the VM or handling the task so that I got charged this much? If not, does anyone have a suggestions for more reasonable cloud computing sources?
You can reach out to the GCP Cloud Billing Support regarding your issue with billing of charges for GCP resources. In the meanwhile, you can look into the GCP Pricing in order to have a better understanding on the specific pricing for different resources.
Its better to open a issuetracker case or billing team of gcp for better overview of the incurred charges

diagnosing bandwidth from dask dashboard

This may be a very silly pb but I cannot diagnose bandwidth from the dask dashboard. I am under the impression the line is always so low it is not visible, cf screen grab.
Can I use the dashboard to get a value in such a situation?
Yes, there are a few places in the dashboard where bandwidth is mentioned:
workers: the read/write network bandwidth is listed real-time per worker
bandwidth-per-workers: the aggregate bandwidth is accumulated per worker pairs
bandwidth-per-type: the aggregate bandwidth is accumulated per type of data
Some of these are more accessible in the performance_report, which may interest you. See https://docs.dask.org/en/latest/diagnostics-distributed.html#capture-diagnostics

Will Google Cloud Run support GPU/TPU some day?

So far Google Cloud Run support CPU. Is there any plan to support GPU? It would be super cool if GPU available, then I can demo the DL project without really running a super expensive GPU instance.
So far Google Cloud Run support CPU. Is there any plan to support GPU?
It would be super cool if GPU available, then I can demo the DL
project without really running a super expensive GPU instance.
I seriously doubt it. GPU/TPUs are specialized hardware. Cloud Run is a managed container service that:
Enables you to run stateless containers that are invokable via HTTP requests. This means that CPU intensive applications are not supported. Inbetween HTTP request/response the CPU is idled to near zero. Your expensive GPU/TPUs would sit idle.
Autoscales based upon the number of requests per second. Launching 10,000 instances in seconds is easy to achieve. Imagine the billing support nightmare for Google if customers could launch that many GPU/TPUs and the size of the bills.
Is billed in 100 ms time intervals. Most requests fit into a few hundred milliseconds of execution. This is not a good execution or business model for CPU/GPU/TPU integration.
Provides a billing model which significantly reduces the cost of web services to near zero when not in use. You just pay for the costs to store your container images. When an HTTP request is received at the service URL, the container image is loaded into an execution unit and processing requests resume. Once requests stop, billing and resource usage also stop.
GPU/TPU types of data processing are best delivered by backend instances that protect and manage the processing power and costs that these processor devices provide.
You can use GPU with Cloud Run for Anthos
https://cloud.google.com/anthos/run/docs/configuring/compute-power-gpu

Is there a way to limit the performance data being recorded by AKS clusters?

I am using azure log analytics to store monitoring data from AKS clusters. 72% of the data stored is performance data. Is there a way to limit how often AKS reports performance data?
At this point we do not provide a mechanism to change performance metric collection frequency. It is set to 1 minute and cannot be changed.
We were actually thinking about adding an option to make more frequent collection as was requested by some customers.
Given the number of objects (pods, containers, etc) running in the cluster collecting even a few perf metrics may generate noticeable amount of data... You need that data in order to figure out what is going on in case of a problem.
Curious: you say your perf data is 72% of total - how much is it in terms om Gb/day, do you now? Do you have any active applications running on the cluster generating tracing? What we see is that once you stand up a new cluster, perf data is "the king" of volume, but once you start ading active apps that trace, logs become more and more of a factor in telemetry data volume...

Resources