Stackdriver latency for monitoring google cloud pub/sub - monitoring

Documented from https://cloud.google.com/monitoring/api/v3/metrics#time-series
Metric data is collected on schedules that vary across monitored resources. Some data is regularly "pulled" by Stackdriver Monitoring from the monitored resources, and some data is "pushed" by applications, services, or the Stackdriver Monitoring agent.
I'd like to know how stackdriver collects data from Google Cloud Pub/Sub, what is the promised latency bound? I've tried creating a topic/subscription and publishing messages and watch how long until the metrics logged in stackdriver. On average it's about 1-2 minutes, but sometimes very slow, up to 5-8 minutes.

We don't currently document what the expectations are for this, in part because there's not a single answer and it depends on different factors. But we are aware that this is important to have, and are working on a clear way to communicate it. Stay tuned.

I'm also facing high latencies in PubSub monitoring with KEDA. It takes 2+ minutes for KEDA to start scaling UP the pods based on the Pub/Sub undelivered messages count provided by GCP's monitoring.

Related

AWS CloudWatch CPUUtilization vs NETSNMP discrepancies

For a few weeks now, we've noticed our ScienceLogic monitoring platform that uses SNMP is unable to detect CPU spikes that CloudWatch alarms are picking up. Both platforms are configured to poll every 5min, but SL1 is not seeing any CPU spikes more than ~20%. Our CloudWatch CPU alarm is set to fire off at 90%, which has gone off twice in the past 12 hours for this EC2 instance. I'm struggling to understand why?
I know that CloudWatch pulls the CPUUtilization metric direct from the hypervisor, but I can't imagine it would differ so much from the CPU percentage captured by SNMP. Anyone have any thoughts? I wonder if I'm just seeing a scaling issue in SNMP?
SL1:
CloudWatch:
I tried contacting Sciencelogic, and they asked me for the "formula" that AWS uses to capture this metric, which I'm not really sure I understand the question lol.
Monitors running inside/outside of a compute unit (here a virtual machine) can observe different results, which I think is rather normal.
The SNMP agent is inside the VM, so its code execution is affected heavily by high CPU events (threads blocked). You can recall similar high CPU events on your personal computer, where if an application consumed all CPU resources, other applications naturally became slow and not responsive.
While CloudWatch sensors are outside, which are almost never affected by the VM internal events.

Do Idle Snowflake Connections Use Cloud Services Credits?

Motivation | Suppose one wanted to execute two SQL queries against a Snowflake DB, ~20 minutes apart.
Optimization Problem | Which would cost fewer cloud services credits:
Re-using one connection, and allowing that connection to idle in the interim.
Connecting once per query.
The documentation indicates that authentication incurs cloud services credit usage, but does not indicate whether idle connections incur credit usage.
Question | Does anyone know whether idle connections incur cloud services credit usage?
Snowflake connections are stateless. They do not occupy a resource, and they do not need to keep the TCP/IP connection alive like other database connections.
Therefore idle connections do not consume any the Cloud Services Layer credits unless you enable "CLIENT_SESSION_KEEP_ALIVE".
https://docs.snowflake.com/en/sql-reference/parameters.html#client-session-keep-alive
When you set CLIENT_SESSION_KEEP_ALIVE, the client will update the token for the session (default value is 1 hour).
https://docs.snowflake.com/en/sql-reference/parameters.html#client-session-keep-alive-heartbeat-frequency
As Peter mentioned, the CSL usage up to 10% of daily warehouse usage is free, so refreshing the tokens will not cost you anything in practice.
About your approaches: I do not know how many queries you are planning to run daily, but creating a new connection for each query can be a performance killer. For costs perspective, idle connection will do max 24 authorization requests on a day, so if you are planning to run more than 24 queries on a day, I suggest you to pick the first approach.
Even if idle connections do not cost anything in the Cloud Services respect, is your warehouse running with idle connections hence giving you other costs to consider? I am guessing there's more factors to consider overall which you can speak to your Snowflake Account Team to discuss. Not trying to dodge your question, but trying to give a more wholesome answer!
In general, the Cloud Services costs are typically on the lower side compared to your other costs. Here are the main drivers for cloud service costs and how to minimalize them: https://community.snowflake.com/s/article/Cloud-Services-Billing-Update-Understanding-and-Adjusting-Usage
The best advice you may get is to test your connections/workflows and compare the costs over time. The overall costs are going to depend on several factors. Even if there's a difference in costs between two workflows, you may still have to analyze the cost/output ratio and your business needs to determine if it's worth the savings.
Approach 1 will incur less cloud services usage, but more data transfer charges (to keep the connection alive). Only the Auth event incurs cloud services usage.
Approach 2 will incur more cloud services usage, but less data transfer charges.
However, the amount of cloud services usage or data transfer charges are extremely small in either case.
Note - any cloud services used (up to 10% of daily warehouse usage) are free, whereas there is no free bandwidth allocation, so using #2 may save you a few pennies.

Is there a GCP equivalent to AWS SQS?

Im curious to understand the implementation of GCP's PubSub. Although Pubsub seems to point to follow a Publish-Subscribe design pattern, it seems more close to AWS's SQS (queue) than AWS SNS (that use publish-subscribe model). Why is think this is, GCP's pubSub
Allows upto 10,000 subscriptions per project.
Allows filtering on subscriptions
It even allows ordering (beta) - which should involve a FIFA queue somewhere.
It exposes synchronous api for request/response pattern.
It makes me wonder if subscriptions in pub/sub are merely queues of SQS.
I would like your opinions on this comparison. The confusion is due to lack of implementation details on PubSub and the obvious name indicating a certain design pattern.
Regards,
The division for messaging in GCP is along slightly different lines than what you may see in AWS. GCP breaks down messaging into three categories:
Torrents: Messaging pipelines that are designed to handle large amounts of throughput on pipes that are persistent. In other words, one creates a new pipeline rarely and sends messages over it for long periods of time. The scaling pattern for torrents is a relatively small number of pipelines transmitting a lot of data. For this category, Cloud Pub/Sub is the right product.
Trickles: Messaging pipelines that are largely ephemeral or require broadcast to a very large number of end-user devices. These pipelines have a low throughput but the number of pipelines can be extremely large. Firebase Cloud Messaging is the product that fits into this category.
Queues: Messaging pipelines where one has more control over the end-to-end message delivery. These pipelines are not really high throughput nor is the number of pipelines large, but more advanced properties are supported, e.g., the ability to delay or cancel the delivery of a message. Cloud Tasks fits in this category, though Cloud Pub/Sub is also adopting features that make it more and more viable for this use case.
So Cloud Pub/Sub is the publish/subscribe aspects of SQS+SNS, where SNS is used as a means to distribute messages to different SQS queues. It also serves as the big-data ingestion mechanism a la Kinesis. Firebase Cloud Messaging covers the portions of SNS designed to reach end user devices. Cloud Tasks (and Cloud Pub/Sub, more and more) provide functionality of a single queue in SQS.
You are correct to say that GCP PubSub is close to AWS SQS. As far as I know, there is no exact SNS tool available in GCP, but I think the closest tool is GCM (Google Cloud Messaging). You are not the only one who has had this query:
AWS SNS equivalent in GCP stack

Monitor Google Cloud Run memory usage

Is there any built-in way to monitor memory usage of an application running in managed Google Cloud Run instances?
In the "Metrics" page of a managed Cloud Run service, there is an item called "Container Memory Allocation". However, as far as I understand it, this graph refers to the instance's maximum allocated memory (chosen in the settings), and not to the memory actually used inside the container. (Please correct me if I'm wrong.)
In the Stackdriver Monitoring list of available metrics for managed Cloud Run ( https://cloud.google.com/monitoring/api/metrics_gcp#gcp-run ), there also doesn't seem to be any metric related to the memory usage, only to allocated memory.
Thank you in advance.
Cloud Run now exposes a new metrics named "Memory Utilization" in Cloud Monitoring, see more details here.
This metrics captures the container memory utilization distribution across all container instances of the revision. It is recommended to look at the percentiles of this metric: 50th percentile, 95th percentiles ad 99th percentiles to understand how utilized are your instances
Currently, there seems to be no way to monitor the memory usage of a Google Cloud Run instance through Stackdriver or on "Cloud Run" page in Google Cloud Console.
I have filed a feature request on your behalf, in order to add memory usage metrics to Cloud Run. You can see and track this feature request in the following link.
There is not currently a metric on memory utilization. However, if your service reaches a memory limit, the following log will appear in Stackdriver Logging with ERROR-level severity:
"Memory limit of 256M exceeded with 325M used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits"
(Replace specific numbers accordingly.)
Based on this log message, you could create a Log-based Metric for memory exceeded.

Will Google Cloud Run support GPU/TPU some day?

So far Google Cloud Run support CPU. Is there any plan to support GPU? It would be super cool if GPU available, then I can demo the DL project without really running a super expensive GPU instance.
So far Google Cloud Run support CPU. Is there any plan to support GPU?
It would be super cool if GPU available, then I can demo the DL
project without really running a super expensive GPU instance.
I seriously doubt it. GPU/TPUs are specialized hardware. Cloud Run is a managed container service that:
Enables you to run stateless containers that are invokable via HTTP requests. This means that CPU intensive applications are not supported. Inbetween HTTP request/response the CPU is idled to near zero. Your expensive GPU/TPUs would sit idle.
Autoscales based upon the number of requests per second. Launching 10,000 instances in seconds is easy to achieve. Imagine the billing support nightmare for Google if customers could launch that many GPU/TPUs and the size of the bills.
Is billed in 100 ms time intervals. Most requests fit into a few hundred milliseconds of execution. This is not a good execution or business model for CPU/GPU/TPU integration.
Provides a billing model which significantly reduces the cost of web services to near zero when not in use. You just pay for the costs to store your container images. When an HTTP request is received at the service URL, the container image is loaded into an execution unit and processing requests resume. Once requests stop, billing and resource usage also stop.
GPU/TPU types of data processing are best delivered by backend instances that protect and manage the processing power and costs that these processor devices provide.
You can use GPU with Cloud Run for Anthos
https://cloud.google.com/anthos/run/docs/configuring/compute-power-gpu

Resources