what factors determine "catch up speed" on Edge after a network outage - azure-iot-edge

I have a customer with IoT Edge deployed to manufacturing plants in remote areas with spotty internet. They have leaf devices sending messages to IOT Edge and then to IoT Hub. They frequently have small outages (5, 10, 15 minutes). They often need to make timely decisions based on the data that makes it to IOT Hub from the plants. They've noticed, if they have a 15 minute outage, it can take anywhere from 15-30 minutes afterwards for IOT Edge to catch up.
Besides network speed itself, what are the factors that would influence that.. For example
- if we were hitting throttling based on their number of iot hub units, would that be surfaced in the edgeHub logs?
- if disk, network, etc can keep up, does edgeHub pretty much upload data as fast as possible (given throttling), or are there any other limits imposed by default?
- What is the default connection retry policy in edgeHub? is the same exponential backoff policy in the C# SDK? If so, could that be the case that if I have a 15 minute outage, that it's taking edgeHub a while after network recovery to 'try again'? If so, is that policy configurable in edgeHub? (via ENV variable or something?)
Any other things to check?

Related

How much impact does the network delay have on IoT Edge throughput?

We have a customer who has deployed a number of iotedge transparent gateways and keeps routing data from tons of leaf devices to cloud.
Recently they noticed the output (edge to IoT Hub) cannot catch up the input on part of the edge devices, which is causing a severe latency issue for their messages.
Here's the information of the built-in metrics on edgeHub,named 8B:
edgehub_queue_length
8B: 981061
edgehub_message_send_duration_seconds
8B: ~110ms
{quantile="0.1"} 0.0632608
{quantile="0.5"} 0.1136008
{quantile="0.9"} 0.127605
{quantile="0.99"} 0.2449048
edgehub_message_process_duration_seconds
8B: 0.5-2.0 ms
We would like to clarify two questions:
What is the recommended network latency for iotedge gateway?
Are there any other methods we can do to improve the output throughput of
edgeHub?

Do Idle Snowflake Connections Use Cloud Services Credits?

Motivation | Suppose one wanted to execute two SQL queries against a Snowflake DB, ~20 minutes apart.
Optimization Problem | Which would cost fewer cloud services credits:
Re-using one connection, and allowing that connection to idle in the interim.
Connecting once per query.
The documentation indicates that authentication incurs cloud services credit usage, but does not indicate whether idle connections incur credit usage.
Question | Does anyone know whether idle connections incur cloud services credit usage?
Snowflake connections are stateless. They do not occupy a resource, and they do not need to keep the TCP/IP connection alive like other database connections.
Therefore idle connections do not consume any the Cloud Services Layer credits unless you enable "CLIENT_SESSION_KEEP_ALIVE".
https://docs.snowflake.com/en/sql-reference/parameters.html#client-session-keep-alive
When you set CLIENT_SESSION_KEEP_ALIVE, the client will update the token for the session (default value is 1 hour).
https://docs.snowflake.com/en/sql-reference/parameters.html#client-session-keep-alive-heartbeat-frequency
As Peter mentioned, the CSL usage up to 10% of daily warehouse usage is free, so refreshing the tokens will not cost you anything in practice.
About your approaches: I do not know how many queries you are planning to run daily, but creating a new connection for each query can be a performance killer. For costs perspective, idle connection will do max 24 authorization requests on a day, so if you are planning to run more than 24 queries on a day, I suggest you to pick the first approach.
Even if idle connections do not cost anything in the Cloud Services respect, is your warehouse running with idle connections hence giving you other costs to consider? I am guessing there's more factors to consider overall which you can speak to your Snowflake Account Team to discuss. Not trying to dodge your question, but trying to give a more wholesome answer!
In general, the Cloud Services costs are typically on the lower side compared to your other costs. Here are the main drivers for cloud service costs and how to minimalize them: https://community.snowflake.com/s/article/Cloud-Services-Billing-Update-Understanding-and-Adjusting-Usage
The best advice you may get is to test your connections/workflows and compare the costs over time. The overall costs are going to depend on several factors. Even if there's a difference in costs between two workflows, you may still have to analyze the cost/output ratio and your business needs to determine if it's worth the savings.
Approach 1 will incur less cloud services usage, but more data transfer charges (to keep the connection alive). Only the Auth event incurs cloud services usage.
Approach 2 will incur more cloud services usage, but less data transfer charges.
However, the amount of cloud services usage or data transfer charges are extremely small in either case.
Note - any cloud services used (up to 10% of daily warehouse usage) are free, whereas there is no free bandwidth allocation, so using #2 may save you a few pennies.

Will Google Cloud Run support GPU/TPU some day?

So far Google Cloud Run support CPU. Is there any plan to support GPU? It would be super cool if GPU available, then I can demo the DL project without really running a super expensive GPU instance.
So far Google Cloud Run support CPU. Is there any plan to support GPU?
It would be super cool if GPU available, then I can demo the DL
project without really running a super expensive GPU instance.
I seriously doubt it. GPU/TPUs are specialized hardware. Cloud Run is a managed container service that:
Enables you to run stateless containers that are invokable via HTTP requests. This means that CPU intensive applications are not supported. Inbetween HTTP request/response the CPU is idled to near zero. Your expensive GPU/TPUs would sit idle.
Autoscales based upon the number of requests per second. Launching 10,000 instances in seconds is easy to achieve. Imagine the billing support nightmare for Google if customers could launch that many GPU/TPUs and the size of the bills.
Is billed in 100 ms time intervals. Most requests fit into a few hundred milliseconds of execution. This is not a good execution or business model for CPU/GPU/TPU integration.
Provides a billing model which significantly reduces the cost of web services to near zero when not in use. You just pay for the costs to store your container images. When an HTTP request is received at the service URL, the container image is loaded into an execution unit and processing requests resume. Once requests stop, billing and resource usage also stop.
GPU/TPU types of data processing are best delivered by backend instances that protect and manage the processing power and costs that these processor devices provide.
You can use GPU with Cloud Run for Anthos
https://cloud.google.com/anthos/run/docs/configuring/compute-power-gpu

How do IoT Edge "internal" messages count against my message quota?

IoT Hub is billed based on number of messages per day (including updating and retrieval of twins, etc). We know that IoT Edge uses some internal messages to operate, such as the reported health/status updates that appear in the portal for it's modules, retrieval of it's own device twin. module twins, etc.
How does this traffic affect the messages against my daily quota? i.e what "counts"? My expectation would be that explicit twin updates/retrievals from custom modules would count, but does the edgeAgent/edgeHub traffic count? If it does, how often does that happen?
Doesn't seem to be a lot of traffic, but it affects pricing and sizing IoT solutions, so needs to be factored in.
--Steve
IoT Edge is "free" with IoT Hub (i.e. the features are available on all IoT hubs; you don't have to bring in/pay for a separate resource), but you do pay for all traffic. Mostly that will just be your traffic (messages your devices/modules are sending/receiving), but Edge Agent and Edge Hub do twin operations when the edge device is starting up, and when things change. So if you deploy a new module to your edge device you'll see some Edge Agent twin traffic related to that. If you change some routes, you'll see the corresponding Edge Hub twin traffic.
As the product nears general availability, you can expect to see documentation that outlines how the Agent and Hub are using their twins, so you know what to expect.

Mirrored queue performance factors

We operate two dual-node brokers, each broker having quite different queues and workloads. Each box has 24 cores (H/T) worth of Xeon E5645 # 2.4GHz with 48GB RAM, connected by Gigabit LAN with ~150μs latency, running RHEL 5.6, RabbitMQ 3.1, Erlang R16B with HiPE off. We've tried with HiPE on but it made no noticeable performance impact, and was very crashy.
We appear to have hit a ceiling for our message rates of between 1,000/s and 1,400/s both in and out. This is broker-wide, not per-queue. Adding more consumers doesn't improve throughput overall, just gives that particular queue a bigger slice of this apparent "pool" of resource.
Every queue is mirrored across the two nodes that make up the broker. Our publishers and consumers connect equally to both nodes in a persistant way. We notice an ADSL-like asymmetry in the rates too; if we manage to publish a high rate of messages the deliver rate drops to high double digits. Testing with an un-mirrored queue has much higher throughput, as expected. Queues and Exchanges are durable, messages are not persistent.
We'd like to know what we can do to improve the situation. The CPU on the box is fine, beam takes a core and a half for 1 process, then another 80% each of two cores for another couple of processes. The rest of the box is essentially idle. We are using ~20GB of RAM in userland with system cache filling the rest. IO rates are fine. Network is fine.
Is there any Erlang/OTP tuning we can do? delegate_count is the default 16, could someone explain what this does in a bit more detail please?
This is difficult to answer without knowing more about how your producers and consumers are configured, which client library you're using and so on. As discussed on irc (http://dev.rabbitmq.com/irclog/index.php?date=2013-05-22) a minute ago, I'd suggest you attempt to reproduce the topology using the MulticastMain java load test tool that ships with the RabbitMQ java client. You can configure multiple producers/consumers, message sizes and so on. I can certainly get 5Khz out of a two-node cluster with HA on my desktop, so this may be a client (or application code) related issue.

Resources