IOT Edge Hub: testing for offline storage - azure-iot-edge

We use Azure IOT and edgeHub and edgeAgent as modules for Edge runtime. We want to verify the capability of Offline storage is configured correctly in our environment
We have custom simulator module connected to custom publisher module that publishes to an API in the Cloud. The simulator is continuously producing a message around 10KB every 2 seconds. The publisher module could not talk outside because of a blocking firewall rule. I want to verify all the memory/RAM allocated for edgeHub is used and later overflows to disk and uses reasonable available disk space.
This exercise is taking longer to complete even when I run multiple modules/instances of simulator.
Queries:
How can I control the size of memory allocated to edgeHub. What is the correct createOptions to control/reduce allocated memory. It is currently allocated around 1.8GB.
I see generally during the exercise, RAM keeps increasing. But at some point, it drops down a little and keeps increasing after. Is there some kind of GC or optimization happening inside edgeHub?
b0fb729d25c8 edgeHub 1.36% 547.8MiB / 1.885GiB 28.38% 451MB / 40.1MB 410MB / 2.92GB 221
How can ensure that any of the messages produced by simulator are not lost. Is there a way to count the number of messages in edgeHub?
I have done the proper configuration to mount a directory from VM to container for persistent storage. Is there a specific folder inside edgeHub folder under which messages would be stored when overflown?

I will document the answers after the input I have received from Azure IoTHub/IoTEdge maintainers.
To control memory limit of containers/modules on Edge, specify as below with appropriate max memory limit in bytes
"createOptions": "{\"HostConfig\": { \"Memory\": 536870912 }}"
Messages are persistent because the implementation is RocksDB which stores to disk by default
To make the messages persistent even after edgeHub container recreation, mount a directory from Host to Container. In the example below /data from Host is mounted to appropriate location within edgeHub
"env": {
"storageFolder": {
"value": "/data"
}
}

To monitor the memory usage the metrics feature will be generally available in the 1.0.10
Monitor IoT Edge Devices At-scale

Related

Trying to understand and debug ETCDv2 Memory usage

I’m trying to understand ETCD’s memory and disk usage within a deployed system using the ETCDv2 API. The system has a file being saved on a regular basis, each time under a new key, and we’re concerned that long-term there’s no clean-up of state leading to both memory and disk usage growing unbounded on each VM in the etcd cluster. We’ve also emulated this, using a large file (several MB) being saved every few minutes.
From the etcd docs, I expected the following:
Each insertion would save the file to disk, causing disk usage to grow unbounded.
This matches what I am seeing.
In memory, etcd would save a key-value pair where the value is a lookup address for the file on disk (taking up a very small amount of memory) and a cached version of the file (taking a large amount of memory).
I would then expect that rebooting an etcd pod after several file writes would cause the cache to be (mostly) cleared, meaning a consistently up pod would have memory growing unbounded but if the pod rebooted, the cache would be cleared of all but the active entry (and any specifically requested by e.g. attempted rollbacks) and the memory usage would (mostly) reset with each reboot.
However, in practice we see a very small memory drop with a reboot which is almost immediately returned after the pod recovers (as though all the cache is restored from the peers).
Is my understanding correct? And if so:
Why does the memory usage reset fully after an etcd pod reboot? Does the etcd cache get synced with its cluster, as well as the main key-value table and file storage?
Is there a recommended way to keep etcd’s memory and disk usage within bounded limits?
Additional notes:
I’ve tried reducing the snapshot_count configuration setting - this doesn’t seem to have had any impact (unless I’ve reduced it too far - I cut it right down to 5 from the default of 100,000).
I’ve attempted changing our file saving to overwrite a single file with a new version each time, instead of storing a new file. This doesn’t appear to have had any impact (although this may be due to issues in my prototype; I’m still investigating).
We can’t migrate existing deployments to etcd v3 file-systems, so are specifically looking at etcd v2 solutions. I think this rules out compact and defrag steps, which seem to be a core part of the answer to this problem in v3.
Any help or insight very gratefully appreciated.
Thanks!

How to get host system information Like memory ,cpu utilization etc in azure iot edge device

How to get host system information Like memory ,cpu utilization metrics etc in azure iot edge device.
Is there Azure Java SDK available for this. What azure tools can be used. Does Azure Agent has these details.
Regards
Jayashree
You can use your own solution to access these metrics. Or, you can use the metrics-collector module which handles collecting the built-in metrics and sending them to Azure Monitor or Azure IoT Hub. For more information, see Collect and transport metrics and Access built-in metrics.
Also, You can declare how much of the host resources a module can use. This control is helpful to ensure that one module can't consume too much memory or CPU usage and prevent other processes from running on the device. You can manage these settings with Docker container create options in the HostConfig group, including:
Memory: Memory limit in bytes. For example, 268435456 bytes = 256 MB.
MemorySwap: Total memory limit (memory + swap). For example,
536870912 bytes = 512 MB.
NanoCpus: CPU quota in units of 10-9 (1 billionth) CPUs. For example,
250000000 nanocpus = 0.25 CPU.
Reference: Restrict module memory and CPU usage

pgpool node detached for no reason?

I have a two-node PostgreSQL cluster running on VMs where each VM runs both the pgpool service and a Postgres server.
due to insufficient memory configuration the Postgres server crashed, so I've bumped the VM memory and the changed Postgres memory config in the postgresql.conf file. since that memory changes the slave pgpool node detaches every night at a specific time, though when looking at node_exporter metrics regarding CPU, load, processes disk usage or memory didn't show any spikes or sudden changes.
the slave node detaching happened before but not day after day. I've stumbled upon this thread and read this part of the documentation about the failover but Since the Postgres server didn't crash and existing connections to the slave node were working (it kept serving existing connections but didn't take new ones) so network issues seemed irrelevant, especially after consulting with our OPS team on whether they noticed any abnormal network or DNS activity that could explain that. Unfortunately, they didn't notice any interesting findings.
I have pg_exporter, postgres_exporter and node_exporter on each node to monitor the server and VM behavior, what should I be looking for to debug this? what should I ask of our OPS team to check specifically? our pgpool log file only states the failure to access the other node but no exact reason, as the aforementioned docs say:
Pgpool-II does not distinguish each case and just decides that the
particular PostgreSQL node is not available if health check fails.
could it still be a network\DNS issue? and if so. how would I confirm this?
thnx for reading and taking your time to assist me in this conundrum
that was interesting
If summing the gist of it,
it was part of the OPS team infrastructure backups
Now the entire process went like that:
setting the ambiance:
we run on-prem on top of VMWare vCenter cluster backing up on the infra side with VMWare VM snapshot and Veeamm VM backup where the vmdk files\ESXi datastores reside on a NetApp storage based on NFS.
when checking node exporter metrics in Node Exporter Full Dashboard I saw network traffic drop in the order of up to 2 packets per second for about 5 to 15 minutes consistently through the last few months, increasing dramatically in phenomenon length in the last month (around the same time late at night).
Rough illustration:
After checking again with our OPS team they indicated it could be the host configurations\Veeam Backups.
It turns out that because the storage for the VMs (including the one that runs the Veeam backup) is attached via network and not directly on the ESXi hosts, the final snapshot saved\consolidated at that late-night time -
node detaches every night at a specific time
With the way NFS works with disk locking (limiting IOPs to existing data) along with the high IOPs requirements from the Veeam backup cause the server to hang\freeze and sometimes on rare occasions even a VM restart. here's the quote from the Veeam issue doc:
The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates
the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
This issue occurs when the target virtual machine and the backup appliance [proxy] reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up [during snapshot removal].
Obviously, that would interfere at the very least with Postgres functionality especially configured as a cluster with replication that requires a near-constant connection between the Postgres servers. A similar thread on SO using a different DB server
a solution is suggested including solving the issue presented in the last quote in this link, though for the time being, we removed the usage of Veeam backup for sensitive VMs until the solution can be verified locally (will update in the future if tried and fixed issue)
additional incidents documentation: similar issue case, suggested solution info from Veeam, third party site solution (around the same problem as a temp fix as I see it), Reddit thread acknowledging the issue and suggesting options

Azure disk alert for AKS dynamic disk

In AKS we dynamically create and use persistent volume with azure disk.
We setup initial size 20G for s service and we ran out of space.
Is there any way that we can setup a disk alert for dynamic disks to be informed?
Starting with agent version ciprod10052020, Container insights integrated agent now supports monitoring PV (persistent volume) usage : Reference
Create metric alerts :
To alert on system resource issues when they are experiencing peak demand and running near capacity, with Container insights you would create a log alert based on performance data stored in Azure Monitor Logs.
For refernce see alerts-metric-overview .
Container insights includes Average disc usage % and Average
persistent volume usage% alerts from among many alerts.
See Prerequisites and Steps to enable the metric alerts in
Azure Monitor from the Azure portal
Log-based alerts:
Log alerts run queries on Log Analytics data : Log alerts and prerquisites
Create an alert rule for container resource utilization - refer :
create an alert rule
Note: To create an alert rule for container resource utilization
requires you to switch to a new log alerts API as described in switch
API preference for log alerts
To Query to alert when free disk space on cluster nodes exceeds a threshold: see last section of resource-utilization-log-search-queries.

Does Google Cloud Run memory limit apply to the container size?

For cloud run's memory usage from the docs (https://cloud.google.com/run/docs/configuring/memory-limits)
Cloud Run applications that exceed their allowed memory limit are terminated.
When you configure memory limit settings, the memory allocation you are specifying is used for:
Operating your service
Writing files to disk
Running binaries or other processes in your container, such as the nginx web server.
Does the size of the container count towards "operating your service" and counts towards the memory limit?
We're intending to use images that could already approach the memory limit, so we would like to know if the service itself will only have access to what is left after subtracting container size from the limit
Cloud Run PM here.
Only what you load in memory counts toward your memory usage. So for example, if you have a 2GB container but only execute a very small binary inside it, then only this one will count as used memory.
This means that if your image contains a lot of OS packages that will never be loaded (because for example you inherited from a.big base image), this is fine.
Size of the container image you deploy to Cloud Run does not count towards the memory limit. For example, if your container image is 3 GiB, you can still run on a 256 MiB memory environment.
Writing new files to local filesystem, or (obviously) allocating more memory within your app will count towards the memory usage of your container. (Perhaps also obvious, but worth mentioning) the operating system will "load" your container's entrypoint executable to memory (well, to execute it). That will count towards the available memory as well.

Resources