I am trying to figure out how best to monitor usage of our HPC resources. Specifically, I am trying to identify cpu usage, disk space consumed, and number of jobs run by group.
The pbs format allows the "-W" group_list flag to identify the group the script belongs to. I want to use this to monitor the cluster usage, but I can't find documentation on how to track this over time.
gmond and gmetric offer some functionality - I can see the parameters I'm interested in, but I can't figure out how to group these by the -W group_list flag or by user or some other metric.
Any advice?
Related
For the project I'm working on, I need to be able to measure the total amount of network traffic for a specific container over a period of time. These periods of time generally are about 20 seconds, and the precision needed realistically is only in kilobytes. Ideally this solution would not involve the use of additional software either in the containers or on the host machine, and would be suitible for linux/windows hosts.
Originally I had planned to use the 'NET I/O' attribute of the 'docker stats' command, though the field is automatically formatted to a more human readable format (i.e. '200 MB') which means for containers that have been running for some time, I can't get the precision I need.
Is there any way to get the raw value of 'NET I/O' or to reset the running count? Outside of this I've explored using something tshark or iptables, but like I said above ideally the solution would not require additional programs. Though if there aren't any good solutions that fit that criteria any other suggestions would be welcome. Thank you!
Is there any built-in way to monitor memory usage of an application running in managed Google Cloud Run instances?
In the "Metrics" page of a managed Cloud Run service, there is an item called "Container Memory Allocation". However, as far as I understand it, this graph refers to the instance's maximum allocated memory (chosen in the settings), and not to the memory actually used inside the container. (Please correct me if I'm wrong.)
In the Stackdriver Monitoring list of available metrics for managed Cloud Run ( https://cloud.google.com/monitoring/api/metrics_gcp#gcp-run ), there also doesn't seem to be any metric related to the memory usage, only to allocated memory.
Thank you in advance.
Cloud Run now exposes a new metrics named "Memory Utilization" in Cloud Monitoring, see more details here.
This metrics captures the container memory utilization distribution across all container instances of the revision. It is recommended to look at the percentiles of this metric: 50th percentile, 95th percentiles ad 99th percentiles to understand how utilized are your instances
Currently, there seems to be no way to monitor the memory usage of a Google Cloud Run instance through Stackdriver or on "Cloud Run" page in Google Cloud Console.
I have filed a feature request on your behalf, in order to add memory usage metrics to Cloud Run. You can see and track this feature request in the following link.
There is not currently a metric on memory utilization. However, if your service reaches a memory limit, the following log will appear in Stackdriver Logging with ERROR-level severity:
"Memory limit of 256M exceeded with 325M used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits"
(Replace specific numbers accordingly.)
Based on this log message, you could create a Log-based Metric for memory exceeded.
I am using azure log analytics to store monitoring data from AKS clusters. 72% of the data stored is performance data. Is there a way to limit how often AKS reports performance data?
At this point we do not provide a mechanism to change performance metric collection frequency. It is set to 1 minute and cannot be changed.
We were actually thinking about adding an option to make more frequent collection as was requested by some customers.
Given the number of objects (pods, containers, etc) running in the cluster collecting even a few perf metrics may generate noticeable amount of data... You need that data in order to figure out what is going on in case of a problem.
Curious: you say your perf data is 72% of total - how much is it in terms om Gb/day, do you now? Do you have any active applications running on the cluster generating tracing? What we see is that once you stand up a new cluster, perf data is "the king" of volume, but once you start ading active apps that trace, logs become more and more of a factor in telemetry data volume...
I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.
Say I want to build a hosting environment using Docker (so that I can scale up if needed).
I want:
Each user to be able to execute arbitrary code and
Each user to not see or affect other users
Is this something more concerning Docker, or some other tool like Apparmor?
I want users to be able to run, say, PHP code. If one users gets a lot of hits and is using a lot of cpu, I want it to not affect another user who I've promised a certain amount of cpu usage. Perhaps I'm missing what concept governs this type of thing altogether?
you can limit the memory an cpu usage of dockers using --memory and --cpus flags when you run the docker so users have a maximum amount of resources they are limited to, for all such constraints use the following documentation.
https://docs.docker.com/engine/admin/resource_constraints/