How can we collect performance metrics from CAdvisor docker container? - docker

Sorry I just started to learn docker. My question may seem stupid for some of you.
In fact, I would like to know if there is a way to collect performance metrics from "CAdvisor" container (not from cgroup) at runtime ? I mean, extract performance values from the curves designed by cadvisor like memory usage or network traffic.
I need to record this values and save them in a database so that, I can perform a statistic analyzes upon these generated values (like comparing memory consumption for two docker containers at t=50s).
Thanks in advance.

As other answers mention, cAdvisor doesn't provide its own performance data API, instead it exposes metrics which are typically handled in a separate database if one wants to derive performance data beyond "real time". For example, cAdvisor exports Prometheus metrics natively:
http://prometheus.io/docs/instrumenting/exporters/
The Prometheus metric types:
http://prometheus.io/docs/concepts/metric_types/
Prometheus supports a fairly rich functional expression language that can be used for querying and visualization:
http://prometheus.io/docs/querying/basics/

cAdvisor does provide a rest endpoint to get any stats in real time. By default, it keeps latest two minute of data. You can configure it to keep more or less. It also supports a storage backend to keep dumping stats to an influxdb database.
REST Api:
eg. /api/v1.3/containers
doc: https://github.com/google/cadvisor/blob/master/docs/api.md
Doc on setting up InfluxDB:
https://github.com/google/cadvisor/blob/master/docs/influxdb.md

I think you could use https://github.com/tutumcloud/container-metrics for this. Basically what that would be doing is using influxdb http://influxdb.com/ as a time series data store.
There is some more information available here: http://blog.tutum.co/2014/08/25/panamax-docker-application-template-with-cadvisor-elasticsearch-grafana-and-influxdb/
A couple of people seemed to be looking into the ELK stack (Elastic Search, Logstash, Kibana) for visualising some of this data here: https://github.com/google/cadvisor/issues/634

Related

Prometheus CPU Usage Histogram Metrics

my goal is to observe metrics (like CPU, Memory usage etc.) with Prometheus on a server and on its running docker containers. Before sending an alarm, I would like to compare the certain values of those metrics with e.g. an 0.95 quantile. However, over several weeks of search in the internet I still struggle to create metrics for the certain quantiles. Therefore I ask in this thread for your help/advice, how a quantile for certain metrics can be created.
Background
The code base is a fork of the docprom repository. This code relies on Prometheus for monitoring. Prometheus retrieves its data from a running cAdvisor container. The provided metrics of cAdvisor for Prometheus can be seen on the following page. However, it provides only Gauge and Counter metric types. During my research I was not able to find parameters that would enable modifications/extensions of those provided metrics.
Problem
According to my current understanding, the metric type should be a Histogram or Summary in order to observe the quantiles. What is the best approach to use the histogram_quantile query on the metrics provided by cAdvisor?
My current idea is to
create a custom server
fetch the desired data from Prometheus
calculate the desired data
provide it as a metric from the server, so that Prometheus can scrape it
Run histogram_quantile on the custom metric
Is it the right approach in order to create a metric that can be used with quantiles?
For example I would like to fire an alarm if a certain containers' CPU usage exceeds a 0,95 quantile. The code for the CPU usage can be seen exemplary below:
sum(rate(container_cpu_usage_seconds_total{name="CONTAINER_NAME"}[10m]))) / count(node_cpu_seconds_total{mode="system"}) * 100
What would be the best approach to create the desired quantiles? Am I on the right path or am I missing something simple here? Because it looks way too hard for me in order to get a simple query with a quantile.
I am thankful for all help and information.

What is recommended solution for monitoring heterogeneous infrastructure?

I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.

Collect metrics from dockers in Kubernetes

I work with Docker and Kubernetes.
I would like to collect application specific metrics from each Docker.
There are various applications, each running in one or more Dockers.
I would like to collect the metrics in JSON format in order to perform further processing on each type of metrics.
I am trying to understand what is the best practice, if any and what tools can I use to achieve my goal.
Currently I am looking into several options, none looks too good:
Connecting to kubectl, getting a list of pods, performing a command (exec) at each pod to cause the application to print/send JSON with metrics. I don't like this option as it means that I need to be aware to which pods exist and access each, while the whole point of having Kubernetes is to avoid dealing with this issue.
I am looking for Kubernetes API HTTP GET request that will allow me to pull a specific file.
The closest I found is GET /api/v1/namespaces/{namespace}/pods/{name}/log and it seems it is not quite what I need.
And again, it forces me to mention each pop by name.
I consider to use ExecAction in Probe to send JSON with metrics periodically. It is a hack (as this is not the purpose of Probe), but it removes the need to handle each specific pod
I can't use Prometheus for reasons that are out of my control but I wonder how Prometheus collects metric. Maybe I can use similar approach?
Any possible solution will be appreciated.
From an architectural point of view you have 2 options here:
1) pull model: your application exposes metrics data through a mechanisms (for instance using the HTTP protocol on a different port) and an external tool scrapes your pods at a timed interval (getting pod addresses from the k8s API); this is the model used by prometheus for instance.
2) push model: your application actively pushes metrics to an external server, tipically a time series database such as influxdb, when it is most relevant to it.
In my opinion, option 2 is the easiest to implement, because:
you don't need to deal with k8s APIs in order to discover pods addresses;
you don't need to create a local storage layer to store your metrics (because you push them one by one as they occour);
But there is a drawback: you need to be careful how you implement this, it could cause your API to become slower and you might need to deal with asynchronous calls to your metrics server.
This is obviously a very generic answer, but I hope it could point you in the right direction.
Pity you can not use Prometheus, but it's a good lead for what can be done in this scope. What Prom does is as follows :
1: it assumes that metrics you want to scrape (collect) are exposed with some http endpoint that Prometheus can access.
2: it connects to kubernetes api to perform a discovery of endpoints to scrape metrics from (there is a config for it, but generaly it means it has to be able to connect to the API and list services/deployments/pods and analyze their annotations (as they have info about metrics endpoints) to compose a list of places to scrape data from
3: periodicaly (15s, 60s etc.) it connects to the endpoints and collects the exposed metrics.
That's it. Rest is storage/postprocessing. The kube related part might be a significant amount of work to do though, so it would be way better to go with something that already exists.
Sidenote: while this is generaly a pull based model, there are cases where pull is not possible (vide short lived scripts like php), that is where Prometheus pushgateway comes into play to allow pushing metrics to a place where Prometheus will pull from.

bosun and telegraf metrics meta information

hello i really want to use bosun/tsdbrelay/opentsdb with the telegraf collector, as it gets all the metrics we want to monitor out of the box.
i allready have a small setup to push metrics from 5 servers to bosun for indexing and opentsdb for storage.
i used the haproxy configs from kyle brandts bosun infrastructure blog to make the tsdbs ha-ready
but bosun is showing that it cannot use the auto-type for metrics, and also in the primary stats view does not show any graphs for cpu / mem etc.
what can i provide that the graphs show up.
kind regards.
Both of these features are mostly scollector specific. The "host" view (I've considered ripping that out, it was done in the early days, better to use something like grafana) depends on scollector specific metrics such as os.cpu.
As far as "Auto" for rate vs gauge, that is also metadata that comes from scollector and sent to bosun. If you want to try to mimic the behavior see https://github.com/bosun-monitor/bosun/blob/master/metadata/metadata.go#L30 and https://github.com/bosun-monitor/bosun/blob/master/metadata/metadata.go#L195 - you would need to create at least the "rate" key for each metric you are getting from telegraph.

Bosun HA and scalability

I have a minor bosun setup, and its collecting metrics from numerous services, and we are planning to scale these services on the cloud.
This will mean more data coming into bosun and hence, the load/efficiency/scale of bosun is affected.
I am afraid of losing data, due to network overhead, and in case of failures.
I am looking for any performance benchmark reports for bosun, or any inputs on benchmarking/testing bosun for scale and HA.
Also, any inputs on good practices to be followed to scale bosun will be helpful.
My current thinking is to run numerous bosun binaries as a cluster, backed by a distributed opentsdb setup.
Also, I am thinking is it worthwhile to run some bosun executors as plain 'collectors' of scollector data (with bosun -n command), and some to just calculate the alerts.
The problem with this approach is it that same alerts might be triggered from multiple bosun instances (running without option -n). Is there a better way to de-duplicate the alerts?
The current best practices are:
Use https://godoc.org/bosun.org/cmd/tsdbrelay to forward metrics to opentsdb. This gets the bosun binary out of the "critical path". It should also forward the metrics to bosun for indexing, and can duplicate the metric stream to multiple data centers for DR/Backups.
Make sure your hadoop/opentsdb cluster has at least 5 nodes. You can't do live maintenance on a 3 node cluster, and hadoop usually runs on a dozen or more nodes. We use Cloudera Manager to manage the hadoop cluster, and others have recommended Apache Ambari.
Use a load balancer like HAProxy to split the /api/put write traffic across multiple instances of tsdbrelay in an active/passive mode. We run one instance on each node (with tsdbrelay forwarding to the local opentsdb instance) and direct all write traffic at a primary write node (with multiple secondary/backup nodes).
Split the /api/query traffic across the remaining nodes pointed directly at opentsdb (no need to go thru the relay) in an active/active mode (aka round robin or hash based routing). This improves query performance by balancing them across the non-write nodes.
We only run a single bosun instance in each datacenter, with the DR site using the read only flag (any failover would be manual). It really isn't designed for HA yet, but in the future may allow two nodes to share a redis instance and allow active/active or active/passive HA.
By using tsdbrelay to duplicate the metric streams you don't have to deal with opentsdb/hbase replication and instead can setup multiple isolated monitoring systems in each datacenter and duplicate the metrics to whichever sites are appropriate. We have a primary and a DR site, and choose to duplicate all metrics to both data centers. I actually use the DR site daily for Grafana queries since it is closer to where I live.
You can find more details about production setups at http://bosun.org/resources including copies of all of the haproxy/tsdbrelay/etc configuration files we use at Stack Overflow.

Resources