we are using Telegraf, Influxdb and Grafana for Monitoring our environment. We have two datacenters dc1 and dc2. Each datacenter has one pod of Influxdb running. we want some approach to replicate the data between two influxdb instances running across two datacenters. So, if dc1 goes down we can have the data of both datacenters(dc1 and dc2) in dc2. We are using opensource Influxdb so can anyone please suggest some approaches to achieve this?
Tried to follow Replication during ingest approach where we configure two influxdb urls of both datacenters in telegraf.conf as per this https://www.influxdata.com/blog/multiple-data-center-replication-influxdb/ documentation but, what if one of the influxdb is down? and also after it's recovery both influxdb will have different data so, we do not want to follow this approach.
Related
I need you help regarding to Telegraf monitoring of influxDB instance and a behavior I cannot explain.
The configuration is the following:
Two independent instances of InfluxDB v1.7.10 are running on seperate servers, say server A and server B
Two telegraf services v1.13.4 are running with the same configuration:
One output being a "monitoring" database created in the influx database
Several inputs (system, disk, ping, ...)
Grafana is used on both server to explore Telegraf stored values
On server A, which is running fine, the monitoring shard size and cardinality are quite regular. On server B on the other hand, the monitoring shard size and cardinality are much more important (by a factor 10).
I cannot explain this difference and I have already checked:
tag and field cardinality of the inputs used on both server
telegraf configuration on both server
Any idea about where to look to explain this behavior ?
Tanks for your help !
I have a minor bosun setup, and its collecting metrics from numerous services, and we are planning to scale these services on the cloud.
This will mean more data coming into bosun and hence, the load/efficiency/scale of bosun is affected.
I am afraid of losing data, due to network overhead, and in case of failures.
I am looking for any performance benchmark reports for bosun, or any inputs on benchmarking/testing bosun for scale and HA.
Also, any inputs on good practices to be followed to scale bosun will be helpful.
My current thinking is to run numerous bosun binaries as a cluster, backed by a distributed opentsdb setup.
Also, I am thinking is it worthwhile to run some bosun executors as plain 'collectors' of scollector data (with bosun -n command), and some to just calculate the alerts.
The problem with this approach is it that same alerts might be triggered from multiple bosun instances (running without option -n). Is there a better way to de-duplicate the alerts?
The current best practices are:
Use https://godoc.org/bosun.org/cmd/tsdbrelay to forward metrics to opentsdb. This gets the bosun binary out of the "critical path". It should also forward the metrics to bosun for indexing, and can duplicate the metric stream to multiple data centers for DR/Backups.
Make sure your hadoop/opentsdb cluster has at least 5 nodes. You can't do live maintenance on a 3 node cluster, and hadoop usually runs on a dozen or more nodes. We use Cloudera Manager to manage the hadoop cluster, and others have recommended Apache Ambari.
Use a load balancer like HAProxy to split the /api/put write traffic across multiple instances of tsdbrelay in an active/passive mode. We run one instance on each node (with tsdbrelay forwarding to the local opentsdb instance) and direct all write traffic at a primary write node (with multiple secondary/backup nodes).
Split the /api/query traffic across the remaining nodes pointed directly at opentsdb (no need to go thru the relay) in an active/active mode (aka round robin or hash based routing). This improves query performance by balancing them across the non-write nodes.
We only run a single bosun instance in each datacenter, with the DR site using the read only flag (any failover would be manual). It really isn't designed for HA yet, but in the future may allow two nodes to share a redis instance and allow active/active or active/passive HA.
By using tsdbrelay to duplicate the metric streams you don't have to deal with opentsdb/hbase replication and instead can setup multiple isolated monitoring systems in each datacenter and duplicate the metrics to whichever sites are appropriate. We have a primary and a DR site, and choose to duplicate all metrics to both data centers. I actually use the DR site daily for Grafana queries since it is closer to where I live.
You can find more details about production setups at http://bosun.org/resources including copies of all of the haproxy/tsdbrelay/etc configuration files we use at Stack Overflow.
I'm going to use influxdb to store a lot of iot data from sensors.
As the last cluster version of influxdbv0.11 is not ready to use in production, and the Relay HA is too young too, is there another way to scale-out influxdb?
eg:
What are the maturity of the last cluster version of influxdb v0.11? Should I customize v0.11 or try other cost-saving way.
How about use kafka infront of influxdb to buffer data when influxdb got down?
How about sharding?Is there any detailed document about sharding in influxdb( https://influxdata.com/high-availability/)?
Any way, I just want to find a free, cluster working influxdb.
Other than InfluxDB Relay there isn't a free way to scale out InfluxDB.
Sorry I just started to learn docker. My question may seem stupid for some of you.
In fact, I would like to know if there is a way to collect performance metrics from "CAdvisor" container (not from cgroup) at runtime ? I mean, extract performance values from the curves designed by cadvisor like memory usage or network traffic.
I need to record this values and save them in a database so that, I can perform a statistic analyzes upon these generated values (like comparing memory consumption for two docker containers at t=50s).
Thanks in advance.
As other answers mention, cAdvisor doesn't provide its own performance data API, instead it exposes metrics which are typically handled in a separate database if one wants to derive performance data beyond "real time". For example, cAdvisor exports Prometheus metrics natively:
http://prometheus.io/docs/instrumenting/exporters/
The Prometheus metric types:
http://prometheus.io/docs/concepts/metric_types/
Prometheus supports a fairly rich functional expression language that can be used for querying and visualization:
http://prometheus.io/docs/querying/basics/
cAdvisor does provide a rest endpoint to get any stats in real time. By default, it keeps latest two minute of data. You can configure it to keep more or less. It also supports a storage backend to keep dumping stats to an influxdb database.
REST Api:
eg. /api/v1.3/containers
doc: https://github.com/google/cadvisor/blob/master/docs/api.md
Doc on setting up InfluxDB:
https://github.com/google/cadvisor/blob/master/docs/influxdb.md
I think you could use https://github.com/tutumcloud/container-metrics for this. Basically what that would be doing is using influxdb http://influxdb.com/ as a time series data store.
There is some more information available here: http://blog.tutum.co/2014/08/25/panamax-docker-application-template-with-cadvisor-elasticsearch-grafana-and-influxdb/
A couple of people seemed to be looking into the ELK stack (Elastic Search, Logstash, Kibana) for visualising some of this data here: https://github.com/google/cadvisor/issues/634
I'm trying to monitor my network usage with Graphite but I can't figure out how to do this, could you help me please?
In addition to this, I would like to monitor some other services, as nginx, mysql, etc,..
Thanks for your help!
Best.
Ofcourse there are multiple solutions possible. The solution I'm using is collectd. With collectd you can collect statistic data from plugins in rrd files. It has a lot of plugins like network, nginx and mysql.
It does not generate graphs by itself, but there are multiple ways to generate graphs. One of them is to send the collected data to graphite with the graphite plugin.