I have a bunch of micro-services hosted on AWS. I am using StatsD, Graphite and Grafana to monitor them. Now I want to expand it to monitor the queues (SQS) through which these micro-services are talking to each other. How can I leverage Graphite/ Grafana to do this? Or a better approach if there aint any support/ plugin for the same. Thanks :)
PS : If it's gotta be Zipkin, please tell me they can co-exist or is there a catch to using multiple tracers.
Alright, so I'm going to answer this based on what you said here:
Or a better approach if there aint any support/ plugin for the same.
The way that I do it us through Prometheus, in combination with cloudwatch_exporter, and alertmanager.
The configuration for cloudwatch_exporter to monitor SQS is going to be something like (this is only two metrics, you'll need to add more based on what you're looking to monitor):
tasks:
- name: ec2_cloudwatch
default_region: us-west-2
metrics:
- aws_namespace: "AWS/SQS"
aws_dimensions: [QueueName]
aws_metric_name: NumberOfMessagesReceived
aws_statistics: [Sum]
range_seconds: 600
- aws_namespace: "AWS/SQS"
aws_dimensions: [QueueName]
aws_metric_name: ApproximateNumberOfMessagesDelayed
aws_statistics: [Sum]
You'll then need to configure prometheus to scrape the cloudwatch_exporter endpoint at an interval, for ex what I do:
- job_name: 'somename'
scrape_timeout: 60s
dns_sd_configs:
- names:
- "some-endpoint"
metrics_path: /scrape
params:
task: [ec2_cloudwatch]
region: [us-east-1]
relabel_configs:
- source_labels: [__param_task]
target_label: task
- source_labels: [__param_region]
target_label: region
You would then configure alertmanager to alert based on those scraped metrics; I do not alert on those metrics so I cannot give you an example. But, to give you an idea how of this architecture, a diagram is below:
If you need to use something like statsd you can use statsd_exporter. And, just in-case you were wondering, yes Grafana supports prometheus.
Related
I'm using traefik 2.0 with docker provider (swarm mode) and I wish to provide a default way for services publishing themselves on traefik avoiding conflicts.
I managed to create a default rule matching my needs, but I'm now struggling because I don't see a way to provide a default middleware to strip away prefixes.
Is there a way to add a docker service label without having to provide a specific router name, but still adding a middleware to whatever router was implicitly created by traefik?
Or is there a way to define a default middleware as there is for the default rule?
The solution I'm trying to approach is to remove all the variable substitutions in the following labels, thus reducing the verbosity of the whole definition but without exposing myself to naming conflicts:
- traefik.enable=true
- traefik.http.services.${ENV:-dev}_${STACK}_whoami.loadbalancer.server.port=80
- traefik.http.middlewares.${ENV:-dev}_${STACK}_whoami.stripprefix.prefixes=/${STACK}
- traefik.http.routers.${ENV:-dev}_${STACK}_whoami.entrypoints=http
- traefik.http.routers.${ENV:-dev}_${STACK}_whoami.rule=PathPrefix(`/${STACK}/whoami`)
- traefik.http.routers.${ENV:-dev}_${STACK}_whoami.middlewares=${ENV:-dev}_${STACK}_whoami#docker
Hoping it could become something like the following, where default is the magic word for using the implicit service name assigned by Docker when deploying the stack:
- traefik.enable=true
- traefik.http.services.default.loadbalancer.server.port=80
- traefik.http.middlewares.default.stripprefix.prefixes=/${STACK}
- traefik.http.routers.default.entrypoints=http
- traefik.http.routers.default.rule=PathPrefix(`/${STACK}/whoami`)
- traefik.http.routers.default.middlewares=default#docker
I tried the following, but apparently the go template doesn't get replaced:
- traefik.enable=true
- traefik.http.services.{{ .Name }}.loadbalancer.server.port=80
- traefik.http.middlewares.{{ .Name }}.stripprefix.prefixes=/${STACK}
- traefik.http.routers.{{ .Name }}.entrypoints=http
- traefik.http.routers.{{ .Name }}.rule=PathPrefix(`/${STACK}/whoami`)
- traefik.http.routers.{{ .Name }}.middlewares={{ .Name }}#docker
I didn't tested it but according to this doc, Go templating is only supported in dynamic (Yaml/Toml) configuration file.
So I suggest you try to add a dynamic configuration file (see here) and write something like :
http:
routers:
{{range $i, $e := until 100 }}
router{{ $e }}:
middlewares = {{ $e }}
{{end}}
Hopes this can help
With making reverse proxy on Docker and Traefik, I want to dispatch several paths on the same host into two different backend servers like these,
1. traefik.test/ -> app1/
2. traefik.test/post/blabla -> app1/post/blabla
3. traefik.test/user/blabla -> app2/user/blabla
If the rules are only #2 and #3, I could do like this in docker-compose.yml
app1:
image: akky/app1
labels:
- "traefik.backend=app1"
- "traefik.frontend.rule=Host:traefik.test;PathPrefix:/post,/comment"
app2:
image: akky/app2
labels:
- "traefik.backend=app2"
- "traefik.frontend.rule=Host:traefik.test;PathPrefix:/user,/group"
However, adding the root '/' into the first PathPrefix seems to cloak /user on app2. The following does not work, and everything goes to app1 backend.
- "traefik.frontend.rule=Host:traefik.test;PathPrefix:/,/post,/group"
The rules "Host:" and "PathPrefix" seems working as 'AND', but I wanted to use 'OR' ( exact /, OR starting with /post ). I searched and came to know that multiple rules can be directed since version 1.3.0, according to pull request #1257 by making multiple lines with adding service names.
By knowing that, what I did is like this,
app1:
image: akky/app1
labels:
- "traefik.app1_subfolder.backend=app1"
- "traefik.app1_subfolder.frontend.rule=Host:traefik.test;PathPrefix:/post,/group"
- "traefik.app1_rootfolder.backend=app1"
- "traefik.app1_rootfolder.frontend.rule=Host:traefik.test;Path:/"
app2:
image: akky/app2
labels:
- "traefik.backend=app2"
- "traefik.frontend.rule=Host:traefik.test;PathPrefix:/user"
Now it works as required, the root access is dispatched to app1/ .
My question is, is this the proper way? It does not look like so for me, as this root and subfolder dispatch should be a typical use case.
You might consider adding priority labels so the app2 rules take precedence over app1 rules. Then you should be able to simplify the app1 config.
app1:
image: akky/app1
labels:
- "traefik.backend=app1"
- "traefik.frontend.priority=10"
- "traefik.frontend.rule=Host:traefik.test;PathPrefix:/,/post,/group"
app2:
image: akky/app2
labels:
- "traefik.backend=app2"
- "traefik.frontend.priority=50"
- "traefik.frontend.rule=Host:traefik.test;PathPrefix:/user"
Update: I had the priorities in the wrong order. Larger priority values take precedence over smaller priority values. According to the docs, it's based on (priority + rule length), and the larger value wins.
I would like to employ Prometheus' relabeling for adding a label hostname, which should be a more concise version of instance as provided by targets. This should allow more compact legends in Grafana dashboards.
For instance, when __address__ has been set to myhost.mydomain.com:8080, hostname should be set to myhost. I am using __address__ rather than instance as source_label, because the second is apparently not yet set when relabeling occurs.
The relevant excerpt of my prometheus.yaml looks as follows (it is meant to employ a lazy regular expression):
- job_name: 'node_exporter'
static_configs:
- targets: ['myhost1.mydomain.com:8080',
'myhost2.mydomain.com:8080']
relabel_configs:
- source_labels: ['__address__']
regex: '^([^\.:]+?)'
replacement: ${1}
target_label: 'hostname'
The expected new label hostname is not yet added. What could be wrong in my setup?
With this regex (with a non-capturing group) things have come to work: '(.+?)(?:[\\.:].+)?'.
I use Prometheus, together with cAdvisor to monitor my environment.
Now, I tried to use Prometheus' "target relabeling", and create a label that its value is the Docker container's image name, without a tag. It is based on the originally scraped image label.
It doesn't work, for some reason, showing no errors when running on debug log level. I can see metrics scraped from cAdvisor (for example container_last_seen) but my newly created label isn't there.
My job configuration:
- job_name: "cadvisor"
scrape_interval: "5s"
dns_sd_configs:
- names: ['cadvisor.marathon.mesos']
relabel_configs:
- source_labels: ['image']
# [REGISTRYHOST/][USERNAME/]NAME[:TAG]
regex: '([^/]+/)?([^/]+/)?([^:]+)(:.+)?'
target_label: 'image_tagless'
replacement: '${1}${2}${3}'
My label - image_tagless - is missing from the scraped metrics.
Any help would be much appreciated.
The image label is not a target label, it's on the metrics themselves. Thus you should use metric_relabel_configs rather than relabel_configs
My blog on Life of a Label explains how this works.
I would like to monitor elasticsearch using nagios.
Basiclly, I want to know if elasticsearch is up.
I think I can use the elasticsearch Cluster Health API (see here)
and use the 'status' that I get back (green, yellow or red), but I still don't know how to use nagios for that matter ( nagios is on one server and elasticsearc is on another server ).
Is there another way to do that?
EDIT :
I just found that - check_http_json. I think I'll try it.
After a while - I've managed to monitor elasticsearch using the nrpe.
I wanted to use the elasticsearch Cluster Health API - but I couldn't use it from another machine - due to security issues...
So, in the monitoring server I created a new service - which the check_command is check_command check_nrpe!check_elastic. And now in the remote server, where the elasticsearch is, I've editted the nrpe.cfg file with the following:
command[check_elastic]=/usr/local/nagios/libexec/check_http -H localhost -u /_cluster/health -p 9200 -w 2 -c 3 -s green
Which is allowed, since this command is run from the remote server - so no security issues here...
It works!!!
I'll still try this check_http_json command that I posted in my qeustion - but for now, my solution is good enough.
After playing around with the suggestions in this post, I wrote a simple check_elasticsearch script. It returns the status as OK, WARNING, and CRITICAL corresponding to the "status" parameter in the cluster health response ("green", "yellow", and "red" respectively).
It also grabs all the other parameters from the health page and dumps them out in the standard Nagios format.
Enjoy!
Shameless plug: https://github.com/jersten/check-es
You can use it with ZenOSS/Nagios to monitor cluster health, data indices, and individual node heap usage.
You can use this cool Python script for monitoring your Elasticsearch cluster. This script check your IP:port for Elasticsearch status. This one and more Python script for monitoring Elasticsearch can be found here.
#!/usr/bin/python
from nagioscheck import NagiosCheck, UsageError
from nagioscheck import PerformanceMetric, Status
import urllib2
import optparse
try:
import json
except ImportError:
import simplejson as json
class ESClusterHealthCheck(NagiosCheck):
def __init__(self):
NagiosCheck.__init__(self)
self.add_option('H', 'host', 'host', 'The cluster to check')
self.add_option('P', 'port', 'port', 'The ES port - defaults to 9200')
def check(self, opts, args):
host = opts.host
port = int(opts.port or '9200')
try:
response = urllib2.urlopen(r'http://%s:%d/_cluster/health'
% (host, port))
except urllib2.HTTPError, e:
raise Status('unknown', ("API failure", None,
"API failure:\n\n%s" % str(e)))
except urllib2.URLError, e:
raise Status('critical', (e.reason))
response_body = response.read()
try:
es_cluster_health = json.loads(response_body)
except ValueError:
raise Status('unknown', ("API returned nonsense",))
cluster_status = es_cluster_health['status'].lower()
if cluster_status == 'red':
raise Status("CRITICAL", "Cluster status is currently reporting as "
"Red")
elif cluster_status == 'yellow':
raise Status("WARNING", "Cluster status is currently reporting as "
"Yellow")
else:
raise Status("OK",
"Cluster status is currently reporting as Green")
if __name__ == "__main__":
ESClusterHealthCheck().run()
I wrote this a million years ago, and it might still be useful: https://github.com/radu-gheorghe/check-es
But it really depends on what you want to monitor. The above measures:
if Elasticsearch responds to HTTP
if ingestion rate drops under the defined levels
if total number of documents drops the defined levels
But of course there's much more that might be interesting. From query time to JVM heap usage. We wrote a blog post about the most important ones here: https://sematext.com/blog/top-10-elasticsearch-metrics-to-watch/
Elasticsearch has APIs for all these, so you may be able to use a generic check_http_json to get the needed metrics. Alternatively, you may want to use something like Sematext Monitoring for Elasticsearch, which gets these metrics out of the box, then forward threshold/anomaly alerts to Nagios. (disclosure: I work for Sematext)