Alerts firing on Prometheus but not on Alertmanager - monitoring

I can't seem to find out why Alertmanager is not getting alerts from Prometheus. I would appreciate a swift assistance on this challenge. I'm fairly new with using Prometheus and Alertmanager. I am using a webhook for MsTeams to push the notifications from alertmanager.
Alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['critical','severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'alert_channel'
receivers:
- name: 'alert_channel'
webhook_configs:
- url: 'http://localhost:2000/alert_channel'
send_resolved: true
prometheus.yml - (Just a part of it)
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- alert_rules.yml
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'kafka'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'
static_configs:
- targets: ['localhost:8080']
labels:
service: 'Kafka'
alertmanager.service
[Unit]
Description=Prometheus Alert Manager
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/data/alertmanager \
--web.listen-address=127.0.0.1:9093
Restart=always
[Install]
WantedBy=multi-user.target
alert_rules
groups:
- name: alert_rules
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: "critical"
annotations:
summary: "Service {{ $labels.service }} down!"
description: "{{ $labels.service }} of job {{ $labels.job }} has been down for more than 1 minute."
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 25
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{ $labels.instance }})"
description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"} < 40
for: 1s
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 40% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Prometheus Alerts
But I don't see those alerts on alertmanager
I'm out of ideas at this point. Please I need help. I've been on this since last week.

You have a mistake in your Alertmanager configuration. group_by expects a collection of label names and from what I am seeing critical is a label value, not the name. So simply remove critical and you should be good to go.
Also check out this blog posts, quite helpful https://www.robustperception.io/whats-the-difference-between-group_interval-group_wait-and-repeat_interval
Edit 1
If you want the receiver alert_channel to only receive alerts that have the severity critical you have to create a route and with a match attribute.
Something along these lines:
route:
group_by: ['...'] # good if very low volum
group_wait: 15s
group_interval: 5m
repeat_interval: 1h
routes:
- match:
- severity: critical
receiver: alert_channel
Edit 2
If this does not work as well try out this:
route:
group_by: ['...']
group_wait: 15s
group_interval: 5m
repeat_interval: 1h
receiver: alert_channel
This should work. Check your Prometheus logs and see if you find hints there

Related

argocd pass dynamic variables to a helm release

I have a set of applications I would like to deploy on several eks clusters like Prometheus, Grafana and others.
I have this setup inside 1 git repo that has an app of apps that each cluster could reference to.
My issue is having small changes in the value for these deployments, lets say for the Grafana deployment I want a unique url per cluster:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: grafana
namespace: argocd
spec:
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- PrunePropagationPolicy=foreground
- CreateNamespace=true
retry:
limit: 2
backoff:
duration: 5s
maxDuration: 3m0s
factor: 2
destination:
server: "https://kubernetes.default.svc"
namespace:
source:
repoURL:
targetRevision:
chart:
helm:
releaseName: grafana
values: |
...
...
hostname/url: {cluster_name}.grafana.... <-----
...
...
so far the only way i see doing this is by having multiple values files, is there a way to make it read values from config maps or maybe pass down a variable through the app of apps to make this work?
any help is appreciated
I'm afraid there is no (yet) good generic solution for templating values.yaml for the Helm charts in the ArgoCD.
Still, for your exact case, ArgoCD already has all you need.
Your "I have a set of applications" should naturally bring you to the ApplicationSet Controller and its features.
For iteration over the set of clusters, I'd recommend you to look at ApplicationSet Generators and in particular on Cluster Generator. Then your example would look something like:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: 'grafana'
namespace: 'argocd'
finalizers:
- 'resources-finalizer.argocd.argoproj.io'
spec:
generators:
- clusters: # select only "remote" clusters
selector:
matchLabels:
'argocd.argoproj.io/secret-type': 'cluster'
template:
metadata:
name: 'grafana-{{ name }}'
spec:
project: 'default'
destination:
server: '{{ server }}'
namespace: 'grafana'
source:
path:
repoURL:
targetRevision:
releaseName: grafana
helm:
values: |
...
...
hostname/url: {{ name }}.grafana.... <-----
...
...
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- PrunePropagationPolicy=foreground
- CreateNamespace=true
retry:
limit: 2
backoff:
duration: 5s
maxDuration: 3m0s
factor: 2
Also check full Application definition for examples how to override particular parameters through:
...
helm:
# Extra parameters to set (same as setting through values.yaml, but these take precedence)
parameters:
- name: "nginx-ingress.controller.service.annotations.external-dns\\.alpha\\.kubernetes\\.io/hostname"
value: mydomain.example.com
- name: "ingress.annotations.kubernetes\\.io/tls-acme"
value: "true"
forceString: true # ensures that value is treated as a string
# Use the contents of files as parameters (uses Helm's --set-file)
fileParameters:
- name: config
path: files/config.json
As well as a combination of the inline values with valueFiles: for common options.

Kubernetes Affinity Prevent Jenkins Workers From Running On Main

Jenkins is running on EKS and there are affinity rules in place on both the Jenkins main and worker pods.
The idea is to prevent the Jenkins worker pods from running on the same EKS worker nodes, where the Jenkins main pod is running.
The following rules work, until resources limits are pushed, at which point the Jenkins worker pods are scheduled onto the same EKS worker nodes as the Jenkins master pod.
Are there affinity / anti-affinity rules to prevent this from happening?
The rules in place for Jenkins main:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions: # assign to eks apps worker group
- key: node.app/group
operator: In
values:
- apps
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions: # don't assign to a node running jenkins main
- key: app.kubernetes.io/name
operator: In
values:
- jenkins
- key: app.kubernetes.io/component
operator: In
values:
- main
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions: # try not to assign to a node already running a jenkins worker
- key: app.kubernetes.io/name
operator: In
values:
- jenkins
- key: app.kubernetes.io/component
operator: In
values:
- worker
topologyKey: kubernetes.io/hostname
The rules in place for Jenkins worker:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions: # assign to eks apps worker group
- key: node.app/group
operator: In
values:
- apps
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions: # don't assign to a node running jenkins main
- key: app.kubernetes.io/name
operator: In
values:
- jenkins
- key: app.kubernetes.io/component
operator: In
values:
- main
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions: # try not to assign to a node already running a jenkins worker
- key: app.kubernetes.io/name
operator: In
values:
- jenkins
- key: app.kubernetes.io/component
operator: In
values:
- worker
topologyKey: kubernetes.io/hostname
So low and behold guess what...the main pod labels weren't set correctly.
Now you can see the selector lables displaying here:
> aws-vault exec nonlive-build -- kubectl get po -n cicd --show-labels
NAME READY STATUS RESTARTS AGE LABELS
jenkins-6597db4979-khxls 2/2 Running 0 4m8s app.kubernetes.io/component=main,app.kubernetes.io/instance=jenkins
To achieve this, new entries were added to the values file:
main:
metadata:
labels:
app.kubernetes.io/name: jenkins
app.kubernetes.io/component: main
And the Helm _helpers.tpl template was updated accordingly:
{{- define "jenkins.selectorLabels" -}}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- if .Values.main.metadata.labels }}
{{- range $k, $v := .Values.main.metadata.labels }}
{{ $k }}: {{ $v }}
{{- end }}
{{- end }}
{{- end }}

Configure basic_auth for Prometheus Target

One of the targets in static_configs in my prometheus.yml config file is secured with basic authentication. As a result, an error of description "Connection refused" is always displayed against that target in the Prometheus Targets' page.
I have researched how to setup prometheus to provide the security credentials when trying to scrape that particular target but couldn't find any solution. What I found was how to set it up on the scrape_config section in the docs. This won't work for me because I have other targets that are not protected with basic_auth.
Please help me out with this challenge.
Here is part of my .yml config as regards my challenge.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
scrape_timeout: 5s
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:5000']
labels:
service: 'Auth'
- targets: ['localhost:5090']
labels:
service: 'Approval'
- targets: ['localhost:6211']
labels:
service: 'Credit Assessment'
- targets: ['localhost:6090']
labels:
service: 'Sweep'
- targets: ['localhost:6500']
labels:
I would like to add more details to the #PatientZro answer.
In my case, I need to create another job (as specified), but basic_auth needs to be at the same level of indentation as job_name. See example here.
As well, my basic_auth cases require a path as they are not displayed at the root of my domain.
Here is an example with an API endpoint specified:
- job_name: 'myapp_health_checks'
scrape_interval: 5m
scrape_timeout: 30s
static_configs:
- targets: ['mywebsite.org']
metrics_path: "/api/health"
basic_auth:
username: 'email#username.me'
password: 'cfgqvzjbhnwcomplicatedpasswordwjnqmd'
Best,
Create another job for the one that needs auth.
So just under what you've posted, do another
- job_name: 'prometheus-basic_auth'
scrape_interval: 5s
scrape_timeout: 5s
static_configs:
- targets: ['localhost:5000']
labels:
service: 'Auth'
basic_auth:
username: foo
password: bar

Why do I have so many duplicated processes?

I'm stressing my kubernetes API and I found out that every request is creating a process inside the Worker Node.
Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${KUBE_APP_NAME}-deployment
namespace: ${KUBE_NAMESPACE}
labels:
app_version: ${KUBE_APP_VERSION}
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
selector:
matchLabels:
app_name: ${KUBE_APP_NAME}
template:
metadata:
labels:
app_name: ${KUBE_APP_NAME}
spec:
containers:
- name: ${KUBE_APP_NAME}
image: XXX:${KUBE_APP_VERSION}
imagePullPolicy: Always
env:
- name: MONGODB_URI
valueFrom:
secretKeyRef:
name: mongodb
key: uri
- name: JWT_PASSWORD
valueFrom:
secretKeyRef:
name: jwt
key: password
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
imagePullSecrets:
- name: regcred
Apache Bench used ab -p payload.json -T application/json -c 10 -n 2000
Why is this?
It's hard to answer your questions if this is normal that the requests are being kept open.
We don't know what exactly is your payload and how big it is. We also don't know if the image that you are using is handling those correctly.
You should use verbose=2 ab -v2 <host> and check what it taking so long.
You are using Apache Bench with -c 10 -n 2000 options which means there will be:
-c 10 concurrent connections at a time,
-n 2000 request total
You could use -k to enable HTTP KeepAlive
-k
Enable the HTTP KeepAlive feature, i.e., perform multiple requests within one HTTP session. Default is no KeepAlive.
It would be easier if you provided the output of using the ab.
As for the Kubernetes part.
We can read a definition of a pod available at Viewing Pods and Nodes:
A Pod is a Kubernetes abstraction that represents a group of one or more application containers (such as Docker or rkt), and some shared resources for those containers
...
The containers in a Pod share an IP Address and port space, are always co-located and co-scheduled, and run in a shared context on the same Node.

Prometheus scrape from unknown number of (docker-)hosts

I have a Docker Swarm with a Prometheus container and 1-n containers for a specific microservice.
The microservice-container can be reached by a url. I suppose the requests to this url is kind of load-balanced (of course...).
Currently I have spawned two microservice-container. Querying the metrics now seems to toggle between the two containers. Example: Number of total requests: 10, 13, 10, 13, 10, 13,...
This is my Prometheus configuration. What do I have to do? I do not want to adjust the Prometheus config each time I kill or start a microservice-container.
scrape_configs:
- job_name: 'myjobname'
metrics_path: '/prometheus'
scrape_interval: 15s
static_configs:
- targets: ['the-service-url:8080']
labels:
application: myapplication
UPDATE 1
I changed my configuration as follows which seems to work. This configuration uses a dns lookup inside of the Docker Swarm and finds all instances running the specified service.
scrape_configs:
- job_name: 'myjobname'
metrics_path: '/prometheus'
scrape_interval: 15s
dns_sd_configs:
- names: ['tasks.myServiceName']
type: A
port: 8080
The question here is: Does this configuration recognize that a Docker instance is stopped and another one is started?
UPDATE 2
There is a parameter for what I am asking for:
scrape_configs:
- job_name: 'myjobname'
metrics_path: '/prometheus'
scrape_interval: 15s
dns_sd_configs:
- names: ['tasks.myServiceName']
type: A
port: 8080
# The time after which the provided names are refreshed
[ refresh_interval: <duration> | default = 30s ]
That should do the trick.
So the answer is very simple:
There are multiple, documented ways to scrape.
I am using the dns-lookup-way:
scrape_configs:
- job_name: 'myjobname'
metrics_path: '/prometheus'
scrape_interval: 15s
dns_sd_configs:
- names ['tasks.myServiceName']
type: A
port: 8080
refresh_interval: 15s

Resources