Prometheus/Alertmanager inhibit rules - devops

I have the following inhibit rule:
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'high'
equal: ['alertname]
and also two alerts accordingly with severity high and critical.
Alert 1
- alert: ContainerCpuUsage
expr: ContainerCpuUsage > 90
for: 30m
labels:
severity: high
topic: container
annotations:
summary: "Container CPU usage for pod '{{ $labels.pod }}' is above 90% for the last 30 minutes."
description: "Container CPU usage (name {{ $labels.pod }})\nMeasuredValue={{ printf \"%.2f\" $value }}%"
Alert 2
- alert: ContainerCpuUsage
expr: ContainerCpuUsage > 98
for: 30m
labels:
severity: critical
topic: container
annotations:
summary: "Container CPU usage for pod '{{ $labels.pod }}' is above 98% for the last 30 minutes."
description: "Container CPU usage (name {{ $labels.pod }})\nMeasuredValue={{ printf \"%.2f\" $value }}%"
The idea is when the CPU usage goes suddenly from 20%, let say, to 99% a critical alert should be fired and also a high alert should not be fired.With inhibit rules above it works perfectly.
But when the CPU usage goes suddenly from 20%, let say, to 91% a high alert is fired and this is correct.After some min if CPU usage goes further to 99% a second alert,a critical one is also fired.So i have in total 2 alerts open,high and critical.
What i want is that if CPU usage >98% high alert should be closed and only critical remains open.Why high alert is not closed/inhibit?
If an alert is already fired,can inhibit rules close it?

The "inhibit_rules" just mutes the alerts, in other words, it prevents sending new notifications (emails, messages, etc) to recipients about the alerts, but does not inactive them.

Related

SystemMemoryExceedsReservation

Hey what does this rule say?
expr: |
sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.95)
for: 15m
labels:
severity: warning
We are receiving SNOW tickets, saying "System memory usage of 1.161G on exceeds 95% of the reservation". But upon checking there is no memory spike, as a matter of fact node limits has crossed 131%.

Timestamps taken by rosbag play (and rqt_bag) and by rosbag.Bag's read_messages() differ

There is something very strange happening with some rosbags I have.
These rosbags contain messages of type sensor_msgs/Image among other topics.
So I do:
First scenario
On one terminal I run rostopic echo /the_image/header because I am not interested in the actual data, just the header info.
In another terminal I run rosbag play --clock the_bag.bag
With this I get
seq: 7814
stamp:
secs: 1625151029
nsecs: 882629359
frame_id: ''
---
seq: 7815
stamp:
secs: 1625151029
nsecs: 934761166
frame_id: ''
---
seq: 7816
stamp:
secs: 1625151029
nsecs: 986241550
frame_id: ''
---
seq: 7817
stamp:
secs: 1625151030
nsecs: 82884301
frame_id: ''
---
Second Scenario
I do the same as the previous scenario but instead of rosbag play I run rqt_bag the_bag.bag and once there I right click the message to publish them.
With that I get similar values but (I have reported the problem before) the first messages are skipped. (this is not the problem of this question)
Third Scenario
Here comes the weird part. Instead of doing as above I have a python script that does
timestamps=[]
for topic, msg, t in rosbag.Bag("the_bag.bag").read_messages():
if topic == '/the_image':
timestamps.append((image_idx, t))
with open("timestamps.txt",'w') as f:
for idx, t in timestamps:
f.write('{0},{1},{2}\n'.format(str(idx).zfill(8), t.secs, t.nsecs))
So as you can see I open the bag and get a list of timestamps and record it in a text file.
Which gives:
00000000,1625151029,987577614
00000001,1625151030,33818541
00000002,1625151030,88932237
00000003,1625151030,170311084
00000004,1625151030,232427083
00000005,1625151030,279726253
00000006,1625151030,363255375
00000007,1625151030,463079346
00000008,1625151030,501315763
00000009,1625151030,566104245
00000010,1625151030,586694806
As you can see the values are totally different!!!
What could be happening here?
This is a known "issue" with rostopic echo and bag files. I put issue in quotes because it's not necessarily an issue, but just a product of how rostopic works. To spare the somewhat obscure details of the implementation, this problem essentially happens because rospy.rostime does not get initialized correctly when just playing a bag file and echoing that; even if you set /use_sim_time to true.
To give some clarity on what you're seeing, the timestamps coming off your Python script are correct and the rostopic ones are not. If you need the timestamps to be 100% correct with rostopic you can use the -b flag like: rostoic echo -b the_bag.bag /my_image_topic

Prometheus blackbox probe helpful metrics

I have around 1000 targets that are probed using HTTP.
job="http_2xx", env="prod", instance="x.x.x.x"
job="http_2xx", env="test", instance="y.y.y.y"
job="http_2xx", env="dev", instance="z.z.z.z"
I want to know for the targets:
Rate of failure by env in last 10 minutes.
Increase in rate of failure by env in last 10 minutes.
Curious what the following does:
sum(increase(probe_success{job="http_2xx"}[10m]))
rate(probe_success{job="http_2xx", env="prod"}[5m]) * 100
The closest I have reached is with following to find operational by env in 10 minutes:
avg(avg_over_time(probe_success{job="http_2xx", env="prod"}[10m]) * 100)
Rate of failure by env in last 10 minutes. The easiest way you can do it is:
sum(rate(probe_success{job="http_2xx"}[10m]) * 100) by (env)
This will return you the percentage off successful probes, which you can reverse adding *(-1) +100
Calculating rate over 10m and increase of rate over 10m seems redundant adding an increase function to the above query didn't work for me. you can replace the rate function with increase if want to.
The first query was pretty close it will calculate the increase of successful probes over 10m period. You can make it show increase of failed probes by adding == 0 and sum it by the "env" variable
sum(increase(probe_success{job="http_2xx"} == 0 [10m])) by (env)
Your second query will return percentage of successful request over 5m for prod environment

The memory of cgroup rss is much higher than the summary of the memory usage of all processes in the docker container

I hava a Redis runing in a container .
Inside the container cgroup rss show using about 1283MB memory.
The kmem memory usage is 30.75MB.
The summary of the memory usage of all processes in the docker container is 883MB.
How can i figure out the "disappeared memory "(1296-883-30=383MB).The "disappeared memory" will growing with the time pass.Flinally the container will be oom killed .
environmet info is
redis version:4.0.1
docker version:18.09.9
k8s version:1.13
**the memory usage is 1283MB **
root#redis-m-s-rbac-0:/opt#cat /sys/fs/cgroup/memory/memory.usage_in_bytes
1346289664 >>>> 1283.921875 MB
the kmem memory usage is 30.75MB
root#redis-m-s-rbac-0:/opt#cat /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes
32194560 >>> 30.703125 MB
root#redis-m-s-rbac-0:/opt#cat /sys/fs/cgroup/memory/memory.stat
cache 3358720
rss 1359073280 >>> 1296.11328125 MB
rss_huge 515899392
shmem 0
mapped_file 405504
dirty 0
writeback 0
swap 0
pgpgin 11355630
pgpgout 11148885
pgfault 25710366
pgmajfault 0
inactive_anon 0
active_anon 1359245312
inactive_file 2351104
active_file 1966080
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 4294967296
total_cache 3358720
total_rss 1359073280
total_rss_huge 515899392
total_shmem 0
total_mapped_file 405504
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 11355630
total_pgpgout 11148885
total_pgfault 25710366
total_pgmajfault 0
total_inactive_anon 0
total_active_anon 1359245312
total_inactive_file 2351104
total_active_file 1966080
total_unevictable 0
**the summary of the memory usage of all processes in the docker container is 883MB **
root#redis-m-s-rbac-0:/opt#ps aux | awk '{sum+=$6} END {print sum / 1024}'
883.609
This is happening because usage_in_bytes does not show exact value of memory and swap usage. The memory.usage_in_bytes show current memory(RSS+Cache) usage.
5.5 usage_in_bytes For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline
false sharing. usage_in_bytes is affected by the method and doesn't
show 'exact' value of memory (and swap) usage, it's a fuzz value for
efficient access. (Of course, when necessary, it's synchronized.) If
you want to know more exact memory usage, you should use
RSS+CACHE(+SWAP) value in memory.stat(see 5.2).
Reference:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

All allocated Tarantool 2.3 space memtx is occupied

Today the allocated space for memxt tarantool is over - memtx_memory = 5GB, the RAM was really busy at 5GB, after restarting tarantool more than 4GB was freed.
What could be clogged with RAM? What settings can this be related to?
box.slab.info()
---
- items_size: 1308568936
items_used_ratio: 91.21%
quota_size: 5737418240
quota_used_ratio: 13.44%
arena_used_ratio: 89.2%
items_used: 1193572600
quota_used: 1442840576
arena_size: 1442840576
arena_used: 1287551224
box.info()
---
- version: 2.3.2-26-g38e825b
id: 1
ro: false
uuid: d9cb7d78-1277-4f83-91dd-9372a763aafa
package: Tarantool
cluster:
uuid: b6c32d07-b448-47df-8967-40461a858c6d
replication:
1:
id: 1
uuid: d9cb7d78-1277-4f83-91dd-9372a763aafa
lsn: 89759968433
2:
id: 2
uuid: 77557306-8e7e-4bab-adb1-9737186bd3fa
lsn: 9
3:
id: 3
uuid: 28bae7dd-26a8-47a7-8587-5c1479c62311
lsn: 0
4:
id: 4
uuid: 6a09c191-c987-43a4-8e69-51da10cc3ff2
lsn: 0
signature: 89759968442
status: running
vinyl: []
uptime: 606297
lsn: 89759968433
sql: []
gc: []
pid: 32274
memory: []
vclock: {2: 9, 1: 89759968433}
cat /etc/tarantool/instances.available/my_app.lua
...
memtx_memory = 5 * 1024 * 1024 * 1024,
...
Tarantool vesrion 2.3.2, OS CentOs 7
https://i.stack.imgur.com/onV44.png
It's the result of a process called fragmentation.
The simple reason for this process is the next situation:
you have some allocated area for tuples
you put one tuple and next you put another one
when you need to increase the first tuple, a database needs to relocate your tuple at another place with enough capacity. After that, the place for the first tuple will be free but we took the new place for the extended tuple.
You can decrease a fragmentation factor by increasing a tuple size for your case.
Choose the size by estimating your typical data or just find the optimal size via metrics of your workload for a time.

Resources