Duplicate entries in prometheus - jenkins

I'm using the prometheus plugin for Jenkins in order to pass data to the prometheus server and subsequently have it displayed in grafana.
With the default setup I can see the metrics at http://:8080/prometheus
But in the list I also find some duplicate entries for the same job
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api/com.xxxxxx.yyy:yyy-web",repo="NA",} 217191.0
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api",repo="NA",} 526098.0
Both entries refer to the same jenkins job spring_api. But the metrics have different value. Why do I see two entries for the same metric?

Possibly one is a a subset of the other.
In the kubernetes world you will have the resource consumption for each container in a pod ,and the pod's overall resource usage.
Suppose I query the metric "container_cpu_usage_seconds_total" for {pod="X"}.
Pod X has 2 containers so I'll get back four metrics.
{pod="X",container="container1"}
{pod="X",container="container2"}
{pod="X",container="POD"} <- some weird "pause" image with very low usage
{pod="X"} <- sum of container1 and container2
There might also be a discrepancy where the metrics with no container is greater than the sum of the container consumption. That might be some "not accounted for" overhead, like maybe pod dns lookups or something. I'm not sure.
I guess my point is that prometheus will often use combinations of labels and omissions of labels to show how a metric is broken down.

Related

How to use cAdvisor data to calculate network bandwidth usage (per month) in grafana?

I’m using Prometheus (incl. cAdvisor) and Grafana to monitor my server, on which docker containers are running. cAdvisor gives me the data for my docker containers.
I’m trying to monitor the network bandwidth usage for the selected time (on the top right corner of grafana). It should output a value like 15 GB (for the selected month) or 500 MB (for the selected day).
My approach so far:
In Grafana I am using the Stat UI with Value options > Calculation of Total, while using the following query:
sum(rate(container_network_receive_bytes_total{instance=~"$host",name=~"$container",name=~".+"}[1m]))
(FYI: I have a container variable to filter the values for the selected container. This is why you can find the part ,name=~"$container" in the query above.)
The problem with the approach above is, that the outputted values do not seem to be right, because if I change the time range to a smaller one, I will get a bigger value. For instance, if I select Last 2 days the output is 1.19 MB, while selecting Last 24 hours gives me 2.38 MB. Of course, this does not make sense, because yesterday + today can’t be smaller than just today.
What do I oversee? How can I achieve it to output correct values?

Why Kubernetes pods are failing randomly when their limits overlap?

I have a single node Kubernetes cluster which shows 10Gi, 3 CPU as available(of total 16 Gi, 4CPU) for running the pods post the cluster startup. I am trying two different scenarios then:
Scenario-1.
Running 3 pods individually with configs(Request,Limit) as:
Pod-A: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
Pod-B: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
Pod-C: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
In this scenario, apps get perfectly up in there corresponding pods and works fine as expected.
Scenario-2.
Running 3 pods individually with configs(Request,Limit) as:
Pod-A: (1 Gi,10 Gi) and (1 cpu,3 cpu)
Pod-B: (1 Gi,10 Gi) and (1 cpu,3 cpu)
Pod-C: (1 Gi,10 Gi) and (1 cpu,3 cpu)
In the second scenario, apps get up in there corresponding pods and but fails randomly after some load is put over any of these pods i.e. sometime Pod-A gets down, at times Pod-2 or Pod-3. At any point of time I am not able to run all the three pods together.
The only event I can see in the failed pod is as below
"The warning which is available in node logs says "Warning CheckLimitsForResolvConf 1m (x32 over 15m) kubelet, xxx.net Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!.".
Having only this information in logs, I am not able to figure out the actual reason for random failure of Pods.
Can anyone help me understand if there is anything wrong with the configs or is there something else I am missing?
Thanks
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on.
Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.
Note Although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
So after scheduling If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory
Refer Default Hard Eviction Threshold values.
The kubelet has the following default hard eviction threshold:
memory.available<100Mi
nodefs.available<10%
nodefs.inodesFree<5%
imagefs.available<15%
You should track your Node Conditions when load is running.
kubelet maps one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction threshold has been met independent of its associated grace period, the kubelet reports a condition that reflects the node is under pressure i.e MemoryPressure or DiskPressure

Can Telegraf combine/add value of metrics that are per-node, say for a cluster?

Let's say I have some software running on a VM that is emitting two metrics that are fed through Telegraf to be written into InfluxDB. Let's say the metric are no. successfully handled HTTP requests (S), and no. of failed HTTP requests (F), on that VM. However, I might configure three such VMs each emitting those 2 metrics.
Now, if I would like to have a computed metric which is the sum of S from each VM, and sum of F from each VM, and store as new metrics, at various instants of time. Is this something that can be achieved using Telegraf ? Or is there a better, more efficient, more elegant way ?
Kindly note that my knowledge of Telegraf and InfluxDB are theoretical, as I've recently started reading up about them, so I have not actually tried any of the above, yet.
This isn't something telegraf would be responsible for.
With Influx 1.x, you'd use a TICKScript or Continuous Queries to calculate the sum and inject the new sampled value.
Roughly, this would look like:
CREATE CONTINUOUS QUERY "sum_sample_daily" ON "database"
BEGIN
SELECT sum("*") INTO "daily_measurement" FROM "measurement" GROUP BY time(1d)
END
CQ docs

How do I get the number of running instances per docker swarm service as a prometheus metric?

For me it seems impossible to get a reliable metric containing all services and their container states (and count).
Using the "last seen" from cadvisor does not work - it is unreliable; there are some open bugs... Using the docker metric I only get the number of total instances running, stopped,...
Does anyone have an idea?
May be below query can help ..
count(count(container_tasks_state{container_label_com_docker_swarm_service_name=~".+", container_label_com_docker_swarm_node_id=~"$node_id"}) by (container_label_com_docker_swarm_service_name))
Use above query in Grafana, prometheus being datasource.

Sharing the metrics in cache of two nodes within a graphite cluster

I have a graphite cluster with 2 nodes under and ELB. Both of them share a same NFS to store the metrics.I didn't have a problem in accessing the metrics that are already written to the NFS.The problem arises in the case where node 1 have some metrics in its cache and have not written yet to the NFS and node 2 tries to access that metric.So one solution that I have in mind is to include the IP of both servers in local_setting.py
#########################
# Cluster Configuration #
#########################
#CLUSTER_SERVERS = ["10.x.x.1:80", "10.x.x.2:80"]
Is there any other way or a better solution to access the cache in node 1 from node 2 under the same ELB ?
Graphite is using files on the disk for resolving globs (e.g. '*') in metric names. If the metric is not yet written to disk - it will not be visible in Graphite.
Adding CLUSTER_SERVERS will not help because they should be another graphite-web instances and not caches. You can add both caches to CARBONLINK_HOSTS, i.e.
CARBONLINK_HOSTS = [‘10.x.x.1:7002’,‘10.x.x.2:7002’]
but I doubt that helps because of what I said above.

Resources