Docker - Prometheus container dies immediately - docker

I have cadvisor running with port mapping 4000:8080 and I have to link it with a container with prometheus.
My prometheus.yml is:
scrape_configs:
# Scrape Prometheus itself every 2 seconds.
- job_name: 'prometheus'
scrape_interval: 2s
target_groups:
- targets: ['localhost:9090', 'cadvisor:8080']
This file has path /home/test/prometheus.yml.
To run the container with prometheus, I do:
docker run -d -p 42047:9090 --name=prometheus -v /home/test/prometheus.yml:/etc/prometheus/prometheus.yml --link cadvisor:cadvisor prom/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus -storage.local.memory-chunks=10000
The container is created, but it dies immediately.
Can you tell me where the problem is?
Messages form docker events& :
2016-11-21T11:43:04.922819454+01:00 container start 69d03c68525c5955cc40757dc973073403b13fdd41c7533f43b7238191088a25 (image=prom/prometheus, name=prometheus)
2016-11-21T11:43:05.152141981+01:00 container die 69d03c68525c5955cc40757dc973073403b13fdd41c7533f43b7238191088a25 (exitCode=1, image=prom/prometheus, name=prometheus)

Config format is changed. targets come under static_config in the latest version.
scrape_configs:
# Scrape Prometheus itself every 2 seconds.
- job_name: 'prometheus'
scrape_interval: 2s
static_configs:
- targets: ['localhost:9090', 'cadvisor:8080']
Prometheus Documentation for further help

Yes, target_groups is renamed to static_configs. Please use the latest Prometheus image with the following.
static_configs:
- targets: ['localhost:9090', 'cadvisor:8080']
The above worked for me.

I think target_groups have been deprecated from scrape_configs in latest version of prometheus.
you can try static_configs or file_sd_config
scrape_config
static_config
file_sd_config
scrape_configs:
- job_name: node_exporter
static_configs:
- targets:
- "stg-elk-app-01:9100"
- "stg-app-02:9100"

The indentation isn't correct, try:
scrape_configs:
# Scrape Prometheus itself every 2 seconds.
- job_name: 'prometheus'
scrape_interval: 2s
target_groups:
- targets: ['localhost:9090', 'cadvisor:8080']

As you said in your earlier comment:
from logs: time="2016-11-21T11:21:40Z" level=error msg="Error loading config: couldn't load configuration (-config.file=/etc/prometheus/prometheus.yml): unknown fields in scrape_config: target_groups" source="main.go:149"
Which clearly means that the field "target_groups" is causing the problem. This is due to the fact that the new version of Prometheus (v1.5 onwards) have discarded the use of "target_groups" field and simply provide the targets. Even I faced this issue about 6 months ago. Please try it with a new version. The docker pull prom/prometheus might be getting you the old one.
Hope this helps...!!!

The name of the container is prometheus.
Generally, when a container exists immediately after it starts, I would recommend adding the -log.level=debug right after -config.file.
docker run -d -p 42047:9090 --name=prometheus -v /home/test/prometheus.yml:/etc/prometheus/prometheus.yml --link cadvisor:cadvisor prom/prometheus -config.file=/etc/prometheus/prometheus.yml -log.level=debug -storage.local.path=/prometheus -storage.local.memory-chunks=10000
Next, see the logs for the container:
docker logs prometheus
Any issues with the configuration will be there.

Related

Prometheus with Dockerfile

I have the following Dockerfile:
FROM prom/prometheus
ADD prometheus.yml /etc/prometheus/
with prometheus.yml:
global:
scrape_interval: 15s
external_labels:
monitor: 'codelab-monitor'
scrape_configs:
- job_name: 'prometheus'
metrics_path: /metrics
scrape_interval: 15s
static_configs:
- targets: ['localhost:9090']
- job_name: 'auth-service'
scrape_interval: 15s
metrics_path: /actuator/prometheus
static_configs:
- targets: ['localhost:8080']
And run it with the following command:
docker build -t prometheus .
docker run -d -p 9090:9090 --rm prometheus
prometheus has status up
auth-service has status down (Get "http://localhost:8080/actuator/prometheus": dial tcp 127.0.0.1:8080: connect: connection refused)
How can I solve problem with auth-service, because from local machine I can get metrics from this address http://localhost:8080/actuator/prometheus:
v.balun#macbook-vbalun Trainter-Prometheus % curl -X GET
http://localhost:8080/actuator/prometheus
# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the
Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="heap",id="G1 Survivor Space",} 4194304.0
jvm_memory_committed_bytes{area="heap",id="G1 Old Gen",} 3.145728E7
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 3.0982144E7
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 2555904.0
jvm_memory_committed_bytes{area="heap",id="G1 Eden Space",} 2.7262976E7
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 4325376.0
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 6291456.0
The issue you are having seems not related to prometheus, it seems it is at the docker network level.
Inside your prometheus container you are saying this:
static_configs:
- targets: ['localhost:8080']
But remember that localhost is NOT now your physical host (As when you ran it locally outside Docker), it's now inside the container, and inside the same container most likely you don't have your service running...
With the information provided I suggest you the following:
Try first instead localhost use your real IP, depending on the network configuration you are using for your container, it will be enough...
You can use instead localhost the ip address of your auth-service, this is the one given by docker, you can run a docker inspect... to get it.
If #1 and #2 didn't work and if auth-service is running in another container inside the same physical host, then you could use a bridge network to make the communication between the containers possible, more details here: https://docs.docker.com/network/bridge/
👆 Once both containers are running in the same network you can use the container name to reference it instead localhost, something like:
static_configs:
- targets: ['auth-service:8080']

Collecting metrics from multiple telegraf to prometheus

Continue on from the question of Sending metrics from telegraf to prometheus, which covers the case of single telegraf agent, what's the suggested setup to collect metrics from multiple telegraf to prometheus?
In the end, I want prometheus to chart (on the same graph), CPU usage of server-1, server-2, ... to server-n, in their own lines.
Taking the configuration from the original post, you can simply add targets to you telegraf job; supposing that the same Telegraf config is used on each server.
scrape_configs:
- job_name: 'telegraf'
scrape_interval: 5s
static_configs:
- targets: ['server-1:9126','server-2:9126',...]
It will produce the metrics (ex: cpu_time_user) with different instance tag corresponding to the targets configured. Typing the metric name in Prometheus will display all of them.
If you really want to see only the name of the server, you can use metric_relabel_configs to generate an additional label:
scrape_configs:
- job_name: 'telegraf'
...
metric_relabel_configs:
- source_labels: [instance]
regex: '(.*):\d+'
target_label: server
Automatically adding servers to your Prometheus config is a matter of service discovery.

cAdvisor prometheus integration returns container_cpu_load_average_10s as 0

I have configured Prometheus to scrape metrics from cAdvisor. However, the metric "container_cpu_load_average_10s" only returns 0. I am able to see the CPU metrics under the cAdvisor web UI correctly but Prometheus receives only 0. It is working fine for other metrics like "container_cpu_system_seconds_total". Could someone point if I am missing something here?
Prometheus version: 2.1.0
Prometheus config:
scrape_configs:
- job_name: cadvisor
scrape_interval: 5s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- 172.17.0.2:8080
cAdvisor version: 0.29.0
In order to get the metric container_cpu_load_average_10s, the cAdvisor must run with the option
--enable_load_reader=true
which is set fo false by default. This is described here.
If the value is zero, it means the container is idle.
You don't need 'enable_load_reader'. I don't enable it as it may make cAdvisor unstable.
Some useful links:
Linux Load Averages: Solving the Mystery
High CPU utilization but low load average question
cAdvisor enable_load_reader

Prometheus create label from metric label

We are running the node-exporter in containers. To quickly identify on which host each node-exporter is running, I created a metric that looks like this: host{host="$HOSTNAME",node="$CONTAINER_ID"} 1
I'm looking for a way to extract the hostname in host= and create a label for each node-exporter instance as a hostname label. I tried numerous configurations and none seem to work. Current prometheus config looks like this:
scrape_configs:
- job_name: 'node'
scrape_interval: 10s
scrape_timeout: 5s
metrics_path: /metrics
scheme: http
dns_sd_configs:
- names:
- tasks.master-nodeexporter
refresh_interval: 30s
type: A
port: 9100
relabel_configs:
- source_labels: ['host']
regex: '"(.*)".*'
target_label: 'hostname'
replacement: '$1'
This is not possible, as target relabelling happens before the scrape.
What you want to do here is use service discovery to have the right hostname in the first place, which is not possible with dns_sd_configs. You might look at something like Consul and https://www.robustperception.io/controlling-the-instance-label/
If someone comes across this:
Create this as docker-entrypoint.sh and make it executable.
#!/bin/sh -e
# Must be executable by others
NODE_NAME=$(cat /etc/nodename) echo "node_meta{node_id=\"$NODE_ID\", container_label_com_docker_swarm_node_id=\"$NODE_ID\", node_name=\"$NODE_NAME\"} 1" > /etc/node-exporter/node-meta.prom
set -- /bin/node_exporter "$#"
exec "$#"
Than create a Dockerfile like this
FROM prom/node-exporter:latest
ENV NODE_ID=none
USER root
COPY conf /etc/node-exporter/
ENTRYPOINT [ "/etc/node-exporter/docker-entrypoint.sh" ]
CMD [ "/bin/node_exporter" ]
Than build it and you will always get the Hostname as a node_meta metric
This answer explains how to export node name via node_name label at node_meta metric. Then it is possible to add node_name label to any metric exposed by node_exporter with group_left() modifier during querying. For example, the following PromQL query adds node_name label from node_meta metric to node_memory_Active_bytes metric:
node_memory_Active_bytes
* on(job,instance) group_left(node_name)
node_meta
See more details about group_left() modifier in these docs and this article.

Not seeing certificate attributes when not running locally

The Problem
When I deploy a 4 peer nodes with PBFT or NOOPS in the cloud, any user certificate attributes are not seen. The values are blank.
Observations
Everything works locally. This suggests that I am calling the API correctly, and the chaincode is accessing attributes correctly.
When I attach to the membership container, I see the correct membersrvc.yaml with aca.enabled set to true. This is the same yaml that works locally. For good measure, I'm also passing the ENV variable MEMBERSRVC_CA_ACA_ENABLED=true.
I can see the attributes for the users in the membership service's ACA database. (suggesting that the users were created with attributes)
When I look at the actual certificate from the log (Bytes to Hex then Base64 decode) I see the attributes. (Appending certificate [30 82 02 dd 30 8....)
All attributes are blank when deployed. No errors.
Membership Service Logs
I enabled debug logging, and see that Membership services thinks it's enabled ACA:
19:57:46.421 [server] main -> DEBU 049 ACA was enabled [aca.enabled == true]
19:57:46.421 [aca] Start -> INFO 04a Staring ACA services...
19:57:46.421 [aca] startACAP -> INFO 04b ACA PUBLIC gRPC API server started
19:57:46.421 [aca] Start -> INFO 04c ACA services started
This looks good. What am I missing?
Guess
Could it be that the underlying docker container the chaincode deploys into doesn't have security enabled? Does it use the ENV passed to the parent peer? One difference is that locally I'm using "dev mode" without the base-image shenanigans.
Membership Service
membersrvc:
container_name: membersrvc
image: hyperledger/fabric-membersrvc
volumes:
- /home/ec2-user/membership:/user/membership
- /var/hyperledger:/var/hyperledger
command: sh -c "cp /user/membership/membersrvc.yaml /opt/gopath/src/github.com/hyperledger/fabric/membersrvc && membersrvc"
restart: unless-stopped
environment:
- MEMBERSRVC_CA_ACA_ENABLED=true
ports:
- 7054:7054
Root Peer Service
rootpeer:
container_name: root-peer
image: hyperledger/fabric-peer
restart: unless-stopped
environment:
- CORE_VM_ENDPOINT=unix:///var/run/docker.sock
- CORE_LOGGING_LEVEL=DEBUG
- CORE_PEER_ID=vp1
- CORE_SECURITY_ENROLLID=vp1
- CORE_SECURITY_ENROLLSECRET=xxxxxxxx
- CORE_SECURITY_ENABLED=true
- CORE_SECURITY_ATTRIBUTES_ENABLED=true
- CORE_PEER_PKI_ECA_PADDR=members.x.net:7054
- CORE_PEER_PKI_TCA_PADDR=members.x.net:7054
- CORE_PEER_PKI_TLSCA_PADDR=members.x.net:7054
- CORE_PEER_VALIDATOR_CONSENSUS_PLUGIN=NOOPS
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /var/hyperledger:/var/hyperledger
command: sh -c "peer node start"
ports:
- 7051:7051
- 7050:7050
Here's the request:
{
"jsonrpc": "2.0",
"method":"query",
"params": {
"chaincodeID": {
"name" :"659cb5dcc3063054e4c90908050eebf68eb2bd193cc1520f1f2d198f0ff42268"
},
"ctorMsg": {
"args":["get_results", "{\"Id\":\"abc123\"}"]
},
"secureContext": "user123",
"attributes":["account_id","role"]
},
"id": 2
}
Edited*: I previously thought this was just PBFT...but it's also happening on NOOPS on the cloud. I reduced the example to NOOPS.
My problem is that the fabric version inside the fabric-baseimage docker container is a good bit newer. This is my fault - because I populated that image with the fabric version manually.
Background
If one is using non-vagrant with 0.6 and not in DEV mode, deploying chaincode will have a "cannot find :latest tag" error. To solve this, I pulled a fabric-baseimage version, and populated it with what I needed, including a git-clone of fabric. I should have pulled the 0.6 branch, but instead it was pulling master.
So essentially, my fabric-peer, node-sdk deployer, and baseimage were using slightly different hyperledger versions.
After about 48 hours of configuration hell, I think I have it straightened out by sending everything back to 0.6. I have terraform spinning everything up successfully now.
I do wish the documentation included something about deploying in a non-dev multi-node environment.

Resources