Context Deadline Exceeded - prometheus - monitoring

I have Prometheus configuration with many jobs where I am scraping metrics over HTTP. But I have one job where I need to scrape the metrics over HTTPS.
When I access:
https://ip-address:port/metrics
I can see the metrics.
The job that I have added in the prometheus.yml configuration is:
- job_name: 'test-jvm-metrics'
scheme: https
static_configs:
- targets: ['ip:port']
When I restart the Prometheus I can see an error on my target that says:
context deadline exceeded
I have read that maybe the scrape_timeout is the problem, but I have set it to 50 sec and still the same problem.
What can cause this problem and how to fix it?
Thank you!

Probably the default scrape_timeout value is too short for you
[ scrape_timeout: <duration> | default = 10s ]
Set a bigger value for scrape_timeout.
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5m
scrape_timeout: 1m
Take a look here https://github.com/prometheus/prometheus/issues/1438

I had a same problem in the past. In my case the problem was with the certificates and I fixed it with adding:
tls_config:
insecure_skip_verify: true
You can try it, maybe it will work.

I had a similar problem, so I tried to extend my scrape_timeout but it didn't do anything - using promtool, however, explained the problem
My problematic job looked like this:
- job_name: 'slow_fella'
scrape_interval: 10s
scrape_timeout: 90s
static_configs:
- targets: ['192.168.1.152:9100']
labels:
alias: sloooow
check your config in /etc/prometheus dir, type this:
promtool check config prometheus.yml
Result explains the problem and indicates how to solve it:
Checking prometheus.yml
FAILED: parsing YAML file prometheus.yml: scrape timeout greater than scrape interval for scrape config with job name "slow_fella"
Just ensure that your scrape_timeout is long enough to accommodate your required scrape_interval.

This can be happened when the prometheus server can't reach out to the scraping endpoints maybe of firewall denied rules. Just check hitting the url in a browser with <url>:9100 (here 9100 is the node_exporter service running port`) and check if you still can access?

I was facing this issue due to max connections reached. I increased the max_connections parameter in database and released some connections.
Then Prometheus was able to scrape metrics again.

in my case it was issue with IPv6. I have blocked IPv6 with ip6tables, but it also blocked prometheus traffic. Correct IPv6 settings solved issue for me

In my case I had accidentally put the wrong port on my Kubernetes Deployment manifest than what was defined in the service associated with it as well as the Prometheus target.

Increasing the timeout to 1m helped me to fix a similar issue

We Started facing similar issue when we re-configured istio-system namespace and its istio-component.
We also had prometheus install via prometheus-operator in monitoring namespace where istio-injection was enabled.
Restarting the promtheus components of the monitoring (istio-injection enabled) namespace resolved the issue.

On AWS, for me opening port(for prometheus) in SG, Worked

Related

429 Rate exceeded responses for no apparent reason

We're using Contentful to manage CMS content. When you save content in Contentful it sends webhooks for a service we've set up on Cloud Run, which in turn ensures the updated content is built and deployed.
This setup has been previously so that the Cloud Run service was limited to 1 container max, with 80 concurrent requests limit. This should be plenty for the few webhooks we get occasionally.
Now when debugging complaints about content not being updated I bumped into a very persistent and irritating issue - Google Cloud Run does not try to process the 2 webhooks sent by Contentful, but instead responds to one of the 2 with status 429 and Rate exceeded. in response body.
This response does not come from our backend, I can see in the Cloud Run Logs tab the message generated by Google: The request was aborted because there was no available instance.
I've tried:
Increasing number of processes on the container from 1 to 2 - should not be necessary due to use of an async framework
Increasing number of containers from 1 to 2
The issue persists for the webhooks from Contentful.
If I try making requests from my local machine with hey that defaults to 200 requests with 50 concurrency, they all go through without any 429 status codes returned.
What is going on that generates 429 status codes when a specific client - in this case Contentful - makes ONLY 2 requests in quick succession? How do we disable or bypass this behavior?
gcloud run services describe <name> gives me these details of the deployment:
+ Service [redacted] in region europe-north1
URL: https://[redacted].a.run.app
Ingress: all
Traffic:
100% LATEST (currently [redacted])
Last updated on 2021-01-19T13:48:46.172388Z by [redacted]:
Revision [redacted]
Image: eu.gcr.io/[redacted]/[redacted]:c0a2e7a6-56d5-4f6f-b241-1dd9ed96dd30
Port: 8080
Memory: 256Mi
CPU: 1000m
Service account: [redacted]-compute#developer.gserviceaccount.com
Env vars:
WEB_CONCURRENCY 2
Concurrency: 80
Max Instances: 2
Timeout: 300s
This is more a speculation that an answer, but I would try re-deploying you Cloud Run service with min-instances set to 1 (or more).
Here is why.
In the Cloud Run troubleshooting docs they write (emphasis mine):
This error can also be caused by a sudden increase in traffic, a long container startup time or a long request processing time.
Your Cloud Run service receives webhook events from a CMS (Contentful). And, as you wrote, these updates are rather sporadic. So I think that your situation could be the same as the one described in this comment on Medium:
I tested “max-instances: 2” and the conclusion is I got 429 — Rate exceeded responses from Google frontend proxy because no container was running. It seems that a very low count of instances will deregister your service completely from the load-balancer until a second request was made.
If Google Cloud did indeed de-register your Cloud Run service completely because it was not receiving any traffic, re-deploying the service with at least one container instance could fix your issue. Another way would be to call your Cloud Run service every once in a while just to keep it "warm".

Azure Kubernetes Service pods not starting health check is timing out and no error logs

I am using Azure Kuerbets Services and Im having a huge problem to detect why pods (of specific type) isnt starting... The only thing that happens is that when new pods starts health check is timing out and silently AKS go back to old deployed services that worked... I have made a lot of trace output in service to detect where it fails if its external calls that are blocked etc and I have a global try/catch in Program.cs but no information comes out... AKS listen on stdout and grabbing logs there and push them to external tool.... I have tried to increase values when health check should start etc as below but with no result
livenessProbe:
.
.
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
.
.
initialDelaySeconds: 50
periodSeconds: 15
When running service locally its up in 10-15 sec
Obviously things seems to fail before service is started or something is timing out and I'm wondering...
Can I fetch logs or monitor whats happening and why pods are so slow in AKS when pods are starting?
Is it possible to monitor what comes out on stdout on an virtual machine that belongs to AKS-cluster?
Feels like I have tested everything but I cant find any reason why health-monitoring is refusing requests.
Thanks!
If you have enabled Azure Monitor for Container when you created your cluster, the logs of your application will be pushed to a Log Analytics workspace in the table ContainerLog. If Azure Monitor is not enable, you can use kubectl to see what is output to stdout and sdterr with the following command :
kubectl logs {pod-name} -n {namespace}
You can also check the kubernetes events, you'll see events saying that the probes failed If this is really the problem :
kubectl get events -n {namespace}

Failed to connect Hyperledger Explorer to Fabric project

I have a Fabric project up and running with 7 org/5 channel setup with each org having 2 peers. Everything is up and running. Now i am trying to connect Hyperledger Explorer to view the blockchain data. However there is an issue i am facing in the configuration part.
Steps i performed:
Pulled the images and added the following containers in a single docker-compose.yaml file for startup: hyperledger/explorer-db:latest, hyperledger/explorer:latest, prom/prometheus:latest, grafana/grafana:latest
Edited the created containers with the respective configurations needed and volume mounts.
volumes:
./config.json:/opt/explorer/app/platform/fabric/config.json
./connection-profile:/opt/explorer/app/platform/fabric/connection-profile/
./crypto-config:/tmp/crypto
walletstore:/opt/wallet
Since its a multi-org setup i edited the config.json files and accordingly pointed them to the respective connection profiles as per the organization setup
{
"network-configs": {
"org1-network": {
"name": "Sample-1",
"profile": "./connection-profile/org1-network.json"
}, and so on for other orgs
Edited the prometheus.yml to put in the static configurations
static_configs:
targets: ['localhost:8443','localhost:8444', and so on for every peer service]
targets: ['orderer0-service:8443','orderer1-service:8444', and so on for every orderer service]
Edited the peer services in my docker-compose.yaml file to add in the below values on each peer config
CORE_OPERATIONS_LISTENADDRESS=0.0.0.0:9449 # RESTful API for Hyperledger Explorer
CORE_METRICS_PROVIDER=prometheus # Prometheus will pull metrics
Issue: (Now resolved - see below)
It seems that explorer isn't able to find my Admin#org1-cert.pem' path in the given location. But i double checked everything and that particular path is present and also accessible. All permissions to that path is also open to avoid any permissioning issue.
Path in question [Full path is provided not the relative path]: /home/auro/Desktop/HLF/fabricapp/crypto-config/peerOrganizations/org1/users/Admin#org1/msp/signcerts/Admin#org1-cert.pem
The config files is also setup properly. I am unable to find a way to correct way. Would be really glad if someone can tell me what is going on with this path issue, because i tried everything i think i could but still not able to get it working.
Other details:
Using Hypereldger Explorer - v1.1.0 - Pulling the latest docker image
Using Hyperledger Fabric - v.1.4.6 - Pulling the specific version from docker hub for this
Update: Okay, i managed to solve this. Apparently the path to be given in the config file isnt that of the local system but of the docker container. I replaced the path with the path to my docker container where the files are placed and it worked.
New Problem -1: (Now solved) Now i am getting an error as shown below. Highlighted in yellow
I had a look at peer-0-org-1-service node logs when this happened and this is the error it had logged.
2020-07-20 04:38:15.995 UTC [core.comm] ServerHandshake -> ERRO 028 TLS handshake failed with error tls: first record does not look like a TLS handshake server=PeerServer remoteaddress=172.18.0.53:33300
Update: Okay, i managed to solve this too. There were 2 issues. The TLS handshake wasn't happening because the TLS certificate wasn't set to true in the config. The second issue of STREAM removed happened because the url in the config wasnt specified as grpc. Once changes were done, it resolved
New Problem -2: (Current Issue)
It seems that the channel issue is there. Somehow it still shows "not assigned to this channel" and a new error of "Error: 14 UNAVAILABLE: failed to connect to all addresses". This same error happened for all the peers across 7 orgs.
And not to mention suddenly the peers are not able to talk to each other.
Error Received: Could not connect to Endpoint: peer0-org2-service:7051, InternalEndpoint: peer0-org2-service:7051, PKI-ID: , Metadata: : context deadline exceeded
I checked the peer channel connection details and everything seems to be in order. Stuck in this for now. Let me know if anyone has any ideas.
As you can see from the edits i got one problem solved before another came along. After banging my head for a lot of times, i removed the entire build, rebuilt it again with my corrections given above and it simply started working.
You seem to be using old Explorer image. I strongly recommend to use the latest one v1.1.1. Note: There are some updates of settings format in connection profile (e.g. login credential of Explorer). Please refer README-CONFIG for detail.

GridGain Web Console with Docker: 404 Not Found

I'm trying to deploy GridGain Web Console 2020.03.01 on RHEL7 x86_64 with Docker following documentation here.
However, there is 404 Not Found error on accessing http://localhost:3000/swagger-ui.html page which is used as healthcheck. Backend logs show no errors. The last version I'm able to get containers running with is 2019.12.02 (which in fact refuses to show a connected cluster, but that's another issue). Starting with 2020.01.00, all backend healthchecks fail. That looks suspicious considering that 2020.01.00 releasenotes include updates of io.springfox and swagger-ui-dist.
Besides that, 2020.03.01 releasenotes say that Console's default port is changed to 8008, but the server still starts on 3000.
Anyone had any luck deploying dockerized Web Console?
The Web Console consists of backend and frontend. The backend is started on port 3000 which is printed in log, while the frontend is started indeed on port 8008 - and you most probably want to use this.
The docker-compose.yml given on Documentation site maps container's 8008 port to host's 80 port, feel free to replace with any wanted.
Regarding the heathcheck, /health endpoint is now changed to this
The Swagger was removed in 2020.01.00 due to security concerns (same GG-26726 issue mentioned in the release notes). You are right to be suspicious, I'll ask right people to update release notes and the docs, sorry about the confusion and thanks for pointing the issue out. Swagger was supposed to be an internal feature for Web Console (WC) developer team only.
As you pointed out, starting with 2020.01.00 the Swagger-based health check won't work. Internally, the WC team uses dockerize to wait for backend to start, here's an example from our E2E test suite compose:
entrypoint: dockerize -wait http://backend:3000/health -timeout 2m -wait-retry-interval 5s node ./index.js --target=${TARGET:-on-premise}
This might work for you too, with some adaptation. You will most likely have to remove "healthcheck" sections from docker-compose.yml too, or modify these, if the "http://backend:3000/health" URL can indeed serve as a direct replacement for the old "http://localhost:3000/swagger-ui.html" URL, which I am not sure about.

Realm database: synchronization with server

I'm testing Realm database using test application RealmTasks and found out that synchronization with the server doesn't work. Authentication works well, but sync not. Realm server is installed on CentOS 7 server. Default port 9080 is busy so I changed Realm server config file:
http:
enable: true
listen_address:'0.0.0.0'
listen_port:6666
network:
http:
listen_address:'0.0.0.0'
listen_port:27080
As a result I can connect to 27080 from outside but can not connect to port 6666. All ports are opened for outside connection. Does it possible that such a configuration doesn't allow to synchronize database?
Update
That config file is just wrong - if that's exactly what you have. The yaml is nested wrongly because your first http is not nested.
Experimenting with the Mac Developer Edition, here's a minimal working configuration.yml file:
storage:
root_path: 'root_dir'
auth:
public_key_path: 'keys/token-signature.pub'
private_key_path: 'keys/token-signature.key'
proxy:
http:
listen_address: '::'
listen_port: 9666
Important - it seems port numbers are constrained the [configuration documentation(https://realm.io/docs/realm-object-server/#configuring-the-server) mentions the need to use 1024 or higher as the server doesn't run as root. I am not sure why I could not get 6666 to run although that is supposedly commonly used for IRC. Multiple failure messages appear in the Terminal window of the process launching the server with that port.
Earlier questions
Are you telling the RealmTasks app to connect to that port? (Obvious question but I had to ask.)
Please supply logs from the server or look at the logs, which you can view and adjust the level of in the web dashboard eg at http://localhost:9080/#!/logs

Resources